Top Banner
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL P ORTABILITY Viet Bac Le*, Laurent Besacier*, T anja Schultz** * CLIPS-IMAG Laboratory, UMR CNRS 5524 BP 53, 38041 Grenoble Cedex 9, FRANCE ** Interactive Systems Laboratories Carnegie Mellon University, Pitt sburgh, PA, USA Presenter: Hsu-Ting Wei
19

ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

Mar 17, 2016

Download

Documents

shlomo

􀀀 ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY. Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory, UMR CNRS 5524 BP 53, 38041 Grenoble Cedex 9, FRANCE - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

1048576ACOUSTIC-PHONETIC UNIT SIMILARITIESFOR CONTEXT DEPENDENT ACOUSTIC MODEL

PORTABILITY

Viet Bac Le Laurent Besacier Tanja Schultz

CLIPS-IMAG Laboratory UMR CNRS 5524 BP 53 38041 Grenoble Cedex 9 FRANCE Interactive Systems Laboratories Carnegie Mellon University Pittsburgh PA USA

Presenter Hsu-Ting Wei

2

Outline

bull 1 Introductionbull 2 Acoustic-phonetic unit similarities

ndash 21 Phoneme Similarityndash 22 Polyphone Similarityndash 23 Clustered Polyphonic Model Similarity

bull 3 Experimental and resultsbull 4 Conclusion

3

1 Introduction

bull There are at less 6900 languages in the worldbull We are interested in new techniques and tools for rapid p

ortability of speech recognition systems (exASR) when only limited resources are available

4

1 Introduction (cont)

bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us

ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin

g data from the target language

bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated

5

1 Introduction (cont)

bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo

nophonic modelsndash Method

bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric

bull During training the parameters of the mixture Laplacian density phoneme models are estimated

bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus

bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by

i j

i j

ijjiji

ijjjij

jiiiji

ddd

XpXpd

XpXpd

21

is average thedictance symmetric

|log|log

|log|log

6

1 Introduction (cont)

bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme

distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones

ndash Method

jiNji

ccccS

SWSWSW

S

N

k

RCL

TRI

1 )( where

)()()()(21)(

)()()(

) - -(

phoneme) central theand

phonemecontext right phonemecontext left thedenote and (

- and - triphones twoof similarity The

ji

1kjkikjkiji

Rj

Riji

Lj

Li

Rjj

Lj

Rii

Li

RL

Rjj

Lj

Rii

Li

Confusion matrix

7

1 Introduction (cont)

bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to

overcome the problem which the context mismatch across languages increases dramatically for wider contexts

ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is

adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 2: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

2

Outline

bull 1 Introductionbull 2 Acoustic-phonetic unit similarities

ndash 21 Phoneme Similarityndash 22 Polyphone Similarityndash 23 Clustered Polyphonic Model Similarity

bull 3 Experimental and resultsbull 4 Conclusion

3

1 Introduction

bull There are at less 6900 languages in the worldbull We are interested in new techniques and tools for rapid p

ortability of speech recognition systems (exASR) when only limited resources are available

4

1 Introduction (cont)

bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us

ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin

g data from the target language

bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated

5

1 Introduction (cont)

bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo

nophonic modelsndash Method

bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric

bull During training the parameters of the mixture Laplacian density phoneme models are estimated

bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus

bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by

i j

i j

ijjiji

ijjjij

jiiiji

ddd

XpXpd

XpXpd

21

is average thedictance symmetric

|log|log

|log|log

6

1 Introduction (cont)

bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme

distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones

ndash Method

jiNji

ccccS

SWSWSW

S

N

k

RCL

TRI

1 )( where

)()()()(21)(

)()()(

) - -(

phoneme) central theand

phonemecontext right phonemecontext left thedenote and (

- and - triphones twoof similarity The

ji

1kjkikjkiji

Rj

Riji

Lj

Li

Rjj

Lj

Rii

Li

RL

Rjj

Lj

Rii

Li

Confusion matrix

7

1 Introduction (cont)

bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to

overcome the problem which the context mismatch across languages increases dramatically for wider contexts

ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is

adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 3: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

3

1 Introduction

bull There are at less 6900 languages in the worldbull We are interested in new techniques and tools for rapid p

ortability of speech recognition systems (exASR) when only limited resources are available

4

1 Introduction (cont)

bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us

ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin

g data from the target language

bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated

5

1 Introduction (cont)

bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo

nophonic modelsndash Method

bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric

bull During training the parameters of the mixture Laplacian density phoneme models are estimated

bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus

bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by

i j

i j

ijjiji

ijjjij

jiiiji

ddd

XpXpd

XpXpd

21

is average thedictance symmetric

|log|log

|log|log

6

1 Introduction (cont)

bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme

distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones

ndash Method

jiNji

ccccS

SWSWSW

S

N

k

RCL

TRI

1 )( where

)()()()(21)(

)()()(

) - -(

phoneme) central theand

phonemecontext right phonemecontext left thedenote and (

- and - triphones twoof similarity The

ji

1kjkikjkiji

Rj

Riji

Lj

Li

Rjj

Lj

Rii

Li

RL

Rjj

Lj

Rii

Li

Confusion matrix

7

1 Introduction (cont)

bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to

overcome the problem which the context mismatch across languages increases dramatically for wider contexts

ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is

adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 4: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

4

1 Introduction (cont)

bull In crosslingual acoustic modeling previous approaches have been limited to context independent modelsndash Monophonic acoustic models in target language were initialized us

ing seed models from source languagendash Then these initial models could be rebuilt or adapted using trainin

g data from the target language

bull The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated

5

1 Introduction (cont)

bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo

nophonic modelsndash Method

bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric

bull During training the parameters of the mixture Laplacian density phoneme models are estimated

bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus

bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by

i j

i j

ijjiji

ijjjij

jiiiji

ddd

XpXpd

XpXpd

21

is average thedictance symmetric

|log|log

|log|log

6

1 Introduction (cont)

bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme

distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones

ndash Method

jiNji

ccccS

SWSWSW

S

N

k

RCL

TRI

1 )( where

)()()()(21)(

)()()(

) - -(

phoneme) central theand

phonemecontext right phonemecontext left thedenote and (

- and - triphones twoof similarity The

ji

1kjkikjkiji

Rj

Riji

Lj

Li

Rjj

Lj

Rii

Li

RL

Rjj

Lj

Rii

Li

Confusion matrix

7

1 Introduction (cont)

bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to

overcome the problem which the context mismatch across languages increases dramatically for wider contexts

ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is

adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 5: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

5

1 Introduction (cont)

bull 1996 J Kohlerndash J Kohler used HMM distances to calculate the similarity between two mo

nophonic modelsndash Method

bull To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric

bull During training the parameters of the mixture Laplacian density phoneme models are estimated

bull Further for each phoneme a set of phoneme tokens X is extracted from a test or development corpus

bull Given two phoneme models and and given the sets of phoneme tokens or observations Xi and Xj the distance between model and is defined by

i j

i j

ijjiji

ijjjij

jiiiji

ddd

XpXpd

XpXpd

21

is average thedictance symmetric

|log|log

|log|log

6

1 Introduction (cont)

bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme

distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones

ndash Method

jiNji

ccccS

SWSWSW

S

N

k

RCL

TRI

1 )( where

)()()()(21)(

)()()(

) - -(

phoneme) central theand

phonemecontext right phonemecontext left thedenote and (

- and - triphones twoof similarity The

ji

1kjkikjkiji

Rj

Riji

Lj

Li

Rjj

Lj

Rii

Li

RL

Rjj

Lj

Rii

Li

Confusion matrix

7

1 Introduction (cont)

bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to

overcome the problem which the context mismatch across languages increases dramatically for wider contexts

ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is

adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 6: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

6

1 Introduction (cont)

bull 2000 B Imperlndash proposed that a triphone similarity estimation method based on phoneme

distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones

ndash Method

jiNji

ccccS

SWSWSW

S

N

k

RCL

TRI

1 )( where

)()()()(21)(

)()()(

) - -(

phoneme) central theand

phonemecontext right phonemecontext left thedenote and (

- and - triphones twoof similarity The

ji

1kjkikjkiji

Rj

Riji

Lj

Li

Rjj

Lj

Rii

Li

RL

Rjj

Lj

Rii

Li

Confusion matrix

7

1 Introduction (cont)

bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to

overcome the problem which the context mismatch across languages increases dramatically for wider contexts

ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is

adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 7: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

7

1 Introduction (cont)

bull 2001 T Schultzndash proposed PDTS (Polyphone Decision Tree Specialization) to

overcome the problem which the context mismatch across languages increases dramatically for wider contexts

ndash PDTS data-driven method (darr)bull In PDTS the clustered multilingual polyphone decision tree is

adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 8: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

8

1 Introduction (cont)

bull In this paper we investigate a new method for this crosslingual transfer process

bull We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language

bull Then based on the acoustic-phonetic unit similarities some crosslingual transfer and adaptation processes are applied

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 9: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

9

2 Acoustic-phonetic unit similarities

bull 21 Phoneme Similarityndash 211 Data-driven methodsndash 212 Proposed knowledge-based method

bull 211 Data-driven methods (darr)ndash The acoustic similarity between two phonemes can be obtained

automatically by calculating the distance between two acoustic models

bull HMM distance (Entropy)bull Kullback-Leibler distance bull Bhattacharyya distancebull Euclidean distance bull Confusion matrix

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 10: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

10

2 Acoustic-phonetic unit similarities (cont)

bull 212 Proposed knowledge-based method (uarr)ndash Step 1 Top-down classificationndash Step 2 Bottom-up estimation

bull Step 1 Top-down classificationndash Figure 1 shows a hierarchical graph where each node is classified into different layers

bull k = number of layersbull Gi = user defined similarity value for layer i (i = 0k-1)

ndash In our work we investigated several settings of k and G i and set G = 09 045 025 01 00 with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments

G0 = 09

G1 = 045

G4 = 00

G3 = 01

G2 = 025

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 11: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

11

2 Acoustic-phonetic unit similarities (cont)

bull Step 2 Bottom-up estimationndash d( [i] [u] ) = G2 (=025 in this experiment)

G0 = 09

G1= 045

G4 = 00

G3 = 01

G2 = 025

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 12: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

12

2 Acoustic-phonetic unit similarities (cont)

bull 22 Polyphone Similarityndash Let S be the phoneme set in source language T be the phoneme set in

target language

Final find the nearest polyphones Ps

N

k

ttctscttctsctsd1

kkkk )()()()(21)(

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 13: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

13

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarityndash Since the number of polyphones in a language is very large a limited

training corpus usually does not cover enough occurrences of every polyphones

ndash A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 14: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

14

2 Acoustic-phonetic unit similarities (cont)

bull 23 Clustered Polyphonic Model Similarity (cont)

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 15: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

15

3 Experimental and results

bull Syllable-based ASR system

bull In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models ndash Baseline

bull 225 hours of data spoken by 8 speakers were used (Vietnamese speech)

ndash Crosslingual methodsbull Speech data from six languages were used to build these models Arabi

c Chinese English German Japanese and Spanishbull Adapted by Vietnamese speech

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 16: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

16

3 Experimental and results (cont)

Initial model

越南人說越南話 (25 小時 )

六國人說自己本國話( 也許 100 小時 )

加入越南人說越南話 (25 小時 ) 調適New models

test

test

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 17: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

17

3 Experimental and results (cont)

bull As the number of clustered sub-models increases SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data

bull However the crosslingual system is able to overcome this problem by indirectly using data in other languages

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 18: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

18

3 Experimental and results (cont)

bull The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation ndash knowledge-based ndash data-driven using confusion matrix

bull We find that the knowledge based method outperforms the data-driven method

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 19: ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY

19

4 Conclusion

bull This paper presents different methods of estimating the similarities between two acoustic-phonetic units

bull The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19