Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University, UK
Center for Speech and Language Technologies, Tsinghua University
Dialectal Chinese Speech Recognition
Thomas Fang ZhengAug. 24, 2007
@ Cambridge University, UK
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
2
Outline
Motivation
Dialectal Chinese database collection Wu Min Chuan
Approaches Chinese syllable mapping Lexicon adaptation State-dependent phoneme-based model merging (SDPBMM) Integration of SDPBMM with adaptation
Remarks
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
3
Motivation
Chinese ASR encounters an issue that is bigger than that of any other language - dialect.
There are 8 major dialectal regions in addition to Mandarin (Northern China), including:- Wu (Southern Jiangsu, Zhejiang, and Shanghai); Yue (Guangdong, Hong Kong, Nanning Guangxi); Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan); Hakka (Meixian Guangdong, Hsin-chu Taiwan); Gan (Jiangxi); Xiang (Hunan); Hui (Anhui) Jin (Shanxi, Hohehot Inner Mongolia).
Can be further divided into over 40 sub-categories.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
4
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
5
Chinese dialects share a same written language:- The same Chinese pinyin set (canonically), The same Chinese character set (canonically), and The same vocabulary (canonically).
And standard Chinese (known as Putonghua, or PTH) is widely spoken in most regions over China.
However, speech is strongly influenced by the native dialects, most Chinese people speak in both standard Chinese and their own dialect, resulting in dialectal Chinese - Putonghua influenced by native dialect
In dialectal Chinese :- Word usage, pronunciation, and syntax and grammar vary depending on the
speaker's dialect. ASR relies to a great extent on the consistent pronunciation and usage of words
within a language. ASR systems constructed to process PTH perform poorly for the great majority of
the population.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
6
Research Goal
To develop a general framework to model in dialectal Chinese ASR tasks :- Phonetic variability, Lexical variability, and Pronunciation variability
To find suitable methods to modify the baseline PTH recognizer to obtain a dialectal Chinese recognizer for the specific dialect of interest, which employ :- dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and training/adaptation data (in relatively small quantities)
Expectation: the resulted recognizer should also work for PTH, in other words, it should be good for a mixture of PTH and dialectal Chinese.
This proposal was selected as one of three projects for '2003 Johns Hopkins University Summer Workshop from tens of proposals collected from universities/companies over the world, and was postponed to 2004 due to SARS.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
7
Dialectal Chinese SpeechRecognition Framework
Standard ChineseSpeech Recognizer
+
Dialectal ChineseSpeech Recognizer
Dialectal Chinese RelatedKnowledge & Resources
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
8
For practical reasons, during the summer we only focused on one specific dialect, the Wu dialect (Shanghai Area), and the target language was Wu dialectal Chinese (WDC for short);
Why Wu dialect? Population: more than 70 million people use WU dialect, the 2nd popular
dialect in China; Economy: one of the most advanced city in China – Shanghai Wu dialect is a full-developed language
The syntax of Wu dialect is very complex; The vocabulary is even more larger than Mandarin; Many literature masterpiece were influenced by WU dialect (in history).
WU Mandarin Cantonese
Phoneme# 50 37 <33
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
9
Useful Dialect-Related Knowledge
Chinese Syllable Mapping (CSM)
This CSM is dialect-related.
Two types: Word-independent CSM: e.g. in Southern Chinese, Initial mappings
include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on;
Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word ' 中国 (China)' but only the tone is changed in word ' 过去 (past)'.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
10
The CSM is not exact. For any mapping AB, it is mostly that the resulted pronunciation is not B exactly, but something quite similar to B, more similar to B than to any other syllable.
A
B
B1
B3
B4
B2
Bi is a variation of B, such as :-
nasalization, centralization, voiced,voiceless, rounding, syllabic, pharyngrealization, aspiration
kei kuo kui...
Standard Chinese Syllabe Set
Chuan Dialect
ke
上[课] [克]服
kuo kui
[扩]大 [魁]梧
The CSM could be N→1, 1→N, or crossed.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
11
Lexicon
Linguists say the vocabulary similarity rate between PTH and Wu dialect is about 60~70%
A dialect-related lexicon containing two parts :- a common part shared by standard Chinese and most dialectal
Chinese languages (over 50k words), and a dialect-related part (several hundreds).
And in this lexicon :- each word has one pinyin string for standard Chinese pronunciation
and a kind of representation for dialectal Chinese pronunciation, and
each of those dialect-related words is corresponding to a word in the common part with the same meaning
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
12
Language
Though it is difficult to collect dialect texts, dialect-related lexical entry replacement rules could be learned in advance, and therefore
The language post-processing or language model adaptation techniques could be adopted.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
13
w1 w2 w3… …V2
w1 w2 w3… …w3 w2
w2
w3
w2
w3
1
2
Dialectal words substitute for some words
我 做饭 给 你 吃 (PTH)我 烧饭 给 你 吃 (Wu)
Word-order changes
你 先 走 (PTH)你 走 先 (Wu)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
14
AM0 = AM for standard Chinese AM1 = AM with accent AM2 = AM with dialect LM0 = LM for standard Chinese LM1 = LM with dialectal lexicon LM2 = LM with dialectal lexicon/syntax Seldom-seen in dialectal Chinese
LM0 LM1 LM2
AM2
AM1
AM0
Dialect
Standard Chinese
Our focu
s
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
15
Database CollectionData Creation for WDC
Database e-Dictionary
DatabaseCollection
SpeechTranscription
ReadSpeech
SpontaneousSpeech
PTH Words Only
PTH + Wu Words Topics
IF & SyllableSet Definition
PTHWords
Wu DialectWords
Misc Info
IFs/GIFs
Syllables
C-Chars
Wu Dialect Pron.
PTH Pron. PTH Pron.
PTH Synonym
Wu Dialect Pron.
IF: a Chinese Initial or Final; GIF: generalized IF; PTH: Putonghua (standard Chinese); WDC: Wu Dialectal Chinese
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
16
Wu Dialectal Chinese (WDC) Database Collection (1)
Collection: Totally 11 hours - Half read (R) + half spontaneous (S):
– 100 Shanghai speakers * (3R +3S) minutes / speaker– 10 Beijing speakers * 6S minutes / speaker
Read speech with well-balanced prompting sentences;– Type I: each sentence contains PTH words only (5-6k)– Type II: each sentence contains one or two most commonly used Wu
dialectal words while others are PTH words Spontaneous speech with Pre-defined talking topics;
– Conversations with PTH speaker on self-selected topic from: sports, policy/economy, entertainment, lifestyles, technology
Balanced Speaker (gender, age, education, PTH level, …)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
17
Num of speakers Male Female Total
Age26-40 27 25 52
41-50 23 25 48
EducationWell 41 41 82
Ordinary 9 9 18
Gender Male : 50% Female: 50%
Age 26-40 : 50% 41-50: 50%
Education Ordinary: 20% Well : 80%
Actual WDC Data Diversity
Goal
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
18
Accent Assessment by experts
0
10
20
30
40
50
60
70
1A 1B 2A 2B 3A 3B
1A. CCTV-level radiobroadcaster; 1B. Province-level radiobroadcaster; 2A. Quite good;2B. Less accented; 3A. More accented; 3B. Hard to understand but known it is PTH
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
19
0
5
10
15
20
25
30
35
1A 1B 2A 2B 3A 3B
26- 4041- 50
Accent Assessment according to age
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
20
05
101520253035404550
1A 1B 2A 2B 3A 3B
Ordi naryWel l
Accent Assessment according to education level
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
21
0
5
10
15
20
25
30
35
1A 1B 2A 2B 3A 3B
Mal eFemal e
Accent Assessment according to gender
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
22
Wu Dialectal Chinese (WDC) Database Collection (2)
Transcriptions include:- For 100 Wu Dialectal Chinese speakers:-
– Canonical Chinese Initial/Final labels, and– Generalized IF (GIF) labels.
For 10 Beijing speakers:-– Chinese character and pinyin transcriptions only
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
23
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
24
Dialectal Lexicon Construction
Establish a 50k-word electronic dialect dictionary with each word having :-
PTH pronunciation in PTH IF string Wu dialect pronunciation in Wu IF string
Purpose: summarizing Dialect-Related Knowledge Figure out Chinese syllable mappings:-
– Same written form (character), different pronunciations;– Both word-independent and word-dependent;
Find dialect-related word variations:-– Same meanings in Chinese language;– Different written forms (character);– Uttered in standard Chinese manner;– For LM adaptation/modification
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
25
e-Dictionary Word Examples
Word No.
Word Pronunciationin PTH
Pronunciationin Wu Dialect
1644 本金 ben3 jin1 b en3 j in1 (unchanged)
1646 本科 ben3 ke1 b en3 k u1 (Final changed only)
1652 本领 ben3 ling3 b en3 l in2 (Final & tone)
1656 本末倒置 ben3 mo4 dao4 zhi4 b en3 m ek5 d o^3 z ii3 (Entering Sound, Final change, CI Initial change, CD Final change )
1659 本票 ben3 piao4 b en3 p voe3 (Final & tone changes)
1660 本期 ben3 qi1 b en3 jj i2 (Voiced Initial, tone change)
1661 本钱 ben3 qian2 b en3 jj i2 (1660&1: Different in PTH, same in Wu)
1662 本人 ben3 ren2 b en3 n in2 (CD Initial & Final change)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
2626
Post-workshop Database Collection -- Min and Chuan
* With aid of Chinese Academy of Social Sciences (CASS)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
2727
Name Min-dialectal Chinese database
Dialectal accent Xiamen city, Fujian province
Sampling rate 22 050 Hz
Channels3 (Two conventional microphones, One USB microphones)
Speakers 36
Age 18~30
Gender 18 females, 18 males
Constituent 200 long sentences, 10 digits, 26 English letters per speaker
Transcription Chinese Character/syllable/Initial-Final
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
2828
Name Chuan-dialectal Chinese database
Dialectal accent Chengdu city, Sichuan province
Sampling rate 22 050 Hz
Channels3 (Two conventional microphones, One USB microphones)
Speakers 36
Age 20~30
Gender 18 females, 18 males
Constituent 200 long sentences, 10 digits, 26 English letters per speaker
Transcription Chinese Character/syllable/Initial-Final
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
2929
Accent distribution for Min/Chuan-dialectal Chinese corpora
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
30
Workshop Experiments Experiment Conditions:
Using HTK 3.2.1;
Data Set Division: Using spontaneous speech data only Data were split according to age (younger, older), education (higher, lower),
and PTH level into– Training Set: 80 speakers– devTest Set: 20 speakers (a part of devTrain)– Test Set: 20 speakers
Acoustic model: Trained from Mandarin Broadcast News (MBN); 39 dimensional MFCC_E_D_A_Z; diagonal covariance matrix; 4 states per unit; 103,041 units (triIF), 10,641 real units (triIF); 3,063 different states (after state tying); 16 mixtures per state, 28 mixtures per state for silence unit;
Language model: Built on HKUST 100 hour CTS data, plus Hub5, plus Wu-Dialectal Training Data
Transcriptions
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
31
Observation on WDC Data
IF-mapping / Syllable-mapping:– Influenced by Wu dialect, a Wu dialectal Chinese (WDC) speaker
often pronounce any of a certain set of IFs into another IF, and there are rules to follow, such as zh -> z, ch -> c, sh -> s, and so on.
Observations on three sets - Train (80 speakers), devTest (20), and Test (20):
– Mapping pairs almost the same among all three sets;– Mapping pairs almost identical to experts' knowledge;– Mapping probabilities also almost equal;
Remarks:– Experts' knowledge could be useful;– Mapping rules can be learned from less data.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
32
Using only devTest set + dialect-based knowledge
Step 1: Apply PTH-IF mapping rules;
Step 2: Apply WDC-IF mapping rules;
Step 3: Apply syllable-dependent mapping rules;
Step 4: Perform multi-pronunciation expansion (MPE) based on unigram probability.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
33
Why trying this method?
"IF-mapping" in dialectal Chinese is the fact (human uses it); "In-domain data training" will sure get a good result but
collecting data is a huge task, especially for 40 sub-dialects of Chinese;
"Mere adaptation" will be easier and better but might make it hard to distinguish those mapping pairs, each pair tends to become a single IF;
This is not practical in such applications where you have no more information about the speakers and a mixture of WDC and PTH is used as Call Centers;
It is expected that knowledge based method would result in an overall good performance for both WDC and PTH.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
34
Step 1: Applying PTH-IF mapping rules
Rules are based on experts' knowledge (with AM unchanged) (zh, z) (z, zh) (ch, c) (c, ch) (sh, s) (s, sh) (eng, en) (en, eng) (ing, in) (in, ing) (r, l)
Gain not so significant: 0.5% Chinese Character Error Rate (CER) reduction
Pronunciation entry probability does not help improve performance
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
35
Step 2: Applying WDC-IF mapping rules
There indeed are some Wu dialect Chinese specific IFs, such as iao -> io^;
Rules learned from devTest Newly introduced WDC specific IFs trained from devTest using
adaptation method 8.66% absolute CER reduction MLLR adaptation outperforms MLLR+MAP
About 10% difference Possibly due to less data
We referred it to surface form (WDC) MLLR adaptation; for comparison purpose, the base form (PTH) MLLR adaptation is also evaluated where only canonical IFs are used.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
36
Step 3: Apply syllable-dependent mapping rules
Assumption: most IF-mappings are context-independent, but some are syllable-dependent (such as iii|(sh iii) -> ii|(s ii)), we believe there are others
Rules learned from devTest We do not succeed in improving the accuracy, on the contrary,
the character accuracy reduced by about 6% We do not have a clear explanation yet So we keep using context-free mapping rules
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
37
Step 4: Multi-pronunciation expansion (MPE) based on unigram probability
Motivation: more pronunciations help model pronunciation variations, but lead to more confusion, there should be tradeoff;
Accumulated unigram probability (AccProb) used as the criterion
Only words with higher unigram probabilities will have multiple pronunciations each;
Words with lower unigram probabilities will have a single standard pronunciation each;
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
38
Word Prob. (descending) Acc. Prob.
0.000
</ s> 0.10782136 0.108
的 0.03608752 0.144
你 0.02161165 0.194
是 0.01907339 0.213
…
标准 0.00005742 0.899 Actual minimum
… 0.00005742
团 0.00005742 0.900 Desired point
… 0.00005742
最多 0.00005742 0.901 Actual maximum
…
鲫鱼 0.00000124 1.000-
黛 0.00000124 1.000-
The Multi-Pronunciation Expansion Criterion
Mu
lti-P
ronu
ncia
tion
E
xpan
sion
Sin
gle-
Pro
nun
ciat
ion
(Sta
ndar
d)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
39A
ccP
rob:
0%
mea
ns n
o m
ultip
le p
ronu
ncia
tion
expa
nsio
n, w
hile
100
% f
ull e
xpan
sion
;
Bes
t res
ult a
chie
ved
at a
sui
tabl
e A
ccP
rob
valu
e, s
ay 9
4%, w
ith V
ocS
izeR
atio
=1.
10
VocSi zeRati o CER-B0% 1. 00 63. 9
80% 1. 01 62. 9890% 1. 05 62. 9592% 1. 07 62. 9794% 1. 10 63. 0796% 1. 15 63. 15
100% 1. 87 63. 55
62. 062. 563. 063. 564. 064. 565. 065. 5
0% 80%90%92%94%96%100%
1. 00
1. 20
1. 40
1. 60
1. 80
2. 00
CER-B VocSi zeRati oBase-form MLLR + PTH-IF mapping + MPE (CER)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
40
VocSi zeRati o CER-S0% 1. 00 65. 47
80% 1. 04 62. 3290% 1. 12 62. 2392% 1. 17 62. 2994% 1. 24 62. 1596% 1. 35 62. 38
100% 3. 03 63. 77
62. 062. 563. 063. 564. 064. 565. 065. 5
0% 80%90%92%94%96%100%
1. 00
1. 50
2. 00
2. 50
3. 00
3. 50
CER-S VocSi zeRati oAcc
Pro
b: 0
% m
eans
no
mul
tiple
pro
nunc
iatio
n ex
pans
ion,
whi
le 1
00%
ful
l exp
ansi
on;
Bes
t res
ult a
chie
ved
at a
sui
tabl
e A
ccP
rob
valu
e, s
ay 9
4%, w
ith V
ocS
izeR
atio
=1.
24
Surface-form MLLR + WDC-IF mapping + MPE (CER)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
41
Bes
t res
ult a
chie
ved
at a
sui
tabl
e A
ccP
rob
valu
e,
say
94%
, with
Voc
Siz
eRat
io=
1.24
62. 062. 563. 063. 564. 064. 565. 065. 566. 0
0% 80% 90% 92% 94% 96% 100%
Acc
Pro
b: 0
% m
eans
no
mul
tiple
pro
nunc
iatio
n ex
pans
ion,
whi
le 1
00%
ful
l exp
ansi
on;
Base-form MLLR + PTH-IF mapping + MPE (CER)
Surface-form MLLR + WDC-IF mapping + MPE (CER)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
42
55
60
65
70
75
80
85
Basel i ne PTH- Mappi ng WDC- Mappi ng MPE
Methods
CER%
AO AY GM GF EL EH MA MS TotalPer
form
ance
imp
rove
men
t co
mp
aris
on:
over
all,
and
in te
rms
of s
peak
er c
lust
ers
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
Q: How about recognizing PTH using the resulted WDC recognizer?
We obtain WDC recognizer from PTH recognizer;
We get a CER reduction of over 10% when recognizing WDC on an average;
How about using it to recognize PTH?
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
44
sh
s
shs
Adaptation
(Conventional Method)
sh
s
sh
s
Rule+MPE
(Our method)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
45
We can expect that using WDC recognizer to recognize PTH, the performance will degrade;
But we would expect it will not decrease too much;
Results: using WDC recognizer, you getOver 10% CER reduction to recognize WDC;0.62% CER increase to recognize PTH.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
46
Conclusions:
The use of knowledge is useful and effective
In this project, there are several problems to solve: channel, speaking-style, dialect background, and domain problems.
It is easier to solve all these problems by simply using the adaptation method;
Our method focuses only on the dialect problem;
The results using our method could be better if we integrate those methods related to channel, and speaking-style.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
47State-Dependent Phoneme-Based Model Merging (SDPBMM)
At acoustic level, approaches include: Retraining the AM based on the standard speech and a certain amount
of dialectal speech Interpolation between standard speech-based HMMs and their
corresponding dialectal speech based HMMs Combination of AM with state-level pronunciation modeling Adaptation with a certain amount of dialectal speech based on the
standard speech-based AM
Existing problems: A large amount of dialectal speech to build dialect-specific acoustic
models The acoustic model cannot demonstrate good performance in standard
speech as well as dialectal speech recognition Some acoustic modeling methods are too complicated to be deployed
readily
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
48
What we proposed:
Taking a precise context-dependent HMM from the standard speech and its corresponding less precise context-independent HMM from dialectal speech into consideration simultaneously
Merging HMMs on a state-level basis according to certain criteria
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
49
b-an+d[2] …
L_Stop?
R_Nasal?
L_Bilabial?
L_Labial?
y
y
y
y
n
nn
n
*-an+*[2]
l-an+d[2] …
l-an+m[2] …
f-an+m[2] …
b-an+m[2] …
an[2] / ang[2]
Standard Chinese Tri-XIF Dialectal Chinese Mono-XIF
Illustration for SDPBMM
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
50
( ) ( ) ( ) ( ) ( )
1
1
( ) ( )
1
( ) ( ) ( ) ( )
1 1
' 1 ,
( ) ; ;
( )
1 ( )
Msc sc dc dc sc
i i im im iim
K
i ik ik ikk
Ksc sc
ik ikk
M Ndc sc dc dc
im i imn imnm n
p x s p x s p x s s p s s
p x s w N x
w N
P s s w N
pdf for merged state
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
51
The seen disadvantage so farThe scale of Gaussian mixtures in the merged state
is expanded
Is it possible to downsize the scale?A straightforward criterion is distance measure
The larger distance, the more coverage acoustically merging, if distance (d,s) threshold no-merging, if distance (d,s) < threshold
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
52
Pseudo-divergence (PD) based distance measure between two states is defined as follows,
11
22
,
1 1
1
,
, ,
1 1, ln
8 2 2
,1
, , 2
where
,
,
and
/ 2
is the Bhattachyaryya distance meas
P Q
M N
Pi Qj P Q
i j
T P QP Q
P Q P Q
P Q
A B A B B Adistance
DispersionPD
Dispersion
Dispersion P Q w w d i j
d P Q
PD PD
P Q
P P
ure.
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
53
b-an+d[2] …
L_Stop?
R_Nasal?
L_Bilabial?
L_Labial?
y
y
y
y
n
nn
n
*-an+*[2]
l-an+d[2] …
l-an+m[2] …
f-an+m[2] …
b-an+m[2] …
an[2] / ang[2]
Standard Chinese Tri-XIF Dialectal Chinese Mono-XIF
Distinguishable states
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
5454
Data set Database Details Usage
PTH_Train Standard Chinese120 speakers, 20 hours, 24,000 long sentences
To bulid Putonghua AM
PTH_Test Standard Chinese12 speakers, 2.5 hours, 2,400 long sentences
Putonghua Test set
Min_DevMin-dialectal Chinese
20 speakers, 1.0 hour, 1,000 long sentences
Adaptation/SDPBMM/pronunciation modeling etc.
Min_TestMin-dialectal Chinese
16 speakers, 50 minutes, 800 long sentences
Dialectal Chinese test set
Wu_DevWu-dialectal Chinese
10 speakers, 40 minutes, 510 long sentences
Adaptation/SDPBMM/pronunciation modeling etc.
Wu_TestWu-dialectal Chinese
20 speakers, 1.0 hour, 910 long sentences
Dialectal Chinese test set
Data division
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
55
Training setApproximately 30 hours from MBN (HUB-4); totally 34,493 utterances
Modeling methodHMM-based Decision-tree-based state-clustered cross-word tri-XIF
Topology3 left-to-right states per tri-XIF, 14 mixtures per state
Number of tri-XIFs 7,411
Number of states 3,230
Number of mixtures 45,220
Features 39 MFCC+ , , /CMN
Lexicon 406 toneless Chinese syllables
Standard Chinese-based HMMs (Baseline)
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
56
Acoustic modelPutonghua SDPBMM+PDBDM
Gaussians 45,220 58,786
SER on Wu_Test 49.8% 43.9% (-5.9%)
SER on PTH_Test 30.5% 31.1% (+0.6%)
Evaluations on Putonghua and Wu-dialectal Chinese
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
57
Integration of SDPBMM with adaptation
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
58
Conclusions:Simple but effective acoustic modeling approach
using only a small amount dialectal speech dataSignificantly effective for the dialectal Chinese
speech recognition.Good performance for both standard and dialectal
speech recognition.Comparable to adaptation methodsAdditive and complementary to adaptation methods
Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II
59
References Linquan Liu, Thomas Fang Zheng, Wenhu Wu. State-Dependent Phoneme-Based Model Merging with Pronunciation
Modeling Based on a Small Data Set for Dialectal Chinese Speech Recognition. Speech Communication, Second Review. Linquan Liu, Thomas Fang Zheng, Makoto Akabane, Ruxin Chen,Wenhu Wu. Using a Small Development Data Set to
Build a Robust Dialectal Chinese Speech Recognizer, Interspeech, Antwerp, 2007. Linquan Liu, Thomas Fang Zheng, Wenhu Wu. State-Dependent Phoneme-Based Model Merging for Dialectal Chinese
Speech Recognition, ISCSLP, Singapore, 2006. (Also collected by Lecture Notes in Artificial Intelligence, 4274, pp. 282-293, 2006. )
Jing Li, Thomas Fang Zheng, William Byrne and Dan Jurafsky, “A dialectal Chinese speech recognition framework,” J. of Computer Science and Technology, 21(1): 106-115, Jan. 2006
http://www.clsp.jhu.edu/ws04 XIONG Zhenyu, ZHENG Fang, LI Jing and WU Wenhu, “An automatic prompting texts selecting algorithm for di-IFs
balanced speech corpus,” National Conference on Man-Machine Speech Communications (NCMMSC7), pp. 252-256, Nov. 23-25, 2003, Xiamen
Thomas Fang Zheng, “Making Full Use of Chinese Speech Corpora,” Invited Keynote Speech, Oriental-COCOSDA, pp.9-23, Oct. 1-3, 2003, Sentosa, Singapore
Jing Li, Fang Zheng, Zhenyu Xiong, and Wenuhu Wu, “Construction of Large-Scale Shanghai Putonghua Speech Corpus for Chinese Speech Recognition,” Oriental-COCOSDA, pp.62-69, Oct. 1-3, 2003, Sentosa, Singapore
Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Reducing pronunciation lexicon confusion and using more data without phonetic transcription for pronunciation modeling,” ICSLP’2002, pp. 2461-2464, Sep. 16-20, 2002, Colorado, USA
Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne. “Mandarin Pronunciation Modeling Based on CASS Corpus,” J. Computer Science & Technology, 17(3): 249-263, May 2002
Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Variation Modeling,” National Conference on Man-Machine Speech Communications (NCMMSC6), pp.K51-64, 20-22 Nov 2001, Shenzhen (Invited Keynote Speech)
Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne, “Modeling Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling,” EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark
W. Byrne, V. Venkataramani, T. Kamm, T. F. Zheng, Z. Song, P. Fung, Y. Liu, U. Ruhi, "Automatic generation of pronunciation lexicons for Mandarin spontaneous speech," ICASSP, May 7-11, 2001, Salt Lake City, USA
Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne. “Mandarin pronunciation modeling based on CASS corpus,” Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing
Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, “Pronunciation Modeling of Mandarin Casual Speech,” Final Report for Workshop 2000 for Language Engineering for Students and Professionals Integrating Research and Education, http://www.clsp.jhu.edu/ws2000/final_reports/mpm/.
Center for Speech and Language Technologies, Tsinghua University
Thanks !
http://cslt.riit.tsinghua.edu.cn/~fzheng