An Introduction to Mandarin Speech Recognition John Steinberg, Temple University
Feb 24, 2016
An Introduction to Mandarin Speech
RecognitionJohn Steinberg, Temple
University
Mobile Phone Technology
Translators & Prostheses
Automotive / GPS Devices
Intelligence Collection
Speech Recognition Applications
Speech Recognition: Basic Process
[5]
Importance of Mandarin
[1]
MandarinEnglish
Importance of Mandarin English speakers in
the World: ~ 350 million [11]
Estimated # of current English learners in China: 200-350 million [12]
Estimated # of native Mandarin speakers: 1+ Billion
[2]
Importance of Mandarin
[3]
Importance of Mandarin
[3]
Mandarin Chinese Tonal language (inflection matters!)
1st tone – High, constant pitch (Like saying “aaah”) 2nd tone – Rising pitch (“Huh?”) 3rd tone – Low pitch (“ugh”) 4th tone – High pitch with a rapid descent (“No!”) “5th tone” – Neutral used for de-emphasized syllables
Characters 8000+ characters compose 80k-200k common words Act as morphemes Are primarily monosyllabic Have a single associated tone
Coarticulation: Context can cause changes in tone
Bu4 + Dui4 = Bu2 Dui4 (wrong)
Ni3 + Hao3 = Ni2 Hao3 (hello)
Mandarin Chinese
Mandarin Chinese Heavily contextual language
Monosyllabic Relatively few # of syllables compared
to English [3] English: ~10,000 syllables Mandarin: ~1300 syllables including tones
(400 excluding) High # of homophones
Challenges in Mandarin Recognition
Requires highly developed language model due to highly contextual nature of Mandarin Tone modeling Coarticulation Large # of homophones
Chinese text is unsegmented No standard lexicon
Chinese sentence/word structure is very flexible Ex: Beijing Da Xue -> Bei Da
Modeling Methods Prosodic Features
Describes tone (question vs. statement), rhythm, and focus of speech
Pitch Extraction Yields more precise character
recognition Stronger Language Models
Determines context more accurately
Prosodic Units Different prosodic units (labels) have been
suggested [4] EX: Syllable (SYL), Prosodic Word (PW), Minor
Prosodic Phrase (MIP), Major Prosodic Phrase (MAP), & Intonation Group (IG)
Past labeling systems are primarily based on auditory perception
Prosodic break labeling is subjective and inconsistent Auditory perception approach loses quantitative
information Impossible to replicate identical prosodic labels for an
original speech signal
Prosodic Units New, more objective Prosodic cues include
[4]: Pause duration (directly measured) Segment/syllable duration (directly measured) F0 reset
F0 contains utterance long intonation information which must be separated from inner-utterance tones to inter-utterance tones.
Quantitative Description of F0 = phrase components + accent or tone components + log(baseline frequency)
Language Models N-grams –
3 Steps: Syllable -> Character -> Word
Neural Networks – Better suited to high dimensionality
Random Forests – May be able to include morphology into
language model [7]
Timeline1980’s:
Individual syllable recogniti
on research begins
1986: 863
program begins funding collectio
n of speech
corpuses in China
1992: LDC is
founded
1993: Golden Mandarin is developed• 1st speaker
dependent dictation system for Mandarin
• Single syllable recognizer
• Designed for typing Mandarin
• 8% CER [10]
1994: Golden Mandarin (II) yields 5% CER on word based speaker independent system [11]
1995: Golden Mandarin (III) [12] • Prosodic
unit based• User
independent
• 10% CER• Dictation
system
Early 2000’s:
Research in prosodic segmentation, tone modeling, and new language
models for CTS yield
~40% CER
Benchmark History [8]
Recent Experimentation
Broadcast News and Conversational Telephone Speech [9]
Future Studies Continue studying current baseline
systems/data sets Further investigate possible language
models Compare effectiveness of prosodic
features
Questions?
References[1] K. Kūriákī, A Grammar of Modern Indo-European, Asociación Cultural Dnghu, 2007[2] Wikipedia[3] W. Gu, K. Hirose, H. Fujisaki, “Comparison of Perceived Prosodic Boundaries and
Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech”, ISCSLP, 2006
[4] J. Picone, A. Harati, "Why Study Engineering at Temple?," Temple University College of Engineering Open House, October 9, 2010
[5] Lee, C-H. “Advances in Chinese spoken language processing”, World Scientific Publishing Co., Singapore, 2007
[6] F.H. Liu, M. Picheny, P. Srinivasa, M. Monkowski, et al, “Speech Recognition on Mandarin Call Home: A Large Vocabulary, Conversational, and Telephone Speech Corpus” ICASSP, 1996
[7] I. Oparin, L. Lamel, J. Gauvain, “Improving Mandarin Chinese STT system with Random Forests language models “, IEEE Xplore, 2010
[8] “The History of Automatic Speech Recognition Evaluations at NIST,” 2009http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html
[9] Schwartz, R.; Colthurst, T.; Duta, N.; Gish, H.; Iyer, R.; Kao, C.-L.; Liu, D.; Kimball, O.; Ma, J.; Makhoul, J.; Matsoukas, S.; Nguyen, L.; Noamany, M.; Prasad, R.; Xiang, B.; Xu, D.-X.; Gauvain, J.-L.; Lamel, L.; Schwenk, H.; Adda, G.; Chen, L.; , "Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on , vol.3, no., pp. iii- 753-6 vol.3, 17-21 May 2004 doi: 10.1109/ICASSP.2004.1326654
References[10] Lee, L.S.; Tseng, C.Y.; Gu, H.Y.; Liu, F.H.; Chang, C.H.; Lin, Y.H.; Lee, Y.; Tu,
S.L.; Hsieh, S.H.; Chen, C.H.; , "Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary," Speech and Audio Processing, IEEE Transactions on , vol.1, no.2, pp.158-179, Apr 1993
[11] Lin-Shan Lee; Keh-Jiann Chen; Chiu-Yu Tseng; Renyuan Lyu; Lee-Feng Chien; Hsin-Min Wang; Jia-Lin Shen; Sung-Chien Lin; Yen-Ju Yang; Bo-Ren Bai; Chi-Ping Nee; Chun-Yi Liao; Shueh- Sheng Lin; Chung-Shu Yang; I-Jung Hung; Ming-Yu Lee; Rei-Chang Wang; Bo-Shen Lin; Yuan-Cheng Chang; Rung-Chiung Yang; Yung-Chi Huang; Chen-Yuan Lou; Tung-Sheng Lin; , "Golden Mandarin(II)-an intelligent Mandarin dictation machine for Chinese character input with adaptation/learning functions," Speech, Image Processing and Neural Networks, 1994. Proceedings, ISSIPNN '94., 1994 International Symposium on , vol., no., pp.155-159 vol.1, 13-16 Apr 1994
[12] Ren-Yuan Lyu; Lee-Feng Chien; Shiao-Hong Hwang; Hung-Yun Hsieh; Rung-Chiuan Yang; Bo- Ren Bai; Jia-Chi Weng; Yen-Ju Yang; Shi-Wei Lin; Keh-Jiann Chen; Chiu-Yu Tseng; Lin-Shan Lee; , "Golden Mandarin (III)-a user-adaptive prosodic-segment-based Mandarin dictation machine for Chinese language with very large vocabulary," Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , vol.1, no., pp.57-60 vol.1, 9-12 May 1995