An Introduction to Mandarin Speech Recognition

An Introduction to Mandarin Speech

RecognitionJohn Steinberg, Temple

University

Mobile Phone Technology

Translators & Prostheses

Automotive / GPS Devices

Intelligence Collection

Speech Recognition Applications

Speech Recognition: Basic Process

[5]

Importance of Mandarin

[1]

MandarinEnglish

Importance of Mandarin English speakers in

the World: ~ 350 million [11]

Estimated # of current English learners in China: 200-350 million [12]

Estimated # of native Mandarin speakers: 1+ Billion

[2]


[3]


[3]

Mandarin Chinese Tonal language (inflection matters!)

1st tone – High, constant pitch (Like saying “aaah”) 2nd tone – Rising pitch (“Huh?”) 3rd tone – Low pitch (“ugh”) 4th tone – High pitch with a rapid descent (“No!”) “5th tone” – Neutral used for de-emphasized syllables

Characters 8000+ characters compose 80k-200k common words Act as morphemes Are primarily monosyllabic Have a single associated tone

Coarticulation: Context can cause changes in tone

Bu4 + Dui4 = Bu2 Dui4 (wrong)

Ni3 + Hao3 = Ni2 Hao3 (hello)

Mandarin Chinese

Mandarin Chinese Heavily contextual language

Monosyllabic Relatively few # of syllables compared

to English [3] English: ~10,000 syllables Mandarin: ~1300 syllables including tones

(400 excluding) High # of homophones

Challenges in Mandarin Recognition

Requires highly developed language model due to highly contextual nature of Mandarin Tone modeling Coarticulation Large # of homophones

Chinese text is unsegmented No standard lexicon

Chinese sentence/word structure is very flexible Ex: Beijing Da Xue -> Bei Da

Modeling Methods Prosodic Features

Describes tone (question vs. statement), rhythm, and focus of speech

Pitch Extraction Yields more precise character

recognition Stronger Language Models

Determines context more accurately

Prosodic Units Different prosodic units (labels) have been

suggested [4] EX: Syllable (SYL), Prosodic Word (PW), Minor

Prosodic Phrase (MIP), Major Prosodic Phrase (MAP), & Intonation Group (IG)

Past labeling systems are primarily based on auditory perception

Prosodic break labeling is subjective and inconsistent Auditory perception approach loses quantitative

information Impossible to replicate identical prosodic labels for an

original speech signal

Prosodic Units New, more objective Prosodic cues include

[4]: Pause duration (directly measured) Segment/syllable duration (directly measured) F0 reset

F0 contains utterance long intonation information which must be separated from inner-utterance tones to inter-utterance tones.

Quantitative Description of F0 = phrase components + accent or tone components + log(baseline frequency)

Language Models N-grams –

3 Steps: Syllable -> Character -> Word

Neural Networks – Better suited to high dimensionality

Random Forests – May be able to include morphology into

language model [7]

Timeline1980’s:

Individual syllable recogniti

on research begins

1986: 863

program begins funding collectio

n of speech

corpuses in China

1992: LDC is

founded

1993: Golden Mandarin is developed• 1st speaker

dependent dictation system for Mandarin

• Single syllable recognizer

• Designed for typing Mandarin

• 8% CER [10]

1994: Golden Mandarin (II) yields 5% CER on word based speaker independent system [11]

1995: Golden Mandarin (III) [12] • Prosodic

unit based• User

independent

• 10% CER• Dictation

system

Early 2000’s:

Research in prosodic segmentation, tone modeling, and new language

models for CTS yield

~40% CER

Benchmark History [8]

Recent Experimentation

Broadcast News and Conversational Telephone Speech [9]

Future Studies Continue studying current baseline

systems/data sets Further investigate possible language

models Compare effectiveness of prosodic

features

Questions?

References[1] K. Kūriákī, A Grammar of Modern Indo-European, Asociación Cultural Dnghu, 2007[2] Wikipedia[3] W. Gu, K. Hirose, H. Fujisaki, “Comparison of Perceived Prosodic Boundaries and

Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech”, ISCSLP, 2006

[4] J. Picone, A. Harati, "Why Study Engineering at Temple?," Temple University College of Engineering Open House, October 9, 2010

[5] Lee, C-H. “Advances in Chinese spoken language processing”, World Scientific Publishing Co., Singapore, 2007

[6] F.H. Liu, M. Picheny, P. Srinivasa, M. Monkowski, et al, “Speech Recognition on Mandarin Call Home: A Large Vocabulary, Conversational, and Telephone Speech Corpus” ICASSP, 1996

[7] I. Oparin, L. Lamel, J. Gauvain, “Improving Mandarin Chinese STT system with Random Forests language models “, IEEE Xplore, 2010

[8] “The History of Automatic Speech Recognition Evaluations at NIST,” 2009http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html

[9] Schwartz, R.; Colthurst, T.; Duta, N.; Gish, H.; Iyer, R.; Kao, C.-L.; Liu, D.; Kimball, O.; Ma, J.; Makhoul, J.; Matsoukas, S.; Nguyen, L.; Noamany, M.; Prasad, R.; Xiang, B.; Xu, D.-X.; Gauvain, J.-L.; Lamel, L.; Schwenk, H.; Adda, G.; Chen, L.; , "Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on , vol.3, no., pp. iii- 753-6 vol.3, 17-21 May 2004 doi: 10.1109/ICASSP.2004.1326654

http://www.isip.piconepress.com/publications/seminars/external/2010/coe_open_house_04/

References[10] Lee, L.S.; Tseng, C.Y.; Gu, H.Y.; Liu, F.H.; Chang, C.H.; Lin, Y.H.; Lee, Y.; Tu,

S.L.; Hsieh, S.H.; Chen, C.H.; , "Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary," Speech and Audio Processing, IEEE Transactions on , vol.1, no.2, pp.158-179, Apr 1993

[11] Lin-Shan Lee; Keh-Jiann Chen; Chiu-Yu Tseng; Renyuan Lyu; Lee-Feng Chien; Hsin-Min Wang; Jia-Lin Shen; Sung-Chien Lin; Yen-Ju Yang; Bo-Ren Bai; Chi-Ping Nee; Chun-Yi Liao; Shueh- Sheng Lin; Chung-Shu Yang; I-Jung Hung; Ming-Yu Lee; Rei-Chang Wang; Bo-Shen Lin; Yuan-Cheng Chang; Rung-Chiung Yang; Yung-Chi Huang; Chen-Yuan Lou; Tung-Sheng Lin; , "Golden Mandarin(II)-an intelligent Mandarin dictation machine for Chinese character input with adaptation/learning functions," Speech, Image Processing and Neural Networks, 1994. Proceedings, ISSIPNN '94., 1994 International Symposium on , vol., no., pp.155-159 vol.1, 13-16 Apr 1994

[12] Ren-Yuan Lyu; Lee-Feng Chien; Shiao-Hong Hwang; Hung-Yun Hsieh; Rung-Chiuan Yang; Bo- Ren Bai; Jia-Chi Weng; Yen-Ju Yang; Shi-Wei Lin; Keh-Jiann Chen; Chiu-Yu Tseng; Lin-Shan Lee; , "Golden Mandarin (III)-a user-adaptive prosodic-segment-based Mandarin dictation machine for Chinese language with very large vocabulary," Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , vol.1, no., pp.57-60 vol.1, 9-12 May 1995

An Introduction to Mandarin Speech Recognition

Documents