International Journal of Computer Science and Applications c Technomathematics Research Foundation Vol. 10 No. 1, pp. 31 - 45, 2013 Voice-driven Computer Game in Noisy Environments ARTUR JANICKI Institute of Telecommunications, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland. [email protected]DARIUSZ WAWER Institute of Telecommunications, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland. [email protected]The paper describes the performance of a task-oriented continuous automatic speech recognition (ASR) system in the computer game interface in noisy conditions. First, the process of designing the ASR system for Polish, based on CMU Sphinx4, is presented. Then, the concept of the computer game called Rally Navigator is described. The exper- iments were first run for the clean speech, and then repeated with added environmental noise with various signal-to-noise (SNR) ratios. Results of experiments with clean speech show that as little as 15 minutes of audio material is enough to produce a highly effec- tive single-speaker command-and-control ASR system for the computer game, providing the sentence recognition accuracy of 97.6%. Results of the tests under noisy conditions show that minor degradation of performance was observed in car environment, however, accuracy decreased severely for babble and factory noises for SNR below 20 dB. Keywords : speech recognition; voice interface; computer game; noise. 1. Introduction Automatic speech recognition (ASR) systems become more and more popular as an input mechanism in various applications - especially in word processors where dictation software is being introduced. But there are also trials being conducted to replace pads and keyboards as computer games controlling devices, thus mak- ing the games more interesting and enabling multimodal input. ASR systems can be successfully utilized in computer games, under the condition that they provide satisfactory recognition accuracy and short processing time. To ensure realistic conditions, an ASR system in the computer game should be able to recognize continuous speech, which is how people usually talk. Such speech is much more difficult to recognize than isolated words. Apart from a few cases, for example [Ziolko et al. (2008)] and [Szymanski et al. (2008)], continuous speech recognition systems barely exist for the Polish language, which is highly inflective and thus difficult to recognize. Recognizing speech in Polish under noisy conditions 31
15
Embed
Voice-driven Computer Game in Noisy Environments · Automatic Speech Recognition in Voice Interfaces ... (e.g. whether a noun is a subject or an object) ... Voice-driven Computer
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
The paper describes the performance of a task-oriented continuous automatic speechrecognition (ASR) system in the computer game interface in noisy conditions. First, the
process of designing the ASR system for Polish, based on CMU Sphinx4, is presented.
Then, the concept of the computer game called Rally Navigator is described. The exper-iments were first run for the clean speech, and then repeated with added environmental
noise with various signal-to-noise (SNR) ratios. Results of experiments with clean speech
show that as little as 15 minutes of audio material is enough to produce a highly effec-tive single-speaker command-and-control ASR system for the computer game, providing
the sentence recognition accuracy of 97.6%. Results of the tests under noisy conditions
show that minor degradation of performance was observed in car environment, however,accuracy decreased severely for babble and factory noises for SNR below 20 dB.
Most of the phonemes (plosives, fricatives and affricates) exist in pairs: unvoiced-
voiced, e.g. [tS] - [dZ].
The Polish grammar is complex. It is highly inflective - nouns are inflected
according to seven cases and two numbers, verbs are inflected according to gender,
tense and number, adjectives and numerals are inflected, too [Feldstein (2001)].
There are three genders in singular and two genders in plural. Thanks to the high
inflection, the word order in a Polish sentence is loose, as the function of the word
(e.g. whether a noun is a subject or an object) is determined not by the position of
the word within the sentence, but by the form of the word. For the same reason,
subjects are often dropped.
These properties make continuous speech recognition for Polish quite a demand-
ing challenge. Lack of articles makes detecting nouns difficult. High inflection some-
times causes a single word to have dozens of forms, which often sound similarly.
Voice-driven Computer Game in Noisy Environments 35
Fig. 1. A view of Rally Navigator’s main window
Loose word order increases the complexity of language models. Pronunciation vari-
ation, which happen to be advantageous in speech synthesis [Janicki et al. (2008)],
is a disturbing factor in speech recognition. Luckily these problems have smaller
impact on the system complexity if we consider a task-oriented, small vocabulary
recognition, which is usually the case for a computer game.
Moreover, in such applications some speech recognition errors are negligible for
the game’s command interpreter. For example, changing the case of a numeral will
result in decoding it as the same number. E.g., for the game considered, some non-
significant prepositions are ignored.
4.2. Concept of the Computer Game
We decided to create a game called Rally Navigator in which the player would
compete in races – not as a driver, but as a navigator. The player’s task would
be to provide the driver with information about the route and track elements like
curves and straights. To make the game more difficult (and the ASR system more
complex) we also decided to include speed control and gear switching. The aim
of the game is to win the rally. The more precise and well-timed the information
supplied by the player, the quicker the car reaches the finish line.
We wanted to make the ASR system an integral part of the game, not an addi-
36 Artur Janicki and Dariusz Wawer
tional or alternative way of controlling it.
The ASR system in the game is activated manually, by pressing the space bar.
This both decreases the chance of recognizing background noise as a command and
simplifies the recognition process by determining command boundaries. Fig. 1 shows
the game prototype’s main window. In the upper left corner there are three lines of
text. The first line shows the car’s current position and speed, the second line shows
the sentence recognized by our ASR system while the third line show the command
that was issued based on the recognition. In this case the command was Accelerate
to 50 and the issued command was Set speed 50.
4.3. Developing the Acoustic Model
The following elements were required to train a new acoustic model:
• audio data with recorded speech;
• transcription of each audio file;
• dictionary with phonetic representations of all words appearing in the tran-
scriptions;
• list of phonemes (and sounds) appearing in the transcriptions.
For a simple command-and-control one-speaker system (the one we began from)
the amount of training audio data can be fairly low. For multi-speaker systems
the amount of required audio increases, and increases even further for dictation
purposes.
The training audio signal was recorded in a home environment, using 16 kHz
sampling. The speaker read the following:
• 114 CORPORA [Grocholewski (1997)] sentences, which were phonetically
balanced, i.e., contained all Polish phonemes and diphones;
• the same set of sentences, but spoken faster, in order to get prepared for
quickly spoken commands, for example for a sequence of tight curves. The
CORPORA sentences in total formed ca. 11 minutes of our training data;
• sample commands, which will be used in game, and numbers, which must
be correctly recognized for the game to work (needed for the curve angles,
gears and distances).
In total, we prepared 25 minutes of audio for a single-speaker acoustic model.
The audio files were then transcribed, including special silence marks for silence
appearing in the file. The transcription allowed the training algorithm to better
align the speech and, in turn, produce better acoustic models. The transcription
involved also non-speech sounds, such as mouse clicking or breathing. It is important
to include such sounds in the transcriptions, so that they are not confused with
phonemes during training. If the non-speech sounds are properly labeled in the
training data, the ASR system may successfully recognize and omit them.
Voice-driven Computer Game in Noisy Environments 37
The phonetic dictionary was prepared in such a way that it contained all ex-
pected words with possible variants of their pronunciation. Careful preparation of
phonetic dictionary turned out to be crucial for the robustness of the speech recog-
nition.
We also generated multi-speaker acoustic models based on CORPORA speech
database, using the audio data originating from 45 speakers: 28 adult males, 3 young
males, 11 adult females and 3 young females. In total, the multi-speaker model was
trained based on c.a. 180 minutes of speech.
To train our acoustic model we used applications and scripts from SphinxTrain,
which is a part of CMU Sphinx.
4.4. Training the Language Model
We decided to use a tool supplied by CMU Sphinx called Sphinx Knowledge Base
Tool [Rudnicky (2010)], which generates the language model from a list of sentences.
The tool calculates all n-grams appearing in the training data and then converts the
results to an n-gram language model format compatible with Sphinx. The quality
of the model therefore depends on how well it reflects the commands which will be
issued in the system. In our work we decided to split our commands into at least
three word long parts, each part with at most two parameters and generated all
possible variations of each of such fragments. Some of these fragments overlapped,
so that all possible bigrams and trigrams were included. This way also allowed us
to repeat the less-parametrized fragments, since the amount of repetitions did not
have to be large. This whole operation caused our training file to become relatively
small, simple and quick to generate. After using the Sphinx Knowledge Base Tool
on this file, the resulting model required some fine-tuning. Most importantly all
impossible silence-starting and silence-ending N -grams were removed.
In addition we decided to include, as we called them, negative n-grams, which are
artificially created n-grams with near-zero probability. Their purpose is to disallow
word sequences which we consider invalid in our systems, but in preliminary tests
were sometimes incorrectly recognized by the system.
Our JSGF grammar was created manually. First we defined groups of words
describing parameters in our commands, like distances, directions and angles. Then
we defined constant fragments of commands using these groups, some of these in a
few variations, each correct from the Polish language point of view. Next step was
grouping these fragments into commands, and some commands into sequences of
commands. It is important to note, that we have tried to relax the strict commands
by making some words optional, and providing alternatives to some of the words.
After the grammar was complete we ran the tests, and then repeated the whole
process for sentences from the tests which did not match any of the grammar rules.
38 Artur Janicki and Dariusz Wawer
(a) Spectrogram of the ”babble” noise
(b) Spectrogram of the ”factory” noise
(c) Spectrogram of the ”Volvo” noise
Fig. 2. Spectrograms of the noise signals used for testing: (a) babble, (b) factory, (c) Volvo.
4.5. Testing Speech Recognition in Noisy Environments
The ASR system was tested using a set of 85 recordings, containing commands
which are likely to occur in the Rally Navigator game. It contained 2-16 words long
phrases, total number of tested words was 422. Two male speakers were recorded
in quiet environment, speaker A whose voice was used the acoustic model training,
and speaker B used only for testing. Total time of test recordings was 489 s.
The game ASR system was first tested with the original test recordings, i.e., in
quiet environment, and then the ASR system was challenged with noisy conditions.
Voice-driven Computer Game in Noisy Environments 39
We decided to conduct the experiments using tree types of environmental noises,
taken from the NOISEX-92 corpus [Varga and Steeneken (1993)]:
• babble: sound of a 100 people speaking in a canteen, with individual voices
are slightly audible; use of this noise can imitate background voices in the
room where the computer game is played;
• factory : a noise was recorded in a factory, near plate-cutting and electri-
cal welding equipment; this sound can imitate a highly disturbing noisy
environment;
• Volvo: a noise of car interior; the Volvo 340 noise recorded at the speed of
120 kmph, in 4th gear, on an asphalt road in rainy conditions; this is the
case of the computer game played using a mobile device in a car.
Figure 2 presents spectrograms of the used noises. One can observe that the
car noise (”Volvo”) is a lower-band stationary noise, whilst the babble and factory
noise have much wider bandwidth and show frequent changes in time. The factory
noise contains sudden sharp sounds, resembling hitting or cutting.
The noise samples were added to the recordings of test commands in various
proportions, controlling the SNR values. This way several sets of recordings were
obtained, with the SNR ranging from ca. 5 dB to 40 dB, which were submitted to
the ASR system of the game.
5. Results
This section presents the most significant results of the carried out experiments.
Following other research on ASR system tests, in this study the following metrics
were evaluated:
• substitutions (Sub): the number of words which were recognized as other
words;
• insertions (Ins): the number of words which were wrongly added to the
recognized words;
• deletions (Del): the number of words which omitted in recognition;
• word error rate (WER): the ratio of word recognition errors (such as sub-
stitutions, insertions and deletions) against the total number of words;
• word accuracy (WA): the ratio of correctly recognized words against the
total number of words; it is strongly related to WER;
• sentence accuracy (SA): the ratio of correctly recognized sentences against
the total number of sentences.
5.1. Tests for clean speech
Figure 3 shows the recognition performance for various language models for clean
speech using the single-speaker acoustic model for speaker A. The JSGF grammar
yielded the worst results (WA = 79.15%, SA = 61.2% for speaker A). More detailed
40 Artur Janicki and Dariusz Wawer
Fig. 3. Word accuracy and sentence accuracy for recognition using various language modeling for
clean speech.
Fig. 4. Word error rate against the duration of audio signal used in the training process.
analysis of the recognition logs showed that it worked very well for short sentences.
However, it had trouble correctly interpreting long word sequences, especially if
they consisted of more than one short command. The recognizer would in such cases
completely ignore the data after the first command, in spite of the used grammar
definition which allowed compound commands.
After switching to the N -gram language models, the recognition improved. Even
unigram models enabled accuracy better than the one for JSFG, but after adding
information about probabilities of bigrams and trigrams, the results improved sig-
nificantly, yielding word accuracy of 98.58% and sentence accuracy of 92.9%. The
use of negative trigrams turned out to be a successful move, giving for speaker A
the final result of WA = 99.29% and SA = 96.5% .
Figure 4 displays the performance of recognition for clean speech when the single-
Voice-driven Computer Game in Noisy Environments 41
Table 1. Recognition results for speakers A and B with various numbers of sentences used for
acoustic model adaptation. No noise added.
speaker
/ adaptationWER WA SA Sub Ins Del
A 0.9% 99.29% 96.5% 2 1 1
B, no adapt. 8.3% 92.91% 72.9% 20 5 10
B, 10 sentences 9.69% 90.78% 69.4% 28 2 11
B, 20 sentences 6.62% 94.09% 77.6% 18 3 7
B, 30 sentences 5.44% 95.27% 81.2% 13 3 7
B, 40 sentences 4.49% 95.98% 82.4% 9 2 8
B, 60 sentences 3.55% 96.93% 84.7% 6 2 7
B, 80 sentences 1.42% 98.58% 92.9% 3 0 3
speaker acoustic model was trained with different amount of audio data. The word
error rate for speaker A after training the acoustic model with 114 CORPORA sen-
tences (ca. 7 minutes of recording) was almost 4%, what is considered high, taking
into account that it is a small-vocabulary task-oriented recognition. After adding
4 minutes more, containing the same sentences, but uttered faster, WER became
slightly below 3% and the sentence accuracy increased from 88.2% to 90.6%. When
adding recordings containing numerals and control commands, the performance con-
tinued to improve until the training set contained 15 minutes of recordings, where
WER equaled to 0.7%. Sentence accuracy at this moment reached 97.6%, what
was considered a satisfactory result. Further enlargement of audio data resulted in
worsening of WER up to 2.1%, due to slight increase of substitutions and deletions.
Training using 25 minutes of recordings yielded again a low value of WER - 0.9%.
It is noteworthy that not every word deletion, substitution or insertion resulted
in a wrong command. E.g. omitting w (here meaning ’to’) in a sentence skrec w
prawo (’turn to the right’) caused the command change into ’turn right’, being
actually the same. So the semantic accuracy was actually higher than the sentence
accuracy.
Table 1 gives information about the recognition performance both for speakers A
and B. When challenging the system with the voice of speaker B, who was not used
to create the single-speaker acoustic model, the ASR system was able to recognize
correctly 72.9% sentences at the WER rate of 8.3%. After adapting the models
with 10 CORPORA sentences of the speaker B, WER even slightly increased, but
adaptation using 20 sentences decreased WER down to 6.62%. Further enlargement
of the adaptation session was steadily improving the recognition performance, but
80 sentences were required to make WER as low as 1.42%, with sentence accuracy
of almost 93%.
42 Artur Janicki and Dariusz Wawer
(a) Speaker A (b) Speaker B
Fig. 5. Number of substitutions, insertions and deletions against the SNR value of the tested
speech, for speaker A and B.
5.2. Tests for noisy environments
All noisy environments tests were performed using multi-speaker acoustic model.
Neither speaker’s voice was used to generate or adapt the model.
Figure 5 shows the number of substitutions, insertions and deletions for babble-
noised data against the SNR value. As SNR decreases, number of substitutions
increases rapidly, while number of deletions increases at a slower rate. As SNR
decreases further, number of substitutions stops its increase and deletions start to
occur more frequently. There is only a small range of SNR where insertions occur,
though it is different for each speaker. Interestingly, insertions did not occur in
larger numbers in either factory or Volvo-noised data.
Figure 6 shows the Word Error Rate against the SNR for the three types of
noise. As the SNR decreased, up to a certain point WER remained low, and then
started to rise. For the babble noise, the rate at which WER increased was almost
linear. For the factory noise, the WER increase started earlier, but up to a certain
SNR value it was slower. At the SNR between 15 dB and 20 dB WER started to
increase rapidly, then, at about 10 dB the increase rate slowed. Audio data noised
by factory sounds tended to have highest WER, though for speaker B there was a
range of the SNR where babble noise had the highest WER. On the tested SNR
range the Volvo noise had none or very low influence on WER. Up to a certain
SNR value the Volvo noise did not increase WER. As SNR decreased further, WER
started to increase gradually, but its increase rate was significantly slower than for
both babble and factory noise.
Figure 7 shows the Sentence Accuracy against the SNR. At the SNR of 5 dB,
from data noised with babble and factory none of the test sentences could be rec-
ognized, while sentences noised with the Volvo noise still could be recognized with
about 80% correctness.
Voice-driven Computer Game in Noisy Environments 43
(a) Speaker A (b) Speaker B
Fig. 6. Word error rate against the SNR for various types of noise, for speaker A and B.
(a) Speaker A (b) Speaker B
Fig. 7. Sentence accuracy against the SNR for various types of noise, for speaker A and B.
(a) The ”babble” noise (b) The ”factory” noise
Fig. 8. WER comparison for the two speakers in the presence of (a) babble, and (b) factory noise.
6. Conclusion
This paper described a task-oriented continuous speech recognition system for Pol-
ish working in a computer game called Rally Navigator under clean and noisy
44 Artur Janicki and Dariusz Wawer
conditions. The process of designing and implementing the system, based on CMU
Sphinx4, is described. We presented the steps undertaken to create the acoustic
model and the language model, using both the grammar and the statistic N -gram
model.
As for the language model, we showed that the best results were achieved if
the statistic trigram model was used. We improved it by adding negative trigrams,
what decreased the number of misrecognized words.
Initial experiments for clean speech showed that the audio material as short as
15 minutes is enough to produce a highly effective single-speaker command-and-
control ASR system, providing the sentence recognition accuracy of 97.6%. What
was expected, such a model required adaptation for another speaker. 20 sentences
of the new speaker enabled partial adaptation of ASR, so that it reached word
accuracy of 94.09%, but better results (WER below 4%) were obtained if the model
was adapted with 60 or 80 sentences. Obviously using such a long audio material
for adaptation of each new user would be impractical, so the acoustic model needs
to be improved.
Experiments with speech recognition in noisy conditions showed that the recog-
nition accuracy was hardly affected if SNR exceeded 25 dB. Only minor degradation
of performance was observed if recognition was tested in car-interior environment
(the Volvo noise), even for the SNR as low as 5 dB. However, major degradation
of accuracy was observed if speech signal was distorted with babble and factory
noises and the SNR decreased below 20 dB. A game using a voice interface would
need to be expanded by, e.g. a noise suppression system, if planned to be used in
environments with loud background speech or factory-like noises.
References
B. Ziolko, S. Manandhar, R. C. Wilson, M. Ziolko, J. Galka, Application of HTK to thePolish Language, In Proceedings of IEEE International Conference on Audio, Languageand Image Processing ICALIP2008, Shanghai, 2008, pp.1759-1764.
M. Szymanski, J. Ogorkiewicz, M. Lange, K. Klessa, S. Grocholewski, G. Demenko, Firstevaluation of Polish LVCSR acoustic models obtained from the JURISDIC database,Speech and Language Technology, vol. 11, 2008.
S. Furui, Selected topics from 40 years of research in speech and speaker recognition, inProc. Interspeech 2009, Brighton UK, 2009.
L. Rabiner and B.-H. Juang, Historical Perspective of the Filed of ASR/NLU, SpringerHandbook of Speech Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.
S. Young, HMMs and Related Speech Recognition Technologies, Springer Handbook ofSpeech Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.
Sun Microsystems,Java Speech Grammar Format, version 1, 1998, http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/
Ubisoft Shanghai, Tom Clancy’s Endwar, 2009, http://endwargame.us.ubi.com/Ubisoft Romania, Tom Clancy’s H.A.W.X, 2009, http://www.hawxgame.com/A. Janicki and D. Wawer Automatic Speech Recognition for Polish in a Computer Game
Voice-driven Computer Game in Noisy Environments 45
Interface, in Proc. of the Federated Conference on Computer Science and InformationSystems (FedCSIS 2011), pp. 711-716, Szczecin, 2011
W. Jassem, Podstawy fonetyki akustycznej (in Polish), Warsaw, Poland : PWN, 1973.R.F. Feldstein, A Concise Polish Grammar, Duke University, Durham, USA: Slavic and
Eurasian Language Resource Center, 2001.A. Janicki, P. Meus, M. Topczewski, Taking Advantage of Pronunciation Variation in Unit
Selection Speech Synthesis for Polish, 3rd International Symposium on Communica-tions, Control and Signal Processing (ISCCSP 2008), St.Julians, Malta, 2008
S. Grocholewski, CORPORA - Speech Database for Polish Diphones, 5th European Con-ference on Speech Communication and Technology Eurospeech ’97, Rhodes, Greece,1997.
A. Rudnicky, Sphinx Knowledge Base Tool, 2010,http://www.speech.cs.cmu.edu/tools/lmtool.html
J. Droppo and A. Acero, Environmental Robustness, in: Springer Handbook of SpeechProcessing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.
H. Yasui, K. Shinoda, S. Furui, and K. Iwano, Noise robust speech recognition using spec-tral subtraction and F0 information extracted by Hough transform, APSIPA AnnualSummit and Conference 2009, TP-P1-4, pp.631-634, 2009.
D.S. Kim, S.Y. Lee, R.M. Kil, Auditory Processing of Speech Signals for Robust SpeechRecognition in Real-World Noisy Environments, IEEE Transactions on Speech andAudio Processing, 7, pp. 55-69, 1999.
J.H.L. Hansen, R. Sarikaya, U. Yapanel, B. Pellom, Robust Speech Recognition in Noise:An Evaluation using the SPINE Corpus, in Proc. Eurospeech 2001, Aalborg, Denmark,2001.
L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, High-Performance Robust SpeechRecognition Using Stereo Training Data, in Proc. ICASSP, Salt Lake City, Utah, 2001.
J.F. Gemmeke, T. Virtanen, A. Hurmalainen and Y. Sun, Exemplar-based sparse repre-sentations for noise robust automatic speech recognition, IEEE Transactions on Audio,Speech and Language Processing, 19(7):2067 2080.
S. Haque, R.Togneri and A. Zaknich, Perceptual features for automatic speech recognitionin noisy environments, Speech Communications, 51 (1), pp. 58-75, 2009.
A. Schmidt-Nielsen, et al., Speech in Noisy Environments (SPINE) Training Audio, Lin-guistic Data Consortium, Philadelphia, 2000.
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognitionsystems, Speech Communication, 12 (3), pp. 247-251, 1993.