Voice-driven Computer Game in Noisy Environments · Automatic Speech Recognition in Voice Interfaces ... (e.g. whether a noun is a subject or an object) ... Voice-driven Computer

International Journal of Computer Science and Applicationsc© Technomathematics Research Foundation

Vol. 10 No. 1, pp. 31 - 45, 2013

Voice-driven Computer Game in Noisy Environments

ARTUR JANICKI

Institute of Telecommunications, Warsaw University of Technology

ul. Nowowiejska 15/19, 00-665 Warsaw, Poland.

[email protected]

DARIUSZ WAWER

Institute of Telecommunications, Warsaw University of Technology

ul. Nowowiejska 15/19, 00-665 Warsaw, [email protected]

The paper describes the performance of a task-oriented continuous automatic speechrecognition (ASR) system in the computer game interface in noisy conditions. First, the

process of designing the ASR system for Polish, based on CMU Sphinx4, is presented.

Then, the concept of the computer game called Rally Navigator is described. The exper-iments were first run for the clean speech, and then repeated with added environmental

noise with various signal-to-noise (SNR) ratios. Results of experiments with clean speech

show that as little as 15 minutes of audio material is enough to produce a highly effec-tive single-speaker command-and-control ASR system for the computer game, providing

the sentence recognition accuracy of 97.6%. Results of the tests under noisy conditions

show that minor degradation of performance was observed in car environment, however,accuracy decreased severely for babble and factory noises for SNR below 20 dB.

Keywords: speech recognition; voice interface; computer game; noise.

1. Introduction

Automatic speech recognition (ASR) systems become more and more popular as

an input mechanism in various applications - especially in word processors where

dictation software is being introduced. But there are also trials being conducted

to replace pads and keyboards as computer games controlling devices, thus mak-

ing the games more interesting and enabling multimodal input. ASR systems can

be successfully utilized in computer games, under the condition that they provide

satisfactory recognition accuracy and short processing time.

To ensure realistic conditions, an ASR system in the computer game should be

able to recognize continuous speech, which is how people usually talk. Such speech

is much more difficult to recognize than isolated words. Apart from a few cases,

for example [Ziolko et al. (2008)] and [Szymanski et al. (2008)], continuous speech

recognition systems barely exist for the Polish language, which is highly inflective

and thus difficult to recognize. Recognizing speech in Polish under noisy conditions

31

32 Artur Janicki and Dariusz Wawer

poses an additional challenge.

This paper describes experiments with a small-vocabulary task-oriented auto-

matic continuous speech recognition system for Polish, working in the voice interface

of a computer game under clean and noisy conditions. The impact of noise on speech

recognition performance will be analyzed and discussed.

2. Voice-driven Computer Games

Nowadays computer games are a large and well developing industry, often generating

high revenues. To be successful in the market, a game has to be easy to play,

intuitive, must be fun and interesting and must be overall player-friendly.

While the first three of the requirements apply rather to the game design, the

last one refers as well to the user interface, i.e., in this case, to the ASR input

system. It imposes some demanding requirements: firstly, the system must recognize

the command in a short time; the maximal acceptable amount of time depends on

how fast the action is in a game, but should be lower than one second. Secondly,

the accuracy of recognition must be very high. For our work, we have aimed at

90% sentence recognition accuracy. Should a player issue several misinterpreted

commands in a row, the game would fail at the friendliness requirement.

An ideal game using speech recognition should work robustly for every speaker

in most of the common acoustic conditions, including people with various accents,

non-native speakers and even people with speech deficiencies.

There are many factors that can have negative impact on speech recognition

robustness:

• a low quality microphone;

• noisy environments;

• high processing power and memory requirements of an ASR system.

It is, however, advantageous that games usually require command-and-control

(task-oriented) systems, which are easier to implement and tend to have better

accuracy than dictation systems with a dictionary of similar size.

A computer game is a specific environment for a speech recognition system. It

can be treated as a special type of a dialog system, in which in some cases the game

may actually predict what the user might want to say, knowing the game scenario.

A slight on-the-fly modification to the language model may be made to reflect the

scenario.

The usage of speech recognition systems in computer games is yet unexplored.

There have been a few attempts to utilize ASR systems, particularly in strategy

games [Tom Clancy’s Endwar (2009)] and flight simulators [Tom Clancy’s H.A.W.X

(2009)]. ASR systems implemented in these games did work properly, but have not

been enthusiastically accepted, mostly because they were tedious to use.

Voice-driven Computer Game in Noisy Environments 33

3. Automatic Speech Recognition in Voice Interfaces

3.1. Continuous Speech Recognition

Early works on ASR systems, starting in the 1950s, concerned recognition of isolated

phonemes, or at best - a few words. One example is the isolated digits recognizer

constructed at Bell Labs in 1952 [Furui (2009)]. The invention of Dynamic Time

Warping (DTW) in the late 60s allowed for projects on larger-vocabulary word

recognition and for processing of connected words [Rabiner (2008)]. In 1971 the

ARPA SUR (Speech Understanding Research) program started, aiming at creating a

reliable large vocabulary ASR for continuous speech. One of its results was HARPY -

an ASR system developed at the Carnegie Mellon University, working with semantic

accuracy of 95% at the processing speed of 80 times real-time [Rabiner (2008)].

Studies on continuous speech recognition continued intensively in the 1980s, when

the usage of statistic acoustic modeling and statistic language modeling advanced,

and they have continued till nowadays.

The key to successful continuous speech recognition is a combination of a highly

accurate acoustic modeling (AM), based usually on MFCC parameters and Hidden

Markov Models (HMM), and a precise and yet flexible language modeling (LM),

based most often on N -grams [Young (2008)] or grammars, such as Java Speech

Grammar Format (JSGF ) [JSGF, Sun Microsystems (1998)].

3.2. Speech Recognition in Noisy Environments

Performance of speech recognition systems used in noisy environments is usually

degraded - this phenomenon has been observed in many studies [Droppo and Acero

(2008)].

Various methods of improving robustness against noise were researched. One of

them is the use of speech enhancement algorithms - in this method speech signal,

before submitting to an ASR system, undergoes a denoising process, e.g. by spectral

subtraction [Yasui et al. (2009)] or Wiener filtering. Other way is to design new

auditory models which are less sensitive to noise [Doh-Suk Kim et al. (1999)].

Other researchers propose advanced feature processing, such as cepstral normal-

ization techniques (e.g., cepstral mean normalization - CMN, variable cepstral mean

normalization - VCMN), see [Hansen et al. (2001)], or other techniques which try to

estimate cepstral parameters of undistorted speech, given cepstral parameters of the

noisy speech [Deng et al. (2008)]; this is sometimes combined with multi-condition

training, i.e., training acoustic models with speech distorted with different types of

noise and signal-to-noise (SNR) ratios. Using sparse representation based classifi-

cation allows for improving robustness, though it requires a lot of processing power

[Gemmeke et al. (2011)]. For some types of noise using perceptual properties proved

to improve results of the ASR system [Haque et al. (2008)]

Many corpora with noisy speech exist to be used for testing the ASR systems,

such as SPINE [Schmidt-Nielsen et al. (2008)], unfortunately they are dedicated


mostly for English. However, researchers use often use corpora of noise sounds, such

as NOISEX-92 [Varga and Steeneken (1993)] - such noise signals can be added to

clean speech data, thus emulating noisy conditions. Unfortunately this method does

not take into account phenomena such as the Lombard effect - changing speaker

articulation in noisy environments in order to be better heard through the noise.

3.3. The Sphinx Framework

CMU Sphinx framework was created and has been continuously developed at

Carnegie Mellon University (CMU). In our work we used Sphinx4 recognizer and

SphinxTrain toolset for preparing training data and accoustic and language models.

Sphinx4 is a multi-threaded CPU-based recognizer written in Java, which uti-

lizes HMMs for speech recognition. For our research we chose 13 parameter MFCC

acoustic models, an N -gram language model and a JSGF grammar.

The Sphinx framework was originally designed for English, but nowadays it

supports several languages, e.g. Spanish, French and Mandarin. However, there are

no acoustic nor language models available for Polish.

To use it with Polish, the package required some modifications to work properly,

e.g. several internal classes needed to be adapted to correctly read names containing

non-ASCII characters, which were found in the language models and dictionaries.

4. Designing the Game Voice Interface

In this section we will characterize briefly the Polish language, describe the concept

of the voice-driven computer game Rally Navigator, originally proposed in [Janicki

and Wawer (2011)], and we will describe the consecutive steps of designing the

speech recognition system for this game, as well as the testing procedure under

noisy conditions.

4.1. The Polish Language

Spoken Polish contains 38 phonemes: 6 vowels and 32 consonants [Jassem (1973)].

Most of the phonemes (plosives, fricatives and affricates) exist in pairs: unvoiced-

voiced, e.g. [tS] - [dZ].

The Polish grammar is complex. It is highly inflective - nouns are inflected

according to seven cases and two numbers, verbs are inflected according to gender,

tense and number, adjectives and numerals are inflected, too [Feldstein (2001)].

There are three genders in singular and two genders in plural. Thanks to the high

inflection, the word order in a Polish sentence is loose, as the function of the word

(e.g. whether a noun is a subject or an object) is determined not by the position of

the word within the sentence, but by the form of the word. For the same reason,

subjects are often dropped.

These properties make continuous speech recognition for Polish quite a demand-

ing challenge. Lack of articles makes detecting nouns difficult. High inflection some-

times causes a single word to have dozens of forms, which often sound similarly.


Fig. 1. A view of Rally Navigator’s main window

Loose word order increases the complexity of language models. Pronunciation vari-

ation, which happen to be advantageous in speech synthesis [Janicki et al. (2008)],

is a disturbing factor in speech recognition. Luckily these problems have smaller

impact on the system complexity if we consider a task-oriented, small vocabulary

recognition, which is usually the case for a computer game.

Moreover, in such applications some speech recognition errors are negligible for

the game’s command interpreter. For example, changing the case of a numeral will

result in decoding it as the same number. E.g., for the game considered, some non-

significant prepositions are ignored.

4.2. Concept of the Computer Game

We decided to create a game called Rally Navigator in which the player would

compete in races – not as a driver, but as a navigator. The player’s task would

be to provide the driver with information about the route and track elements like

curves and straights. To make the game more difficult (and the ASR system more

complex) we also decided to include speed control and gear switching. The aim

of the game is to win the rally. The more precise and well-timed the information

supplied by the player, the quicker the car reaches the finish line.

We wanted to make the ASR system an integral part of the game, not an addi-


tional or alternative way of controlling it.

The ASR system in the game is activated manually, by pressing the space bar.

This both decreases the chance of recognizing background noise as a command and

simplifies the recognition process by determining command boundaries. Fig. 1 shows

the game prototype’s main window. In the upper left corner there are three lines of

text. The first line shows the car’s current position and speed, the second line shows

the sentence recognized by our ASR system while the third line show the command

that was issued based on the recognition. In this case the command was Accelerate

to 50 and the issued command was Set speed 50.

4.3. Developing the Acoustic Model

The following elements were required to train a new acoustic model:

• audio data with recorded speech;

• transcription of each audio file;

• dictionary with phonetic representations of all words appearing in the tran-

scriptions;

• list of phonemes (and sounds) appearing in the transcriptions.

For a simple command-and-control one-speaker system (the one we began from)

the amount of training audio data can be fairly low. For multi-speaker systems

the amount of required audio increases, and increases even further for dictation

purposes.

The training audio signal was recorded in a home environment, using 16 kHz

sampling. The speaker read the following:

• 114 CORPORA [Grocholewski (1997)] sentences, which were phonetically

balanced, i.e., contained all Polish phonemes and diphones;

• the same set of sentences, but spoken faster, in order to get prepared for

quickly spoken commands, for example for a sequence of tight curves. The

CORPORA sentences in total formed ca. 11 minutes of our training data;

• sample commands, which will be used in game, and numbers, which must

be correctly recognized for the game to work (needed for the curve angles,

gears and distances).

In total, we prepared 25 minutes of audio for a single-speaker acoustic model.

The audio files were then transcribed, including special silence marks for silence

appearing in the file. The transcription allowed the training algorithm to better

align the speech and, in turn, produce better acoustic models. The transcription

involved also non-speech sounds, such as mouse clicking or breathing. It is important

to include such sounds in the transcriptions, so that they are not confused with

phonemes during training. If the non-speech sounds are properly labeled in the

training data, the ASR system may successfully recognize and omit them.


The phonetic dictionary was prepared in such a way that it contained all ex-

pected words with possible variants of their pronunciation. Careful preparation of

phonetic dictionary turned out to be crucial for the robustness of the speech recog-

nition.

We also generated multi-speaker acoustic models based on CORPORA speech

database, using the audio data originating from 45 speakers: 28 adult males, 3 young

males, 11 adult females and 3 young females. In total, the multi-speaker model was

trained based on c.a. 180 minutes of speech.

To train our acoustic model we used applications and scripts from SphinxTrain,

which is a part of CMU Sphinx.

4.4. Training the Language Model

We decided to use a tool supplied by CMU Sphinx called Sphinx Knowledge Base

Tool [Rudnicky (2010)], which generates the language model from a list of sentences.

The tool calculates all n-grams appearing in the training data and then converts the

results to an n-gram language model format compatible with Sphinx. The quality

of the model therefore depends on how well it reflects the commands which will be

issued in the system. In our work we decided to split our commands into at least

three word long parts, each part with at most two parameters and generated all

possible variations of each of such fragments. Some of these fragments overlapped,

so that all possible bigrams and trigrams were included. This way also allowed us

to repeat the less-parametrized fragments, since the amount of repetitions did not

have to be large. This whole operation caused our training file to become relatively

small, simple and quick to generate. After using the Sphinx Knowledge Base Tool

on this file, the resulting model required some fine-tuning. Most importantly all

impossible silence-starting and silence-ending N -grams were removed.

In addition we decided to include, as we called them, negative n-grams, which are

artificially created n-grams with near-zero probability. Their purpose is to disallow

word sequences which we consider invalid in our systems, but in preliminary tests

were sometimes incorrectly recognized by the system.

Our JSGF grammar was created manually. First we defined groups of words

describing parameters in our commands, like distances, directions and angles. Then

we defined constant fragments of commands using these groups, some of these in a

few variations, each correct from the Polish language point of view. Next step was

grouping these fragments into commands, and some commands into sequences of

commands. It is important to note, that we have tried to relax the strict commands

by making some words optional, and providing alternatives to some of the words.

After the grammar was complete we ran the tests, and then repeated the whole

process for sentences from the tests which did not match any of the grammar rules.


(a) Spectrogram of the ”babble” noise

(b) Spectrogram of the ”factory” noise

(c) Spectrogram of the ”Volvo” noise

Fig. 2. Spectrograms of the noise signals used for testing: (a) babble, (b) factory, (c) Volvo.

4.5. Testing Speech Recognition in Noisy Environments

The ASR system was tested using a set of 85 recordings, containing commands

which are likely to occur in the Rally Navigator game. It contained 2-16 words long

phrases, total number of tested words was 422. Two male speakers were recorded

in quiet environment, speaker A whose voice was used the acoustic model training,

and speaker B used only for testing. Total time of test recordings was 489 s.

The game ASR system was first tested with the original test recordings, i.e., in

quiet environment, and then the ASR system was challenged with noisy conditions.


We decided to conduct the experiments using tree types of environmental noises,

taken from the NOISEX-92 corpus [Varga and Steeneken (1993)]:

• babble: sound of a 100 people speaking in a canteen, with individual voices

are slightly audible; use of this noise can imitate background voices in the

room where the computer game is played;

• factory : a noise was recorded in a factory, near plate-cutting and electri-

cal welding equipment; this sound can imitate a highly disturbing noisy

environment;

• Volvo: a noise of car interior; the Volvo 340 noise recorded at the speed of

120 kmph, in 4th gear, on an asphalt road in rainy conditions; this is the

case of the computer game played using a mobile device in a car.

Figure 2 presents spectrograms of the used noises. One can observe that the

car noise (”Volvo”) is a lower-band stationary noise, whilst the babble and factory

noise have much wider bandwidth and show frequent changes in time. The factory

noise contains sudden sharp sounds, resembling hitting or cutting.

The noise samples were added to the recordings of test commands in various

proportions, controlling the SNR values. This way several sets of recordings were

obtained, with the SNR ranging from ca. 5 dB to 40 dB, which were submitted to

the ASR system of the game.

5. Results

This section presents the most significant results of the carried out experiments.

Following other research on ASR system tests, in this study the following metrics

were evaluated:

• substitutions (Sub): the number of words which were recognized as other

words;

• insertions (Ins): the number of words which were wrongly added to the

recognized words;

• deletions (Del): the number of words which omitted in recognition;

• word error rate (WER): the ratio of word recognition errors (such as sub-

stitutions, insertions and deletions) against the total number of words;

• word accuracy (WA): the ratio of correctly recognized words against the

total number of words; it is strongly related to WER;

• sentence accuracy (SA): the ratio of correctly recognized sentences against

the total number of sentences.

5.1. Tests for clean speech

Figure 3 shows the recognition performance for various language models for clean

speech using the single-speaker acoustic model for speaker A. The JSGF grammar

yielded the worst results (WA = 79.15%, SA = 61.2% for speaker A). More detailed


Fig. 3. Word accuracy and sentence accuracy for recognition using various language modeling for

clean speech.

Fig. 4. Word error rate against the duration of audio signal used in the training process.

analysis of the recognition logs showed that it worked very well for short sentences.

However, it had trouble correctly interpreting long word sequences, especially if

they consisted of more than one short command. The recognizer would in such cases

completely ignore the data after the first command, in spite of the used grammar

definition which allowed compound commands.

After switching to the N -gram language models, the recognition improved. Even

unigram models enabled accuracy better than the one for JSFG, but after adding

information about probabilities of bigrams and trigrams, the results improved sig-

nificantly, yielding word accuracy of 98.58% and sentence accuracy of 92.9%. The

use of negative trigrams turned out to be a successful move, giving for speaker A

the final result of WA = 99.29% and SA = 96.5% .

Figure 4 displays the performance of recognition for clean speech when the single-


Table 1. Recognition results for speakers A and B with various numbers of sentences used for

acoustic model adaptation. No noise added.

speaker

/ adaptationWER WA SA Sub Ins Del

A 0.9% 99.29% 96.5% 2 1 1

B, no adapt. 8.3% 92.91% 72.9% 20 5 10

B, 10 sentences 9.69% 90.78% 69.4% 28 2 11

B, 20 sentences 6.62% 94.09% 77.6% 18 3 7

B, 30 sentences 5.44% 95.27% 81.2% 13 3 7

B, 40 sentences 4.49% 95.98% 82.4% 9 2 8

B, 60 sentences 3.55% 96.93% 84.7% 6 2 7

B, 80 sentences 1.42% 98.58% 92.9% 3 0 3

speaker acoustic model was trained with different amount of audio data. The word

error rate for speaker A after training the acoustic model with 114 CORPORA sen-

tences (ca. 7 minutes of recording) was almost 4%, what is considered high, taking

into account that it is a small-vocabulary task-oriented recognition. After adding

4 minutes more, containing the same sentences, but uttered faster, WER became

slightly below 3% and the sentence accuracy increased from 88.2% to 90.6%. When

adding recordings containing numerals and control commands, the performance con-

tinued to improve until the training set contained 15 minutes of recordings, where

WER equaled to 0.7%. Sentence accuracy at this moment reached 97.6%, what

was considered a satisfactory result. Further enlargement of audio data resulted in

worsening of WER up to 2.1%, due to slight increase of substitutions and deletions.

Training using 25 minutes of recordings yielded again a low value of WER - 0.9%.

It is noteworthy that not every word deletion, substitution or insertion resulted

in a wrong command. E.g. omitting w (here meaning ’to’) in a sentence skrec w

prawo (’turn to the right’) caused the command change into ’turn right’, being

actually the same. So the semantic accuracy was actually higher than the sentence

accuracy.

Table 1 gives information about the recognition performance both for speakers A

and B. When challenging the system with the voice of speaker B, who was not used

to create the single-speaker acoustic model, the ASR system was able to recognize

correctly 72.9% sentences at the WER rate of 8.3%. After adapting the models

with 10 CORPORA sentences of the speaker B, WER even slightly increased, but

adaptation using 20 sentences decreased WER down to 6.62%. Further enlargement

of the adaptation session was steadily improving the recognition performance, but

80 sentences were required to make WER as low as 1.42%, with sentence accuracy

of almost 93%.


(a) Speaker A (b) Speaker B

Fig. 5. Number of substitutions, insertions and deletions against the SNR value of the tested

speech, for speaker A and B.

5.2. Tests for noisy environments

All noisy environments tests were performed using multi-speaker acoustic model.

Neither speaker’s voice was used to generate or adapt the model.

Figure 5 shows the number of substitutions, insertions and deletions for babble-

noised data against the SNR value. As SNR decreases, number of substitutions

increases rapidly, while number of deletions increases at a slower rate. As SNR

decreases further, number of substitutions stops its increase and deletions start to

occur more frequently. There is only a small range of SNR where insertions occur,

though it is different for each speaker. Interestingly, insertions did not occur in

larger numbers in either factory or Volvo-noised data.

Figure 6 shows the Word Error Rate against the SNR for the three types of

noise. As the SNR decreased, up to a certain point WER remained low, and then

started to rise. For the babble noise, the rate at which WER increased was almost

linear. For the factory noise, the WER increase started earlier, but up to a certain

SNR value it was slower. At the SNR between 15 dB and 20 dB WER started to

increase rapidly, then, at about 10 dB the increase rate slowed. Audio data noised

by factory sounds tended to have highest WER, though for speaker B there was a

range of the SNR where babble noise had the highest WER. On the tested SNR

range the Volvo noise had none or very low influence on WER. Up to a certain

SNR value the Volvo noise did not increase WER. As SNR decreased further, WER

started to increase gradually, but its increase rate was significantly slower than for

both babble and factory noise.

Figure 7 shows the Sentence Accuracy against the SNR. At the SNR of 5 dB,

from data noised with babble and factory none of the test sentences could be rec-

ognized, while sentences noised with the Volvo noise still could be recognized with

about 80% correctness.



Fig. 6. Word error rate against the SNR for various types of noise, for speaker A and B.


Fig. 7. Sentence accuracy against the SNR for various types of noise, for speaker A and B.

(a) The ”babble” noise (b) The ”factory” noise

Fig. 8. WER comparison for the two speakers in the presence of (a) babble, and (b) factory noise.

6. Conclusion

This paper described a task-oriented continuous speech recognition system for Pol-

ish working in a computer game called Rally Navigator under clean and noisy


conditions. The process of designing and implementing the system, based on CMU

Sphinx4, is described. We presented the steps undertaken to create the acoustic

model and the language model, using both the grammar and the statistic N -gram

model.

As for the language model, we showed that the best results were achieved if

the statistic trigram model was used. We improved it by adding negative trigrams,

what decreased the number of misrecognized words.

Initial experiments for clean speech showed that the audio material as short as

15 minutes is enough to produce a highly effective single-speaker command-and-

control ASR system, providing the sentence recognition accuracy of 97.6%. What

was expected, such a model required adaptation for another speaker. 20 sentences

of the new speaker enabled partial adaptation of ASR, so that it reached word

accuracy of 94.09%, but better results (WER below 4%) were obtained if the model

was adapted with 60 or 80 sentences. Obviously using such a long audio material

for adaptation of each new user would be impractical, so the acoustic model needs

to be improved.

Experiments with speech recognition in noisy conditions showed that the recog-

nition accuracy was hardly affected if SNR exceeded 25 dB. Only minor degradation

of performance was observed if recognition was tested in car-interior environment

(the Volvo noise), even for the SNR as low as 5 dB. However, major degradation

of accuracy was observed if speech signal was distorted with babble and factory

noises and the SNR decreased below 20 dB. A game using a voice interface would

need to be expanded by, e.g. a noise suppression system, if planned to be used in

environments with loud background speech or factory-like noises.

References

B. Ziolko, S. Manandhar, R. C. Wilson, M. Ziolko, J. Galka, Application of HTK to thePolish Language, In Proceedings of IEEE International Conference on Audio, Languageand Image Processing ICALIP2008, Shanghai, 2008, pp.1759-1764.

M. Szymanski, J. Ogorkiewicz, M. Lange, K. Klessa, S. Grocholewski, G. Demenko, Firstevaluation of Polish LVCSR acoustic models obtained from the JURISDIC database,Speech and Language Technology, vol. 11, 2008.

S. Furui, Selected topics from 40 years of research in speech and speaker recognition, inProc. Interspeech 2009, Brighton UK, 2009.

L. Rabiner and B.-H. Juang, Historical Perspective of the Filed of ASR/NLU, SpringerHandbook of Speech Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.

S. Young, HMMs and Related Speech Recognition Technologies, Springer Handbook ofSpeech Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.

Sun Microsystems,Java Speech Grammar Format, version 1, 1998, http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/

Ubisoft Shanghai, Tom Clancy’s Endwar, 2009, http://endwargame.us.ubi.com/Ubisoft Romania, Tom Clancy’s H.A.W.X, 2009, http://www.hawxgame.com/A. Janicki and D. Wawer Automatic Speech Recognition for Polish in a Computer Game


Interface, in Proc. of the Federated Conference on Computer Science and InformationSystems (FedCSIS 2011), pp. 711-716, Szczecin, 2011

W. Jassem, Podstawy fonetyki akustycznej (in Polish), Warsaw, Poland : PWN, 1973.R.F. Feldstein, A Concise Polish Grammar, Duke University, Durham, USA: Slavic and

Eurasian Language Resource Center, 2001.A. Janicki, P. Meus, M. Topczewski, Taking Advantage of Pronunciation Variation in Unit

Selection Speech Synthesis for Polish, 3rd International Symposium on Communica-tions, Control and Signal Processing (ISCCSP 2008), St.Julians, Malta, 2008

S. Grocholewski, CORPORA - Speech Database for Polish Diphones, 5th European Con-ference on Speech Communication and Technology Eurospeech ’97, Rhodes, Greece,1997.

A. Rudnicky, Sphinx Knowledge Base Tool, 2010,http://www.speech.cs.cmu.edu/tools/lmtool.html

J. Droppo and A. Acero, Environmental Robustness, in: Springer Handbook of SpeechProcessing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.

H. Yasui, K. Shinoda, S. Furui, and K. Iwano, Noise robust speech recognition using spec-tral subtraction and F0 information extracted by Hough transform, APSIPA AnnualSummit and Conference 2009, TP-P1-4, pp.631-634, 2009.

D.S. Kim, S.Y. Lee, R.M. Kil, Auditory Processing of Speech Signals for Robust SpeechRecognition in Real-World Noisy Environments, IEEE Transactions on Speech andAudio Processing, 7, pp. 55-69, 1999.

J.H.L. Hansen, R. Sarikaya, U. Yapanel, B. Pellom, Robust Speech Recognition in Noise:An Evaluation using the SPINE Corpus, in Proc. Eurospeech 2001, Aalborg, Denmark,2001.

L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, High-Performance Robust SpeechRecognition Using Stereo Training Data, in Proc. ICASSP, Salt Lake City, Utah, 2001.

J.F. Gemmeke, T. Virtanen, A. Hurmalainen and Y. Sun, Exemplar-based sparse repre-sentations for noise robust automatic speech recognition, IEEE Transactions on Audio,Speech and Language Processing, 19(7):2067 2080.

S. Haque, R.Togneri and A. Zaknich, Perceptual features for automatic speech recognitionin noisy environments, Speech Communications, 51 (1), pp. 58-75, 2009.

A. Schmidt-Nielsen, et al., Speech in Noisy Environments (SPINE) Training Audio, Lin-guistic Data Consortium, Philadelphia, 2000.

A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognitionsystems, Speech Communication, 12 (3), pp. 247-251, 1993.

Voice-driven Computer Game in Noisy Environments · Automatic Speech Recognition in Voice Interfaces ... (e.g. whether a noun is a subject or an object) ... Voice-driven Computer

Documents