End-to-End Automatic Speech Recognition for Gujarati

Proceedings of the 17th International Conference on Natural Language Processing, pages 409–419Patna, India, December 18 - 21, 2020. ©2020 NLP Association of India (NLPAI)

409

End-to-End Automatic Speech Recognition for Gujarati

Deepang Raval, Vyom Pathak, Muktan Patel, Brijesh BhattComputer Engineering Department, Dharmsinh Desai University, Nadiad

[email protected], [email protected],[email protected], [email protected]

AbstractWe present a novel approach for improvingthe performance of an End-to-End speechrecognition system for the Gujarati lan-guage. We follow a deep learning based ap-proach which includes Convolutional Neu-ral Network (CNN), Bi-directional LongShort Term Memory (BiLSTM) layers,Dense layers, and Connectionist TemporalClassification (CTC) as a loss function. Inorder to improve the performance of thesystem with the limited size of the dataset,we present a combined language model(WLM and CLM) based prefix decodingtechnique and Bidirectional Encoder Rep-resentations from Transformers (BERT)based post-processing technique. To gainkey insights from our Automatic SpeechRecognition (ASR) system, we proposeddifferent analysis methods. These insightshelp to understand our ASR system basedon a particular language (Gujarati) as wellas can govern ASR systems’ to improvethe performance for low resource languages.We have trained the model on the Mi-crosoft Speech Corpus, and we observe a5.11% decrease in Word Error Rate (WER)with respect to base-model WER.

1 IntroductionASR is the process of deriving the transcrip-tion (word sequence) of an utterance, giventhe speech waveform. Speech Recognitionhas been an active area of research for manydecades. Initial work in ASR was based onstatistical modeling techniques like HiddenMarkov Model (HMM) (Baker, 1975) and usedphonemes to represent distinct sounds thatmake up the word. With the rise of DeepLearning based techniques and the increasingavailability of data, the End-to-End speechrecognition systems started showing compet-itive results. Initial deep learning based ASR

models, based on Recurrent Neural Network(RNN) and CTC (Graves et al., 2006), over-came the issues of statistical systems and pro-vided the mapping of variable length input tooutput. With the further advancements in al-gorithms and resources, various complex deeplearning architectures have been introducedfor an effective End-to-End speech recognitionsystem. End-to-End speech recognition for lowresource languages has not gained significantlyfrom the advancements in deep learning due tolack of training data compared to other highresource languages. Linguistic diversities1 alsomakes it difficult to adopt models across lan-guages.

In this paper, we present a speech recogni-tion system for the Gujarati Language. Gu-jarati is a rich language consisting of 34 conso-nants and 13 vowels. While the more numberof vowels may reduce the homophones, morenumber of consonants may increase the ambi-guity.

The key contributions of this paper are asfollows,

• We have adopted the state of the art ASRmodel described in (Amodei et al., 2015)for the Gujarati Language.

• We present a novel approach of combin-ing two language models, 4-gram word-level language model (WLM) and bi-gramcharacter-level language model (CLM) toimprove performance of prefix decoding.

• We propose a Spell Corrector BERTbased post-processing technique to cor-rect erroneous prediction and further im-prove the performance of the ASR Sys-tem.

1http://www.cs.cmu.edu/ ytsvetko/jsalt-part1.pdf

410

The proposed system reduced the WER to65.54% from the initial 70.65%. We analyzedthe system using the testing hypothesis and de-rived many useful insights on the performanceof the system as well as the cause of the er-rors in the hypothesis. We analyzed that theerrors produced in the Gujarati language aremainly because of interchanging/mismatchingdiacritics (‘િ◌’, ‘◌ી’, etc.), consonants (‘ધ’,‘જ’, ‘ય’, etc.), independents (‘આ’, ‘અ’, ‘ઇ’,‘ઈ’, etc.) and some homophones.

The remaining of the paper is organized asfollows, Section 2 describes the Literature Sur-vey of ASR system architectures. The pro-posed approach is described in Section 3. Sec-tion 4 constitutes the experiments conductedand its observations are followed in Section 5.Section 6 provides the conclusion of our work.

2 Related Work

Since the first ASR circuit developed by BellLaboratories (Davis et al., 1952) in the 1950s,ASR has remained an active area of research.In early 1960’s (Kenichi et al., 1966) pre-sented a phoneme based speech recognitionwhich involved the first use of speech seg-menter in different portions of the input ut-terances. (Vintsyuk, 1968; Sakoe and Chiba,1978) introduced the concept of the non-uniform time scale for alignment of speechpatterns (dynamic wrapping). Both of theseworks lead to the Viterbi Algorithm (Viterbi,1967) which had been an indispensable tech-nique in ASR for decades. By the mid-1970s, the basic ideas of applying fundamen-tal pattern recognition technology to speechrecognition, based on Linear Predictive Cod-ing (LPC) (Atal and Hanauer, 1971) meth-ods, were proposed by Itakura (Itakura, 1975).CMU’s Harpy System (Lowerre and Reddy,1976), was the first ever system to use theFinite State Network (FSN) to reduce com-putation for matching in Speech Recognition.However, methods which optimized the result-ing FSN did not come about until the early1990’s (Mohri, 1997), which were limited tosmall to medium vocabulary electronic basedsolutions for ASR.

The earlier approaches of Electronics basedASR were eventually replaced by statistical ap-proaches with the introduction of HMM based

speech recognition. The basic implementationof HMM based speech recognition model wasfirst published in 1975 by Baker (Baker, 1975)at CMU. Further work on HMM continuedwith the introduction of first ever use of HMMfor continuous speech recognition in 1976 (Je-linek, 1976). As the research continued, theHMM model was tried with various machinelearning techniques including the HMM/ANNarchitecture in 1990 (Bourlard and Wellekens,1990), HMM/GMM architecture in 1997 (Ro-dríguez et al., 1997) and HMM/SVM architec-ture in 1998 (Golowich and Sun, 1998). Thiswave for HMM continued till the introductionof RNN based approaches in early 2005.

The above approaches had major drawbackssuch as

• It requires high task-specific knowledge,e.g. to design the state models for HMMs.

• It requires fairly complex parameter tun-ing as the pipeline contains multiple con-figurations.

(Graves et al., 2006) introduced a novelmethod for training RNNs to label un-segmented sequences directly, using CTC,thereby eliminating the above drawbacks andcreating an End-to-End ASR system. Withfurther enhancements in algorithms and re-sources, deep learning based End-to-End ASRsystems got better and better and theystarted outperforming traditional ASR sys-tems (Graves and Jaitly, 2014; Hannun et al.,2014). End-to-End ASR systems with encoder-decoder have shown competitive results (Chanet al., 2016). The RNN encoder-decoderparadigm uses an encoder RNN to map theinput to a fixed-length vector and a decodernetwork to expand the fixed-length vector intoa sequence of output predictions (Cho et al.,2014; Sutskever et al., 2014). Adding an at-tention mechanism to the decoder greatly im-proves the performance of the system, partic-ularly with long inputs or outputs.

As we saw End-to-End ASR gives great re-sults but at the cost of large data requiredto train it, which is not feasible for low re-source languages. According to Interspeech2018, Low Resource Automatic Speech Recog-nition Challenge2, TDNN-based systems (Ped-

2https://tiny.cc/Interspeech2018

411

Figure 1: End-to-End Automatic Speech Recogni-tion Process

dinti et al., 2015) are efficient in modelling longtemporal context and performed well evenin the low-resource setting (Fathima et al.,2018; Pulugundla et al., 2018). Even with thesmaller amount of data, with some enhance-ments, End-to-End systems showed promisingresults (Billa, 2018).

The work similar to our approach are pre-sented in speech recognition primer3, and Gu-jarati Automatic Speech Recognition4. Thefirst approach is based on a combinationof CNN (Chua and Yang, 1988) and BiL-STM (Schuster and Paliwal, 1997) the latterapproach uses a combination of 3 Gated Re-current Units (GRUs). The first approach isdesigned for English, while the second is forGujarati.

Our approach differs from the above two asfollows,

• The model architecture described in ourpaper constitutes 1 CNN - 3 BiLSTM - 3Dense layers.

• We present a more effective approach todecode the output using the prefix beamsearch algorithm along with the combina-tion of the language model.

• Our approach introduces a post-processing technique to improve theperformance of the system even more.

BERT is a neural network-based techniquefor natural language processing pre-training.The pre-trained BERT model can be fine-tuned with just one additional output layer

3https://github.com/apoorvnandan/speech-recognition-primer

4https://github.com/niteya-shah/Gujarati-Automatic-Speech-Recognition

to create state-of-the-art models for a widerange of tasks, such as question answeringand language inference, without substantialtask-specific architecture modifications (De-vlin et al., 2018). Multilingual-BERT uses arepresentation that is able to incorporate infor-mation from multiple languages (Pires et al.,2019).

3 Proposed Approach

Figure 1 describes the End-to-End speechrecognition system proposed in the paper.This section describes the processing involvedin the various stages.

3.1 Audio featureWe have used mel-frequency cepstral coeffi-cients (MFCC) (Motlıcek, 2002) as featuresto represent the input audio signal. We con-vert the input audio signal into MFCC. Thedimension of these features are (Time_Steps,MFCCs) and for the given batch size it is of di-mension (Batch_Size, Time_Steps, MFCCs).These features serve as the input to the deeplearning model.

3.2 Model ArchitectureThe model architecture incorporates four ma-jor components: CNN layer, BiLSTM layer,Dense Layer, and CTC. Each component hasits own importance and components like CNNlayer, BiLSTM layer, and Dense layer have tobe tuned as per the size of the input data.

Convolutions in frequency and time do-mains, when applied to the spectral inputfeatures prior to any other processing, canslightly improve ASR performance (Abdel-Hamid et al., 2012; Sainath et al., 2013). Italso attempts to model spectral variance dueto speaker variability, which is another reasonto use the first layer as convolution (Amodeiet al., 2015). We have used a single 1D-convolution layer with 200 filters with ReLUactivation function with kernel size 11 andstride value as 2. Features extracted are thenpassed to a deep BiLSTM RNN (Schuster andPaliwal, 1997).

We have used 3 BiLSTM layers, each layerconsisting of 200 BiLSTM units (400 LSTMunits) and tanh as the activation function ofeach unit. When provided input from the con-

412

Figure 2: Working of decoding algorithm usingWLM and CLM

volution layer, the BiLSTM layer gives us out-put as (Batch_Size, Convolved_Time_Steps,LSTM_blocks). Features extracted are thenpassed to a DNN (Hinton et al., 2006), whichconsists of 3 layers where the first 2 layers con-sist of 200 units. The number of units in thelast layer is equal to the number of charactersin the languages.

While training, a common technique formapping variable-length audio input tovariable-length output is the CTC algo-rithm (Graves et al., 2006) coupled with anRNN. The CTC-RNN model performs well inEnd-to-End speech recognition with graphemeoutputs (Graves and Jaitly, 2014; Hannunet al., 2014; Maas et al., 2014, 2015). Given thenetwork outputs, CTC maximizes the proba-bilities of the correct labelings. The CTC ob-jective function is differentiable thus the net-work can then be trained with standard back-propagation through time (Werbos, 1990).

3.3 DecodingWe propose an enhanced language modelbased prefix decoding which uses our cus-tom built 4-gram word-level language model(WLM) and a bi-gram character-level languagemodel (CLM) (Brown et al., 1992). Here werefer to it as Prefix with LMs’. Both of thesemodels were created using the whole GujaratiWikipedia5. Using prefix decoding with twolanguage models (WLM and CLM) makes de-cent corrections, as it introduces language con-straints at both word and character scope. Wehave compared this approach with greedy de-coding (Maas et al., 2014) and prefix decod-ing (Maas et al., 2014).

As shown in the figure 2, the output from5https://gu.wikipedia.org/wiki/

the network is passed through the prefix de-coding algorithm which incorporates a WLMand a CLM. The WLM will score the last wordgiven in the prefix word sequence. This score’sinfluence can be controlled by WLM weight(wlm). Similarly, CLM will score the last char-acter to be appended given its previous char-acter. This score’s influence on the new prefixis controlled by CLM weight (clm). We en-compassed the insertion bonus by multiplyingthe count of words in prefixes to avoid biastowards shorter prefixes. To control the influ-ence of insertion bonus we used beta (β).

3.4 Post ProcessingWe propose a BERT based post-processingtechnique to improve the output of speechrecognition systems. The BERT model is usedto correct the spelling of the predicted out-put words. Here we term this technique asSpell Corrector BERT. Figure 3 describes theworking of sentence correction algorithm usingBERT. For a sentence, we iterate through allthe words during which we find the replace-ments for the current word by finding its cor-responding zero, one, or two edit substitutesfrom the Wikipedia corpus. Using the list pro-duced by this approach, we can verify that ifthe word predicted is correct, it would be al-ready present in the list and needs not to bereplaced.

If the current word is not present in thereplacements list, then that word is replacedwith [MASK] and the sentences are generatedby replacing the [MASK] with the word re-placements from the replacements list. Fur-ther, the sentences are tokenized and passedto the BERT model. As an output, BERTreturns the list of replaced words and theirrespective probabilities w.r.t. the sentence.From this list, we select K word replace-ments with the highest probability and ap-pend this list of word replacements to theoutput_list. Once all the words are iterated, afinal output_list is generated which contains alist of all words with at most K replacementsfor each word. Given a sentence containing3 words, અમદાવાદર એપોટર્ પર where અમદાવાદરand એપોટર્ are incorrect, this process gives theoutput_list= [ [અમદાવાદ, અમદાવાદમા,...] ,[એરપોટર્, પોટર્ ,...] , [પર] ]. This output_listis passed to a combinator which generates sen-

413

Figure 3: Working of Spell Corrector BERT

tences by combining the words. If there is alist contain a single word in it, it is selected asit is and if there is a list of word replacementsthen various combinations are produced usingthe combinator. The output of this processwill be output_sentences= [અમદાવાદ એરપોટર્પર , અમદાવાદ પોટર્ પર , અમદાવાદમા એરપોટર્ પર,...]. To select the best sentence out of all thesentences, a WLM is used for sentence scoringand the sentence with the highest score is se-lected as the final_output = અમદાવાદ એરપોટર્પર.

4 Experiments

4.1 Dataset

We have used Microsoft Speech Corpus avail-able for Gujarati6 which contains approxi-mately 22,807 training examples and 3,075testing examples. The dataset contains aneclectic collection of speakers where the lengthof a single audio utterance is 6.35±2.33 sec-onds. Out of 22,807 training examples, wehave used 16,000 utterances (≈28.2 Hours) fortraining and 4,807 (≈8.5 Hours) utterancesfor validation and all the testing examples i.e.3,075 (≈5.4 Hours) utterances for inferencing.

6https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e

Figure 4: Training and Validation Loss for themodel

4.2 TrainingThe training data, as well as validation data,was divided into 7 batches and the model wastrained for 24 epochs and 92 hours on T4 GPUwith 16 GB of GPU-memory. The model con-sists of 2,744,676 total parameters. We usedAdam optimizer for gradient descent, and forcalculating the loss we used CTC loss function.

4.3 DecodingWe have used Gujarati data scraped fromWikipedia containing 2,501,841 words with avocabulary size of 163,170 words. We used thiscorpus to create the statistical word-level lan-guage model and produced 1,570,614 4-grams.We also used the same corpus to create the sta-

414

Table 1: Sentence and its corresponding output through various decoding techniques.

Actual અમદાવાદ એરપોટર્ પર સુરક્ષાને લઈ તમામ તૈયારીઓ કરીદેવાઈ છે

Greedy અમદાવાદાર પર પણ સુરક્ષાઅને લઈને તમમ ટેરે કરીજવોાય છે

Prefix with LMs’ અમદાવાદાર પર પણસુરક્ષાને લઈને તમમ કેરે કરી જવાયછે

Prefix with LMs’ & Spell Corrector BERT અમદાવાદ પર પણ સુરક્ષાને લઈને તમામ કેરે કરી જવાયછેે

Table 2: Distribution of single letter error words

Technique Consonants Diacritic IndependentsGreedy 66.41% (1,962) 28.23% (834) 5.34% (158)

Prefix with LMs’ 66.52% (1,824) 27.97% (767) 5.5% (151)Prefix with LMs’ & Spell Corrector BERT 52.90% (829) 38.67% (606) 8.4% (132)

Table 3: Techniques and their corresponding WER

Techniques Word Error Rate (%)Greedy 70.65

Prefix without Language Model 69.95Prefix with WLM 69.53Prefix with CLM 68.64Prefix with LMs’ 68.23

Prefix with LMs’ & Spell Corrector BERT 65.54

tistical character-level language model to cre-ate bi-grams for each alphabet of the Gujaratilanguage. For prefix decoding, the beam widthwas taken as 50 and all other parameters weredecided using cross-validation. The algorithmof Prefix with LMs’ recorded a 2.42% decreasein WER w.r.t. system using greedy decodingtechnique.

4.4 Post Processing

Spell Corrector BERT is used to further im-prove the output produced by Prefix withLMs’. We have used a pre-trained BERT mul-tilingual model by Google7 combined with a4-gram WLM as the core components of SpellCorrector BERT. The algorithm of Spell Cor-rector BERT recorded a 2.69% decrease inWER w.r.t. standalone Prefix with LMs’. Thetable 3 shows the comparison of WER for dif-ferent techniques.

7https://github.com/google-research/bert/blob/master/multilingual.md

5 Observation5.1 Comparison of various decoding

and post-processing techniquesWe have tested the performance of the decod-ing technique and post-processing by observ-ing the distribution and frequency of the sin-gle letter error words. Table 1 shows a sam-ple testing sentence as well as the hypothe-sis generated by our model using various de-coding techniques and post-processing tech-niques. Table 2 describes the distribution ofsingle-letter error words observed in differentapproaches. Here, the count of errors due toconsonants/diacritics/independents w.r.t. thetotal count of single-letter error words in eachdecoding technique is shown as a percentage.It shows that, Prefix with LMs’ and Pre-fix with LMs’ & Spell Corrector BERT post-processing, both help in reducing single lettererror words.

Table 2 also shows count of single letter er-ror words. Subsequently, from this count wecan conclude that, percentage decrease of theerror in consonant, diacritic and independentsusing Prefix with LMs’ is 7%, 8% and 4% re-spectively w.r.t greedy decoding, while usingPrefix with LMs’ & Spell Corrector BERT, thepercentage decrease of the error in consonant,diacritic and independents is 57.74%, 27.00%and 16.45% respectively w.r.t. greedy decod-ing.

We observe that, with a lesser number of

415

incorrect characters, Spell Corrector BERT ei-ther retains WER or in the majority of thecases, will improve WER significantly. Table4 shows sample sentences with a different num-ber of erroneous words for comparison of theperformance of Spell Corrector BERT.

5.2 System AnalysisThe system analysis is performed on the modelhypothesis decoded using Prefix with LMs’& Spell Corrector BERT. We have evalu-ated the performance of the proposed modelon 3,075 test examples. Out of total erro-neous words, 7.19% words have one letter er-ror with similar sounding alphabets of lettersinterchange. e.g. ‘શ’ → ‘સ’, ‘ઇ’ → ‘ઈ’, ‘િ◌’ → ‘◌ી’. The interchange of con-sonants/diacritics/independents is due to thefactors like, noise, channel variability, speakervariability, anatomy of the vocal tract, speedof speech, regional and social dialects, homo-phones (Forsberg, 2003). Any incorrect wordin the sentence is replaced on the basis of prob-ability and hence WLM, CLM, or BERT is notsolely responsible for the selection of any word,it is the combination of the probabilities thatresults in the final output.

5.2.1 Error due to conso-nants/independents/diacritics

Table 5 displays examples of words which havea single-letter error due to consonants, inde-pendents, and diacritics. Ref. depicts the ac-tual word in the sentence, Hyp. denotes theoutput word, Ref. Freq. shows the count of ref-erence word in the corpus, Hyp. Freq. is thecount of hypothesis word in the corpus, Ref.→ Hyp. shows the character that is replacedand Type shows the type of error in the infer-ence word. From this table, we can observethat, despite the hypothesis word being infre-quent in the corpus, the similar sounding let-ters in the reference word gets replaced. Thisadvocates the idea that the replacement of thesimilar sounding letters from the words is alsoone of the factors inducing the errors in thesystem, irrespective of the words’ frequency inthe corpus.

Out of the total 1,567 one letter error words,52.90% errors are due to single consonant mis-match. The top three incorrectly predictedconsonants are ‘ક’, ‘ર’, ‘ત’, with frequencies

102, 92, and 79 respectively. Together theycontribute to 32.93% of total errors due to con-sonants.

In figure 5, the connection between twoconsonants represents the error of interchang-ing/misplacing these consonants with eachother. For example, the connection be-tween ‘શ’ and ‘સ’ indicates that these con-sonants are generally interchanged/misplacedwith each other, in a word predicted by thesystem.

Out of the total 1,567 one letter error words,8.40% errors are due to single-independentmismatch. The top three incorrectly pre-dicted independents are ‘ઈ’, ‘ઇ’, ‘અ’, withfrequencies 81, 21, and 13 respectively. To-gether they contribute to 87% of total errorsdue to independents. We observe that ([‘ઇ’,‘ઈ’],[‘ઉ’, ‘ઊ’]) are more vulnerable to beingmisplaced/interchanged.

Out of the total 1,567 one letter error words,38.67% errors are due to single-diacritic mis-match. The top three incorrectly predictedindependents are ‘◌ી’, ‘◌ે’, ‘◌ા’, with fre-quencies 213, 120, and 76 respectively. Weobserve that frequency of diacritic ‘◌ી’ and‘◌ે’is greater than the sum of the frequenciesof remaining diacritics, and they constitute to67% of total incorrectly predicted diacritics.

The output ([શરૂ, સરૂ],[સાડા, સાળા],[કોઈ,કોએ]) are interesting cases as the hypothe-sis words are not present in the corpus. Thepredicted output misplaced ([‘શ’, ‘ડ’, ‘ઈ’])with similar sounding ([‘સ’, ‘ળ’, ‘અ’]) re-spectively without any prior knowledge of theword. This is due to the character by characterprediction approach of our model.

Figure 5: Interchanging consonants which resultsin erroneous prediction.

416

Table 4: Sample sentences for comparing performance of BERT

Actual અમદાવાદ એરપોટર્ પર સુરક્ષાને લઈ તમામ તૈયારીઓ કરીદેવાઈ છે

At most one error per word અમદાવદ એપોટર્ પર સુરક્ષામે લઈ તમાન તૈયરીઓ જરીદેરાઈ છેે

Spell Corrector BERT Output અમદાવાદ એરપોટર્ પર સુરક્ષાને લઈ તમામ તૈયારીઓ કરીદેવાઈ છેે

At most two errors per word અમદવા એરપર્ પર સરક્ષને લઈ તમ તૈયરઓ કરી દેવ છેSpell Corrector BERT Output અમદાવાદ એરપોટર્ પર સુરક્ષાને લઈ તમે તૈયારીઓ કરી

દેવ છે

At least 1 word with error greater than 2 અમદાવાદાર પર પણ સુરક્ષાઅને લઈને તમમ ટેરે કરીજવોાય છે

Spell Corrector BERT Output અમદાવાદ પર પણ સુરક્ષાને લઈને તમામ કેરે કરી જવાયછેે

Table 5: Examples of words which have single letter error

Ref. Hyp. Ref. Freq. Hyp. Freq. Ref.→Hyp. Typeશરૂ સરૂ 282 0 શ → સ Consonantત્યારે ક્યારે 377 15 ત → ક Consonantધરાઈ ધરાઇ 9 15 ઈ → ઇ Independentઊમેયુ ઉમેયુ 2 11 ઊ → ઉ Independentપ્રારં�ભક પ્રારંભીક 4 0 િ◌ → ◌ી Diacriticચૂંટણીમાં ચુંટણીમાં 50 12 ◌ૂ → ◌ુ Diacritic

5.2.2 Error due to homophonesOut of a total of 606 words that had a sin-gle diacritic error, only 2.97% of errors weredue to homophones which is a small fractionof the total amount of errors in the inferencefrom the ASR system. This might be becauseGujarati is mostly a phonetic language withonly a few exceptions. Also, the number of al-phabets (vowels and consonants) in Gujaratiare more than that in English. By reducingdiacritic errors, we can resolve errors due tohomophones. This helps us to understand thatour system is not much affected by error dueto homophones. Table 6 shows the incorrectlypredicted homophones.

5.2.3 Effect of word frequency on errorTo understand the effect of frequency on error,we calculated the frequency of the predictedwords in the test dataset. We categorized thewords into three different categories based onthe correctness of the word referenced to their

occurrence in the testing dataset. The threedifferent categories are,

• ACPW: Words that are always predictedcorrectly.

• AIPW: Words which are always predictedincorrectly

• CAIPW: Words which are predicted cor-rectly as well as incorrectly at times.

Count of ACPW, AIPW and CAIPW is1,069, 7,809 and 1,604 respectively. Mean fre-quencies/occurrences of ACPW, AIPW andCAIPW is 1.11, 1.34, and 15.38 respectively.Words which are ACP, AIP and CAIP in test-ing, are shown in training with a mean fre-quency/occurrence of 4.75, 4.86, and 77.81 re-spectively with a count of 857, 5,764 and 1,594respectively. This gives us a rationalization forthe fact that our system is able to learn fromthe utterances shown in the training and caninfer unseen examples too.

417

Table 6: Examples of words which are incorrectly predicted homophones

Reference Word કતાર્ (Actor) રિવ (Sun) પીતા (Drinking)Hypothesis Word કરતા (Than) રવી (Winter Crop) પતા (Father)

Table 7: Sample words which are predicted correctly as well as incorrectly

Words in Testing Total Count Wrong count Right count Correctness(%)આજે 81 32 49 60.49

રાજકોટના 2 1 1 50.00તાકાત 2 1 1 50.00િવભાગ 7 4 3 42.86આવે 67 24 43 64.18

Average 53.50

We also analyzed the correctness of thewords from the set CAIPW. 8.32 out of 15.38mean frequencies of CAIPW are correct andthe remaining 7.06 out of 15.38 are incorrect.Table 7 shows some examples of the correctlyas well incorrectly predicted words with theamount of correctness. This gives an explana-tion for how words are predicted correctly aswell as incorrectly with the same proportion.

5.2.4 Error due to half-conjugatesThis type of error occurs due to the mismatchin the speed of utterance. The fast-conjugateerror occurs when a word is uttered too quicklybut the hypothesis word is predicted slow, e.g.,(મુખજીર્ → મુખરજી). When a word is utteredslowly but the hypothesis word is predictedfast then this type of error is called slow-conjugate error (ગયો → ગ્યો). Out of total erro-neous predicted words, 2.4% of them have half-conjugate error and out of those 2.4% words,10% words consist of pure half-conjugate erro-neous words. This justifies the fact that the er-ror due to half-conjugate is trivial and thus thevariation in speaker speed is not a significantfactor due to which error occurs in inferenceby our system.

6 Conclusion

In this paper, we have presented an End-to-End speech recognition system for Gujarati.We propose a prefix decoding technique thatuses two language models to improve the per-formance of the system. We have also useda BERT based spelling corrector model in apost-processing step to further improve the

performance. We observe that the proposedapproach reduces the overall WER by 5.11%.

While deep learning models require a lotof training data for better results, in this pa-per we showed that without increasing thetraining data we can improve the performance.This is particularly important for a resource-constrained language like Gujarati. We areoptimistic that with an increase in data ouroptimizations would perform even better.

We explored and analyzed the inferencesfrom our ASR system to gain key insightswhich consist of checking the correctness of theword error due to consonants, diacritics, in-dependents, half-conjugates, and homophones.These insights can help to understand our ASRsystem based on a particular language (Gu-jarati) as well as can govern ASR systems’ toimprove the performance for low resource lan-guages.

Acknowledgement

Param Shavak supercomputer was used to per-form some experiments of this research work.We are thankful to GUJCOST and Depart-ment of Science and Technology, GoG for es-tablishing supercomputer facility.

ReferencesO. Abdel-Hamid, A. Mohamed, H. Jiang, and

G. Penn. 2012. Applying convolutional neuralnetworks concepts to hybrid nn-hmm model forspeech recognition. In 2012 IEEE InternationalConference on Acoustics, Speech and Signal Pro-cessing (ICASSP), pages 4277–4280.

https://doi.org/10.1109/ICASSP.2012.6288864



418

Dario Amodei, Rishita Anubhai, Eric Battenberg,Carl Case, Jared Casper, Bryan Catanzaro, Jing-dong Chen, Mike Chrzanowski, Adam Coates,Greg Diamos, Erich Elsen, Jesse H. Engel, LinxiFan, Christopher Fougner, Tony Han, Awni Y.Hannun, Billy Jun, Patrick LeGresley, LibbyLin, Sharan Narang, Andrew Y. Ng, SherjilOzair, Ryan Prenger, Jonathan Raiman, San-jeev Satheesh, David Seetapun, Shubho Sen-gupta, Yi Wang, Zhiqian Wang, Chong Wang,Bo Xiao, Dani Yogatama, Jun Zhan, andZhenyao Zhu. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin.CoRR, abs/1512.02595.

Bishnu S Atal and Suzanne L Hanauer. 1971.Speech analysis and synthesis by linear predic-tion of the speech wave. The journal of theacoustical society of America, 50(2B):637–655.

J. Baker. 1975. The dragon system–an overview.IEEE Transactions on Acoustics, Speech, andSignal Processing, 23(1):24–29.

Jayadev Billa. 2018. Isi asr system for the lowresource speech recognition challenge for indianlanguages. pages 3207–3211.

H. Bourlard and C. J. Wellekens. 1990. Linksbetween markov models and multilayer percep-trons. IEEE Transactions on Pattern Analysisand Machine Intelligence, 12(12):1167–1178.

Peter F Brown, Vincent J Della Pietra, Peter VDesouza, Jennifer C Lai, and Robert L Mer-cer. 1992. Class-based n-gram models of naturallanguage. Computational linguistics, 18(4):467–480.

William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals. 2016. Listen, attend and spell: A neu-ral network for large vocabulary conversationalspeech recognition. In 2016 IEEE InternationalConference on Acoustics, Speech and Signal Pro-cessing (ICASSP), pages 4960–4964. IEEE.

Kyunghyun Cho, Bart Van Merriënboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2014.Learning phrase representations using rnnencoder-decoder for statistical machine transla-tion. arXiv preprint arXiv:1406.1078.

L. O. Chua and L. Yang. 1988. Cellular neural net-works: theory. IEEE Transactions on Circuitsand Systems, 35(10):1257–1272.

Ken H Davis, R Biddulph, and Stephen Balashek.1952. Automatic recognition of spoken digits.The Journal of the Acoustical Society of America,24(6):637–642.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-trainingof deep bidirectional transformers for languageunderstanding. CoRR, abs/1810.04805.

Noor Fathima, Tanvina Patel, C Mahima, andAnuroop Iyengar. 2018. Tdnn-based multilin-gual speech recognition system for low resourceindian languages. In INTERSPEECH, pages3197–3201.

Markus Forsberg. 2003. Why is speech recognitiondifficult.

Steven E. Golowich and Don X. Sun. 1998. A sup-port vector/hidden markov model approach tophoneme recognition, in. In Center for MediaTechnology (RCMT), School of Creative Media,City University of Hong Kong, Hong Kong, pages125–130.

Alex Graves, Santiago Fernández, Faustino Gomez,and Jürgen Schmidhuber. 2006. Connectionisttemporal classification: Labelling unsegmentedsequence data with recurrent neural ’networks.volume 2006, pages 369–376.

Alex Graves and Navdeep Jaitly. 2014. Towardsend-to-end speech recognition with recurrentneural networks. In Proceedings of the 31stInternational Conference on International Con-ference on Machine Learning - Volume 32,ICML’14, page II–1764–II–1772. JMLR.org.

Awni Y. Hannun, Carl Case, Jared Casper, BryanCatanzaro, Greg Diamos, Erich Elsen, RyanPrenger, Sanjeev Satheesh, Shubho Sengupta,Adam Coates, and Andrew Y. Ng. 2014. Deepspeech: Scaling up end-to-end speech recogni-tion. CoRR, abs/1412.5567.

Geoffrey E. Hinton, Simon Osindero, and Y. Teh.2006. A fast learning algorithm for deep beliefnets. Neural Computation, 18:1527–1554.

Fumitada Itakura. 1975. Minimum predictionresidual principle applied to speech recognition.IEEE Transactions on acoustics, speech, and sig-nal processing, 23(1):67–72.

F. Jelinek. 1976. Continuous speech recognitionby statistical methods. Proceedings of the IEEE,64(4):532–556.

Maeda Kenichi, Sakai Toshiyuki, and DoshitaShuji. 1966. Phonetic typewriter system. USPatent 3,265,814.

B Lowerre and R Reddy. 1976. The harpy speechrecognition system: performance with large vo-cabularies. The Journal of the Acoustical Soci-ety of America, 60(S1):S10–S11.

Andrew Maas, Ziang Xie, Dan Jurafsky, and An-drew Ng. 2015. Lexicon-free conversationalspeech recognition with neural networks. In Pro-ceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies, pages 345–354, Denver, Colorado. As-sociation for Computational Linguistics.

http://arxiv.org/abs/1512.02595


https://doi.org/10.1109/TASSP.1975.1162650

https://doi.org/10.21437/Interspeech.2018-2473



https://doi.org/10.1109/34.62605

https://doi.org/10.1109/34.62605

https://doi.org/10.1109/34.62605

https://doi.org/10.1109/31.7600

https://doi.org/10.1109/31.7600




https://doi.org/10.1145/1143844.1143891

https://doi.org/10.1145/1143844.1143891

https://doi.org/10.1145/1143844.1143891




https://doi.org/10.1109/PROC.1976.10159

https://doi.org/10.1109/PROC.1976.10159

https://doi.org/10.3115/v1/N15-1038

https://doi.org/10.3115/v1/N15-1038

419

Andrew L. Maas, Awni Y. Hannun, Daniel Ju-rafsky, and Andrew Y. Ng. 2014. First-passlarge vocabulary continuous speech recognitionusing bi-directional recurrent dnns. CoRR,abs/1408.2873.

Mehryar Mohri. 1997. Finite-state transducers inlanguage and speech processing. Computationallinguistics, 23(2):269–311.

Petr Motlıcek. 2002. Feature extraction in speechcoding and recognition. Technical report, Tech-nical Report of PhD research internship in ASPGroup, OGI-OHSU,< http ….

Vijayaditya Peddinti, Daniel Povey, and SanjeevKhudanpur. 2015. A time delay neural networkarchitecture for efficient modeling of long tem-poral contexts. In Sixteenth Annual Conferenceof the International Speech Communication As-sociation.

Telmo Pires, Eva Schlinger, and Dan Garrette.2019. How multilingual is multilingual bert?CoRR, abs/1906.01502.

Bhargav Pulugundla, Murali Karthick Baskar,Santosh Kesiraju, Ekaterina Egorova, MartinKarafiát, Lukás Burget, and Jan Cernockỳ. 2018.But system for low resource indian language asr.In INTERSPEECH, pages 3182–3186.

Elena Rodríguez, Belén Ruíz, Ángel García-Crespo, and Fernando García. 1997.Speech/speaker recognition using a hmm/gmmhybrid model. In Audio- and Video-based Bio-metric Person Authentication, pages 227–234,Berlin, Heidelberg. Springer Berlin Heidelberg.

Tara Sainath, Abdel-rahman Mohamed, BrianKingsbury, and Bhuvana Ramabhadran. 2013.Deep convolutional neural networks for lvcsr.pages 8614–8618.

Hiroaki Sakoe and Seibi Chiba. 1978. Dynamicprogramming algorithm optimization for spokenword recognition. IEEE transactions on acous-tics, speech, and signal processing, 26(1):43–49.

M. Schuster and K. K. Paliwal. 1997. Bidirectionalrecurrent neural networks. IEEE Transactionson Signal Processing, 45(11):2673–2681.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information pro-cessing systems, pages 3104–3112.

Taras K Vintsyuk. 1968. Speech discriminationby dynamic programming. Cybernetics, 4(1):52–57.

Andrew Viterbi. 1967. Error bounds for convolu-tional codes and an asymptotically optimum de-coding algorithm. IEEE transactions on Infor-mation Theory, 13(2):260–269.

P. J. Werbos. 1990. Backpropagation through time:what it does and how to do it. Proceedings ofthe IEEE, 78(10):1550–1560.






https://doi.org/10.1109/78.650093

https://doi.org/10.1109/78.650093

https://doi.org/10.1109/5.58337

https://doi.org/10.1109/5.58337

End-to-End Automatic Speech Recognition for Gujarati

Documents