Top Banner
NATLID: Native Language Identification Ankita Bihani Stanford University Anupriya Gagneja Stanford University {ankitab,anupriya,mohanas}@stanford.edu Mohana Prasad Sathya Moorthy Stanford University Abstract Native Language Identification is the task of identifying the native language of a user based on a sample of their writing or speech or using both. This project presents several models to solve the prob- lem of native language identification using spoken and/or written responses using the dataset released by ETS. In this paper, we compare several different approaches for this task and built several models for the same: RNN over words, deep RNN with fully connected layers, CNN over char- acters, CNN over words, Deeper CNNs, RNN over CNN. Among all these, we saw that CNN over words model worked the best. Hence, we tried several approaches to iteratively improve our CNN over words model like adding pre-trained gloVe word vectors, normalizing i-vectors, etc. Fi- nally, we used an ensemble of all the mod- els, which gave a further bump in our precision.Our best results were obtained using an ensemble of the above models which gave a precision of 62.07 for the essay task, 52.33 for the speech task and 84.12 for the speech task with i-vectors. 1 Introduction In this project, we worked on the Native language identification task which can be modelled as a se- quence classification task. Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence. What makes this problem challenging is that the sequences can vary in length, be comprised of a very large vocabulary of input symbols and may require the model to learn the long-term context or dependencies between symbols in the input se- quence. Non-native speakers show different degrees of reading competence and pronunciation (Mengel, 1993),that is, their knowledge of the grapheme- to-phoneme conventions of the foreign language may vary a lot, as well as their ability to pro- nounce sounds which are not part of their native sound inventory. Similarly, the way in which non- native speakers construct their sentences is seman- tically different. Hence, it is interesting to solve this problem and tackle the challenges. 2 Motivation Most of the work in the area of Native Language Identification domain has focused on identifying the native language of writers learning English as a second language. This is indeed a challenging task even for humans; hence it is interesting to see how well a machine can infer a connection be- tween a person’s native language and his/her spo- ken/ written responses. It is also interesting to analyze whether the written or spoken language component gives a stronger correlation with the native language of the speaker. The task is typi- cally framed as a multi-class classification prob- lem where the set of languages is known apriori. This problem is also exciting because it al- lows us to build a system that can be used in any space where automatic native language identifica- tion can be useful. Of the many applications, we could use this system for drawing meaningful an- alytics from forum interactions, social networks, opinion mining for people with common native language ties. Another potential application of NLI is in the field of forensic linguistics (Gibbons, 2003), a juncture where the legal system and linguistic stylistics intersect (Prakasam, 2004). In this con- text Natural Language Identification can be used as a tool for Authorship Profiling (Grant, 2007) in order to provide evidence about the linguistic background of an author.
8

NATLID: Native Language Identification

Oct 29, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NATLID: Native Language Identification

NATLID: Native Language Identification

Ankita BihaniStanford University

Anupriya GagnejaStanford University

{ankitab,anupriya,mohanas}@stanford.edu

Mohana Prasad Sathya MoorthyStanford University

AbstractNative Language Identification is the taskof identifying the native language of auser based on a sample of their writingor speech or using both. This projectpresents several models to solve the prob-lem of native language identification usingspoken and/or written responses using thedataset released by ETS. In this paper, wecompare several different approaches forthis task and built several models for thesame: RNN over words, deep RNN withfully connected layers, CNN over char-acters, CNN over words, Deeper CNNs,RNN over CNN. Among all these, we sawthat CNN over words model worked thebest. Hence, we tried several approachesto iteratively improve our CNN over wordsmodel like adding pre-trained gloVe wordvectors, normalizing i-vectors, etc. Fi-nally, we used an ensemble of all the mod-els, which gave a further bump in ourprecision.Our best results were obtainedusing an ensemble of the above modelswhich gave a precision of 62.07 for theessay task, 52.33 for the speech task and84.12 for the speech task with i-vectors.

1 Introduction

In this project, we worked on the Native languageidentification task which can be modelled as a se-quence classification task. Sequence classificationis a predictive modeling problem where you havesome sequence of inputs over space or time andthe task is to predict a category for the sequence.What makes this problem challenging is that thesequences can vary in length, be comprised of avery large vocabulary of input symbols and mayrequire the model to learn the long-term contextor dependencies between symbols in the input se-quence.

Non-native speakers show different degrees ofreading competence and pronunciation (Mengel,

1993),that is, their knowledge of the grapheme-to-phoneme conventions of the foreign languagemay vary a lot, as well as their ability to pro-nounce sounds which are not part of their nativesound inventory. Similarly, the way in which non-native speakers construct their sentences is seman-tically different. Hence, it is interesting to solvethis problem and tackle the challenges.

2 Motivation

Most of the work in the area of Native LanguageIdentification domain has focused on identifyingthe native language of writers learning English asa second language. This is indeed a challengingtask even for humans; hence it is interesting tosee how well a machine can infer a connection be-tween a person’s native language and his/her spo-ken/ written responses. It is also interesting toanalyze whether the written or spoken languagecomponent gives a stronger correlation with thenative language of the speaker. The task is typi-cally framed as a multi-class classification prob-lem where the set of languages is known apriori.

This problem is also exciting because it al-lows us to build a system that can be used in anyspace where automatic native language identifica-tion can be useful. Of the many applications, wecould use this system for drawing meaningful an-alytics from forum interactions, social networks,opinion mining for people with common nativelanguage ties.

Another potential application of NLI is in thefield of forensic linguistics (Gibbons, 2003), ajuncture where the legal system and linguisticstylistics intersect (Prakasam, 2004). In this con-text Natural Language Identification can be usedas a tool for Authorship Profiling (Grant, 2007)in order to provide evidence about the linguisticbackground of an author.

Page 2: NATLID: Native Language Identification

3 Related Work

We reviewed several research works in this areaand present the most relevant papers below.

In (Mitra et al., 2005), the authors provide aunique approach for solving a multiclass classi-fication problem for a sequence of text. Theypresent a model integrating a Recurrent NeuralNetwork and a least squares Support Vector Ma-chine for classification of document titles into pre-determined categories. They implement a systembased on this Neuro SVM model using Latent Se-mantic Indexing(LSI) to generate probabilistic co-efficients from document titles which are used asthe input to the system. Using a corpus of 96,956words, from University of Denver’s Penrose Li-brary catalogue, the system’s accuracy was re-markable.

The paper (Arevian, 2007) explores the appli-cation of Recurrent Neural Networks for the taskof robust text classification. The results demon-strate that these recurrent neural networks can bea viable addition to the many techniques used inweb intelligence for tasks like context sensitiveemail classification and website indexing as well.This paper presents research on the capabilities ofextended simple recurrent neural network models(xRNN) to the task of classifying real-world newstitles from the well-known Reuters-21578 Corpus.

In (Kim, 2014), the author shows that a simpleCNN with little hyperparameter tuning and staticvectors achieves excellent results on multiplebenchmarks. A series of experiments are reportedwith CNN trained on top of pre-trained word vec-tors for sentence-level classification tasks. Thiswork is similar in philosophy to (Sharif Razavianet al., 2014), which showed that for image clas-sification, feature extractors obtained from a pre-trained deep learning model perform well on a va-riety of tasks including tasks that are very differentfrom the original task for which the feature extrac-tors were trained.

In (Ma and Hovy, 2016), the authors intro-duce a novel neural network architecture that ben-efits from both word and character level rep-resentations by using a combination of bidirec-tional LSTM, CNN and Conditional RandomFields(CRF). Their system is unique in the sensethat it does not require large amounts of task spe-cific knowledge in the form of handcrafted fea-tures and data pre-processing; thus making it ap-plicable to a wide variety of sequence labelling

Figure 1: Character level representations of thewords from (Ma and Hovy, 2016). Dashed arrowsindicate a dropout layer applied before characterembeddings are input to CNN.

tasks. Previous studies have shown that CNN iseffective in extracting information from charactersof words and encoding it into neural representa-tions. Figure 1 shows the CNN architecture usedin this paper to extract character-level representa-tion of a given word.

In (Abu-Jbara et al., 2013) the authors took thechallenge of Native Language Identification on theETS dataset for that year. They achieved an accu-racy of 43% on the test data, improved it to 63%with feature normalization. For their model theytrained a SVM classifier on a set of features ex-tracted from the training data. They normalizedthe feature values by dividing them the (max-min)of the respective features. Table 1 below lists thefeatures that they used for their model. Althoughtheir model seems to capture a lot of relevant in-formation for the task using hand-crafted features,in this project we intend to learn these feature rep-resentations automatically using neural networks.

4 Dataset

For this project, we use the Educational Test-ing Service (ETS) dataset which includes test re-sponses from 13,200 test takers (one essay andone spoken response transcription per test taker)covering 11 native languages (L1s) with 1,200 testtakers per language. The 11 native languages cov-ered by the corpus are: Arabic, Chinese, French,German, Hindi, Italian, Japanese, Korean, Span-

Page 3: NATLID: Native Language Identification

Table 1: Features used by (Abu-Jbara et al., 2013)in their model for the Native Language Identifica-tion Task

Character and Word N-grams Missing PunctuationPart-Of-Speech N-grams Average Number of SyllablesFunction Words Arc LengthUse of punctuation Downtoners and IntensifiersNumber of Unique Stems Production RulesMisuse of Articles Subject AgreementCapitalization Words per SentenceTense and Aspect Frequency Topic Scores

ish, Telugu, and Turkish. Since the audio files arenot provided for this task, we use the i-vectors pro-vided along with the dataset to model a more re-alistic sense of the performance of speech-basedNLI system.

5 Evaluation Metric

We evaluate all our models’ performance by usingthe precision scores for the following three tasks:1. Essay Task: The identification of an individ-ual’s native language based on an essay written byhim/her in English.2. Speech Task with transcriptions only: The clas-sification of an individual’s native language basedon an English spoken response3. Speech Task with transcriptions and i-vectors:The classification of an individual’s native lan-guage based on an English spoken response usingtranscriptions and i-vectors both.

6 Approaches/ Methodology

6.1 Baseline model: Linear SVCThe baseline model for speech transcript and writ-ten essay classification trains a linear support vec-tor classifier (using the scikit-learn library) us-ing unigram features computed from the tokenizedversions of the data. i-vectors are incorporated byconcatenating them to the unigram features. Theresults of the baseline model are presented in Ta-ble 2 below.

Table 2: Average Precision, Recall and F1 scoresfor the Baseline model (Baseline given by ETS forthe Native Language Identification Task)

Task Precision Recall F1Essay Task 0.72 0.72 0.72Speech Task without I-Vectors 0.52 0.52 0.52Speech Task with I-Vectors 0.76 0.76 0.76Fusion Task without I-Vectors 0.75 0.75 0.75Fusion Task with I-Vectors 0.79 0.78 0.78

6.2 Recurrent Neural Network over words

One of the key advantages of using a recurrentneural network as opposed to traditional classifiersis that it eliminates the need for human crafted fea-tures and automatically captures contextual infor-mation when learning word representations, whichmay introduce considerably less noise comparedto traditional window-based neural networks. Re-current neural network incorporates dynamic tem-poral behavior and can use its hidden layer to pro-cess and capture arbitary sequence inputs.

We use a LSTM encoder layer to encode all theinformation in the text and feed the output of theLSTM to a feedforward neural network for clas-sification. In this approach we build a vocabularyof words from the training data and also learn aembedding matrix for representing each of thesewords during the training phase. Since backprop-agating over very long sequences is hard, we con-sider the first 150 words of the datapoint for ourprediction.

We experimented with various sizes of hiddenunits, embedding dimensions, etc and the best re-sults that we arrived at are shown in the table 3below. Initially we observed that the results areworse than the baseline classifier (SVM over un-igram features) for both speech transcript and es-say classification for our initial experimental runs.However, on extensive hyper parameter tuning wewere able to get better results using this model asshown in Table 3 below.

Table 3: Average Precision scores for the RNNover words model

Task PrecisionEssay Task 48.65Speech Task without I-Vectors 46.19Speech Task with I-Vectors 63.48

6.3 Deep RNN with fully connected layers

To improve our RNN model further, we stackedlayers of LSTM cells to help our model learna better representation of the text. We experi-mented with 1,2 and 3 layer LSTMs. We alsoadded more fully connected layers for improvingthe classification performance. We observed thatthe best accuracy was achieved with 2 layer RNNand 2 Layer FC. The results obtained by using thismodel are summarized in Table 4 below.

Page 4: NATLID: Native Language Identification

Table 4: Average Precision for the Deep RNN withfully connected layers model

Model Essay task Speech task2 layer RNN 49.87 47.983 layer RNN 47.30 45.742 layer RNN + 2 layer FC 49.91 49.42

6.4 CNN over characters

On literature survey, we arrived at the conclusionthat Character level CNNs have been shown togive good performance in many related tasks inthis domain (Jozefowicz et al., 2016; Zhang et al.,2015; Hwang and Sung, 2016). Hence we decidedto explore this route, as well. Our alphabet set wascomposed of characters- ’a’ to ’z’, ’A’ to ’Z’ andnumerals 0-9. In order to capture special discrim-inative features, we used convolutional filters withdifferent filter sizes, experimented with differentnumber of filters per filter size and maxpoolinglayers. We concatenated the output from differentfilters and used a fully connected layer to make thefinal predictions. In this model as well, we learntour own embeddings for the alphabets. We alsoobserved that larger filter sizes work well for thismodel. This is probably because filter sizes of 6 or7 try to capture words (which is close to the aver-age length of an english word) whereas even largerfilters capture word interactions. We believe that adeeper CNN with smaller filter sizes could achievesimilar results as it’s field of view would be betterthan the current model.

The results of this model are shown below in Ta-ble 5. We observe that character level models per-form poorly compared to our other models basedon words. Our conjecture is that character basedmodels may be suitable for short text classificationtasks (like predicting sentiment of chat messages)but may not suitable for the task of native languageidentification using lengthy essays or spoken re-sponses as we may require a lot of filters to ex-haustively capture all the important features in aresponse.

Table 5: Average Precision for the CNN over char-acters model

Task PrecisionEssay Task 27.0Speech Task without I-Vectors 21.5Speech Task with I-Vectors 30.1

6.5 Convolutional Neural Network overwords

Since our character level CNNs performed worsecompared to all our models that used word-levelrepresentation, we decided to explore CNN overwords model. CNNs have been found to be veryeffective in solving many active research problemsin this domain. One of the reasons that CNNshave surpassed the traditional machine learningapproaches is the ability of CNNs to automaticallylearn values for their filters based on the task wewant the model to perform. Although it is hardfor a human to qualitatively come up with suchrules/filters, it might be easier for a CNN to learnthese from data. We can think of each filter aslooking for a specific composition of words in thesentence. Filters at higher layers will capture morecomplex compositions and trends that differentiateessay writing/speech transcriptions of different na-tive language speakers. Also, Convolutional net-works are faster and have better feature represen-tation compared to traditional n-gram models.

Our best model uses a Convolutional NeuralNetwork with a single layer convolution. We ex-perimented with various filter sizes and found thatfilter sizes of 2 and 3 with 1024 filters for each fil-ter size performed optimally for our task. We use astride of 1 for the convolution layer.We use RELUactivation followed by a maxpool layer which ba-sically finds the maximum over the entire regionand gives one value for each filter. In our casesince we have 1024 * 2 filters overall (includ-ing 2 size and 3 size filters), we end up with a2048 element vector. This is followed by a singlelayer feed forward network with 11 output units.We use dropout training to train our model for 20epochs with a batch size of 128. We used crossentropy loss and Adam optimizer in the model.Figure 2 shows the detailed architecture of the ourmodel.The results of this model are are summa-rized in table 6 below.

In our experiments, we find that smaller fil-ter sizes (2,3) give better results than larger filtersizes. Our conjecture is that people with a com-mon native language tie might have similar waysof constructing smaller sequences (like position-ing of certain articles, etc), but might not have agreat degree of overlap for longer sequences.

Page 5: NATLID: Native Language Identification

Figure 2: Architecture diagram of our Convolu-tional Neural Network over words model

Figure 3: We incrementally improved our modelby adding more layers on top of the core CNN overwords model.

6.6 Deeper CNNsWe also experimented with deeper CNNs withsmaller convolutional filters. For instance, twosize 3 filters stacked on top of each other have awide field of view similar to size 7 filter but usesway fewer parameters than having a single size7 filter. This is probably why deep narrow net-works work better than shallow broad networks.We tried 2 layer CNNs and multi-layer fully con-nected layers, however these did not improve theperformance significantly. Figure 3 shows how weincrementally experimented with more layers ontop of the core CNN over words model, to improvethe final model.

Table 6: Average Precision for the CNN overwords model

Task PrecisionEssay Task 61.6Speech Task without I-Vectors 51.5Speech Task with I-Vectors 83.2

6.7 RNN over CNN

In order to improve performance, we tried us-ing CNN and Maxpooling to reduce the sequencelength and then ran a RNN over it. This performedbetter than the vanilla RNN model but wasn’t ableto beat our earlier CNN model.The reason for thiscould be that we would have to do extensive hyperparameter tuning in order to get the optimal resultswhich was a non-trivial task given the number ofhyper parameters in this model. The results forthis model are shown in the Table 7 below.

Table 7: Average Precision for the RNN over CNNmodel

Task PrecisionEssay Task 47.30Speech Task without I-Vectors 45.57Speech Task with I-Vectors 58.1

6.8 Using Pre-trained word vectors: gloVe

A problem that we anticipated with our model wasthat we had only about 11,000 data points whichmight be too few to learn a good word embeddingfor our vocabulary. Therefore, we used pre-trainedword vectors: GloVe (Pennington et al., 2014) toinitialize our word embeddings. We then trainedour dataset so that the embeddings are modifiedaccording to our data and task. We also tried keep-ing the word embeddings static and just using thegloVe vectors as it is. However, both of these mod-els did not improve our model’s performance sig-nificantly and were computationally very expen-sive. Table 8 shows the results obtained by usinggloVe vectors with CNN over words model.

Table 8: Average Precision for the CNN model +gloVe pretrained vectors

Task PrecisionEssay Task 61.9Speech Task without I-Vectors 51.56Speech Task without I-Vectors 83.56

6.9 Leveraging i-vectors

Our CNN model performed well but we stillwanted to leverage the i-vectors to make up forthe lack of the audio component in our SpeechTask. i-vectors (Dehak et al., 2011; Martınez et al.,2011) convey information such as speaker charac-teristics, transmission channel, acoustic environ-

Page 6: NATLID: Native Language Identification

ment or phonetic content of speech segments. i-vector extraction can be seen as a probabilisticcompression process that reduces the dimensional-ity of speech-session super-vectors according to alinear-Gaussian model. i-vectors were initially in-troduced for speech recognition but are now usedwidely for speaker identification, language recog-nition, etc. Before the final fully connected layer,we concatenate the i-vectors to the output of thehidden states coming from RNN or CNN network.

Using the i-vectors indeed made a lot of differ-ence in the accuracy of the model which is also ev-ident in the baseline results. The baseline systemprecision for the Speech Task using only the tran-scriptions is 0.52, as against the precision for theSpeechTask when the i-vectors are included, i.e.,0.76). Similarly for our model, we see that addingthe i-vectors improved the accuracy of the Speechtask from 51.5% to 76.9%. Further, on normal-izing the i-vectors, the precision for the speechtask with i-vectors jumped to 83.2 which is sig-nificantly higher than the SVM baseline.

6.10 Ensemble model

In order to get a performance boost, we used an en-semble of all our above previously trained modelsand used a weighted voting scheme proportionatewith the accuracy of the models to get the final pre-dictions using their individual predictions.Resultsof this are shown below in Table 9.

Table 9: Average Precision for the Ensemblemodel

Task PrecisionEssay Task 62.07Speech Task without I-Vectors 52.33Speech Task with I-Vectors 84.12

7 Experimental Analysis

7.1 Effecting of changing the number offilters per filter size and filter sizes

The number of filters is equal to the number ofneurons/kernels and thus changing this parameterwould affect the results significantly. Each filterwhen convolving around the input is multiplyingits values with that of the original features of theinput. The result of applying a filter on the inputis called the feature map.

From our experiments we observed that smallerfilter sizes worked better in most of the cases. This

Figure 4: Plot showing the effect of changing filtersizes and number of filters per filter size

is intuitive because with a smaller number of filtersit is more likely that the patterns common to a cer-tain native language are captured better in a con-volution filter of sizes 2, 3 as explained in section6.5. As we increase the size of our filters, we de-crease the chance of capturing patterns unique to aparticular native language. The effect of changingfilter sizes is shown in Figure 4.

7.2 Effect of changing the learning rate

A high learning rate implies that the system con-tains too much kinetic energy and the parametervector bounces around chaotically, unable to set-tle down into deeper, but narrower parts of theloss function. Learning rate controls the size ofthe steps the model takes towards convergence.A small learning rate implies that the system thattakes ”baby” steps, hence makes lesser mistakesbut takes longer to train. With larger learning rate,the model takes bigger steps, hence learns faster,but is more prone to taking bigger wrong stepsand end up in a local minima. Hence, we needto find a stable learning rate that neither makes themodel slow nor traps it in a local minima. Verysmall learning rate can cause wasting computationbouncing around chaotically with little improve-ment for a long time. But very aggressive largelearning rate might not allow the system to reachthe global optima. We did several experimentalruns to understand the effect of learning rate onour model. The optimal learning rate for most ofour models was 2e-3.The effect of changing learn-ing rate is shown in Figure 5.

7.3 Effect of using different input types to themodel

In this section we present an analysis of how amodel’s performance changes on giving different

Page 7: NATLID: Native Language Identification

Figure 5: Plot showing the effect of changing thelearning rate of the model

types of inputs:1. Essay / Written response Only: On feed-

ing in just the written responses by test takers, weget an average accuracy of 62.07% using our bestmodel. It is evident from the results that the writ-ten text also provides a number of reliable cuesfor native language identification such as grammarand spelling idiosyncrasies typical of non-nativeEnglish writers. Transfer of linguistic knowledgefrom one’s native language into English gives uspatterns that can be captured well for this task.

2. Speech transcriptions Only: On feeding injust the speech transcriptions, we get an accuracyof 52.33% using our best model. This is becausethe speech transcriptions alone do not encode in-formation such as speaker characteristics, trans-mission channel, acoustic environment or phoneticcontent of speech segments.

3. Speech transcription with i-vectors: On nor-malizing and adding the i-vectors to the speechtranscriptions, we get an accuracy of 84.12% us-ing our best model.

7.4 Effect of changing the model

In this section, we compare how the different mod-els perform relative to each other for the nativelanguage identification task. From our experi-ments we observed that the best model that sig-nificantly outperformed all other models is thedeep/CNN over words followed by deep/RNN andthen RNN over CNN and lastly, CNN over charac-ters. Figure 6 lists the various models in decreas-ing order of performance. The average accuracyacross all the tasks decreases as we go down thepyramid. Figure 7 shows the precision of the dif-ferent models.

Figure 6: Performance of our models arranged indecreasing order of average precision across allthe tasks

Figure 7: Performance of our models across all thetasks

8 Conclusion and Analysis

One of the interesting observations from our ex-periments is that the accuracy of the native lan-guage identification using written responses is bet-ter than the using the spoken responses. Thiswas contrary to what we had originally expected.The spoken response, in our understanding shouldhave had a higher correlation with the native lan-guage of the speaker because written responses areexpected to be more formal and well-thought ofin a timed test environment compared to the spo-ken responses. However, based on what we ob-served on performing manual error analysis, weconcluded that this could potentially be becausean essay has a strong correlation with a person’svocabulary because people tend to use simplerwords when conversing whereas, when they havea longer duration to construct an essay they tendto use more complex words from their vocabulary;the choice of which is strongly influenced by theirnative language. Also, since essays are length-ier than the spoken responses, they encode moreinformation about the semantics of the construc-

Page 8: NATLID: Native Language Identification

tions.Another observation on analyzing the confu-

sion matrix from the baseline results was that thehighest mis-classification happens between Hindiand Telugu languages, which makes sense intu-itively because both these languages have a com-mon country of origin and thus influence the En-glish speaking/writing styles of the people speak-ing these languages. This is observed in both writ-ten and spoken responses. We also find highermisclassification between Spanish and French. Itis interesting to see how the analytics cluster to-gether people in the non-English speaking com-munities with similar native languages. Since en-glish essays and speech transcripts from peopleof similar native languages are harder to classify,we can conclude that one’s native language has astrong impact on the way a person thinks, speaksand writes in a non-native language. Figure 8shows the confusion matrix for the speech with i-vectors task for our CNN over words model.

Also, the results show that absolutely distinctlanguages like Arabic and Korean, French andTelugu, Turkish and German, etc are almost nevermis-classified. Given that languages like Chineseand Telugu, German and Turkish, etc are in starkcontrast and have no common elements, we be-lieve that our model learns to identify patternsto correctly classify all these very different lan-guages. Hence, we conclude that our model is ableto generalize pretty well.

Also, as expected, on adding ivectors (whichencode the speaker’s characteristics, phoneticcontent of speech segments, acoustic environ-ments, etc we get significantly higher performancethan using just the speech transcripts.

As a potential and interesting next step, we pro-pose combining the use of written and spoken re-sponses or the ”fusion” of the two to predict thenative language of a given user more precisely.

Overall, native language identification is a highimpact problem with many nuances and hard tosolve issues but there are many areas to work onthat can result in overall better systems and newadvances show a lot of promise in where we cango with these systems.

ReferencesAmjad Abu-Jbara, Rahul Jha, Eric Morley, and Dragomir Radev. 2013. Exper-

imental results on the native language identification shared task. In Pro-ceedings of the Eighth Workshop on Innovative Use of NLP for BuildingEducational Applications. pages 82–88.

Garen Arevian. 2007. Recurrent neural networks for robust real-world text

Figure 8: Confusion matrix for the speech with i-vectors using the CNN over words model

classification. In Proceedings of the IEEE/WIC/ACM International Con-ference on Web Intelligence. IEEE Computer Society, pages 326–329.

Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Dumouchel, and PierreOuellet. 2011. Front-end factor analysis for speaker verification. IEEETransactions on Audio, Speech, and Language Processing 19(4):788–798.

John Gibbons. 2003. Forensic linguistics. an introduction to language in thelegal system.

Tim Grant. 2007. Quantifying evidence in forensic authorship analysis. Inter-national Journal of Speech, Language & the Law 14(1).

Kyuyeon Hwang and Wonyong Sung. 2016. Character-level languagemodeling with hierarchical recurrent neural networks. arXiv preprintarXiv:1609.03777 .

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and YonghuiWu. 2016. Exploring the limits of language modeling. arXiv preprintarXiv:1602.02410 .

Yoon Kim. 2014. Convolutional neural networks for sentence classification.arXiv preprint arXiv:1408.5882 .

Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 .

David Martınez, Oldrich Plchot, Lukas Burget, Ondrej Glembek, and PavelMatejka. 2011. Language recognition in ivectors space. Proceedings ofInterspeech, Firenze, Italy pages 861–864.

Andreas Mengel. 1993. Transcribing names-a multiple choice task: mistakes,pitfals and escape routes. In Proc. 1st ONOMASTICA Research Collo-quium. pages 5–9.

Vikramjit Mitra, Chia-Jiu Wang, and Satarupa Banerjee. 2005. A neuro-svmmodel for text classification using latent semantic indexing. In Neural Net-works, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Con-ference on. IEEE, volume 1, pages 564–569.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014.Glove: Global vectors for word representation. In EMNLP. volume 14,pages 1532–1543.

V Prakasam. 2004. The indian evidence act 1872: a lexicogrammatical study.In J. Gibbons et al. pages 17–23.

Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carls-son. 2014. Cnn features off-the-shelf: an astounding baseline for recog-nition. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops. pages 806–813.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolu-tional networks for text classification. In Advances in neural informationprocessing systems. pages 649–657.

AppendixCode:https://drive.google.com/drive/folders/0B1PBgJkY6Miwa1hweDA1a2t5MTQ?usp=sharing