Language Idenࢢficaࢢon: a Neural Network approach Alberto Simões 1 José João Almeida 2 Simon D. Byers 3 1 CEHUM, Minho's University [email protected]2 CCTC, Minho's University [email protected]3 AT&T Labs, Bedminster NJ [email protected]SLATE2014, 19--20th June 2014 Alberto Simões, José João Almeida, Simon D. Byers Language Idenࢢficaࢢon: a Neural Network approach
40
Embed
Language Identification: A neural network approach
A presentation on some experiments on language identifying with Perl and Neural networks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Language Iden fica on:a Neural Network approach
Alberto Simões1 José João Almeida2 Simon D. Byers3
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
Malgranda Sablodezerto estasdezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjųbandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
Malgranda Sablodezerto estasdezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjųbandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
Malgranda Sablodezerto estasdezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjųbandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
俄罗斯眼下不具备航母建造、停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계�개편은�기본적으로노사�합의�또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
俄罗斯眼下不具备航母建造、停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계�개편은�기본적으로노사�합의�또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
俄罗斯眼下不具备航母建造、停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계�개편은�기본적으로노사�합의�또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
جلوگیری کردند. گروه دوم هم بهPersian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদাBengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
جلوگیری کردند. گروه دوم هم بهPersian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদাBengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
جلوگیری کردند. گروه دوم هم بهPersian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদাBengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
ဦးသနိး္စနိအ္စိုးရရ �ဲဝန�္ကးီအမာ်းစဟုာ စစဗ္ုိလန္�ဲ
စစဗ္ိုလလ္ထူြကေ္တြBurmese
આ રસ મ લ િનચોડી સારીરી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
ဦးသနိး္စနိအ္စိုးရရ �ဲဝန�္ကးီအမာ်းစဟုာ စစဗ္ုိလန္�ဲ
စစဗ္ိုလလ္ထူြကေ္တြBurmese
આ રસ મ લ િનચોડી સારીરી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
ဦးသနိး္စနိအ္စိုးရရ �ဲဝန�္ကးီအမာ်းစဟုာ စစဗ္ုိလန္�ဲ
စစဗ္ိုလလ္ထူြကေ္တြBurmese
આ રસ મ લ િનચોડી સારીરી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Approaches
Using a dic onary of words for each language:Problem: amount of word forms!
Using language features:compute unigrams, bigrams, trigrams, …;compute short words;compute word beginnings or termina ons;
Then use language models:Naïve Bayes;Hidden Markov Models (HMM);Support Vector Machines (SVM);Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Approaches
Using a dic onary of words for each language:Problem: amount of word forms!
Using language features:compute unigrams, bigrams, trigrams, …;compute short words;compute word beginnings or termina ons;
Then use language models:Naïve Bayes;Hidden Markov Models (HMM);Support Vector Machines (SVM);Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Approaches
Using a dic onary of words for each language:Problem: amount of word forms!
Using language features:compute unigrams, bigrams, trigrams, …;compute short words;compute word beginnings or termina ons;
Then use language models:Naïve Bayes;Hidden Markov Models (HMM);Support Vector Machines (SVM);Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Mo va on for a new tool
lack of a decent iden fica on tool for Perl;
use of Chrome Language Detec on library is limited:how to add new languages?how to restrict results to specific languages?
there are tools for other programming languages:language interoperability can be a hassle;not clear how to add new languages;
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Why using a Neural Network?
learn how Neural Networks work!
an approach where:training is tedious and slow;iden fica on is easy to implement;iden fica on efficient when BLAS available;
therefore:possible to use trained data in different programming languages;easy to restrict analysis to a set of languages;iden fica on probabili es are comparable;
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Neural Network Architecture
x1
x2
x3
. . .
xn
input layer(features)
a(2)1
a(2)2
a(2)3
. . .
a(2)s2
y1
y2
. . .
yK
Θ(1) Θ(2)
outputlayer
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Preparing Training Data
texts from TED website;more than 105 languages available!English texts were matched against English dic onary;OOV items are removed from the English texts and from otherlanguage texts (trying to remove named en es wri en in theirEnglish form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired twostanding ova ons at TED2011. She tells the story of hermetamorphosis — from a wide-eyed teenager soaking in verse atNew York's Bowery Poetry Club to a teacher connec ng kids withthe power of self-expression through Project V.O.I.C.E. — andgives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Preparing Training Data
texts from TED website;more than 105 languages available!English texts were matched against English dic onary;OOV items are removed from the English texts and from otherlanguage texts (trying to remove named en es wri en in theirEnglish form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired twostanding ova ons at TED2011. She tells the story of hermetamorphosis — from a wide-eyed teenager soaking in verse atNew York's Bowery Poetry Club to a teacher connec ng kids withthe power of self-expression through Project V.O.I.C.E. — andgives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Preparing Training Data
texts from TED website;more than 105 languages available!English texts were matched against English dic onary;OOV items are removed from the English texts and from otherlanguage texts (trying to remove named en es wri en in theirEnglish form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired twostanding ova ons at TED2011. She tells the story of hermetamorphosis — from a wide-eyed teenager soaking in verse atNew York's Bowery Poetry Club to a teacher connec ng kids withthe power of self-expression through Project V.O.I.C.E. — andgives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Two kind of Features
Used AlphabetWhich are the computer characters used in the text?Are they usually used in Asia c, Arabic or La n text?
Used Sequences of CharactersWhich unigrams, bigrams or trigrams are used?Which are most common for each language?
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Two kind of Features
Used AlphabetWhich are the computer characters used in the text?Are they usually used in Asia c, Arabic or La n text?
Used Sequences of CharactersWhich unigrams, bigrams or trigrams are used?Which are most common for each language?
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Alphabet Features
Count number of Unicode characters in the following classes:C1 La n characters, only a-z, without diacri cs;C2 Cyrillic characters (0x0410-0x042F and 0x0430-0x044F);C3 Hiragana and Katakana characters (0x3040-0x30FF);C4 Hangul characters (0xAC00-0xD7AF, 0x1100-0x11FF,
0x3130-0x318F, 0xA960-0xA97F and 0xD7B0-0xD7FF);C5 Kanji characters (0x4E00-0x9FAF);C6 Simplified Chinese characters (2877 hand defined characters);C7 Tradi onal Chinese characters (2663 hand defined characters);C8 Arabic characters (0x0600-0x06FF);C9 Thai characters (0x0E00-0x0E7F);C10 Greek characters (0x0370-0x03FF and 0x1F00-0x1FFF).
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Binariza on of Alphabet Features
In order of reducing entropy in the NN:Alphabet features are binarized using a set of rules:
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features
Why Trigrams?
bigrams would be too small when comparing very closelanguages like Portuguese and Spanish;
tetragrams would be too big for some languages (like Asia c's),where some glyphs represent words or morphemes;
as punctua on and numbers were removed, and spacesnormalized, trigrams would be able to capture, as well, the endor beginning of words as well as to capture single characterwords that appear surrounded by spaces.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features: example
Für mich war das eine neue Erkenntnis. Und ich denke, mit derZeit, in den kommenden Jahren, Wir haben Künstler, aber leiderhaben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nureine Form kultureller Integra on. Wir haben erkannt, dass seitkurzem immer mehr Leutea
Top occurring trigramsen␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features: example
Für mich war das eine neue Erkenntnis. Und ich denke, mit derZeit, in den kommenden Jahren, Wir haben Künstler, aber leiderhaben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nureine Form kultureller Integra on. Wir haben erkannt, dass seitkurzem immer mehr Leutea
Top occurring trigramsen␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features: Merging
features← {};for L ∈ L do
trigrams← ∅;for file ∈ FilesL do
T← computeTrigrams(file) ; // Str→ INT← mostOccurring(T) ; // Top 30 trigramsfor t ∈ keys(T) do
trigrams[t]← trigrams[t] + 1;
T← mostOccurring(T) ;features← features ∪ keys(trigrams);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Training Data Matrix (excerpt)
Alphabet Features Trigram FeaturesLa n Greek Cyril. ␣pa ới␣ par nia ест ати. ата
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Exp1: Accuracy
Language 1500 iters. 4000 iters.ar, bg, de 100% 100%el, es, fa 100% 100%fr, he, hu 100% 100%it, ja, ko 100% 100%
nl, pl 100% 100%pt 5% 52% wrongly classifies as pt-br
pt-br 100% 76% wrongly classifies as ptro, ru, sr 100% 100%th, tr, uk 100% 100%
vi, zh-cn, zh-tw 100% 100%
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Exp1: Comparison of PT variants
PT PT-BR
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Experiment 2: 55 languages
Afrikaans
Albanian
Arabic
Bulgarian
Bengali
Catalan
Czech
Danish
German
ModernGreek
English
Esperanto
Spanish
Estonian
Persian
Finnish
French
Galician
Gujara
Hebrew
Hindi
Hungarian
Armenian
Indonesian
Italian
Japanese
Georgian
Kannada
Korean
Kurdish
Lithuanian
Latvian
Macedonian
Malayalam
Marathi
Burmese
Nepali
Dutch
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Somali
Serbian
Swedish
Tamil
Thai
Turkish
Ukrainian
Urdu
Vietnamese
Chinese(simplified)
Chinese(tradi onal)
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Exp 2: Results
55 languages,1.126 features,Θ(l) take 11MB on disk (binary format),running 7500 itera ons of learning algorithm,during 6574 minutes and 50.386 seconds (more than 4.5 days),s ll 21 test files per language,46 seconds to run over the 1155 test files,accuracy of 99.740%,mis-iden fica ons:
2 Bulgarian texts detected as Macedonian,1 Danish text detected as Dutch.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Conclusions
Up to 96% of accuracy when tes ng few languages, andincluding two Portuguese variants;Over 99.7% of accuracy for 55 languages;NN are able to grow, but training me grows exaggeratedly;The choice of features is relevant;(if we know a specific detail will be good to dis nguish alanguage, add it to the network!)Obtained results are not ``determinis c''. Although the samepropor on of results are expected, the random ini aliza on ofthe network may lead to some different results in differentnumber of itera ons.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Future Work
Reduce number of trigrams per language and include unigrams;Compute distribu on differences between near languages;Make experiments on training different neural networks foreach alphabet;Include a regulariza on coefficient (λ ̸= 0);Make experiments to Deep Neural Networks;Test language iden fica on on short texts (namely Twi ertweets).
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Language Iden fica on:a Neural Network approach
Alberto Simões1 José João Almeida2 Simon D. Byers3