Phonetic and Visual Priors for Decipherment of Informal ...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8308–8319July 5 - 10, 2020. c©2020 Association for Computational Linguistics

8308

Phonetic and Visual Priors for Decipherment of Informal Romanization

Maria Ryskina1 Matthew R. Gormley2 Taylor Berg-Kirkpatrick3

1Language Technologies Institute, Carnegie Mellon University2Machine Learning Department, Carnegie Mellon University

3Computer Science and Engineering, University of California, San Diego{mryskina,mgormley}@cs.cmu.edu [email protected]

Abstract

Informal romanization is an idiosyncratic pro-cess used by humans in informal digital com-munication to encode non-Latin script lan-guages into Latin character sets found oncommon keyboards. Character substitutionchoices differ between users but have beenshown to be governed by the same main princi-ples observed across a variety of languages—namely, character pairs are often associatedthrough phonetic or visual similarity. We pro-pose a noisy-channel WFST cascade modelfor deciphering the original non-Latin scriptfrom observed romanized text in an unsuper-vised fashion. We train our model directly onromanized data from two languages: Egyp-tian Arabic and Russian. We demonstrate thatadding inductive bias through phonetic andvisual priors on character mappings substan-tially improves the model’s performance onboth languages, yielding results much closerto the supervised skyline. Finally, we intro-duce a new dataset of romanized Russian, col-lected from a Russian social network websiteand partially annotated for our experiments.1

1 Introduction

Written online communication poses a number ofchallenges for natural language processing sys-tems, including the presence of neologisms, code-switching, and the use of non-standard orthogra-phy. One notable example of orthographic varia-tion in social media is informal romanization2—speakers of languages written in non-Latin alpha-bets encoding their messages in Latin characters,for convenience or due to technical constraints(improper rendering of native script or keyboard

1The code and data are available at https://github.com/ryskina/romanization-decipherment

2Our focus on informal transliteration excludes formalsettings such as pinyin for Mandarin where transliterationconventions are well established.

хорошоxopowo

horosho

[Phonetic]

[Visual]

[Cyrillic]

[Phonetically romanized]

[Visually romanized]

[Underlying Cyrillic]

[Underlying Cyrillic]

[Visually romanized]

[Phonetically romanized]

Figure 1: Example transliterations of a Russianword horoxo [horošo, ‘good’] (middle) based onphonetic (top) and visual (bottom) similarity, withcharacter alignments displayed. The phonetic-visual dichotomy gives rise to one-to-many map-pings such as x /S/→ sh / w.

layout incompatibility). An example of such a sen-tence can be found in Figure 2. Unlike named en-tity transliteration where the change of script rep-resents the change of language, here Latin charac-ters serve as an intermediate symbolic representa-tion to be decoded by another speaker of the samesource language, calling for a completely differ-ent transliteration mechanism: instead of express-ing the pronunciation of the word according tothe phonetic rules of another language, informaltransliteration can be viewed as a substitution ci-pher, where each source character is replaced witha similar Latin character.

In this paper, we focus on decoding informallyromanized texts back into their original scripts.We view the task as a decipherment problem andpropose an unsupervised approach, which allowsus to save annotation effort since parallel datafor informal transliteration does not occur natu-rally. We propose a weighted finite-state trans-ducer (WFST) cascade model that learns to de-code informal romanization without parallel text,relying only on transliterated data and a languagemodel over the original orthography. We test iton two languages, Egyptian Arabic and Russian,collecting our own dataset of romanized Russianfrom a Russian social network website vk.com.

https://github.com/ryskina/romanization-decipherment

https://github.com/ryskina/romanization-decipherment

8309

4to mowet bit’ ly4we? [Romanized]Qto mo�et byt~ luqxe? [Latent Cyrillic]Cto možet byt’ lucše? [Scientific]/Sto "moZ1t b1tj "lu

>tSS1/ [IPA]

What can be better? [Translated]

Figure 2: Example of an informally romanizedsentence from the dataset presented in this paper,containing a many-to-one mapping � / x → w.Scientific transliteration, broad phonetic transcrip-tion, and translation are not included in the datasetand are presented for illustration only.

Since informal transliteration is not standard-ized, converting romanized text back to its origi-nal orthography requires reasoning about the spe-cific user’s transliteration preferences and han-dling many-to-one (Figure 2) and one-to-many(Figure 1) character mappings, which is beyondtraditional rule-based converters. Although userbehaviors vary, there are two dominant patternsin informal romanization that have been observedindependently across different languages, such asRussian (Paulsen, 2014), dialectal Arabic (Dar-wish, 2014) or Greek (Chalamandaris et al., 2006):

Phonetic similarity: Users represent source char-acters with Latin characters or digraphs associatedwith similar phonemes (e.g. m /m/→m, l /l/→ lin Figure 2). This substitution method requiresimplicitly tying the Latin characters to a phoneticsystem of an intermediate language (typically, En-glish).

Visual similarity: Users replace source characterswith similar-looking symbols (e.g. q /

>tSj/ → 4,

u /u/→ y in Figure 2). Visual similarity choicesoften involve numerals, especially when the cor-responding source language phoneme has no En-glish equivalent (e.g. Arabic � /Q/→ 3).

Taking that consistency across languages intoaccount, we show that incorporating these stylepatterns into our model as priors on the emissionparameters—also constructed from naturally oc-curring resources—improves the decoding accu-racy on both languages. We compare the pro-posed unsupervised WFST model with a super-vised WFST, an unsupervised neural architecture,and commercial systems for decoding romanizedRussian (translit) and Arabic (Arabizi). Our un-supervised WFST outperforms the unsupervisedneural baseline on both languages.

2 Related work

Prior work on informal transliteration uses su-pervised approaches with character substitutionrules either manually defined or learned from au-tomatically extracted character alignments (Dar-wish, 2014; Chalamandaris et al., 2004). Typi-cally, such approaches are pipelined: they producecandidate transliterations and rerank them usingmodules encoding knowledge of the source lan-guage, such as morphological analyzers or word-level language models (Al-Badrashiny et al., 2014;Eskander et al., 2014). Supervised finite-state ap-proaches have also been explored (Wolf-Sonkinet al., 2019; Hellsten et al., 2017); these WFSTcascade models are similar to the one we propose,but they encode a different set of assumptionsabout the transliteration process due to being de-signed for abugida scripts (using consonant-vowelsyllables as units) rather than alphabets. To ourknowledge, there is no prior unsupervised work onthis problem.

Named entity transliteration, a task closely re-lated to ours, is better explored, but there is littleunsupervised work on this task as well. In par-ticular, Ravi and Knight (2009) propose a fullyunsupervised version of the WFST approach in-troduced by Knight and Graehl (1998), refram-ing the task as a decipherment problem and learn-ing cross-lingual phoneme mappings from mono-lingual data. We take a similar path, although itshould be noted that named entity transliterationmethods cannot be straightforwardly adapted toour task due to the different nature of the translit-eration choices. The goal of the standard translit-eration task is to communicate the pronunciationof a sequence in the source language (SL) to aspeaker of the target language (TL) by render-ing it appropriately in the TL alphabet; in con-trast, informal romanization emerges in commu-nication between SL speakers only, and TL isnot specified. If we picked any specific Latin-script language to represent TL (e.g. English,which is often used to ground phonetic substi-tutions), many of the informally romanized se-quences would still not conform to its pronuncia-tion rules: the transliteration process is character-level rather than phoneme-level and does not takepossible TL digraphs into account (e.g. Russiansh /sx/→ sh), and it often involves eclectic visualsubstitution choices such as numerals or punctua-

8310

tion (e.g. Arabic �� [tHt, ‘under’]3 → ta7t, Rus-sian dl� [dlja, ‘for’]→ dl9| ).

Finally, another relevant task is translating be-tween closely related languages, possibly writ-ten in different scripts. An approach similar toours is proposed by Pourdamghani and Knight(2017). They also take an unsupervised decipher-ment approach: the cipher model, parameterizedas a WFST, is trained to encode the source lan-guage character sequences into the target languagealphabet as part of a character-level noisy-channelmodel, and at decoding time it is composed witha word-level language model of the source lan-guage. Recently, the unsupervised neural architec-tures (Lample et al., 2018, 2019) have also beenused for related language translation and similardecipherment tasks (He et al., 2020), and we ex-tend one of these neural models to our character-level setup to serve as a baseline (§5).

3 Methods

We train a character-based noisy-channel modelthat transforms a character sequence o in the nativealphabet of the language into a sequence of Latincharacters l, and use it to decode the romanizedsequence l back into the original orthography. Ourproposed model is composed of separate transitionand emission components as discussed in §3.1,similarly to an HMM. However, an HMM assumesa one-to-one alignment between the characters ofthe observed and the latent sequences, which is nottrue for our task. One original script character canbe aligned to two consecutive Latin characters orvice versa: for example, when a phoneme is rep-resented with a single symbol on one side but witha digraph on the other (Figure 1), or when a char-acter is omitted on one side but explicitly writtenon the other (e.g. short vowels not written in un-vocalized Arabic but written in transliteration, orthe Russian soft sign ~ representing palatalizationbeing often omitted in the romanized version). Tohandle those alignments, we introduce insertionsand deletions into the emission model and mod-ify the emission transducer to limit the number ofconsecutive insertions and deletions. In our exper-iments, we compare the performance of the modelwith and without informative phonetic and visualsimilarity priors described in §3.2.

3The square brackets following a foreign word show itslinguistic transliteration (using the scientific and the Buck-walter schemas for Russian and Arabic respectively) and itsEnglish translation.

3.1 ModelIf we view the process of romanization as encod-ing a source sequence o into Latin characters, wecan consider each observation l to have originatedvia o being generated from a distribution p(o) andthen transformed to Latin script according to an-other distribution p(l|o). We can write the proba-bility of the observed Latin sequence as:

p(l) =∑o

p(o; γ) · p(l|o; θ) · pprior(θ;α) (1)

The first two terms in (1) correspond to the proba-bilities under the transition model (the languagemodel trained on the original orthography) andthe emission model respectively. The third termrepresents the prior distribution on the emissionmodel parameters through which we introduce hu-man knowledge into the model. Our goal is tolearn the parameters θ of the emission distributionwith the transition parameters γ being fixed.

We parameterize the emission and transitiondistributions as weighted finite-state transducers(WFSTs):

Transition WFSA The n-gram weighted finite-state acceptor (WFSA) T represents a character-level n-gram language model of the language inthe native script, producing the native alphabetcharacter sequence o with the probability p(o; γ).We use the parameterization of Allauzen et al.(2003), with the states encoding conditioning his-tory, arcs weighted by n-gram probabilities, andfailure transitions representing backoffs. The roleof T is to inform the model of what well-formedtext in the original orthography looks like; its pa-rameters γ are learned from a separate corpus andkept fixed during the rest of the training.

Emission WFST The emission WFST S trans-duces the original script sequence o to a Latin se-quence l with the probability p(l|o; θ). Since therecan be multiple paths through S that correspondto the input-output pair (o, l), this probability issummed over all such paths (i.e. is a marginal overall possible monotonic character alignments):

p(l|o; θ) =∑e

p(l, e|o; θ) (2)

We view each path e as a sequence of edit op-erations: substitutions of original characters withLatin ones (co → cl), insertions of Latin charac-ters (ε → cl), and deletions of original charac-ters (co → ε). Each arc in S corresponds to one

8311

of the possible edit operations; an arc represent-ing the edit co → cl is characterized by the in-put label co, the output label cl, and the weight− log p(cl|co; θ). The emission parameters θ arethe multinomial conditional probabilities of theedit operations p(cl|co); we learn θ using the al-gorithm described in §3.3.

3.2 Phonetic and visual priors

To inform the model of which pairs of symbols areclose in the phonetic or visual space, we introducethe priors on the emission parameters, increasingthe probability of an original alphabet characterbeing substituted by a similar Latin one. Ratherthan attempting to operationalize the notions ofphonetic or visual similarity, we choose to readthe likely mappings between symbols off human-compiled resources that use the same underlyingprinciple: phonetic keyboard layouts and visuallyconfusable symbol lists. Examples of mappingsthat we encode as priors can be found in Table 1.

Phonetic similarity Since we think of the infor-mal romanization as a cipher, we aim to capturethe phonetic similarity between characters basedon association rather than on the actual grapheme-to-phoneme mappings in specific words. We ap-proximate it using phonetic keyboard layouts, one-to-one mappings built to bring together “similar-sounding” characters in different alphabets. Wetake the character pairs from a union of multiplelayouts for each language, two for Arabic4 andfour for Russian.5 The main drawback of usingkeyboard layouts is that they require every char-acter to have a Latin counterpart, so some map-pings will inevitably be arbitrary; we compensatefor this effect by averaging over several layouts.

Visual similarity The strongest example of vi-sual character similarity would be homoglyphs—symbols from different alphabets represented bythe same glyph, such as Cyrillic a and Latin a.The fact that homoglyph pairs can be made in-distinguishable in certain fonts has been exploitedin phishing attacks, e.g. when Latin charactersare replaced by virtually identical Cyrillic ones(Gabrilovich and Gontmakher, 2002). This led theUnicode Consortium to publish a list of symbolsand symbol combinations similar enough to be po-

4http://arabic.omaralzabir.com/,https://thomasplagwitz.com/2013/01/06/imrans-phonetic-keyboard-for-arabic/

5http://winrus.com/kbd_e.htm

OriginalLatin

Phon. Vis.

r /r/ r pb /b/ b b, 6v /v/ v, w b

¤ /w, u:, o:/ w, u —� /x/ k, x —

Table 1: Example Cyrillic–Latin and Arabic–Latin mappings encoded in the visual and phoneticpriors respectively.

tentially confusing to the human eye (referred toas confusables).6 This list contains not only exacthomoglyphs but also strongly homoglyphic pairssuch as Cyrillic � and Latin lO.

We construct a visual prior for the Russianmodel from all Cyrillic–Latin symbol pairs inthe Unicode confusables list.7 Although this listdoes not cover more complex visual associationsused in informal romanization, such as partialsimilarity (Arabic Alif with Hamza � → 2 due toHamza º resembling an inverted 2) or similarityconditioned on a transformation such as reflection(Russian l → v), it makes a sensible startingpoint. However, this restrictive definition of visualsimilarity does not allow us to create a visual priorfor Arabic—the two scripts are dissimilar enoughthat the confusables list does not contain anyArabic–Latin character pairs. Proposing a morenuanced definition of visual similarity for Arabicand the associated prior is left for future work.

We incorporate these mappings into the modelas Dirichlet priors on the emission parameters:θ ∼ Dir(α), where each dimension of the param-eter α corresponds to a character pair (co, cl), andthe corresponding element of α is set to the num-ber of times these symbols are mapped to eachother in the predefined mapping set.

3.3 Learning

We learn the emission WFST parameters in an un-supervised fashion, observing only the Latin sideof the training instances. The marginal likelihoodof a romanized sequence l can be computed by

6https://www.unicode.org/Public/security/latest/confusables.txt

7In our parameterization, we cannot introduce a mappingfrom one to multiple symbols or vice versa, so we map allpossible pairs instead: (�, lo) → (�, l), (�, o).

http://arabic.omaralzabir.com/

https://thomasplagwitz.com/2013/01/06/imrans-phonetic-keyboard-for-arabic/

https://thomasplagwitz.com/2013/01/06/imrans-phonetic-keyboard-for-arabic/

http://winrus.com/kbd_e.htm

https://www.unicode.org/Public/security/latest/confusables.txt

https://www.unicode.org/Public/security/latest/confusables.txt

8312

�2 �1 0 1 2

✏ : ⇤l

⇤o : ✏ ⇤o : ✏ ⇤o : ✏ ⇤o : ✏

⇤o : ⇤l ⇤o : ⇤l ⇤o : ⇤l ⇤o : ⇤l

✏ : ⇤l ✏ : ⇤l ✏ : ⇤l

⇤o : ⇤l

Figure 3: Schematic of the emission WFSTwith limited delay (here, up to 2) with stateslabeled by their delay values. ∗o and ∗l rep-resent an arbitrary original or Latin symbolrespectively. Weights of the arcs are omit-ted for clarity; weights with the same input-output label pairs are tied.

summing over the weights of all paths througha lattice obtained by composing T ◦ S ◦ A(l).Here A(l) is an unweighted acceptor of l, which,when composed with a lattice, constrains all pathsthrough the lattice to produce l as the output se-quence. The expectation–maximization (EM) al-gorithm is commonly used to maximize marginallikelihood; however, the size of the lattice wouldmake the computation prohibitively slow. Wecombine online learning (Liang and Klein, 2009)and curriculum learning (Bengio et al., 2009) toachieve faster convergence, as described in §3.3.1.

3.3.1 Unsupervised learningWe use a version of the stepwise EM algorithmdescribed by Liang and Klein (2009), reminis-cent of the stochastic gradient descent in the spaceof the sufficient statistics. Training data is splitinto mini-batches, and after processing each mini-batch we update the overall vector of the suffi-cient statistics µ and re-estimate the parametersbased on the updated vector. The update is per-formed by interpolating between the current valueof the overall vector and the vector of sufficientstatistics sk collected from the k-th mini-batch:µ(k+1) ← (1 − ηk)µ

(k) + ηksk. The stepsize isgradually decreased, causing the model to makesmaller changes to the parameters as the learningstabilizes. Following Liang and Klein (2009), weset it to ηk = (k + 2)−β .

However, if the mini-batch contains long se-quences, summing over all paths in the corre-sponding lattices could still take a long time. Aswe know, the character substitutions are not arbi-trary: each original alphabet symbols is likely tobe mapped to only a few Latin characters, whichmeans that most of the paths through the latticewould have very low probabilities. We prunethe improbable arcs in the emission WFST whiletraining on batches of shorter sentences. Doingthis eliminates up to 66% and up to 76% of theemission arcs for Arabic and Russian respectively.

We discourage excessive use of insertions anddeletions by keeping the corresponding probabili-

ties low at the early stages of training: during thefirst several updates, we freeze the deletion proba-bilities at a small initial value and disable inser-tions completely to keep the model locally nor-malized. We also iteratively increase the languagemodel order as learning progresses. Once most ofthe emission WFST arcs have been pruned, we canafford to compose it with a larger language modelWFST without the size of the resulting lattice ren-dering the computation impractical. The two stepsof the EM algorithm are performed as follows:

E-step At the E-step we compute the sufficientstatistics for updating θ, which in our case wouldbe the expected number of traversals of each ofthe emission WFST arcs. For ease of bookkeep-ing, we compute those expectations using finite-state methods in the expectation semiring (Eisner,2002). Summing over all paths in the lattice is usu-ally performed via shortest distance computationin log semiring; in the expectation semiring, weaugment the weight of each arc with a basis vec-tor, where the only non-zero element correspondsto the index of the emission edit operation associ-ated with the arc (i.e. the input-output label pair).This way the shortest distance algorithm yields notonly the marginal likelihood but also the vector ofthe sufficient statistics for the input sequence.

To speed up the shortest distance computation,we shrink the lattice by limiting delay of all pathsthrough the emission WFST. Delay of a path isdefined as the difference between the number ofthe epsilon labels on the input and output sides ofthe path. Figure 3 shows the schema of the emis-sion WFST where delay is limited. Substitutionsare performed without a state change, and eachdeletion or insertion arc transitions to the next orprevious state respectively. When the first (last)state is reached, further deletions (insertions) areno longer allowed.

M-step The M-step then corresponds to simplyre-estimating θ by appropriately normalizing theobtained expected counts.

8313

Arabic RussianSent. Char. Sent. Char.

LM train 49K 935K 307K 111MTrain 5K 104K 5K 319KValidation 301 8K 227 15KTest 1K 20K 1K 72K

Table 2: Splits of the Arabic and Russian data usedin our experiments. All Arabic data comes fromthe LDC BOLT Phase 2 corpus, in which all sen-tences are annotated with their transliteration intothe Arabic script. For the experiments on Rus-sian, the language model is trained on a sectionof the Taiga corpus, and the train, validation, andtest portions are collected by the authors; only thevalidation and test sentences are annotated.

3.3.2 Supervised learning

We also compare the performance of our modelwith the same model trained in a supervised way,using the annotated portion of the data that con-tains parallel o and l sequences. In the supervisedcase we can additionally constrain the lattice withan acceptor of the original orthography sequence:A(o) ◦ T ◦ S ◦ A(l). However, the alignment be-tween the symbols in o and l is still latent. To op-timize this marginal likelihood we still employ theEM algorithm. As this constrained lattice is muchsmaller, we can run the standard EM without themodifications discussed in §3.3.1.

3.4 Decoding

Inference at test time is also performed usingfinite-state methods and closely resembles the E-step of the unsupervised learning: given a Latinsequence l, we construct the machine T ◦S ◦A(l)in the tropical semiring and run the shortest pathalgorithm to obtain the most probable path e; thesource sequence o is read off the obtained path.

4 Datasets

Here we discuss the data used to train the unsu-pervised model. Unlike Arabizi, which has beenexplored in prior work due to its popularity in themodern online community, a dataset of informallyromanized Russian was not available, so we col-lect and partially annotate our own dataset fromthe Russian social network vk.com.

4.1 Arabic

We use the Arabizi portion of the LDC BOLTPhase 2 SMS/Chat dataset (Bies et al., 2014;Song et al., 2014), a collection of written infor-mal conversations in romanized Egyptian Arabicannotated with their Arabic script representation.To prevent the annotators from introducing or-thographic variation inherent to dialectal Arabic,compliance with the Conventional orthography fordialectal Arabic (CODA; Habash et al., 2012) isensured. However, the effects of some of the nor-malization choices (e.g. expanding frequent abbre-viations) would pose difficulties to our model.

To obtain a subset of the data better suited forour task, we discard any instances which are notoriginally romanized (5% of all data), ones wherethe Arabic annotation contains Latin characters(4%), or where emoji/emoticon normalization wasperformed (12%). The information about the splitsis provided in Table 2. Most of the data is allocatedto the language model training set in order to givethe unsupervised model enough signal from thenative script side. We choose to train the transi-tion model on the annotations from the same cor-pus to make the language model specific to boththe informal domain and the CODA orthography.

4.2 Russian

We collect our own dataset of romanized Rus-sian text from a social network website vk.com,adopting an approach similar to the one describedby Darwish (2014). We take a list of the 50most frequent Russian lemmas (Lyashevskaya andSharov, 2009), filtering out those shorter than 3characters, and produce a set of candidate roman-izations for each of them to use as queries to thevk.com API. In order to encourage diversity ofromanization styles in our dataset, we generate thequeries by defining all plausible visual and pho-netic mappings for each Cyrillic character and ap-plying all possible combinations of those substitu-tions to the underlying Russian word. We scrapepublic posts on the user and group pages, retain-ing only the information about which posts wereauthored by the same user, and manually go overthe collected set to filter out coincidental results.

Our dataset consists of 1796 wall posts from1681 users and communities. Since the postsare quite long on average (248 characters, longestones up to 15K), we split them into sentences us-ing the NLTK sentence tokenizer, with manual

8314

correction when needed. The obtained sentencesare used as data points, split into training, valida-tion and test according to the numbers in Table 2.The average length of an obtained sentence is 65characters, which is 3 times longer than an aver-age Arabizi sentence; we believe this is due to thedifferent nature of the data (social media posts vs.SMS). Sentences collected from the same user aredistributed across different splits so that we ob-serve a diverse set of romanization preferences inboth training and testing. Each sentence in the val-idation and test sets is annotated by one of thetwo native speaker annotators, following guide-lines similar to those designed for the ArabiziBOLT data (Bies et al., 2014). For more detailson the annotation guidelines and inter-annotatoragreement, see Appendix A.

Since we do not have enough annotations totrain the Russian language model on the same cor-pus, we use a separate in-domain dataset. Wetake a portion of the Taiga dataset (Shavrina andShapovalova, 2017), containing 307K commentsscraped from the same social network vk.com,and apply the same preprocessing steps as we didin the collection process.

5 Experiments

Here we discuss the experimental setup used todetermine how much information relevant for ourtask is contained in the character similarity map-pings, and how it compares to the amount of in-formation encoded in the human annotations. Wecompare them by evaluating the effect of the in-formative priors (described in §3.2) on the perfor-mance of the unsupervised model and comparingit to the performance of the supervised model.

Methods We compare the performance of ourmodel trained in three different setups: unsuper-vised with a uniform prior on the emission pa-rameters, unsupervised with informative phoneticand visual priors (§3.2), and supervised. We ad-ditionally compare them to a commercial onlinedecoding system for each language (directly en-coding human knowledge about the transliterationprocess) and a character-level unsupervised neu-ral machine translation architecture (encoding noassumptions about the underlying process at all).

We train the unsupervised models with the step-wise EM algorithm as described in §3.3.1, per-forming stochastic updates and making only onepass over the entire training set. The supervised

models are trained on the validation set with fiveiterations of EM with a six-gram transition model.It should be noted that only a subset of the valida-tion data is actually used in the supervised train-ing: if the absolute value of the delay of the emis-sion WFST paths is limited by n, we will not beable to compose a lattice for any data points wherethe input and output sequences differ in length bymore than n (those constitute 22% of the Arabicvalidation data and 33% of the Russian validationdata for n = 5 and n = 2 respectively). Sinceall of the Arabic data comes annotated, we canperform the same experiment using the full train-ing set; surprisingly, the performance of the super-vised model does not improve (see Table 3).

The online transliteration decoding systems weuse are translit.net for Russian and Yamli8

for Arabic. The Russian decoder is rule-based, butthe information about what algorithm the Arabicdecoder uses is not disclosed.

We take the unsupervised neural machine trans-lation (UNMT) model of Lample et al. (2018)as the neural baseline, using the implementationfrom the codebase of He et al. (2020), with oneimportant difference: since the romanization pro-cess is known to be strictly character-level, we to-kenize the text into characters rather than words.

Implementation We use the OpenFst library(Allauzen et al., 2007) for the implementation ofall the finite-state methods, in conjunction with theOpenGrm NGram library (Roark et al., 2012) fortraining the transition model specifically. We trainthe character-level n-gram models with Witten–Bell smoothing (Witten and Bell, 1991) of ordersfrom two to six. Since the WFSTs encoding fullhigher-order models become very large (for ex-ample, the Russian six-gram model has 3M statesand 13M arcs), we shrink all the models exceptfor the bigram one using relative entropy prun-ing (Stolcke, 1998). However, since pruning de-creases the quality of the language model, we ob-serve most of the improvement in accuracy whiletraining with the unpruned bigram model, and thesubsequent order increases lead to relatively mi-nor gains. Hyperparameter settings for trainingthe transition and emission WFSTs are describedin Appendix B.

We optimize the delay limit for each languageseparately, obtaining best results with 2 for Rus-sian and 5 for Arabic. To approximate the mono-

8https://www.yamli.com/

https://www.yamli.com/

8315

Arabic Russian

Unsupervised: uniform prior 0.735 0.660Unsupervised: phonetic prior 0.377 0.222Unsupervised: visual prior — 0.372Unsupervised: combined prior — 0.212

Supervised 0.225* 0.140UNMT 0.791 0.242Commercial 0.206 0.137

Table 3: Character error rate for different experi-mental setups. We compare unsupervised modelswith and without informative priors with the su-pervised model (trained on validation data) and acommercial online system. We do not have a vi-sual prior for Arabic due to the Arabic–Latin vi-sual character similarity not being captured by therestrictive confusables list that defines the prior(see §3.2). Each supervised and unsupervisedexperiment is performed with 5 random restarts.*The Arabic supervised experiment result is forthe model trained on the validation set; trainingon the 5K training set yields 0.226.

tonic word-level alignment between the originaland Latin sequences, we restrict the operations onthe space character to only three: insertion, dele-tion, and substitution with itself. We apply thesame to the punctuation marks (with specializedsubstitutions for certain Arabic symbols, such as?→ ?). This substantially reduces the number ofarcs in the emission WFST, as punctuation marksmake up over half of each of the alphabets.

Evaluation We use character error rate (CER) asour evaluation metric. We compute CER as the ra-tio of the character-level edit distance between thepredicted original script sequence and the humanannotation to the length of the annotation sequencein characters.

6 Results and analysis

The CER values for the models we compare arepresented in Table 3. One trend we notice is thatthe error rate is lower for Russian than for Arabicin all the experiments, including the uniform priorsetting, which suggests that decoding Arabizi isan inherently harder task. Some of the errors ofthe Arabic commercial system could be explainedby the decoder predictions being plausible but notmatching the CODA orthography of the reference.

Original Latin

r /r/ r (.93), p (.05)b /b/ b (.95), 6 (.02)v /v/ v (.87), 8 (.05), w (.05)

¤ /w, u:, o:/ w (.48), o (.33), u (.06)� /x/ 5 (.76), k (.24)

Table 4: Emission probabilities learned by the su-pervised model (compare to Table 1). All substitu-tions with probability greater than 0.01 are shown.

Effect of priors The unsupervised models with-out an informative prior perform poorly for eitherlanguage, which means that there is not enoughsignal in the language model alone under the train-ing constraints we enforce. Possibly, the algorithmcould have converged to a better local optimum ifwe did not use the online algorithm and prune boththe language model and the emission model; how-ever, that experiment would be infeasibly slow. In-corporating a phonetic prior reduces the error rateby 0.36 and 0.44 for Arabic and Russian respec-tively, which provides a substantial improvementwhile maintaining the efficiency advantage. Thevisual prior for Russian appears to be slightly lesshelpful, improving CER by 0.29. We attribute thebetter performance of the model with the phoneticprior to the sparsity and restrictiveness of the vi-sually confusable symbol mappings, or it could bedue to the phonetic substitutions being more pop-ular with users. Finally, combining the two priorsfor Russian leads to a slight additional improve-ment in accuracy over the phonetic prior only.

We additionally verify that the phonetic and vi-sual similarity-based substitutions are prominentin informal romanization by inspecting the emis-sion parameters learned by the supervised modelwith a uniform prior (Table 4). We observe that:(a) the highest-probability substitutions can be ex-plained by either phonetic or visual similarity, and(b) the external mappings we use for our priors areindeed appropriate since the supervised model re-covers the same mappings in the annotated data.

Error analysis Figure 4 shows some of the el-ements of the confusion matrices for the test pre-dictions of the best-performing unsupervised mod-els in both languages. We see that many ofthe frequent errors are caused by the model fail-ing to disambiguate between two plausible de-codings of a Latin character, either mapped to it

8316

through different types of similarity ( n /n/ [pho-netic]→ n← [visual] p , n [visual]→ h← [pho-netic] h /x/ ), or the same one (visual 8→ 8← v,phonetic £ /h/→ h← � /è/ ); such cases couldbe ambiguous for humans to decode as well.

Other errors in Figure 4 illustrate the limitationsof our parameterization and the resources we relyon. Our model does not allow one-to-many align-ments, which leads to digraph interpretation errorssuch as x /s/ + £ /h/→ sh←M /S/. Some arti-facts of the resources our priors are based on alsopollute the results: for example, the confusion be-tween ~ and h in Russian is explained by the Rus-sian soft sign ~, which has no English phoneticequivalent, being arbitrarily mapped to the Latin xin one of the phonetic keyboard layouts.

Comparison to UNMT The unsupervised neu-ral model trained on Russian performs onlymarginally worse than the unsupervised WFSTmodel with an informative prior, demonstratingthat with a sufficient amount of data the neu-ral architecture is powerful enough to learn thecharacter substitution rules without the need forthe inductive bias. However, we cannot say thesame about Arabic—with a smaller training set(see Table 2), the UNMT model is outperformedby the unsupervised WFST even without an infor-mative prior. The main difference in the perfor-mance between the two models comes down to thetrade-off between structure and power: althoughthe neural architecture captures long-range depen-dencies better due to having a stronger languagemodel, it does not provide an easy way of en-forcing character-level constraints on the decodingprocess, which the WFST model encodes by de-sign. As a result, we observe that while the UNMTmodel can recover whole words more success-fully (for Russian it achieves 45.8 BLEU score,while the best-performing unsupervised WFST isat 20.4), it also tends to arbitrarily insert or repeatwords in the output, which leads to higher CER.

7 Conclusion

This paper tackles the problem of decoding non-standardized informal romanization used in socialmedia into the original orthography without paral-lel text. We train a WFST noisy-channel modelto decode romanized Egyptian Arabic and Rus-sian to their original scripts with the stepwise EMalgorithm combined with curriculum learning anddemonstrate that while the unsupervised model by

73

26 20

0

7

3

28 1 29

ه

س

ع

إ ش ح

0 88

123 3

0

0 2

8

н

ь

х п в

155

101

Figure 4: Fragments of the confusion matrix com-paring test time predictions of the best-performingunsupervised models for Arabic (left) and Russian(right) to human annotations. Each number repre-sents the count of the corresponding substitutionin the best alignment (edit distance path) betweenthe predicted and gold sequences, summed overthe test set. Rows stand for predictions, columnscorrespond to ground truth.

itself performs poorly, introducing an informativeprior that encodes the notion of phonetic or visualcharacter similarity brings its performance sub-stantially closer to that of the supervised model.

The informative priors used in our experimentsare constructed using sets of character mappingscompiled for other purposes but using the sameunderlying principle (phonetic keyboard layoutsand the Unicode confusable symbol list). Whilethese mappings provide a convenient way to avoidformalizing the complex notions of the phoneticand visual similarity, they are restrictive and do notcapture all the diverse aspects of similarity that id-iosyncratic romanization uses, so designing moresuitable priors via operationalizing the concept ofcharacter similarity could be a promising direc-tion for future work. Another research avenue thatcould be explored is modeling specific user prefer-ences: since each user likely favors a certain set ofcharacter substitutions, allowing user-specific pa-rameters could improve decoding and be useful forauthorship attribution.

Acknowledgments

This project is funded in part by the NSF undergrants 1618044 and 1936155, and by the NEHunder grant HAA256044-17. The authors thankJohn Wieting, Shruti Rijhwani, David Mortensen,Nikita Srivatsan, and Mahmoud Al Ismail forhelpful discussion, Junxian He for help with theUNMT experiments, Stas Kashepava for data an-notation, and the three anonymous reviewers fortheir valuable feedback.

8317

ReferencesMohamed Al-Badrashiny, Ramy Eskander, Nizar

Habash, and Owen Rambow. 2014. Automatictransliteration of Romanized dialectal Arabic. InProceedings of the Eighteenth Conference on Com-putational Natural Language Learning, pages 30–38, Ann Arbor, Michigan. Association for Compu-tational Linguistics.

Cyril Allauzen, Mehryar Mohri, and Brian Roark.2003. Generalized algorithms for constructing sta-tistical language models. In Proceedings of the 41stAnnual Meeting of the Association for Computa-tional Linguistics, pages 40–47, Sapporo, Japan. As-sociation for Computational Linguistics.

Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo-jciech Skut, and Mehryar Mohri. 2007. OpenFst: Ageneral and efficient weighted finite-state transducerlibrary. In Proceedings of the Ninth InternationalConference on Implementation and Application ofAutomata, (CIAA 2007), volume 4783 of LectureNotes in Computer Science, pages 11–23. Springer.http://www.openfst.org.

Yoshua Bengio, Jérôme Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th Annual International Con-ference on Machine Learning, ICML ’09, page41–48, New York, NY, USA. Association for Com-puting Machinery.

Ann Bies, Zhiyi Song, Mohamed Maamouri, StephenGrimes, Haejoong Lee, Jonathan Wright, StephanieStrassel, Nizar Habash, Ramy Eskander, and OwenRambow. 2014. Transliteration of Arabizi into Ara-bic orthography: Developing a parallel annotatedArabizi-Arabic script SMS/chat corpus. In Proceed-ings of the EMNLP 2014 Workshop on Arabic Nat-ural Language Processing (ANLP), pages 93–103,Doha, Qatar. Association for Computational Lin-guistics.

Aimilios Chalamandaris, Athanassios Protopapas, Pir-ros Tsiakoulis, and Spyros Raptis. 2006. All Greekto me! An automatic Greeklish to Greek translit-eration system. In Proceedings of the Fifth Inter-national Conference on Language Resources andEvaluation (LREC’06), Genoa, Italy. European Lan-guage Resources Association (ELRA).

Aimilios Chalamandaris, Pirros Tsiakoulis, SpyrosRaptis, G Giannopoulos, and George Carayannis.2004. Bypassing Greeklish! In Proceedings ofthe Fourth International Conference on LanguageResources and Evaluation (LREC’04), Lisbon, Por-tugal. European Language Resources Association(ELRA).

Kareem Darwish. 2014. Arabizi detection and conver-sion to Arabic. In Proceedings of the EMNLP 2014Workshop on Arabic Natural Language Processing(ANLP), pages 217–224, Doha, Qatar. Associationfor Computational Linguistics.

Kareem Darwish, Walid Magdy, and Ahmed Mourad.2012. Language processing for arabic microblog re-trieval. In Proceedings of the 21st ACM Interna-tional Conference on Information and KnowledgeManagement, CIKM ’12, page 2427–2430, NewYork, NY, USA. Association for Computing Ma-chinery.

Jason Eisner. 2002. Parameter estimation for prob-abilistic finite-state transducers. In Proceedingsof the 40th Annual Meeting of the Association forComputational Linguistics, pages 1–8, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Ramy Eskander, Mohamed Al-Badrashiny, NizarHabash, and Owen Rambow. 2014. Foreign wordsand the automatic processing of Arabic social me-dia text written in Roman script. In Proceedings ofthe First Workshop on Computational Approaches toCode Switching, pages 1–12, Doha, Qatar. Associa-tion for Computational Linguistics.

Evgeniy Gabrilovich and Alex Gontmakher. 2002. Thehomograph attack. Commun. ACM, 45(2):128.

Nizar Habash, Mona Diab, and Owen Rambow. 2012.Conventional orthography for dialectal Arabic. InProceedings of the Eighth International Conferenceon Language Resources and Evaluation (LREC’12),pages 711–718, Istanbul, Turkey. European Lan-guage Resources Association (ELRA).

Junxian He, Xinyi Wang, Graham Neubig, and TaylorBerg-Kirkpatrick. 2020. A probabilistic formulationof unsupervised text style transfer. In InternationalConference on Learning Representations.

Lars Hellsten, Brian Roark, Prasoon Goyal, Cyril Al-lauzen, Françoise Beaufays, Tom Ouyang, MichaelRiley, and David Rybach. 2017. Transliterated mo-bile keyboard input via weighted finite-state trans-ducers. In Proceedings of the 13th InternationalConference on Finite State Methods and NaturalLanguage Processing (FSMNLP 2017), pages 10–19, Umeå, Sweden. Association for ComputationalLinguistics.

Kevin Knight and Jonathan Graehl. 1998. Ma-chine transliteration. Computational Linguistics,24(4):599–612.

Guillaume Lample, Alexis Conneau, Ludovic De-noyer, and Marc’Aurelio Ranzato. 2018. Unsuper-vised machine translation using monolingual cor-pora only. In International Conference on LearningRepresentations.

Guillaume Lample, Sandeep Subramanian, Eric Smith,Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019. Multiple-attribute text rewrit-ing. In International Conference on Learning Rep-resentations.

https://doi.org/10.3115/v1/W14-1604

https://doi.org/10.3115/v1/W14-1604

https://doi.org/10.3115/1075096.1075102

https://doi.org/10.3115/1075096.1075102

http://www.openfst.org

https://doi.org/10.1145/1553374.1553380

https://doi.org/10.3115/v1/W14-3612

https://doi.org/10.3115/v1/W14-3612

https://doi.org/10.3115/v1/W14-3612

http://www.lrec-conf.org/proceedings/lrec2006/pdf/390_pdf.pdf



http://www.lrec-conf.org/proceedings/lrec2004/pdf/593.pdf

https://doi.org/10.3115/v1/W14-3629

https://doi.org/10.3115/v1/W14-3629

https://doi.org/10.1145/2396761.2398658

https://doi.org/10.1145/2396761.2398658

https://doi.org/10.3115/1073083.1073085

https://doi.org/10.3115/1073083.1073085

https://doi.org/10.3115/v1/W14-3901

https://doi.org/10.3115/v1/W14-3901

https://doi.org/10.3115/v1/W14-3901

https://doi.org/10.1145/503124.503156

https://doi.org/10.1145/503124.503156

http://www.lrec-conf.org/proceedings/lrec2012/pdf/579_Paper.pdf

https://openreview.net/forum?id=HJlA0C4tPS

https://openreview.net/forum?id=HJlA0C4tPS

https://doi.org/10.18653/v1/W17-4002

https://doi.org/10.18653/v1/W17-4002

https://doi.org/10.18653/v1/W17-4002

https://www.aclweb.org/anthology/J98-4003

https://www.aclweb.org/anthology/J98-4003

https://openreview.net/forum?id=rkYTTf-AZ



https://openreview.net/forum?id=H1g2NhC5KQ

https://openreview.net/forum?id=H1g2NhC5KQ

8318

Percy Liang and Dan Klein. 2009. Online EM forunsupervised models. In Proceedings of HumanLanguage Technologies: The 2009 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 611–619,Boulder, Colorado. Association for ComputationalLinguistics.

Olga N Lyashevskaya and Sergey A Sharov. 2009.Frequency dictionary of modern Russian basedon the Russian National Corpus [Chastotnyy slo-var’ sovremennogo russkogo jazyka (na mate-riale Nacional’nogo korpusa russkogo jazyka)].Azbukovnik, Moscow.

Martin Paulsen. 2014. Translit: Computer-mediateddigraphia on the Runet. Digital Russia: The Lan-guage, Culture and Politics of New Media Commu-nication.

Nima Pourdamghani and Kevin Knight. 2017. Deci-phering related languages. In Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing, pages 2513–2518, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Sujith Ravi and Kevin Knight. 2009. Learningphoneme mappings for transliteration without paral-lel data. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 37–45, Boulder, Colorado.Association for Computational Linguistics.

Brian Roark, Richard Sproat, Cyril Allauzen, MichaelRiley, Jeffrey Sorensen, and Terry Tai. 2012. TheOpenGrm open-source finite-state grammar soft-ware libraries. In Proceedings of the ACL 2012 Sys-tem Demonstrations, pages 61–66, Jeju Island, Ko-rea. Association for Computational Linguistics.

Tatiana Shavrina and Olga Shapovalova. 2017. Tothe methodology of corpus construction for machinelearning: Taiga syntax tree corpus and parser. InProc. CORPORA 2017 International Conference,pages 78–84, St. Petersburg.

Zhiyi Song, Stephanie Strassel, Haejoong Lee, KevinWalker, Jonathan Wright, Jennifer Garland, DanaFore, Brian Gainor, Preston Cabe, Thomas Thomas,Brendan Callahan, and Ann Sawyer. 2014. Col-lecting natural SMS and chat conversations in mul-tiple languages: The BOLT phase 2 corpus. InProceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC’14),pages 1699–1704, Reykjavik, Iceland. EuropeanLanguage Resources Association (ELRA).

Andreas Stolcke. 1998. Entropy-based pruning ofbackoff language models. In Proc. DARPA Broad-cast News Transcription and Understanding Work-shop, pages 270––274.

Ian H Witten and Timothy C Bell. 1991. The zero-frequency problem: Estimating the probabilities ofnovel events in adaptive text compression. IEEEtransactions on information theory, 37(4):1085–1094.

Lawrence Wolf-Sonkin, Vlad Schogol, Brian Roark,and Michael Riley. 2019. Latin script keyboards forSouth Asian languages with finite-state normaliza-tion. In Proceedings of the 14th International Con-ference on Finite-State Methods and Natural Lan-guage Processing, pages 108–117, Dresden, Ger-many. Association for Computational Linguistics.

https://www.aclweb.org/anthology/N09-1069


https://doi.org/10.18653/v1/D17-1266

https://doi.org/10.18653/v1/D17-1266




https://www.aclweb.org/anthology/P12-3011






https://doi.org/10.18653/v1/W19-3114

https://doi.org/10.18653/v1/W19-3114

https://doi.org/10.18653/v1/W19-3114

8319

A Data collection and annotation

Preprocessing We generate a set of 270 candi-date transliterations of 26 Russian words to use asqueries. However, many of the produced combi-nations are highly unlikely and yield no results,and some happen to share the spelling with wordsin other languages (most often other Slavic lan-guages that use Latin script, such as Polish). Wescrape public posts on user and group pages, re-taining only the information about which postswere authored by the same user, and manuallygo over the collected set to filter out coincidentalresults. We additionally preprocess the collecteddata by normalizing punctuation and removingnon-ASCII characters and emoji. We also replaceall substrings of the same character repeated morethan twice to only two repetitions, as suggestedby Darwish et al. (2012), since these repetitionsare more likely to be a written expression of emo-tion than to be explained by the underlying Rus-sian sentence. The same preprocessing steps areapplied to the original script side of the data (theannotations and the monolingual language modeltraining corpus) as well.

Annotation guidelines While transliterating,annotators perform orthographic normalizationwherever possible, correcting typos and errors inword boundaries; grammatical errors are not cor-rected. Tokens that do not require transliteration(foreign words, emoticons) or ones that annota-tor fails to identify (proper names, badly mis-spelled words) are removed from the romanizedsentence and not transliterated. Although it meansthat some of the test set sentences will not exactlyrepresent the original romanized sequence, it willhelp us ensure that we are only testing our model’sability to transliterate rather than make word-by-word normalization decisions.

In addition, 200 of the validation sequences aredually annotated to measure the inter-annotatoragreement. We evaluate it using character er-ror rate (CER; edit distance between the two se-quences normalized by the length of the referencesequence), the same metric we use to evaluate themodel’s performance. In this case, since neitherof the annotations is the ground truth, we computeCER in both directions and average. Despite thediscrepancies caused by the annotators deletingunknown words at their discretion, average CERis only 0.014, which indicates a very high level ofagreement.

B Hyperparameter settings

WFST model The Witten–Bell smoothing pa-rameter for the language model is set to 10, andthe relative entropy pruning threshold is 10−5 forthe trigram model and 2 · 10−5 for higher-ordermodels. Unsupervised training is performed inbatches of size 10 and the language model orderis increased every 100 batches. While trainingwith the bigram model, we disallow insertions andfreeze all the deletion probabilities at e−100. TheEM stepsize decay rate is β = 0.9. The emissionarc pruning threshold is gradually decreased from5 to 4.5 (in the negative log probability space). Weperform multiple random restarts for each experi-ment, initializing the emission distribution to uni-form plus random noise.

UNMT baseline Our unsupervised neural base-line uses a single-layer LSTM with hidden statesize 512 for both the encoder and the decoder. Theembedding dimension is set to 128. For the de-noising autoencoding loss, we adopt the defaultnoise model and hyperparameters as describedby Lample et al. (2018). The autoencoding lossis annealed over the first 3 epochs.

We tune the maximum training sequence length(controlling how much training data is used) andthe maximum allowed decoding length by opti-mizing the validation set CER. In our case, themaximum output length is important because theevaluation metric penalizes the discrepancy inlength between the prediction and the reference;we observe the best results when setting it to 40characters for Arabic and 180 for Russian. Attraining time, we filter out sequences longer than100 characters for either language, which consti-tute 1% of the available Arabic training data (boththe Arabic-only LM training set and the Latin-onlytraining set combined) but almost 70% of the Rus-sian data. Surprisingly, the Russian model trainedon the remaining 30% achieves better results thanthe one trained on the full data; we hypothesizethat the improvement comes from having a morebalanced training set, since the full data is heavilyskewed towards the Cyrillic side (LM training set)otherwise (see Table 2).

Phonetic and Visual Priors for Decipherment of Informal ...

Documents