-
Morphology Matters: A Multilingual Language Modeling
Analysis
Hyunji Hayley ParkUniversity of Illinois
[email protected]
Katherine J. ZhangCarnegie Mellon University
[email protected]
Coleman HaleyJohns Hopkins University
[email protected]
Kenneth SteimelIndiana [email protected]
Han LiuUniversity of Chicago∗
[email protected]
Lane SchwartzUniversity of [email protected]
Abstract
Prior studies in multilingual language mod-eling (e.g.,
Cotterell et al., 2018; Mielkeet al., 2019) disagree on whether or
notinflectional morphology makes languagesharder to model. We
attempt to resolvethe disagreement and extend those studies.We
compile a larger corpus of 145 Bibletranslations in 92 languages
and a largernumber of typological features.1 We fillin missing
typological data for several lan-guages and consider corpus-based
measuresof morphological complexity in addition toexpert-produced
typological features. Wefind that several morphological measuresare
significantly associated with higher sur-prisal when LSTM models
are trained withBPE-segmented data. We also
investigatelinguistically-motivated subword segmenta-tion
strategies like Morfessor and Finite-State Transducers (FSTs) and
find that thesesegmentation strategies yield better perfor-mance
and reduce the impact of a lan-guage’s morphology on language
modeling.
1 Introduction
With most research in Natural Language Process-ing (NLP)
directed at a small subset of the world’slanguages, whether the
techniques developed aretruly language-agnostic is often not known.
Be-cause the vast majority of research focuses onEnglish with
Chinese a distant second (Mielke,2016), neither of which is
morphologically rich,the impact of morphology on NLP tasks for
vari-ous languages is not entirely understood.
Several studies have investigated this issue inthe context of
language modeling by comparinga number of languages, but found
conflicting re-sults. Gerz et al. (2018) and Cotterell et al.
(2018)
∗Work done while at University of Colorado
Boulder1https://github.com/hayleypark/
MorphologyMatters
find that morphological complexity is predictive oflanguage
modeling difficulty, while Mielke et al.(2019) conclude that simple
statistics of a text likethe number of types explain differences in
model-ing difficulty, rather than morphological measures.
This paper revisits this issue by increasing thenumber of
languages considered and augment-ing the kind and number of
morphological fea-tures used. We train language models for
92languages from a corpus of Bibles fully alignedat the verse level
and measure language model-ing performance using surprisal (the
negative log-likelihood) per verse (see §4.5). We investigatehow
this measure is correlated with 12 linguist-generated morphological
features and four corpus-based measures of morphological
complexity.
Additionally, we contend that the relation be-tween segmentation
method, morphology, and lan-guage modeling performance needs
further inves-tigation. Byte-Pair Encoding (BPE; Shibata et
al.,1999) is widely used in NLP tasks including ma-chine
translation (Sennrich et al., 2016) as an un-supervised
information-theoretic method for seg-menting text data into subword
units. Variants ofBPE or closely related methods such as
WordPiece(Kudo, 2018) are frequently employed by state-of-the-art
pretrained language models (Liu et al.,2019; Radford et al., 2019;
Devlin et al., 2019;Yang et al., 2019). However, BPE and other
seg-mentation methods may vary in how closely theycapture
morphological segments for a given lan-guage, which may affect
language modeling per-formance.
Therefore, this paper focuses on the followingtwo research
questions:
1. Does a language’s morphology influence lan-guage modeling
difficulty?
2. If so, how do different segmentation methodsinteract with
morphology?
In order to answer the first question, we train
arX
iv:2
012.
0626
2v1
[cs
.CL
] 1
1 D
ec 2
020
https://github.com/hayleypark/MorphologyMattershttps://github.com/hayleypark/MorphologyMatters
-
models using data sets segmented by charactersand BPE units. Our
results show that BPE lan-guage modeling surprisal is significantly
corre-lated with measures of morphological typologyand complexity.
This suggests that BPE segmentsare ineffective in mitigating the
effect of morphol-ogy in language modeling.
As for the second question, we consider
morelinguistically-motivated segmentation methods tocompare with
BPE: Morfessor (Creutz and La-gus, 2007) and Finite-State
Transducers (FSTs)(see §4.3). Our comparison of the models us-ing
the different segmentation methods shows thatMorfessor reduces the
impact of morphology formore languages than BPE. FST-based
segmenta-tion methods outperform the other segmentationmethods when
available. These results suggestthat morphologically motivated
segmentations im-prove cross-linguistic language modeling.
2 Modeling difficulty across languages
Studies have demonstrated that different lan-guages may be
unequally difficult to model andhave tested the relations between
such model-ing difficulty and morphological properties of
lan-guages, using different segmentation methods.
Vania and Lopez (2017) compared the effective-ness of word
representations based on differentsegmentation methods in modeling
10 languageswith various morphological typologies. Theytrained
word-level language models, but utilizesegmentation methods to
create word embeddingsthat include segment-level information.
Compar-ing character, BPE, and Morfessor segmentations,they
concluded that character-based representa-tions were most effective
across languages, withBPE always outperforming Morfessor.
However,models based on hand-crafted morphological anal-yses
outperformed all other segmentation methodsby a wide margin.
Gerz et al. (2018) trained n-gram and neural lan-guage models
over 50 languages and argued thatthe type of morphological system
is predictive ofmodel performance. Their results show that
lan-guages differ with regard to modeling difficulty.They
attributed the differences among languagesto four types of
morphological systems: isolating,fusional, introflexive, and
agglutinative. Whilethey found a significant association between
themorphological type and modeling difficulty, Type-Token Ratio
(TTR) was the most predictive of lan-
guage modeling performance.Cotterell et al. (2018) arrived at a
similar con-
clusion modeling 21 languages using the Europarlcorpus (Koehn,
2005). When trained with n-gramand character-based Long Short-Term
Memory(LSTM) models, the languages showed differentmodeling
difficulties, which were correlated witha measure of morphology,
Morphological Count-ing Complexity (MCC) or the number of
inflec-tional categories (Sagot, 2013).
However, Mielke et al. (2019) failed to repro-duce the
correlation with MCC when they in-creased the scope to 69
languages, utilizing aBible corpus (Mayer and Cysouw, 2014).
Theyalso reported no correlation with measures ofmorphosyntactic
complexity such as head-POSentropy (Dehouck and Denis, 2018) and
otherlinguist-generated features (Dryer and Haspel-math, 2013).
Rather, they found that simplerstatistics, namely the number of
types and num-ber of characters per word, correlate with
languagemodel surprisal using BPE and character segmen-tation,
respectively.
3 Morphological measures
Different measures of morphology are used to rep-resent a
language’s morphology.
3.1 Linguist-generated measures
The most linguistically-informed measures ofmorphology involve
expert descriptions of lan-guages. The World Atlas of Language
Structures(WALS; Dryer and Haspelmath, 2013) has beenused
frequently in the literature to provide typo-logical information.
WALS is a large database oflinguistic features gathered from
descriptive mate-rials, such as reference grammars. It contains
144chapters in 11 areas including phonology, mor-phology, and word
order. Each chapter describes afeature with categorical values and
lists languagesthat have each value. However, not all languagesin
the database have data for all the features, andfor some languages
there is no data at all.
The studies reviewed in §2 all relied on thisexpert-description
approach to quantify morpho-logical properties. Gerz et al. (2018)
focusedon WALS descriptions of inflectional synthesisof verbs,
fusion, exponence, and flexivity, whileMielke et al. (2019) looked
at two WALS fea-tures, 26A “Prefixing vs. Suffixing in
InflectionalMorphology” and 81A “Order of Subject, Object
-
and Verb.” Cotterell et al. (2018) used UniMorph(Kirov et al.,
2018), instead of WALS, to calcu-late MCC. Vania and Lopez (2017)
did not citeany databases but provided descriptions of
fourmorphological types (fusional, agglutinative, root-and-pattern,
and reduplication) and categorized 10languages into these
types.
A major issue with this approach to represent-ing morphology is
that there is not enough expertdata available to enable comparisons
across manydifferent languages. In fact, Mielke et al. (2019)chose
their two WALS features because data forthese features existed for
most of their languages.Moreover, Bentz et al. (2016) showed that
theirWALS-based measure had lower correlations withother measures
of morphological complexity dueto this issue of missing data.
3.2 Corpus-based measures
In contrast, corpus-based measures of morphol-ogy can be easily
calculated on a given dataset. These measures include the number
oftypes, Type-Token Ratio (TTR), Moving-AverageTTR (MATTR;
Covington and McFall, 2010), andMean Length of Words (MLW). The
exact defini-tion of the measures may vary depending on stud-ies,
but we define them as in Table 1, where a wordtoken is a string
separated by spaces in the trainingset after tokenization but
before segmentation.
While some studies (e.g., Mielke et al., 2019)consider these
measures as simple statistics of acorpus, other studies have found
that they can beused as approximate measures of
morphologicalcomplexity. Kettunen (2014) showed that TTR,MATTR, and
MLW can capture the overall rank-ing of morphological complexity
generated byinformation-theoretic and expert-generated mea-sures of
morphological complexity. Bentz et al.(2016) compared different
measures of morpho-logical complexity for 519 languages across
101families and showed a strong correlation betweenall measures,
which were based on corpus statis-tics, linguistic expertise,
information theory, andtranslation alignment. They argued that
corpus-based measures, including TTR, and other mea-sures of
morphological complexity can be usedinterchangeably. In addition,
Gerz et al. (2018)showed that TTR is influenced by the
morpholog-ical typology of a language. According to them,isolating
languages tend to have small TTR valuesand are often easier to
model while the opposite is
Measure Definition
Types Number of unique word tokensTTR Number of unique word
tokens di-
vided by total number of word to-kens
MATTR Average TTR calculated over amoving window of 500 word
tokens
MLW Average number of characters perword token
Table 1: Corpus-based measures of morphology de-fined for this
study. These measures are calculated ontokenized data sets before
applying any segmentationmethod.
true for agglutinative languages.Given the previous literature,
we utilize
these corpus-based measures, as well as expert-generated WALS
features, as a proxy for morpho-logical differences among languages
in our study.
4 Methods
We design our experiments to test if a lan-guage’s morphology is
correlated with languagemodel performance, depending on the
segmen-tation method. We represent a language’s mor-phology using
WALS features and corpus statis-tics. We train language models for
Bible trans-lations in 92 languages based on five
differentsegmentation methods: character, BPE, Morfes-sor, and FST
with BPE or Morfessor back-offstrategies (FST+BPE &
FST+Morfessor). Weuse surprisal per verse (Mielke et al., 2019)
asthe evaluation metric to compare language mod-eling performance
across different languages anddifferent segmentation methods.
Additionally, wequantify the difference in surprisal per verse
be-tween segmentation methods to compare the rel-ative strength of
each segmentation method withregard to morphological
complexity.
4.1 DataOur data consist of 145 Bible translations in
92languages covering 22 language families,2 fullyaligned at the
verse level. The majority of
2For each language, we report the family assigned byWALS (Dryer
and Haspelmath, 2013): 6 Afro-Asiatic, 1Algic, 1 Altaic, 2
Austro-Asiatic, 6 Austronesian, 1 Ay-maran, 3 Dravidian, 4
Eskimo-Aleut, 1 Guaicuruan, 33 Indo-European, 1 Japanese, 1 Korean,
1 Mande, 6 Mayan, 6 Niger-Congo, 4 Quechuan, 5 Sino-Tibetan, 1
Songhay, 1 Tai-Kadai,2 Tupian, 2 Uralic, 2 Uto-Aztecan, 2
creoles.
-
the data came verse-aligned from Mielke et al.(2019) (original
data from Mayer and Cysouw,2014). We added more Bibles from another
cor-pus (Christodoulopoulos and Steedman, 2014) andfrom online
Bible resources (see Appendix A formore information). We refer to
each language byISO 639-3 code when applicable.
We followed Mielke et al. (2019)’s method tosplit the data into
training, development, and testsets: the verse-aligned data were
divided intoblocks of 30 verses, with the first five verses be-ing
assigned to the development set, the next fiveto the test set and
the rest to the training set. Theresulting training set had 16,926
verses while de-velopment and test sets had 4,225 verses each.
It should be noted that both Mielke et al. (2019)and
Christodoulopoulos and Steedman (2014) pro-vided tokenized data. We
tokenized the newlyadded Bibles using Mielke and Eisner
(2019)’stokenizer, following Mielke et al. (2019). Whenboth
tokenized and untokenized versions wereavailable, we included the
tokenized versions only.
We chose to replace characters that only oc-curred one time with
a special UNK symbol.Mielke et al. (2019) applied this procedure to
char-acters that appear less than 25 times in the trainingset
except for Chinese, where only singleton char-acters were replaced.
Because we added severallanguages where the original strategy would
haveresulted in removing too much data, we prepro-cessed singleton
characters across the board.
We also corrected several errors present in thedata. For
example, the Bible translations in Shona(sna) and Telugu (tel) were
mis-coded as Shan(shn) and Tecpatlán Totonac (tcw),
respectively.
4.2 Morphological measures selected
In this paper, we adopt two approaches to repre-senting a
language’s morphology. First, we rely onexpert descriptions of
languages in WALS, manu-ally augmenting the database to rectify the
issueof missing data. Second, we utilize corpus-basedmeasures like
TTR to represent the morphologicalcomplexity of a given
language.
WALS features While some previous studies(e.g., Gerz et al.,
2018; Vania and Lopez, 2017)categorized relatively well-known
languages intoa small number of morphological types, such
cat-egorization is not always clear. Some other stud-ies (e.g.,
Cotterell et al., 2018; Mielke et al., 2019)selected a small number
of available typological
ID Name
20A Fusion of Selected Inflectional Forma-tives
21A Exponence of Selected Inflectional For-matives
21B Exponence of Tense-Aspect-Mood In-flection
22A Inflectional Synthesis of the Verb23A Locus of Marking in
the Clause24A Locus of Marking in Possessive Noun
Phrases25A Locus of Marking: Whole-language Ty-
pology25B Zero Marking of A and P Arguments26A Prefixing vs.
Suffixing in Inflectional
Morphology27A Reduplication28A Case Syncretism29A Syncretism in
Verbal Person/Number
Marking
Table 2: The 12 morphological features in WALS.
features to compare, but their conclusions wereat odds, possibly
calling for exploration of othermeasures. Therefore, we consider
all availablemorphological features described by WALS to ex-plore
which features affect language modeling andhow. Instead of making
theoretical claims aboutmorphological typology, we explore which
typo-logical features make a language’s morphologymore complex for
LSTM language models.
To that end, we augmented the existing WALSdatabase by
consulting reference grammars foreach language. Of the 92 languages
in our cor-pus, six were not in the WALS database.3 In ad-dition,
many of the languages in the database hadmissing data for some
features. For example, wehad no data for any of the morphological
featuresof Afrikaans (afr). We manually assigned missingfeatures
where possible following the descriptionsin the relevant WALS
chapters regarding the pro-cedures used to assign feature values to
languages.
Of the almost 200 features in WALS, the editorsof the database
labeled 12 of them as morpholog-ical features. Therefore, we
considered these 12features, listed in Table 2 and described
below,4 totest the hypothesis that morphological complexity
3ikt, lat, nch, tbz, wbm, zom4See https://wals.info/chapter for
more de-
tails and examples of these features.
https://wals.info/chapter
-
correlates with modeling difficulty.
Feature 20A describes how closely grammati-cal markers
(inflectional formatives) are phono-logically connected to a host
word or stem. Themarkers can be isolating, concatenative, or
evennonlinear (i.e., ablaut and tone).
Features 21A and 21B measure the exponenceof selected
grammatical markers. Exponencerefers to the number of categories
that a singlemorpheme expresses. For 21A, the selected gram-matical
markers were case markers. For 21B, theywere tense-aspect-mood
(TAM) markers.
Feature 22A measures how many grammati-cal categories may appear
on verbs in a lan-guage. These categories include
tense-aspect-mood, negation, voice, and agreement.
Features 23A through 25B describe the exis-tence and locus of
marking in different kinds ofphrases. A phrase may have marking on
eitherits head, its dependent(s), both, or neither. In fullclauses,
the verb is the head, and the subject andobject arguments are
dependents. In possessivenoun phrases, the possessed noun is the
head whilethe possessor is dependent.
Feature 26A measures the degree to which lan-guages use prefixes
versus suffixes in their in-flectional morphology. Feature 27A
describeswhich languages use reduplication productivelyand whether
or not both full and partial redupli-cation are used.
Both Features 28A and 29A measure syn-cretism. Syncretism occurs
when a single in-flected form corresponds to more than one
func-tion. 28A measures case syncretism specificallywhile 29A
measures syncretism in the subjectagreement marking of verbs.
Types, TTR, MATTR, and MLW We calcu-lated the number of types,
TTR, Moving-AverageTTR, and Mean Length of Word using an
adaptedscript from the Python module LexicalRichness.5
We used a window size of 500 for Moving-Average TTR, following
previous studies (e.g.,Kettunen, 2014). The definitions of the
measuresare found in Table 1. All measures were calculatedbased on
the word tokens in the training set beforeapplying any segmentation
method.
5https://github.com/LSYS/LexicalRichness
4.3 Segmentation methods
We chose to train only open-vocabulary lan-guage models for fair
comparison. Word-levelmodels will predict UNK for
out-of-vocabularyword tokens and cannot be fairly compared
withcharacter- and subword-level models as a result.Specifically,
we trained language models usingfive segmentation methods:
character, BPE, Mor-fessor, FST+BPE, and FST+Morfessor.
Thesesegmentation methods provide a way to segmentany given text
into smaller pieces, some of whichapproximate morphemes.
A morpheme is the smallest meaning-bearingmorphological unit
while a morph is the sur-face representation of one or more
morphemes.Linguistically-motivated methods like Morfessorand FSTs
are designed with the goal of produc-ing subword segments that are
closely aligned tothe true morphs comprising a word. While BPEwas
not designed with morpheme segmentation inmind, its resulting
subwords are commonly be-lieved to align with morphs to some degree
dueto morph subsequences being frequent in the data.
Segmenting words into morphs may reduce theimpact of rich
morphology as highly inflectedwords can be broken into smaller
pieces that arelikely to contribute similar meanings across
con-texts in the corpus. Table 3 provides examplesof the
segmentation methods we used to train lan-guage models. The
original verse is provided forreference only and not used to train
any models.
Character We trained character-based languagemodels, following
previous studies (Mielke et al.,2019; Gerz et al., 2018; Cotterell
et al., 2018).Character language models are trained to predictthe
next character given the preceding context, andthe vocabulary
includes an underscore 〈_〉 to de-note word boundaries.
BPE We trained BPE-based language models,following Mielke et al.
(2019). Starting with char-acter segmentation, BPE operations
combine char-acters into larger chunks based on their frequen-cies
to create units somewhere between charactersand words with the
number of merge operationsas the hyperparameter (Sennrich et al.,
2016). Weused 0.4 × types as the number of merges, asMielke et al.
(2019) reported that to be most ef-fective with their corpus.6 BPE
language models
6Additional static numbers of merge operations were alsotested,
with nearly identical results.
https://github.com/LSYS/LexicalRichnesshttps://github.com/LSYS/LexicalRichness
-
Segmentation Example
Tokenized Yuhannanın kardeşi Yakubu kılıçla öldürdü .Character
Y u h a n n a n ı n _ k a r d e ş i _ Y a k u b u _ k ı l ı ç l a
_ ö l d ü r d ü .BPE Yuhan@@ nanın kardeşi Yakubu kılıçla öldürdü
.Morfessor Yuhanna@@ nın kardeş@@ i Yakub@@ u kılıç@@ la öldürdü
.FST+BPE Yuhan@@ nanın kardeş@@ i Yakub@@ u kılıç@@ la öl@@ dür@@
dü .FST+Morfessor Yuhanna@@ nın kardeş@@ i Yakub@@ u kılıç@@ la
öl@@ dür@@ dü .
Table 3: Turkish examples for different segmentation methods. An
English translation is “And he killed Jamesthe brother of John with
the sword” (Acts 12:2). FST does not produce analyses for
Yuhannanın (“John’s”),for which BPE or Morfessor back-off was used.
The segmentation created by human experts was the same
asFST+Morfessor. 〈@@〉 denotes subword segmentation while 〈_〉
encodes space between word tokens for charactersegmentation.
are trained to predict the next BPE unit. The dou-ble at sign
〈@@〉 is used to indicate segments thatare not word-final.
Morfessor Morfessor (Creutz and Lagus, 2007)is a word
segmentation method explicitly designedfor morphological
segmentation. The default im-plementation utilizes a unigram
language model tofind morph-like constructs. While like BPE
thisapproach is information-theoretic, it selects seg-ments
top-down and includes a prior term for thelength of segments,
regularizing segments to bemore plausible morphemes.
Using the default settings with Morfessor 2.0(Virpioja et al.,
2013), we trained Morfessor onthe training set and applied the
segmentation to alldata sets. Just like BPE, the language models
aretrained to predict the next morph unit.
FST While segmentation based on BPE andMorfessor may or may not
resemble actual mor-phemes, morpheme segmentation from Finite-State
Transducers (FSTs) provides a knowledge-based method to segment a
text into morphemes.Finite-state morphological analyzers are
rule-based systems that take a surface string as in-put and produce
all possible morphological anal-yses as output. To use FSTs for
segmenta-tion, we changed existing morphological analyz-ers into
segmenters and developed a heuristicto select one analysis for a
given word token.FSTs for Plains Cree (Arppe et al.,
2014–2019),German (Schmid et al., 2004), English (Axel-son et al.,
2015), Finnish (Pirinen, 2015), Indone-sian (Larasati et al.,
2011), Cuzco Quechua (Vilcaet al., 2012), and Turkish (Çağrı
Çöltekin, 2014,2010) were used as morphological segmenters.
Most FSTs are designed to provide analyses for
surface forms, not morphological segmentations.Fortunately,
morpheme boundaries are frequentlypart of FSTs due to their
relevance for lexico-phonological phenomena. By modifying the
FSTbefore the cleanup rules that remove morphemeboundaries can
apply, we create a morphologicalsegmenter that takes in a surface
form and returnsthe surface form with morpheme boundary mark-ers.
If the analyzer provides segmentations, thetransducer is used
as-is.
For example, the Turkish FST produces a mor-phological analysis
for the surface form kılıçla(“with the sword") in the example in
Table 3:kılıç. In-stead of producing such an analysis for the
givenword, the segmenter instead produces the seg-mented surface
form kılıç@@ la, which isused in the FST segmentation methods.
Because a FST may return multiple analysesor segmentations given
a single word, a heuristicmethod was used to determine which
segmenta-tion to select. In general, we chose the segmenta-tion
with the fewest segments. However, the En-glish segmenter based on
Axelson et al. (2015) al-ways returns the input string itself as a
possiblesegmentation if covered by the analyzer. For ex-ample,
walks would produce two segmentations inthe English segmenter:
walks and walk@@ s.For this segmenter, we selected the fewest
numberof segments excluding the input string itself (e.g.,choosing
walk@@ s over walks).
When a FST produces no analyses for a givenword, as in the case
of Yuhannanın (John’s) inTable 3, we adopt the FST-augmented BPE
seg-mentation (FST+BPE) and FST-augmented Mor-fessor segmentation
(FST+Morfessor), where wefall back to BPE or Morfessor segmentation
when-
-
ever FST segmentation is unavailable. As shownin the table,
FST+BPE and FST+Morfessor onlydiffer in the segmentation of the
unanalyzed word.For this particular verse, the human segmenta-tion
agrees with the FST+Morfessor segmenta-tion. FST+BPE and
FST+Morfessor models aretrained just like BPE or Morfessor models
to pre-dict the next subword unit.
4.4 Models
Following Mielke et al. (2019), we trained LongShort-Term Memory
(LSTM) models introducedby Merity et al. (2018) for each of the
segmenta-tion methods. Three LSTM models using char-acter, BPE, and
Morfessor segmentation weretrained for all languages. For a select
group of lan-guages, we also trained models using FST+BPEand
FST+Morfessor units. The neural architec-ture consisted of an
initial embedding layer, mul-tiple LSTM layers, and a linear
decoder layer. Forour particular experiments, we adopted the
hyper-parameters from Mielke et al. (2019) (see Merityet al., 2018,
for their character PTB setttings). Thebatch size used for
character models was 128 with500 epochs of training. All other
models used abatch size of 40 and were trained for 200 epochs.
4.5 Metrics
Surprisal per verse One major evaluation met-ric for language
models is the negative log-likelihood on a test set. The negative
log-likelihood, or surprisal, is the amount of informa-tion a
language model needs to generate the nextunit. Following Mielke et
al. (2019), we definethe surprisal at the verse level, where
NLL(vij) =− log2 p(vij) with a verse vij (for ith verse in
lan-guage j). Since each verse is intended to expressthe same
meaning across languages, differencesin per-verse surprisal across
languages primarilyindicate differences in cross-linguistic
languagemodel quality (rather than differences in
meaningcontent).
For each language j, we average the negativelog-likelihood
across the 4,225 verses in the testset, making Lj = 14225
∑4225i=1 NLL(vij).
Surprisal difference Additionally, we quan-tify the difference
between segmentation methodsin language modeling performance as
shown inEquation 1. This quantity compares the relative
strength of one segmentation method to another.
∆Sj1,Sj2 =Lj1 − Lj2
12(Lj1 + Lj2)
(1)
Sj1 and Sj2 are two segmentation methods tocompare and Lj1 and
Lj2 represent the surprisalper verse for the language models based
on the twosegmentation methods. If ∆Sj1,Sj2 is positive,
Sj1resulted in a higher surprisal than Sj2 and Sj2 wasmore
effective in modeling a given language.
5 Results
We now present results from our experiments. Wereport the strong
association between several mor-phological features and surprisal
per verse for BPElanguage models, compared to language modelsbased
on other segmentation methods. Then, weshow the trade offs between
different segmenta-tion methods and how they interact with
morpho-logical complexity. Our assumption is that, if asegmentation
method reduces the impact of mor-phology, the surprisal values of
language modelsbased on that segmentation will have weaker
cor-relations with measures of morphology.
5.1 Correlation studies with character andBPE models
We investigated correlations between surprisal perverse and
various measures of morphology (i.e.,WALS features, number of
types, TTR, Moving-Average TTR, Mean Length of Word). Benjaminiand
Hochberg (1995)’s procedure was used to con-trol the false
discovery rate, so only p ≤ 815 · 0.05(≈ 0.027) is considered
significant.
WALS features We tested for association be-tween surprisal and
each selected WALS featurewith the Kruskal–Wallis test, or one-way
ANOVAon ranks. This non-parametric test was chosen be-cause the
distribution of surprisal values did notmeet the assumption of
normality. A significanttest result in this context means that
there are sig-nificant differences in the median surprisal
valuesbetween categories for a given feature. In order forthe test
to be effective, only feature values with asample size ≥ 5 were
tested.
For the character models, no features showedsignificant
association with surprisal. However,for the BPE models, half of the
morphologicalfeatures had significant association with
surprisal.These features were 21A “Exponence of
SelectedInflectional Formatives,” 23A “Locus of Marking
-
Segmentation ID p-value η2
BPE
21A 1.3e-05 0.2823A 6.7e-06 0.2824A 2.2e-04 0.22825A 6.5e-05
0.25325B 0.014 0.0629A 2.0e-04 0.198
Morfessor
21A 0.009 0.10923A 0.002 0.13526A 0.022 0.06429A 0.024 0.072
Table 4: p-values and effect sizes of WALS featuresthat showed
significant effect on surprisal per verse.Large effect sizes (≥
0.14) are in bold.
in the Clause,” 24A “Locus of Marking in Pos-sessive Noun
Phrases,” 25A “Locus of Marking:Whole-language Typology,” 25B “Zero
Markingof A and P Arguments,” and 29A “Syncretism inVerbal
Person/Number Marking.”
For the features shown to have an effect onthe BPE surprisal, we
calculated the effect sizesand performed post-hoc comparisons to
determinewhich categories were significantly different. Inthis
context, effect size (η2) indicates the propor-tion of variance in
surprisal per verse explained byeach WALS feature, and η2 ≥ 0.14 is
considereda large effect (Tomczak and Tomczak, 2014). Thep-values
and effect sizes are summarized in Table4. The effect size was
large for all of the signifi-cant features except for 25B.
For Feature 21A, the median surprisal value forlanguages with no
case was significantly lowerthan the median value for other types.
Similarly,for 23A, the median surprisal value for languageswith no
marking was significantly lower than thevalue for other types. In
the cases of both 24A and25A, languages with double marking had
highersurprisal values than those with single or no mark-ing. For
25B, languages with non-zero markinghad slightly higher surprisal
values than those withzero-marking. Lastly, for 29A, languages
withoutsyncretism had higher surprisal values than thosewith
syncretism or with no marking.
In general, less inflectional morphology was as-sociated with
lower surprisal while more inflec-tional morphology was associated
with higher sur-prisal.
Segmentation Measure Spearman’s ρ
Character
Types 0.19∗TTR 0.15MATTR 0.17∗MLW 0.06
BPE
Types 0.80∗∗∗TTR 0.76∗∗∗MATTR 0.68∗∗∗MLW 0.61∗∗∗
Morfessor
Types 0.50∗∗∗TTR 0.44∗∗∗MATTR 0.39∗∗∗MLW 0.30∗∗∗
Table 5: Correlation between surprisal per verse persegmentation
method and morphological complexitymeasures. ∗p < 0.027, ∗∗∗p
< 0.0005.
Corpus-based measures A similar trendemerged for corpus-based
measures of morpho-logical complexity. The surprisal per verse
ofBPE models was highly correlated with typecount, TTR,
Moving-Average TTR (MATTR),and Mean Length of Word (MLW). Yet
withcharacter models, the strength of the correlationwas weak and
often insignificant. These resultssuggest that BPE segmentation was
ineffective inreducing the impact of morphological complexity.
Table 5 summarizes the correlation coefficientsand corresponding
p-values. For the character-based models, only the number of types
andMATTR showed a significant correlation in Spear-man’s rank-order
correlation, and those correla-tions were rather weak. In contrast,
the BPEmodels presented strong correlations with all ofthe
corpus-based measures at any reasonable al-pha value (p <
10−16). The number of typesshowed the strongest correlation,
followed byTTR, MATTR, and MLW in that order.
5.2 Comparison with Morfessor andFinite-State Transducer
models
We trained language models using three additionalsegmentation
methods: Morfessor, FST+BPE,and FST+Morfessor. Because Morfessor is
an un-supervised method, we were able to utilize it tosegment all
languages, but we were able to gen-erate FST segmentation for only
a few languages.As such, we compare the character, BPE, and
Mor-fessor models for all languages before looking into
-
Figure 1: Pairwise comparisons of surprisal per verse values for
character, BPE, and Morfessor models. For themajority of the
languages, Morfessor segmentation resulted in lower surprisal per
verse than character or BPEsegmentation.
a subset of them where the FST methods wereavailable.
Morfessor models Morfessor segmentation per-formed better than
both character and BPE seg-mentation for the majority of languages.
Figure 1shows the pairwise comparisons of the surprisalper verse
values of a given language on differ-ent segmentation strategies.
As shown in the ploton the left, the relative strength between BPE
andcharacter segmentation methods is not clear. BPEsegmentation
produced slightly better results for49 of the 92 languages, but
character segmen-tation produced much lower surprisal values forthe
rest of the languages. In contrast, Morfes-sor clearly outperformed
character and BPE formost of the languages, as shown in the plots
inthe middle and right. Only 12 out of the 92 lan-guages had higher
surprisal values for Morfessorsegmentation than character, while a
total of 66languages performed better with Morfessor seg-mentation
than with BPE.
In addition, Morfessor models’ surprisal perverse showed weaker
correlations with measuresof morphology. Only four WALS features
showedsignificant association with the Morfessor mod-els: 21A
“Exponence of Selected Inflectional For-matives,” 23A “Locus of
Marking in the Clause,”26A “Prefixing vs. Suffixing in Inflectional
Mor-phology,” and 29A “Syncretism in Verbal Per-son/Number
Marking.” The effect sizes were alsomuch smaller than those for the
BPE models asshown in Table 4.
Just as with the BPE models, the median sur-prisal for languages
with no marking was muchlower than the surprisal for other types
for Fea-
tures 21A, 23A, and 29A. For 26A, there was onlya significant
difference between weakly suffixinglanguages and strongly prefixing
languages, withstrongly prefixing languages having a lower me-dian
surprisal per verse.
As shown in Table 5, corpus-based statistics stillshowed
significant correlations with the surprisalper verse value of
Morfessor models, but the cor-relations were moderate compared to
those of theBPE models.
FST models When available, a FST segmen-tation method resulted
in the best performance.The graph in Figure 2 displays the
surprisal ofFST+BPE and FST+Morfessor models in com-parison to the
segmentation methods discussedabove. For all seven languages,
either FST+BPEor FST+Morfessor segmentation (or both) showsa clear
decrease in the surprisal per verse com-pared to the BPE and
Morfessor segmentations.
5.3 Surprisal difference and morphologicalcomplexity
In order to look into the effect of morphologicalcomplexity on
the relative strength of a given seg-mentation method, we conducted
correlation stud-ies with the difference between the surprisal
perverse for pairs of segmentation methods (the ∆values as defined
in §4.5). We considered onlythe measures of morphological
complexity thatwere continuous variables (i.e., number of
types,TTR, Moving-Average TTR, and Mean Length ofWord).
As shown in Table 6, all of the corpus-basedstatistics were
highly correlated to the ∆ values.The correlations range from
moderate to high us-
-
Figure 2: Surprisal per verse per segmentation methodincluding
FST segmentation methods. FST+BPE orFST+Morfessor models outperform
all other models.
ing Spearman’s ρ (0.50 < ρ < 0.95). Eventhough the
strength of correlations varied slightly,number of types, TTR,
MATTR, and MLW allshowed a similar correlation with the
differencestatistics. They all had a positive correlation with∆
BPE, char. This indicates that the more morpho-logically complex a
language is, the better it ismodeled with character segmentation
compared toBPE segmentation. Similarly, there were posi-tive
correlations between the morphological mea-sures and ∆ Morfessor,
char, suggesting that char-acter segmentation works better than
Morfessorin modeling morphologically complex languages.∆ BPE,
Morfessor also had positive correlations withcomplexity measures.
This means that languageswith higher morphological complexity tend
torecord lower surprisal values with Morfessor seg-mentation than
BPE. While BPE and Morfessormodels outperformed character models on
averageas shown in §5.2, the positive correlations with∆ Morfessor,
char and ∆ BPE, char suggest that char-acter segmentation
outperformed BPE and Mor-fessor segmentation for languages with
very richmorphology.
These results are supported by Figure 3, wherethe surprisal per
verse for different segmenta-tion models is plotted against MATTR.7
For lan-guages with lower MATTR, BPE and Morfessorperform better
than character segmentation. How-
7The same trend was captured when we plotted with theother
corpus-based measures.
Difference Measure Spearman’s ρ
∆ BPE, char
Types 0.95∗∗∗TTR 0.92∗∗∗MATTR 0.77∗∗∗MLW 0.74∗∗∗
∆ Morfessor, char
Types 0.71∗∗∗TTR 0.66∗∗∗MATTR 0.50∗∗∗MLW 0.53∗∗∗
∆ BPE, Morfessor
Types 0.86∗∗∗TTR 0.86∗∗∗MATTR 0.80∗∗∗MLW 0.75∗∗∗
Table 6: Correlation between surprisal differencesand
morphological complexity measures for character,BPE, and Morfessor
models. All p-values < 10−11.
ever, for languages with higher MATTR, characterand Morfessor
models outperform BPE.
6 Discussion
Our results show that BPE models’ surprisalper verse is highly
correlated with a language’smorphology, represented by several WALS
fea-tures and corpus-based measures. Morfessorshows weaker
correlations with such measuresand records better performance for
most of thelanguages. FST-based models outperform otherswhen
available. In this section, we discuss the im-plications of these
findings in the context of previ-ous work and future research.
6.1 Morphology and surprisal
In accordance with the prior work discussed in§2, we found
differences in modeling difficultybetween languages. The
correlation studies in§5 provide evidence that morphology is a
sub-stantial contributing factor to these differences.Six WALS
(Dryer and Haspelmath, 2013) mor-phology features showed
association with the sur-prisal per verse of BPE language models.
Corpus-based statistics like number of types and MATTRshowed strong
correlations with BPE surprisal,supporting the relationship between
modeling dif-ficulty and morphological complexity.
Our conclusion that a language’s morphologyimpacts language
modeling difficulty agrees withCotterell et al. (2018) and Gerz et
al. (2018), butis at odds with Mielke et al. (2019). We
included
-
Figure 3: Surprisal per verse plotted against Moving-Average TTR
for character, BPE, and Morfessor segmentationmethods. Lines
indicate the regression estimate with 95% confidence intervals.
languages known for their rich morphology, suchas Western
Canadian Inuktitut (ikt) and CentralAlaskan Yup’ik (esu), which may
have increasedthe variation in morphological complexity in
thecorpus. We also augmented the WALS data byconsulting reference
grammars, so we were able toconsider 11 more morphological WALS
featuresthan Mielke et al. (2019). We found that the mor-phological
feature Mielke et al. (2019) considered,26A “Prefixing vs.
Suffixing in Inflectional Mor-phology,” indeed showed no
correlation with BPEsurprisal. However, our results show that
thereare aspects of morphology that affect surprisal thatwere not
considered before.
Previous work, such as Gerz et al. (2018), fo-cused only on
aspects of morphology that they be-lieved a priori would predict
language model per-formance. In contrast, our study tested all of
themorphological features listed in WALS and alsotested each of
them individually. We found thattwo of the four features in Gerz et
al. (2018), 20A“Fusion of Selected Inflectional Formatives” and22A
“Inflectional Synthesis of the Verb,” showedno association with
language model performance.Additionally, we found several features
that af-fected language modeling performance, specifi-cally locus
of marking and syncretism, which werenot mentioned in the
literature. These results showthat the features tied to
morphological complexityin previous work are not necessarily the
same fea-tures that affect language modeling.
In addition to differences in results, our in-terpretation of
corpus-based statistics like TTRalso diverges from previous work.
While Mielke
et al. (2019) reported high correlations betweenlanguage model
performance and such statistics,they considered them only as simple
statistics ofthe data. In fact, our results replicate Mielkeet al.
(2019) in that the number of types was themost predictive of BPE
language model surprisalamong all the variables considered.
However, weargue that corpus-based statistics can be used as
anapproximate measure of morphological complex-ity based on
previous studies. These corpus-basedmeasures of morphology are
reported to capturethe overall ranking of morphological
complexity(Kettunen, 2014; Bentz et al., 2016) and can
beinterpreted in relation to morphological typology(Gerz et al.,
2018). We also believe our resultsindicate that TTR and the WALS
features cap-ture similar information. For example, the posi-tive
correlation of ∆ BPE, Morfessor for corpus-basedmeasures
corresponds to the smaller effect sizesof WALS features found for
Morfessor comparedto BPE. This indicates a lesser effect of rich
mor-phology on Morfessor models compared to BPE.
6.2 Segmentation methods
While the primary goal of this work is to analyzethe relation of
a language’s morphology to lan-guage modeling performance, we found
this to beentangled with the level and method of segmenta-tion. Our
results show that there is significant vari-ation in the
effectiveness of segmentation meth-ods cross-linguistically, and
suggest challengesto the status quo methods of subword
segmenta-tion in particular. While the subword segmen-tation
methods we used generally outperformed
-
character-level segmentation, the higher the TTR,the smaller the
difference in surprisal for both BPEand Morfessor, suggesting that
these methods areless effective at segmenting languages with
highlycomplex morphology. Of pre-existing methods,we found
Morfessor to have the lowest surprisalper verse for most of the
languages considered.Morfessor’s weaker correlations with WALS
fea-tures and other measures like TTR suggest that itsbetter
performance may be due to a better abilityto model languages with a
wider range of mor-phological attributes. This is in line with
Bostromand Durrett (2020), which showed that UnigramLM (Kudo,
2018), a segmentation algorithm sim-ilar to Morfessor, often
outperforms BPE and pro-duces more morph-like segmentation in the
con-text of language model pretraining in English andJapanese.
However, Morfessor was significantly outper-formed by character
segmentation for a small sub-set of languages.8 Many of these
languages havebeen classified as polysynthetic, suggesting
thatperhaps Morfessor is ill-suited for such languages(see Klavans,
2018; Tyers and Mishchenkova,2020; Mager et al., 2018, for
discussions onchallenges polysynthetic languages pose for
NLPtasks).
Additionally, for a typologically diverse sub-set of languages
for which we could obtainFST morphological segmenters, we
considerednovel segmentation methods: FST+BPE andFST+Morfessor. We
found this simple exten-sion of BPE and Morfessor with
morphological in-formation achieved the lowest surprisal per
versein all available languages. The overall successof combining
statistical segmentations with FSTsfurther confirms the impact of
morphology on lan-guage modeling and yields significant promise
forthe use of segmentation based on linguistic mor-phological
information.
7 Conclusion
A language’s morphology is strongly associ-ated with language
modeling surprisal for BPE-segmented language models. BPE model
sur-prisal is associated with 6 out of the 12 stud-ied WALS
morphology features, indicating thatthere are aspects of some
languages’ morphologythat BPE does not help mitigate. Strong
corre-
8amh, arz, ayr, cmn, esu, heb, ike, ikt, kal, quh, tel, xho.BPE
outperformed Morfessor for cmn and heb.
lations with corpus-based measures of morphol-ogy such as TTR
further suggest that the moretypes available in a language (often
by means ofrich morphology), the harder it is to model basedon BPE
units. Morfessor, which was designedwith morpheme induction in
mind, performs bet-ter for most languages and shows less
associationwith morphological features. When available,
thelinguistically-informed method of FST-augmentedBPE or Morfessor
segmentation performs best,indicating a further promise for using
linguisticknowledge to combat the effects of morphologyon language
model surprisal.
These conclusions were only possible throughmanual augmentation
of typological databases andexpansion of studied languages. Future
effortscould adopt our approach for other areas of lan-guage. Using
linguistically-informed resourcesacross many languages is an avenue
for improvingneural models in NLP in both design and analysis.
Acknowledgments
This paper builds on our prior work for the2019 Sixth Frederick
Jelinek Memorial SummerWorkshop on Speech and Language
Technology(JSALT 2019) (Schwartz et al., 2020). We thankthe
organizers of the workshop and the mem-bers of our workshop team on
Neural Polysyn-thetic Language Modeling for inspiring us to pur-sue
this research direction. Our special thanks toRebecca Knowles,
Christo Kirov, Lori Levin, Chi-kiu (Jackie) Lo, and TACL reviewers
and editorsfor their feedback on our manuscript. We thankAta Tuncer
for his assistance with Turkish seg-mentation. This work utilizes
resources supportedby the National Science Foundation’s Major
Re-search Instrumentation program, grant #1725729,as well as the
University of Illinois at Urbana-Champaign.
References
Antti Arppe, Atticus Harrigan, Katherine Schmir-ler, Lene
Antonsen, Trond Trosterud, SjurNørstebø Moshagen, Miikka
Silfverberg, ArokWolvengrey, Conor Snoek, Jordan Lach-ler, Eddie
Antonio Santos, Jean Okimāsis,and Dorothy Thunder. 2014–2019.
Finite-state transducer-based computational model ofPlains Cree
morphology.
https://giellalt.uit.no/lang/crk/PlainsCreeDocumentation.htmlhttps://giellalt.uit.no/lang/crk/PlainsCreeDocumentation.htmlhttps://giellalt.uit.no/lang/crk/PlainsCreeDocumentation.html
-
Eric Axelson, Sam Hardwick, Krister Lindén,Kimmo Koskenniemi,
Flammie Pirinen, MikkaSilfverberg, and Senka Drobac. 2015.
Helsinkifinite-state technology resources.
Yoav Benjamini and Yosef Hochberg. 1995. Con-trolling the false
discovery rate: A practical andpowerful approach to multiple
testing. Jour-nal of the Royal Statistical Society: Series
B(Methodological), 57(1):289–300.
Christian Bentz, Tatyana Ruzsics, Alexander Ko-plenig, and Tanja
Samardžić. 2016. A compar-ison between morphological complexity
mea-sures: Typological data vs. language corpora.In Proceedings of
the Workshop on Compu-tational Linguistics for Linguistic
Complexity(CL4LC), pages 142–153, Osaka, Japan. TheCOLING 2016
Organizing Committee.
Kaj Bostrom and Greg Durrett. 2020. Byte pairencoding is
suboptimal for language model pre-training. CoRR,
cs.CL/2004.03720v1.
Çağrı Çöltekin. 2010. A freely available morpho-logical
analyzer for Turkish. In Proceedings ofthe Seventh International
Conference on Lan-guage Resources and Evaluation
(LREC’10),Valletta, Malta. European Language ResourcesAssociation
(ELRA).
Çağrı Çöltekin. 2014. A set of open source toolsfor Turkish
natural language processing. InProceedings of the Ninth
International Confer-ence on Language Resources and
Evaluation(LREC’14), Reykjavik, Iceland. European Lan-guage
Resources Association (ELRA).
Christos Christodoulopoulos and Mark Steedman.2014. A massively
parallel corpus: The Bible in100 languages. Language Resources and
Eval-uation, 49:1–21.
Ryan Cotterell, Sabrina J. Mielke, Jason Eisner,and Brian Roark.
2018. Are all languagesequally hard to language-model? In
Proceed-ings of the 2018 Conference of the North Amer-ican Chapter
of the Association for Computa-tional Linguistics: Human Language
Technolo-gies, Volume 2 (Short Papers), pages 536–541,New Orleans,
Louisiana. Association for Com-putational Linguistics.
Michael A. Covington and Joe D. McFall. 2010.Cutting the Gordian
knot: The moving-average
type–token ratio (MATTR). Journal of Quanti-tative Linguistics,
17(2):94–100.
Mathias Creutz and Krista Lagus. 2007. Unsu-pervised models for
morpheme segmentationand morphology learning. ACM Transactionson
Speech and Language Processing, 4(1):3:1–3:34.
Mathieu Dehouck and Pascal Denis. 2018. Aframework for
understanding the role of mor-phology in universal dependency
parsing. InProceedings of the 2018 Conference on Empir-ical Methods
in Natural Language Processing,pages 2864–2870, Brussels, Belgium.
Associa-tion for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova.
2019. BERT: Pre-trainingof deep bidirectional transformers for
languageunderstanding. In Proceedings of the 2019Conference of the
North American Chapterof the Association for Computational
Linguis-tics: Human Language Technologies, Volume1 (Long and Short
Papers), pages 4171–4186,Minneapolis, Minnesota. Association for
Com-putational Linguistics.
Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS
Online. Max Planck Institute forEvolutionary Anthropology,
Leipzig.
Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti,Roi Reichart, and
Anna Korhonen. 2018. Onthe relation between linguistic typology
and(limitations of) multilingual language model-ing. In Proceedings
of the 2018 Conference onEmpirical Methods in Natural Language
Pro-cessing, pages 316–327, Brussels, Belgium. As-sociation for
Computational Linguistics.
Kimmo Kettunen. 2014. Can type-token ratio beused to show
morphological complexity of lan-guages? Journal of Quantitative
Linguistics,21(3):223–245.
Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine
Walther, Ekaterina Vylo-mova, Patrick Xia, Manaal Faruqui,
SebastianMielke, Arya McCarthy, Sandra Kübler, DavidYarowsky, Jason
Eisner, and Mans Hulden.2018. UniMorph 2.0: Universal morphology.In
Proceedings of the Eleventh International
https://hfst.github.io/https://hfst.github.io/https://doi.org/10.1111/j.2517-6161.1995.tb02031.xhttps://doi.org/10.1111/j.2517-6161.1995.tb02031.xhttps://doi.org/10.1111/j.2517-6161.1995.tb02031.xhttps://www.aclweb.org/anthology/W16-4117https://www.aclweb.org/anthology/W16-4117https://www.aclweb.org/anthology/W16-4117http://arxiv.org/abs/2004.03720v1http://arxiv.org/abs/2004.03720v1http://arxiv.org/abs/2004.03720v1http://www.lrec-conf.org/proceedings/lrec2010/pdf/109_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2010/pdf/109_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/437_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/437_Paper.pdfhttps://doi.org/10.1007/s10579-014-9287-yhttps://doi.org/10.1007/s10579-014-9287-yhttps://doi.org/10.18653/v1/N18-2085https://doi.org/10.18653/v1/N18-2085https://doi.org/10.1080/09296171003643098https://doi.org/10.1080/09296171003643098https://doi.org/10.1145/1187415.1187418https://doi.org/10.1145/1187415.1187418https://doi.org/10.1145/1187415.1187418https://doi.org/10.18653/v1/D18-1312https://doi.org/10.18653/v1/D18-1312https://doi.org/10.18653/v1/D18-1312https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://wals.info/https://doi.org/10.18653/v1/D18-1029https://doi.org/10.18653/v1/D18-1029https://doi.org/10.18653/v1/D18-1029https://doi.org/10.18653/v1/D18-1029https://doi.org/10.1080/09296174.2014.911506https://doi.org/10.1080/09296174.2014.911506https://doi.org/10.1080/09296174.2014.911506https://www.aclweb.org/anthology/L18-1293
-
Conference on Language Resources and Evalu-ation (LREC 2018),
Miyazaki, Japan. EuropeanLanguage Resources Association (ELRA).
Judith L. Klavans. 2018. Computational chal-lenges for
polysynthetic languages. In Proceed-ings of the Workshop on
Computational Mod-eling of Polysynthetic Languages, pages
1–11,Santa Fe, New Mexico, USA. Association forComputational
Linguistics.
Philipp Koehn. 2005. Europarl: A parallel corpusfor statistical
machine translation. In Proceed-ings of the Tenth Machine
Translation Summit,pages 79–86, Phuket, Thailand. AAMT.
Taku Kudo. 2018. Subword regularization: Im-proving neural
network translation models withmultiple subword candidates. In
Proceedings ofthe 56th Annual Meeting of the Association
forComputational Linguistics (Volume 1: Long Pa-pers), pages 66–75,
Melbourne, Australia. As-sociation for Computational
Linguistics.
Septina Dian Larasati, Vladislav Kuboň, andDaniel Zeman. 2011.
Indonesian morphol-ogy tool (MorphInd): Towards an
indonesiancorpus. In Cerstin Mahlow and Michael Pi-otrowski,
editors, Systems and Frameworks forComputational Morphology, pages
119–129.Springer Berlin Heidelberg, Berlin, Heidelberg.
Yinhan Liu, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi,
Danqi Chen, Omer Levy,Mike Lewis, Luke Zettlemoyer, and
VeselinStoyanov. 2019. RoBERTa: A robustly op-timized BERT
pretraining approach. CoRR,cs.CL/1907.11692v1.
Manuel Mager, Elisabeth Mager, Alfonso Medina-Urrea, Ivan
Vladimir Meza Ruiz, and Katha-rina Kann. 2018. Lost in translation:
Analy-sis of information loss during machine trans-lation between
polysynthetic and fusional lan-guages. In Proceedings of the
Workshop onComputational Modeling of Polysynthetic Lan-guages,
pages 73–83, Santa Fe, New Mexico,USA. Association for
Computational Linguis-tics.
Thomas Mayer and Michael Cysouw. 2014. Cre-ating a massively
parallel Bible corpus. InProceedings of the Ninth International
Confer-ence on Language Resources and Evaluation
(LREC’14), pages 3158–3163, Reykjavik, Ice-land. European
Language Resources Associa-tion (ELRA).
Stephen Merity, Nitish Shirish Keskar, andRichard Socher. 2018.
An analysis of neurallanguage modeling at multiple scales.
CoRR,cs.CL/1803.08240v1.
Sabrina J. Mielke. 2016. Language diversity inACL 2004 -
2016.
Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman,Brian Roark, and
Jason Eisner. 2019. Whatkind of language is hard to
language-model?In Proceedings of the 57th Annual Meeting ofthe
Association for Computational Linguistics,pages 4975–4989,
Florence, Italy. Associationfor Computational Linguistics.
Sabrina J. Mielke and Jason Eisner. 2019. Spellonce, summon
anywhere: A two-level open-vocabulary language model. Proceedings
ofthe AAAI Conference on Artificial Intelligence,33:6843–6850.
Tommi A. Pirinen. 2015. Omorfi — free andopen source
morphological lexical databasefor Finnish. In Proceedings of the
20thNordic Conference of Computational Linguis-tics (NODALIDA
2015), pages 313–315, Vil-nius, Lithuania. Linköping University
Elec-tronic Press, Sweden.
Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and
Ilya Sutskever. 2019. Lan-guage models are unsupervised multitask
learn-ers.
Benoît Sagot. 2013. Comparing complexity mea-sures. In
Computational Approaches to Mor-phological Complexity, Paris,
France. SurreyMorphology Group.
Helmut Schmid, Arne Fitschen, and Ulrich Heid.2004. SMOR: A
German computational mor-phology covering derivation, composition
andinflection. In Proceedings of the Fourth In-ternational
Conference on Language Resourcesand Evaluation (LREC’04), pages
1–263, Lis-bon, Portugal. European Language ResourcesAssociation
(ELRA).
Lane Schwartz, Francis Tyers, Lori Levin, ChristoKirov, Patrick
Littell, Chi-kiu Lo, Emily
https://www.aclweb.org/anthology/W18-4801https://www.aclweb.org/anthology/W18-4801http://mt-archive.info/MTS-2005-Koehn.pdfhttp://mt-archive.info/MTS-2005-Koehn.pdfhttps://doi.org/10.18653/v1/P18-1007https://doi.org/10.18653/v1/P18-1007https://doi.org/10.18653/v1/P18-1007https://doi.org/10.1007/978-3-642-23138-4_8https://doi.org/10.1007/978-3-642-23138-4_8https://doi.org/10.1007/978-3-642-23138-4_8http://arxiv.org/abs/1907.11692v1http://arxiv.org/abs/1907.11692v1https://www.aclweb.org/anthology/W18-4808https://www.aclweb.org/anthology/W18-4808https://www.aclweb.org/anthology/W18-4808https://www.aclweb.org/anthology/W18-4808http://www.lrec-conf.org/proceedings/lrec2014/pdf/220_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/220_Paper.pdfhttp://arxiv.org/abs/1803.08240v1http://arxiv.org/abs/1803.08240v1https://sjmielke.com/acl-language-diversity.htmhttps://sjmielke.com/acl-language-diversity.htmhttps://doi.org/10.18653/v1/P19-1491https://doi.org/10.18653/v1/P19-1491https://doi.org/10.1609/aaai.v33i01.33016843https://doi.org/10.1609/aaai.v33i01.33016843https://doi.org/10.1609/aaai.v33i01.33016843https://www.aclweb.org/anthology/W15-1844https://www.aclweb.org/anthology/W15-1844https://www.aclweb.org/anthology/W15-1844https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfhttps://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfhttps://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfhttps://hal.inria.fr/hal-00927276https://hal.inria.fr/hal-00927276http://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdfhttp://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdfhttp://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdf
-
Prud’hommeaux, Hyunji Hayley Park, Ken-neth Steimel, Rebecca
Knowles, Jeffrey Micher,Lonny Strunk, Han Liu, Coleman
Haley,Katherine J. Zhang, Robbie Jimerson, Vasil-isa Andriyanets,
Aldrian Obaja Muis, NaokiOtani, Jong Hyuk Park, and Zhisong
Zhang.2020. Neural polysynthetic language mod-elling. CoRR,
cs.CL/2005.05477v2.
Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Neural
machine translation of rarewords with subword units. In Proceedings
ofthe 54th Annual Meeting of the Association forComputational
Linguistics (Volume 1: Long Pa-pers), pages 1715–1725, Berlin,
Germany. As-sociation for Computational Linguistics.
Yusuxke Shibata, Takuya Kida, Shuichi Fuka-machi, Masayuki
Takeda, Ayumi Shinohara,Takeshi Shinohara, and Setsuo Arikawa.
1999.Byte pair encoding: A text compression schemethat accelerates
pattern matching. Technical re-port, Department of Informatics,
Kyushu Uni-versity.
Maciej Tomczak and Ewa Tomczak. 2014. Theneed to report effect
size estimates revisited. Anoverview of some recommended measures
ofeffect size. Trends in Sport Sciences, 1(21):19–25.
Francis Tyers and Karina Mishchenkova. 2020.Dependency
annotation of noun incorporationin polysynthetic languages. In
Proceedings ofthe Fourth Workshop on Universal Dependen-cies (UDW
2020), pages 195–204, Barcelona,Spain (Online). Association for
ComputationalLinguistics.
Clara Vania and Adam Lopez. 2017. From char-acters to words to
in between: Do we capturemorphology? In Proceedings of the 55th
An-nual Meeting of the Association for Compu-tational Linguistics
(Volume 1: Long Papers),pages 2016–2027, Vancouver, Canada.
Associ-ation for Computational Linguistics.
Hugo David Calderon Vilca, Flor Cagniy Cár-denas Mariñó, and
Edwin Fredy MamaniCalderon. 2012. Analizador morfólogico dela
lengua Quechua basado en software
libreHelsinkifinite-statetransducer (HFST).
Sami Virpioja, Peter Smit, Stig-Arne Grönroos,and Mikko Kurimo.
2013. Morfessor 2.0:
Python implementation and extensions for Mor-fessor baseline.
Technical report, Aalto Univer-sity; Aalto-yliopisto.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan
Salakhutdinov, and Quoc V.Le. 2019. XLNet: Generalized
autoregres-sive pretraining for language understanding.In Hanna
Wallach, Hugo Larochelle, AlinaBeygelzimer, Florence d’Alché Buc,
EmilyFox, and Roman Garnett, editors, Advancesin Neural Information
Processing Systems 32,pages 5753–5763. Curran Associates, Inc.
A Data
We began with the data used in Mielke et al.(2019). This was
originally a subset of a Biblecorpus (Mayer and Cysouw, 2014),
which is nolonger publically available. We excluded con-structed
languages (epo, tlh) from the data, keep-ing a total of 104
verse-aligned Bibles in 60 lan-guages9 in 12 language families. To
increase thenumber of the languages and language
familiesrepresented, we added 41 Bibles in 32 languagesto the data.
13 Bible translations in 13 languages10
were sourced from Christodoulopoulos and Steed-man (2014). In
addition, we included 28 Bibletranslations in 21 languages scraped
from vari-ous online sources. Two of the Bibles scrapedwere in
Spanish (spa) and Telugu (tel), languageswhich were already
included in the Bible cor-pora (Mayer and Cysouw, 2014;
Christodoulopou-los and Steedman, 2014). These translationswere
included because the new Spanish Biblewas a parallel source of the
Paraguayan Guaraní(gug) translation, and the Telugu Bible
obtainedfrom Mielke et al. (2019) was originally misla-beled as
Tecpatlán Totonac (tcw). The CentralAlaskan Yup’ik (esu) Bible was
from https://bibles.org. 26 Bibles in 19 languages11
were from http://bible.com. The Green-landic (kal) Bible was
obtained from http://old.bibelselskabet.dk.
9afr, aln, arb, arz, ayr, bba, ben, bqc, bul, cac, cak, ceb,ces,
cmn, cnh, cym, dan, deu, ell, eng, fin, fra, guj, gur, hat,hrv,
hun, ind, ita, kek, kjb, lat, lit, mah, mam, mri, mya, nld,nor,
plt, poh, por, qub, quh, quy, quz, ron, rus, som, tbz, tel,tgl,
tpi, tpm, ukr, vie, wal, wbm, xho, zom
10als, amh, dje, heb, isl, jpn, kor, pck, slk, slv, spa, swe,
tha11crk, gug, gui, hin, ike, ikt, kan, mal, mar, nch, nep,
nhe,
pes, pol, sna, spa, tel, tob, tur
http://arxiv.org/abs/2005.05477v2http://arxiv.org/abs/2005.05477v2https://doi.org/10.18653/v1/P16-1162https://doi.org/10.18653/v1/P16-1162http://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttp://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttp://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttp://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttps://www.aclweb.org/anthology/2020.udw-1.22https://www.aclweb.org/anthology/2020.udw-1.22https://doi.org/10.18653/v1/P17-1184https://doi.org/10.18653/v1/P17-1184https://doi.org/10.18653/v1/P17-1184http://urn.fi/URN:ISBN:978-952-60-5501-5http://urn.fi/URN:ISBN:978-952-60-5501-5http://urn.fi/URN:ISBN:978-952-60-5501-5http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.pdfhttp://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.pdfhttps://bibles.orghttps://bibles.orghttp://bible.comhttp://old.bibelselskabet.dkhttp://old.bibelselskabet.dk