Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 273–291 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 273 Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation Benjamin Heinzerling † * and Michael Strube ‡ † RIKEN AIP & Tohoku University ‡ Heidelberg Institute for Theoretical Studies gGmbH [email protected]| [email protected]Abstract Pretrained contextual and non-contextual sub- word embeddings have become available in over 250 languages, allowing massively mul- tilingual NLP. However, while there is no dearth of pretrained embeddings, the distinct lack of systematic evaluations makes it diffi- cult for practitioners to choose between them. In this work, we conduct an extensive evalu- ation comparing non-contextual subword em- beddings, namely FastText and BPEmb, and a contextual representation method, namely BERT, on multilingual named entity recogni- tion and part-of-speech tagging. We find that overall, a combination of BERT, BPEmb, and character representations works well across languages and tasks. A more detailed analysis reveals different strengths and weaknesses: Multilingual BERT performs well in medium- to high-resource languages, but is outperformed by non-contextual sub- word embeddings in a low-resource setting. 1 Introduction Rare and unknown words pose a difficult chal- lenge for embedding methods that rely on seeing a word frequently during training (Bullinaria and Levy, 2007; Luong et al., 2013). Subword seg- mentation methods avoid this problem by assum- ing a word’s meaning can be inferred from the meaning of its parts. Linguistically motivated sub- word approaches first split words into morphemes and then represent word meaning by composing morpheme embeddings (Luong et al., 2013). More recently, character-ngram approaches (Luong and Manning, 2016; Bojanowski et al., 2017) and Byte Pair Encoding (BPE) (Sennrich et al., 2016) have grown in popularity, likely due to their computa- tional simplicity and language-agnosticity. 1 * Work done while at HITS. 1 While language-agnostic, these approaches are not language-independent. See Appendix B for a discussion. _magn us _car ls en _played M a g n u s charRNN C a r l s e n charRNN p l a y e d charRNN Magnus Carl ##sen played Multilingual BERT Magnus Carlsen played B-PER I-PER O BPEmb Figure 1: A high-performing ensemble of subword representations encodes the input using multilingual BERT (yellow, bottom left), an LSTM with BPEmb (pink, bottom middle), and a character-RNN (blue, bot- tom right). A meta-LSTM (green, center) combines the different encodings before classification (top). Hori- zontal arrows symbolize bidirectional LSTMs. Sequence tagging with subwords. Subword in- formation has long been recognized as an im- portant feature in sequence tagging tasks such as named entity recognition (NER) and part-of- speech (POS) tagging. For example, the suffix -ly often indicates adverbs in English POS tag- ging and English NER may exploit that profes- sions often end in suffixes like -ist (journalist, cyclist) or companies in suffixes like -tech or - soft. In early systems, these observations were operationalized with manually compiled lists of such word endings or with character-ngram fea- tures (Nadeau and Sekine, 2007). Since the ad- vent of neural sequence tagging (Graves, 2012;
19
Embed
Sequence Tagging with Contextual and Non-Contextual ... · BPE vs50000 magnus carls en played against vis wan athan anand BPE vs100000 magnus carlsen played against viswan athan anand
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
AbstractPretrained contextual and non-contextual sub-word embeddings have become available inover 250 languages, allowing massively mul-tilingual NLP. However, while there is nodearth of pretrained embeddings, the distinctlack of systematic evaluations makes it diffi-cult for practitioners to choose between them.In this work, we conduct an extensive evalu-ation comparing non-contextual subword em-beddings, namely FastText and BPEmb, anda contextual representation method, namelyBERT, on multilingual named entity recogni-tion and part-of-speech tagging.
We find that overall, a combination of BERT,BPEmb, and character representations workswell across languages and tasks. A moredetailed analysis reveals different strengthsand weaknesses: Multilingual BERT performswell in medium- to high-resource languages,but is outperformed by non-contextual sub-word embeddings in a low-resource setting.
1 Introduction
Rare and unknown words pose a difficult chal-lenge for embedding methods that rely on seeinga word frequently during training (Bullinaria andLevy, 2007; Luong et al., 2013). Subword seg-mentation methods avoid this problem by assum-ing a word’s meaning can be inferred from themeaning of its parts. Linguistically motivated sub-word approaches first split words into morphemesand then represent word meaning by composingmorpheme embeddings (Luong et al., 2013). Morerecently, character-ngram approaches (Luong andManning, 2016; Bojanowski et al., 2017) and BytePair Encoding (BPE) (Sennrich et al., 2016) havegrown in popularity, likely due to their computa-tional simplicity and language-agnosticity.1
∗ Work done while at HITS.1While language-agnostic, these approaches are not
language-independent. See Appendix B for a discussion.
_magn us _car ls en _played M a g n u s
charRNN
C a r l s e n
charRNN
p l a y e d
charRNN
Magnus Carl ##sen played
Multilingual BERT
Magnus Carlsen played
B-PER I-PER O
BPEmb
Figure 1: A high-performing ensemble of subwordrepresentations encodes the input using multilingualBERT (yellow, bottom left), an LSTM with BPEmb(pink, bottom middle), and a character-RNN (blue, bot-tom right). A meta-LSTM (green, center) combines thedifferent encodings before classification (top). Hori-zontal arrows symbolize bidirectional LSTMs.
Sequence tagging with subwords. Subword in-formation has long been recognized as an im-portant feature in sequence tagging tasks suchas named entity recognition (NER) and part-of-speech (POS) tagging. For example, the suffix-ly often indicates adverbs in English POS tag-ging and English NER may exploit that profes-sions often end in suffixes like -ist (journalist,cyclist) or companies in suffixes like -tech or -soft. In early systems, these observations wereoperationalized with manually compiled lists ofsuch word endings or with character-ngram fea-tures (Nadeau and Sekine, 2007). Since the ad-vent of neural sequence tagging (Graves, 2012;
Method Subword segmentation and token transformation
Original text Magnus Carlsen played against Viswanathan AnandCharacters M a g n u s C a r l s e n p l a y e d a g a i n s t V i s w a n a t h a n A n a n dWord shape Aa Aa a a Aa AaFastText magnus+mag+. . . carlsen+car+arl+. . . played+. . . against+. . . vis+isw+. . . +nathan ana+. . .BPE vs1000 m ag n us car l s en play ed against v is w an ath an an andBPE vs3000 mag n us car ls en played against vis w an ath an an andBPE vs5000 magn us car ls en played against vis wan ath an an andBPE vs10000 magn us car ls en played against vis wan athan an andBPE vs25000 magnus car ls en played against vis wan athan an andBPE vs50000 magnus carls en played against vis wan athan anandBPE vs100000 magnus carlsen played against viswan athan anandBERT Magnus Carl ##sen played against V ##is ##wana ##than Anand
Table 1: Overview of the subword segmentations and token transformations evaluated in this work.
Huang et al., 2015), the predominant way of in-corporating character-level subword informationis learning embeddings for each character in aword, which are then composed into a fixed-size representation using a character-CNN (Chiuand Nichols, 2016) or character-RNN (char-RNN)(Lample et al., 2016). Moving beyond single char-acters, pretrained subword representations such asFastText, BPEmb, and those provided by BERT(see §2) have become available.
While there now exist several pretrained sub-word representations in many languages, a practi-tioner faced with these options has a simple ques-tion: Which subword embeddings should I use?In this work, we answer this question for multilin-gual named entity recognition and part-of-speechtagging and make the following contributions:
• We present a large-scale evaluation of mul-tilingual subword representations on two se-quence tagging tasks;• We find that subword vocabulary size matters
and give recommendations for choosing it;• We find that different methods have differ-
ent strengths: Monolingual BPEmb worksbest in medium- and high-resource settings,multilingual non-contextual subword embed-dings are best in low-resource languages,while multilingual BERT gives good or bestresults across languages.
2 Subword Embeddings
We now introduce the three kinds of multilin-gual subword embeddings compared in our eval-uation: FastText and BPEmb are collections ofpretrained, monolingual, non-contextual subwordembeddings available in many languages, while
BERT provides contextual subword embeddingsfor many languages in a single pretrained languagemodel with a vocabulary shared among all lan-guages. Table 1 shows examples of the subwordsegmentations these methods produce.
2.1 FastText: Character-ngram Embeddings
FastText (Bojanowski et al., 2017) represents aword w as the sum of the learned embeddings ~zgof its constituting character-ngrams g and, in caseof in-vocabulary words, an embedding ~zw of theword itself: ~w = ~zw +
∑g∈Gw
~zg, where Gw isthe set of all constituting character n-grams for3 ≤ n ≤ 6. Bojanowski et al. provide embeddingstrained on Wikipedia editions in 294 languages.2
2.2 BPEmb: Byte-Pair Embeddings
Byte Pair Encoding (BPE) is an unsupervised seg-mentation method which operates by iterativelymerging frequent pairs of adjacent symbols intonew symbols. E.g., when applied to English text,BPE merges the characters h and e into the newbyte-pair symbol he, then the pair consisting ofthe character t and the byte-pair symbol he intothe new symbol the and so on. These merge oper-ations are learned from a large background corpus.The set of byte-pair symbols learned in this fash-ion is called the BPE vocabulary.
Applying BPE, i.e. iteratively performinglearned merge operations, segments a text intosubwords (see BPE segmentations for vocabu-lary sizes vs1000 to vs100000 in Table 1). Byemploying an embedding algorithm, e.g. GloVe(Pennington et al., 2014), to train embeddingson such a subword-segmented text, one obtains
embeddings for all byte-pair symbols in the BPEvocabulary. In this work, we evaluate BPEmb(Heinzerling and Strube, 2018), a collectionof byte-pair embeddings trained on Wikipediaeditions in 275 languages.3
2.3 BERT: Contextual Subword Embeddings
One of the drawbacks of the subword embeddingsintroduced above, and of pretrained word embed-dings in general, is their lack of context. For ex-ample, with a non-contextual representation, theembedding of the word play will be the same bothin the phrase a play by Shakespeare and the phraseto play Chess, even though play in the first phraseis a noun with a distinctly different meaning thanthe verb play in the second phrase. Contextualword representations (Dai and Le, 2015; Mela-mud et al., 2016; Ramachandran et al., 2017; Pe-ters et al., 2018; Radford et al., 2018; Howard andRuder, 2018) overcome this shortcoming via pre-trained language models.
Instead of representing a word or subword bya lookup of a learned embedding, which is thesame regardless of context, a contextual represen-tation is obtained by encoding the word in con-text using a neural language model (Bengio et al.,2003). Neural language models typically employa sequence encoder such as a bidirectional LSTM(Hochreiter and Schmidhuber, 1997) or Trans-former (Vaswani et al., 2017). In such a model,each word or subword in the input sequence is en-coded into a vector representation. With a bidi-rectional LSTM, this representation is influencedby its left and right context through state updateswhen encoding the sequence from left to right andfrom right to left. With a Transformer, context in-fluences a word’s or subword’s representation viaan attention mechanism (Bahdanau et al., 2015).
In this work we evaluate BERT (Devlin et al.,2019), a Transformer-based pretrained languagemodel operating on subwords similar to BPE (seelast row in Table 1). We choose BERT amongthe pretrained language models mentioned abovesince it is the only one for which a multilingualversion is publicly available. Multilingual BERT4
has been trained on the 104 largest Wikipedia edi-tions, so that, in contrast to FastText and BPEmb,many low-resource languages are not supported.
Table 2: Number of languages supported by the threesubword embedding methods compared in our evalua-tion, as well as the NER baseline system (Pan17).
3 Multilingual Evaluation
We compare the three different pretrained sub-word representations introduced in §2 on twotasks: NER and POS tagging. Our multilingualevaluation is split in four parts. After devising asequence tagging architecture (§3.1), we investi-gate an important hyper-parameter in BPE-basedsubword segmentation: the BPE vocabulary size(§3.2). Then, we conduct NER experiments ontwo sets of languages (see Table 2): 265 languagessupported by FastText and BPEmb (§3.3) and the101 languages supported by all methods includingBERT (§3.4). Our experiments conclude with POStagging on 27 languages (§3.4).Data. For NER, we use WikiAnn (Pan et al.,2017), a dataset containing named entity mentionand three-class entity type annotations in 282 lan-guages. WikiAnn was automatically generated byextracting and classifying entity mentions frominter-article links on Wikipedia. Because of this,WikiAnn suffers from problems such as skewedentity type distributions in languages with smallWikipedias (see Figure 6 in Appendix A), as wellas wrong entity types due to automatic type classi-fication. These issues notwithstanding, WikiAnnis the only available NER dataset that covers al-most all languages supported by the subword rep-resentations compared in this work. For POS tag-ging, we follow Plank et al. (2016); Yasunaga et al.(2018) and use annotations from the Universal De-pendencies project (Nivre et al., 2016). These an-notations take the form of language-universal POStags (Petrov et al., 2012), such as noun, verb, ad-jective, determiner, and numeral.
3.1 Sequence Tagging ArchitectureOur sequence tagging architecture is depicted inFigure 1. The architecture is modular and allowsencoding text using one or more subword embed-ding methods. The model receives a sequenceof tokens as input, here Magnus Carlsen played.After subword segmentation and an embedding
lookup, subword embeddings are encoded withan encoder specific to the respective subwordmethod. For BERT, this is a pretrained Trans-former, which is finetuned during training. Forall other methods we train bidirectional LSTMs.Depending on the particular subword method, in-put tokens are segmented into different subwords.Here, BERT splits Carlsen into two subwords re-sulting in two encoder states for this token, whileBPEmb with an LSTM encoder splits this wordinto three. FastText (not depicted) and charac-ter RNNs yield one encoder state per token. Tomatch subword representations with the tokeniza-tion of the gold data, we arbitrarily select the en-coder state corresponding to the first subword ineach token. A meta-LSTM combines the tokenrepresentations produced by each encoder beforeclassification.5
Decoding the sequence of a neural model’spre-classification states with a conditional ran-dom field (CRF) (Lafferty et al., 2001) has beenshown to improve NER performance by 0.7 to1.8 F1 points (Ma and Hovy, 2016; Reimers andGurevych, 2017) on a benchmark dataset. In ourpreliminary experiments on WikiAnn, CRFs con-siderably increased training time but did not showconsistent improvements across languages.6 Sinceour study involves a large number of experimentscomparing several subword representations withcross-validation in over 250 languages, we omitthe CRF in order to reduce model training time.Implementation details. Our sequence taggingarchitecture is implemented in PyTorch (Paszkeet al., 2017). All model hyper-parameters for agiven subword representation are tuned in prelim-inary experiments on development sets and thenkept the same for all languages (see Appendix D).For many low-resource languages, WikiAnn pro-vides only a few hundred instances with skewedentity type distributions. In order to mitigatethe impact of variance from random train-dev-test splits in such cases, we report averages ofn-fold cross-validation runs, with n=10 for low-resource, n=5 for medium-resource, and n=3 forhigh-resource languages.7 For experiments in-
5In preliminary experiments (results not shown), wefound that performing classification directly on the concate-nated token representation without such an additional LSTMon top does not work well.
6The system we compare to as baseline (Pan et al., 2017)includes a CRF but did not report an ablation without it.
7Due to high computational resource requirements, we setn=1 for finetuning experiments with BERT.
102 103 104 105 106
Dataset size (#instances)
100000
50000
25000
10000
5000
3000
1000
Best
BPE
voc
abul
ary
size
Figure 2: The best BPE vocabulary size varies withdataset size. For each of the different vocabulary sizes,the box plot shows means and quartiles of the datasetsizes for which this vocabulary size is optimal, accord-ing to the NER F1 score on the respective developmentset in WikiAnn. E.g., the bottom, pink box recordsthe sizes of the datasets (languages) for which BPE vo-cabulary size 1000 was best, and the top, blue box thedataset sizes for which vocabulary size 100k was best.
volving FastText, we precompute a 300d embed-ding for each word and update embeddings duringtraining. We use BERT in a finetuning setting, thatis, we start training with a pretrained model andthen update that model’s weights by backpropa-gating through all of BERT’s layers. Finetuningis computationally more expensive, but gives bet-ter results than feature extraction, i.e. using one ormore of BERT’s layers for classification withoutfinetuning (Devlin et al., 2019). For BPEmb, weuse 100d embeddings and choose the best BPE vo-cabulary size as described in the next subsection.
3.2 Tuning BPE
In subword segmentation with BPE, performingonly a small number of byte-pair merge opera-tions results in a small vocabulary. This leads tooversegmentation, i.e., words are split into manyshort subwords (see BPE vs1000 in Table 1). Withmore merge operations, both the vocabulary sizeand the average subword length increase. As thebyte-pair vocabulary grows larger it adds sym-bols corresponding to frequent words, resulting insuch words not being split into subwords. Note,for example, that the common English prepositionagainst is not split even with the smallest vocabu-lary size, or that played is split into the stem playand suffix ed with a vocabulary of size 1000, butis not split with larger vocabulary sizes.
The choice of vocabulary size involves a trade-off. On the one hand, a small vocabulary re-
Table 3: NER results on WikiAnn. The first row shows macro-averaged F1 scores (%) for all 265 languages inthe Intersect. 1 setting. Rows two to four break down scores for 188 low-resource languages (<10k instances), 48medium-resource languages (10k to 100k instances), and 29 high-resource languages (>100k instances).
quires less data for pre-training subword embed-dings since there are fewer subwords for whichembeddings need to be learned. Furthermore, asmaller vocabulary size is more convenient formodel training since training time increases withvocabulary size (Morin and Bengio, 2005) andhence a model with a smaller vocabulary trainsfaster. On the other hand, a small vocabulary re-sults in less meaningful subwords and longer inputsequence lengths due to oversegmentation.
Conversely, a larger BPE vocabulary tends toyield longer, more meaningful subwords so thatsubword composition becomes easier – or in caseof frequent words even unnecessary – in down-stream applications, but a larger vocabulary alsorequires a larger text corpus for pre-training goodembeddings for all symbols in the vocabulary.Furthermore, a larger vocabulary size requiresmore annotated data for training larger neuralmodels and increases training time.
Since the optimal BPE vocabulary size for agiven dataset and a given language is not a prioriclear, we determine this hyper-parameter empiri-cally. To do so, we train NER models with vary-ing BPE vocabulary sizes8 for each language andrecord the best vocabulary size on the language’sdevelopment set as a function of dataset size (Fig-ure 2). This data shows that larger vocabularysizes are better for high-resource languages withmore training data, and smaller vocabulary sizesare better for low-resource languages with smallerdatasets. In all experiments involving byte-pairembeddings, we choose the BPE vocabulary sizefor the given language according to this data.9
3.3 NER with FastText and BPEmbIn this section, we evaluate FastText and BPEmbon NER in 265 languages. As baseline, we com-
9The procedure for selecting BPE vocabulary size is givenin Appendix C.
Figure 3: Impact of word shape embeddings on NERperformance in a given language as function of the cap-italization ratio in a random Wikipedia sample.
pare to Pan et al. (2017)’s system, which combinesmorphological features mined from Wikipediamarkup with cross-lingual knowledge transfer viaWikipedia language links (Pan17 in Table 3). Av-eraged over all languages, FastText performs 4.1F1 points worse than this baseline. BPEmb is onpar overall, with higher scores for medium- andhigh-resource languages, but a worse F1 score onlow-resource languages. BPEmb combined withcharacter embeddings (+char) yields the overallhighest scores for medium- and high-resource lan-guages among monolingual methods.Word shape. When training word embeddings,lowercasing is a common preprocessing step (Pen-nington et al., 2014) that on the one hand re-duces vocabulary size, but on the other loses in-formation in writing systems with a distinction be-tween upper and lower case letters. As a moreexpressive alternative to restoring case informa-tion via a binary feature indicating capitalized orlowercased words (Curran and Clark, 2003), wordshapes (Collins, 2002; Finkel et al., 2005) map
Figure 4: The distribution of byte-pair symbol lengthsvaries with BPE vocabulary size.
BPE vocabulary size100k 320k 1000k
Dev. F1 87.1 88.7 89.3
Table 4: Average WikiAnn NER F1 scores on the de-velopment sets of 265 languages with shared vocabu-laries of different size.
characters to their type and collapse repeats. Forexample, Magnus is mapped to the word shape Aaand G.M. to A.A. Adding such shape embeddingsto the model (+shape in Table 3) yields similarimprovements as character embeddings.
Since capitalization is not important in all lan-guages, we heuristically decide whether shape em-beddings should be added for a given languageor not. We define the capitalization ratio of alanguage as the ratio of upper case charactersamong all characters in a written sample. As Fig-ure 3 shows, capitalization ratios vary betweenlanguages, with shape embeddings tending to bemore beneficial in languages with higher ratios.By thresholding on the capitalization ratio, weonly add shape embeddings for languages with ahigh ratio (+someshape). This leads to an overallhigher average F1 score of 85.3 among monolin-gual models, due to improved performance (81.9vs. 81.5) on low-resource languages.One NER model for 265 languages. The re-duction in vocabulary size achieved by BPE isa crucial advantage in neural machine translation(Johnson et al., 2017) and other tasks which in-volve the costly operation of taking a softmax overthe entire output vocabulary (see Morin and Ben-gio, 2005; Li et al., 2019). BPE vocabulary sizesbetween 8k and 64k are common in neural ma-chine translation. Multilingual BERT operates ona subword vocabulary of size 100k which is sharedamong 104 languages. Even with shared sym-
bols among languages, this allots at best only afew thousand byte-pair symbols to each language.Given that sequence tagging does not involve tak-ing a softmax over the vocabulary, much largervocabulary sizes are feasible, and as §3.2 shows,a larger BPE vocabulary is better when enoughtraining data is available. To study the effect ofa large BPE vocabulary size in a multilingual set-ting, we train BPE models and byte-pair embed-dings with subword vocabularies of up to 1000kBPE symbols, which are shared among all lan-guages in our evaluation.10
The shared BPE vocabulary and correspondingbyte-pair embeddings allow training a single NERmodel for all 265 languages. To do so, we firstencode WikiAnn in all languages using the sharedBPE vocabulary and then train a single multilin-gual NER model in the same fashion as a mono-lingual model. As the vocabulary size has a largeeffect on the distribution of BPE symbol lengths(Figure 4, also see §3.2) and model quality, wedetermine this hyper-parameter empirically (Ta-ble 4). To reduce the disparity between datasetsizes of different languages, and to keep trainingtime short, we limit training data to a maximumof 3000 instances per language.11 Results for thismultilingual model (MultiBPEmb) with sharedcharacter embeddings (+char) and without fur-ther finetuning -finetune show a strong improve-ment in low-resource languages (89.7 vs. 81.9with +someshape), while performance degradesdrastically on high-resource languages. Since the188 low-resource languages in WikiAnn are typo-logically and genealogically diverse, the improve-ment suggests that low-resource languages notonly profit from cross-lingual transfer from similarlanguages (Cotterell and Heigold, 2017), but thatmultilingual training brings other benefits, as well.In multilingual training, certain aspects of the taskat hand, such as tag distribution and BIO con-straints have to be learned only once, while theyhave to be separately learned on each language inmonolingual training. Furthermore, multilingualtraining may prevent overfitting to biases in smallmonolingual datasets, such as a skewed tag distri-
10Specifically, we extract up to 500k randomly selectedparagraphs from articles in each Wikipedia edition, yielding16GB of text in 265 languages. Then, we train BPE modelswith vocabulary sizes 100k, 320k, and 1000k using Senten-cePiece (Kudo and Richardson, 2018), and finally train 300dsubword embeddings using GloVe.
11With this limit, training takes about a week on oneNVIDIA P40 GPU.
279
Figure 5: Shared multilingual byte-pair embedding space pretrained (left) and after NER model training (right),2-d UMAP projection (McInnes et al., 2018). As there is no 1-to-1 correspondence between BPE symbols andlanguages in a shared multilingual vocabulary, it is not possible to color BPE symbols by language. Instead, wecolor symbols by Unicode code point. This yields a coloring in which, for example, BPE symbols consisting ofcharacters from the Latin alphabet are green (large cluster in the center), symbols in Cyrillic script blue (largecluster at 11 o’clock), and symbols in Arabic script purple (cluster at 5 o’clock). Best viewed in color.
Table 5: NER F1 scores for the 101 WikiAnn languages supported by all evaluated methods.
butions. A visualization of the multilingual sub-word embedding space (Figure 5) gives evidencefor this view. Before training, distinct clusters ofsubword embeddings from the same language arevisible. After training, some of these clusters aremore spread out and show more overlap, which in-dicates that some embeddings from different lan-guages appear to have moved “closer together”,as one would expect embeddings of semantically-related words to do. However, the overall struc-ture of the embedding space remains largely un-changed. The model maintains language-specificsubspaces and does not appear to create an in-terlingual semantic space which could facilitatecross-lingual transfer.
Having trained a multilingual model on all lan-guages, we can further train this model on a sin-gle language (Table 3, +finetune). This finetun-ing further improves performance, giving the bestoverall score (91.4) and an 8.8 point improvementover Pan et al. on low-resource languages (90.4 vs.81.6). These results show that multilingual train-ing followed by monolingual finetuning is an ef-
fective method for low-resource sequence tagging.
3.4 NER with Multilingual BERT
Table 5 shows NER results on the intersection oflanguages supported by all methods in our evalu-ation. As in §3.3, FastText performs worst over-all, monolingual BPEmb with character embed-dings performs best on high-resource languages(93.6 F1), and multilingual BPEmb best on low-resource languages (91.1). Multilingual BERToutperforms the Pan17 baseline and shows strongresults in comparison to monolingual BPEmb. Thecombination of multilingual BERT, monolingualBPEmb, and character embeddings is best overall(92.0) among models trained only on monolingualNER data. However, this ensemble of contextualand non-contextual subword embeddings is infe-rior to MultiBPEmb (93.2), which was first trainedon multilingual data from all languages collec-tively, and then separately finetuned to each lan-guage. Score distributions and detailed NER re-sults for each language and method are shown inAppendix E and Appendix F.
Table 7: POS tagging accuracy on low-resource lan-guages in UD 1.2.
3.5 POS Tagging in 27 Languages
We perform POS tagging experiments in the 21high-resource (Table 6) and 6 low-resource lan-guages (Table 7) from the Universal Dependencies(UD) treebanks on which Yasunaga et al. (2018)report state-of-the-art results via adversarial train-ing (Adv.). In high-resource POS tagging, we alsocompare to the BiLSTM by Plank et al. (2016).While differences between methods are less pro-nounced than for NER, we observe similar pat-terns. On average, the combination of multilingualBERT, monolingual BPEmb, and character em-beddings is best for high-resource languages andoutperforms Adv. by 0.2 percent (96.8 vs. 96.6).For low-resource languages, multilingual BPEmbwith character embeddings and finetuning is thebest method, yielding an average improvement of0.8 percent over Adv. (92.4 vs. 91.6).
4 Limitations and Conclusions
Limitations. While extensive, our evaluation isnot without limitations. Throughout this study,we have used a Wikipedia edition in a given lan-guage as a sample of that language. The de-gree to which this sample is representative varies,and low-resource Wikipedias in particular con-tain large fractions of “foreign” text and noise,which propagates into embeddings and datasets.Our evaluation did not include other subword rep-resentations, most notably ELMo (Peters et al.,2018) and contextual string embeddings (Akbiket al., 2018), since, even though they are language-agnostic in principle, pretrained models are onlyavailable in a few languages.Conclusions. We have presented a large-scalestudy of contextual and non-contextual subwordembeddings, in which we trained monolingual andmultilingual NER models in 265 languages andPOS-tagging models in 27 languages. BPE vo-cabulary size has a large effect on model qual-ity, both in monolingual settings and with a largevocabulary shared among 265 languages. As arule of thumb, a smaller vocabulary size is betterfor small datasets and larger vocabulary sizes bet-ter for larger datasets. Large improvements overmonolingual training showed that low-resourcelanguages benefit from multilingual model train-ing with shared subword embeddings. Such im-provements are likely not solely caused by cross-
281
lingual transfer, but also by the prevention of over-fitting and mitigation of noise in small monolin-gual datasets. Monolingual finetuning of a multi-lingual model improves performance in almost allcases (compare -finetune and +finetune columnsin Table 9 in Appendix F). For high-resource lan-guages, we found that monolingual embeddingsand monolingual training perform better than mul-tilingual approaches with a shared vocabulary.This is likely due to the fact that a high-resourcelanguage provides large background corpora forlearning good embeddings of a large vocabularyand also provides so much training data for thetask at hand that little additional information canbe gained from training data in other languages.Our experiments also show that even a large multi-lingual contextual model like BERT benefits fromcharacter embeddings and additional monolingualembeddings.
Finally, and while asking the reader to bearabove limitations in mind, we make the follow-ing practical recommendations for multilingual se-quence tagging with subword representations:
• Choose the largest feasible subword vocabu-lary size when a large amount of data is avail-able.
• Multilingual BERT is a robust choice acrosstasks and languages if the computational re-quirements can be met.
• With limited computational resources, usesmall monolingual, non-contextual represen-tations, such as BPEmb combined with char-acter embeddings.
• Combine different subword representationsfor better results.
• In low-resource scenarios, first perform mul-tilingual pretraining with a shared subwordvocabulary, then finetune to the language ofinterest.
5 Acknowledgements
We thank the anonymous reviewers for insight-ful comments. This work has been funded bythe Klaus Tschira Foundation, Heidelberg, Ger-many, and partially funded by the German Re-search Foundation as part of the Research Training
Group “Adaptive Preparation of Information fromHeterogeneous Sources” (AIPHES) under grantNo. GRK 1994/1.
ReferencesAlan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual string embeddings for sequencelabeling. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages1638–1649. Association for Computational Linguis-tics.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.
Emily M Bender. 2011. On achieving and evaluatinglanguage-independence in NLP. Linguistic Issuesin Language Technology, 6(3):1–26.
Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of machine learning research,3(Feb):1137–1155.
Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.
John A Bullinaria and Joseph P Levy. 2007. Extractingsemantic representations from word co-occurrencestatistics: A computational study. Behavior re-search methods, 39(3):510–526.
Jason Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional LSTM-CNNs. Trans-actions of the Association for Computational Lin-guistics, 4:357–370.
Michael Collins. 2002. Ranking algorithms for namedentity extraction: Boosting and the VotedPerceptron.In Proceedings of the 40th Annual Meeting of theAssociation for Computational Linguistics.
Ryan Cotterell and Georg Heigold. 2017. Cross-lingual character-level neural morphological tag-ging. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 748–759. Association for ComputationalLinguistics.
James Curran and Stephen Clark. 2003. Language in-dependent NER using a maximum entropy tagger.In Proceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003.
Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In Advances in neural informa-tion processing systems, pages 3079–3087.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.
Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by Gibbssampling. In Proceedings of the 43rd Annual Meet-ing of the Association for Computational Linguistics(ACL’05), pages 363–370. Association for Compu-tational Linguistics.
Alex Graves. 2012. Supervised sequence labelling withrecurrent neural networks. Ph.D. thesis, TechnicalUniversity of Munich.
Benjamin Heinzerling and Michael Strube. 2018.BPEmb: Tokenization-free pre-trained subword em-beddings in 275 languages. In Proceedings ofthe Eleventh International Conference on LanguageResources and Evaluation (LREC 2018), Paris,France. European Language Resources Association(ELRA).
Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.
Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 328–339. Association for Com-putational Linguistics.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi-rectional LSTM-CRF models for sequence tagging.CoRR.
Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viegas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation. Transactions of the As-sociation for Computational Linguistics, 5:339–351.
Taku Kudo and John Richardson. 2018. Sentence-Piece: A simple and language independent subwordtokenizer and detokenizer for neural text processing.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing: Sys-tem Demonstrations, pages 66–71. Association forComputational Linguistics.
John Lafferty, Andrew McCallum, and FernandoPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proceedings of the 18th
International Conference on Machine Learning,Williamstown, Mass., 28 June – 1 July 2001, pages282–289.
Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 260–270. Association for Computational Lin-guistics.
Liunian Harold Li, Patrick H. Chen, Cho-Jui Hsieh,and Kai-Wei Chang. 2019. Efficient contextual rep-resentation learning without softmax layer. CoRR.
Minh-Thang Luong and Christopher D. Manning.2016. Achieving open vocabulary neural machinetranslation with hybrid word-character models. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1054–1063. Association forComputational Linguistics.
Minh-Thang Luong, Richard Socher, and Christo-pher D. Manning. 2013. Better word representationswith recursive neural networks for morphology. InProceedings of the Seventeenth Conference on Com-putational Natural Language Learning, pages 104–113. Association for Computational Linguistics.
Xuezhe Ma and Eduard Hovy. 2016. End-to-endsequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1064–1074. Associa-tion for Computational Linguistics.
Leland McInnes, John Healy, Nathaniel Saul, andLukas Grossberger. 2018. Umap: Uniform mani-fold approximation and projection. The Journal ofOpen Source Software, 3(29):861.
Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional LSTM. In Proceedingsof The 20th SIGNLL Conference on ComputationalNatural Language Learning, pages 51–61. Associa-tion for Computational Linguistics.
Frederic Morin and Yoshua Bengio. 2005. Hierarchi-cal probabilistic neural network language model. InAISTATS, volume 5, pages 246–252.
David Nadeau and Satoshi Sekine. 2007. A sur-vey of named entity recognition and classification.Lingvisticae Investigationes, 30(1):3–26.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016. Universal dependencies v1: A multilingualtreebank collection. In Proceedings of the Tenth In-ternational Conference on Language Resources and
Evaluation (LREC 2016), Paris, France. EuropeanLanguage Resources Association (ELRA).
Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 1946–1958. Association forComputational Linguistics.
Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in PyTorch.In Autodiff Workshop, NIPS 2017.
Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1532–1543. As-sociation for Computational Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
Slav Petrov, Dipanjan Das, and Ryan McDonald.2012. A universal part-of-speech tagset. In Pro-ceedings of the Eighth International Conference onLanguage Resources and Evaluation (LREC-2012),pages 2089–2096, Istanbul, Turkey. European Lan-guage Resources Association (ELRA).
Barbara Plank, Anders Søgaard, and Yoav Goldberg.2016. Multilingual part-of-speech tagging withbidirectional long short-term memory models andauxiliary loss. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 412–418.Association for Computational Linguistics.
Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. Technical re-port, OpenAI.
Prajit Ramachandran, Peter Liu, and Quoc Le. 2017.Unsupervised pretraining for sequence to sequencelearning. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 383–391. Association for ComputationalLinguistics.
Nils Reimers and Iryna Gurevych. 2017. Reportingscore distributions makes a difference: Performancestudy of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages
338–348. Association for Computational Linguis-tics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.
Michihiro Yasunaga, Jungo Kasai, and DragomirRadev. 2018. Robust multilingual part-of-speechtagging via adversarial training. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (LongPapers), pages 976–986. Association for Computa-tional Linguistics.
A Analysis of NER tag distribution and baseline performance in WikiAnn
1 50 100 150 200 250 265Tag entropy rank
100
102
104
106
Data
set s
ize (l
og sc
ale)
1 50 100 150 200 250 2650
20
40
60
80
100
Pan1
7 NE
R F1
(%)
1 50 100 150 200 250 2650.0
0.2
0.4
0.6
0.8
1.0
Tag
rela
tive
frequ
ency
OI-PERB-PERI-ORGB-ORGI-LOCB-LOC
Figure 6: WikiAnn named entity tag distribution for each language (top) in comparison to Pan et al. NER F1scores (middle) and each language’s dataset size (bottom). Languages are sorted from left to right from highest tolowest tag distribution entropy. That is, the NER tags in WikiAnn for the language in question are well-balancedfor higher-ranked languages on the left and become more skewed for lower-ranked languages towards the right.Pan et al. achieve NER F1 scores up to 100 percent on some languages, which can be explained by the highlyskewed, i.e. low-entropy, tag distribution in these languages (compare F1 scores >99% in middle subfigure withskewed tag distributions in top subfigure). Better balance, i.e. higher entropy, of tag distribution tends to be foundin languages for which WikiAnn provides more data (compare top and bottom subfigures).
285
B BPE and character-ngrams are notlanguage-independent
Some methods proposed in NLP are unjusti-fiedly claimed to be language-independent (Ben-der, 2011). Subword segmentation with BPE orcharacter-ngrams is language-agnostic, i.e., such asegmentation can be applied to any sequence ofsymbols, regardless of the language or meaningof these symbols. However, BPE and character-ngrams are based on the assumption that meaning-ful subwords consist of adjacent characters, suchas the suffix -ed indicating past tense in Englishor the copular negation nai in Japanese. This as-sumption does not hold in languages with non-concatenative morphology. For example, Semiticroots in languages such as Arabic and Hebreware patterns of discontinuous sequences of con-sonants which form words by insertion of vowelsand other consonants. For instance, words relatedto writing are derived from the root k-t-b: kataba“he wrote” or kitab “book”. BPE and character-ngrams are not suited to efficiently capture suchpatterns of non-adjacent characters, and hence arenot language-independent.
C Procedure for selecting the best BPEvocabulary size
We determine the best BPE vocabulary size foreach language according to the following proce-dure.
1. For each language l in the set of all languagesL and each BPE vocabulary size v ∈ V , runn-fold cross-validation with each fold com-prising a random split into training, develop-ment, and test set.12
2. Find the best BPE vocabulary size vl for eachlanguage, according to the mean evaluationscore on the development set of each cross-validation fold.
3. Determine the dataset size, measured in num-ber of instances Nl, for each language.
4. For each vocabulary size v, compute themedian number of training instances of thelanguages for which v gives the maximumevaluation score on the development set, i.e.Nv = median({Nl|v = vl∀l ∈ L}).
Table 8: Hyper-parameters used in our experiments.
287
E NER score distributions on WikiAnn
50 100 150 200 250Method performance rank
0.0
0.2
0.4
0.6
0.8
1.0NE
R F1
Pan17FastTextBPEmb+charMultiBPEmb+char
20 40 60 80 100Method performance rank
0.0
0.2
0.4
0.6
0.8
1.0
NER
F1
BERTBERT+char+BPEmbBPEmb+charMultiBPEmb+char
Figure 7: NER results for the 265 languages represented in Pan et al. (2017), FastText, and BPEmb (top), and the101 languages constituting the intersection of these methods and BERT (bottom). Per-language F1 scores achievedby each method are sorted in descending order from left to right. The data points at rank 1 show the highest scoreamong all languages achieved by the method in question, rank 2 the second-highest score etc.
288
F Detailed NER Results on WikiAnnBPEmb BERT MultiBPEmb+char