Morphological Analyzer and Generator for Russian and Ukrainian Languages Mikhail Korobov AIST 2015
Jul 18, 2015
Morphological Analyzer and Generator for Russian and Ukrainian Languages
Mikhail Korobov AIST 2015
Morphological Analysis: word -> possible grammatical tags
• стали: VERB,perf,intr plur,past,indc (ГЛ,сов,неперех мн,прош,изъяв);
• стали: NOUN,inan,femn sing,[nomn;gent;datv;loct] (СУЩ,неод,жр [ед,рд;ед,дт;eд,пр;мн,им;мн,вн])
• бутявка: NOUN,inan,femn sing,nomn (СУЩ,неод,жр ед,им)
Moprhological Generation
• lemmatization: стали -> стать, ежом -> ёж
• inflection: стали -> (sing,3per,fut) -> станет
• inflection: ёж -> (datv) -> ежу
pymorphy2: features• Morphological analysis of Russian words;
• morphological generation: lemmatization, inflection, number agreement;
• P(tag | word) estimates;
• out-of-vocabulary words handling;
• experimental support for Ukrainian language.
pymorphy2: implementation• Python library and a command line tool
• Permissive open-source license: MIT for code, Creative Commons BY-SA for data
• 600+ unit tests; 90%+ test coverage
• Memory usage: 30MB = 15MB pymorphy2 + 15MB Python interpreter
• Speed: 20-100K words per second with an optional C++ extension
Analysis of Vocabulary Words
• OpenCorpora dictionary for Russian (5M word forms, 400K lemmas);
• a dictionary based on LanguageTool data (2.5M word forms) by Andrey Rysin, Dmitry Chaplinsky, Mariana Romanyshyn, Vladimir Sevastyanov & others.
Analysis of Vocabulary Words
Source dictionaries provide lexemes:
ёж NOUN,anim,masc sing,nomnежа NOUN,anim,masc sing,gentежу NOUN,anim,masc sing,datv...ежами NOUN,anim,masc plur,abltежах NOUN,anim,masc plur,loct
Tasks
• Analyze: get a word from dictionary, return its tag
• Lemmatize: find a word in dictionary, get 1st word from its lexeme
• Inflect: find a word in dictionary, get a compatible word from its lexeme
Efficiency considerations
• OpenCorpora XML dictionary is 400MB on disk
• XML search lookup is O(N)
• When loaded to an in-memory hash table (Python dict) dictionary takes several GB of RAM
Solution
• Extract paradigms from lexemes; encode words as DAFSA.
• Also tried: succinct tries, two double-array tries
• 5M Russian word forms in DAFSA == 3MB RAM
Lexeme word tag хомяковый ADJF,Qual masc,sing,nomn хомякового ADJF,Qual masc,sing,gent ... хомяковы ADJS,Qual plur хомяковее COMP,Qual хомяковей COMP,Qual V-ejпохомяковее COMP,Qual Cmp2похомяковей COMP,Qual Cmp2,V-ej
Lexemeprefix stem suffix tag хомяков ый ADJF,Qual masc,sing,nomn хомяков ого ADJF,Qual masc,sing,gent ... хомяков ы ADJS,Qual plur хомяков ее COMP,Qual хомяков ей COMP,Qual V-ej по хомяков ее COMP,Qual Cmp2 по хомяков ей COMP,Qual Cmp2,V-ej
Paradigmprefix suffix tag ый ADJF,Qual masc,sing,nomn ого ADJF,Qual masc,sing,gent ... ы ADJS,Qual plur ее COMP,Qual ей COMP,Qual V-ej по ее COMP,Qual Cmp2 по ей COMP,Qual Cmp2,V-ej
Paradigm, encodedprefix_id suffix_id tag_id 0 66 78 0 67 79 ... 0 37 94 0 82 95 0 121 96 1 82 97 1 121 98
DAFSA10
14
0
2
3
1
16
4 6
32И
sep
7
22sep8 9sep
И
13103
12103
102
2
2
0
17104
2
(word, paradigm_id, form_index) triples:(двор, 103, 0); (ёж, 104, 0); (дворник, 101, 2); (дворник, 102, 2); (ёжик, 101, 2); (ёжик, 102, 2)
Out of Vocabulary Words
Common prefixes removal: language-specific lists of common immutable
prefixes (e.g. "не", "псевдо")
• недопсевдоавиашоу == недо + псевдоавиашоу
• псевдоавиашоу == псевдо + авиашоу
• авиашоу == авиа + шоу
• шоу - a known word
Words Ending with Other Dictionary Words Example: котопсина
• a word being analyzed has another word from a dictionary as a suffix;
• the length of this "suffix" word is no less than 3;
• the length of the word without the "suffix" is no greater than 5;
• "suffix" word is of an open class (noun, verb, adjective, participle, gerund)
Endings Matching Example: бурбуляторовый
• words with common endings often have the same grammatical form
• pymorphy2 builds an index of all 1-5 char word endings and their analyses
• (frequency, paradigm_id, form_index) triple is stored for each ending
Words with a Hyphen
• adverbs with a hyphen: по-хорошему
• particles separated by a hyphen: смотри-ка
• compound words: интернет-магазин, человек-паук
P(tag | word) estimation
• Based on partially disambiguated OpenCorpora data;
• MLE with Laplace smoothing
Evaluation: bad ideas
• evaluate pymorphy2 on OpenCorpora data
• evaluate Mystem on ruscorpora.ru (НКРЯ) data
Evaluation Setup• pymorphy2 and Mystem 3.0;
• 100 randomly selected sentences from OpenCorpora ("microcorpus");
• 100 randomly selected sentences from ruscorpora.ru;
• tagsets are different; evaluation requires complicated tag matching and manual checking of all errors;
• available online (http://goo.gl/BNXQXf)
Evaluation: errors (full grammatical tags, recall, errors in
hyphenated words are not considered errors)
0
7,5
15
22,5
30
pymorphy2 Mystem 3.0
89
15
10
microcorpus ruscorpora
Evaluation: errors
0
3,5
7
10,5
14
Abbreviations People Names Regular Words Other Hyphenated Words*
11
2
6
1
14
02
44
9
pymorphy2 Mystem 3.0
Evaluation: results• Both pymorphy2 and mystem made less than 1%
errors (without disambiguation); most errors are in special cases.
• Hard to draw a conclusion; interpretation of evaluation results is important.
• 6 errors in ruscorpora.ru gold results are found by parsing it with pymorphy2, 1 error in microcorpus gold results is found by parsing it with mystem.
Future work• Improve people names, abbreviations, hyphenated words
parsing;
• improve non-contextual P(tag|word) estimates;
• improve Ukrainian language support;
• add Belarusian language support;
• there is a room for speed improvements;
• nicer command-line utility;
• ideas?