RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav Nakov, Sofia University "St. Kliment Ohridski" Elena Paskaleva, Bulgarian Academy of Sciences Svetlin Nakov, Sofia University "St. Kliment Ohridski" Workshop “Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages”, RANLP 2009
31
Embed
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
A Knowledge-Rich Approach to Measuring the Similarity
between Bulgarian and Russian Words
Preslav Nakov, Sofia University "St. Kliment Ohridski"
Elena Paskaleva, Bulgarian Academy of Sciences
Svetlin Nakov, Sofia University "St. Kliment Ohridski"
Workshop “Multilingual Resources, Technologies and Evaluation for Central and Eastern European
Languages”, RANLP 2009
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Introduction Objective
Measure the extent to which a Bulgarian and a Russian word are perceived as similar by a person who is fluent in both languages
Orthographic similarity
Modified to account typical cross-lingual correspondences between Bulgarian and Russian, e.g. transformations of inflections
Example Bulgarian афектирахме and Russian
аффектировались are orthographically different but perceived as similar
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Orthographic Similarity Minimum Edit Distance Ratio (MEDR)
MED(s1, s2) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s1 to s2 (Levenshtein distance)
MEDR is also known as normalized edit distance (NED)
Longest Common Subsequence Ratio (LCSR)
Maximal length subsequence common to both words, normalized by the longer word
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Modified Minimum Edit Distance Ratio (MMEDR)
Our MMEDR similarity algorithm
Reduces the Russian word to an intermediate Bulgarian-sounding form
Applies a set of linguistically motivated transformation rules
Compares orthographically the modified Russian word with the Bulgarian word
Calculates weighted Levenshtein distance
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Linguistic Motivation behind the MMEDR
Algorithm
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Linguistic Motivation Transliteration from Cyrillic to Cyrillic
Full coincidence (equality) of letters
Regular letter transitions
Transformations of n-grams
Lemmatization
Transformation Weights
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Transliteration What is transliteration?
Transition of sounds and their letter correspondences in one language to letters in another language
Russian → Bulgarian transliteration Full coincidence (equality) of letters
E.g. a → a (азбука – азбука) Russian letters missing in Bulgarian
E.g. ы → и, э → е (рыба – риба, поэт – поет) Removing a Russian letter
E.g. пальто → палто Regular letter transitions
E.g. муж → мъж, хлеб → хляб, сон → сън
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Transformation of n-grams Regular sound-letter transitions from Russian
to Bulgarian
Transformations originating from spelling Double consonants, e.g. процесс → процес
Voiceless to voiced consonants, e.g. бессмертный → безсмъртен
Transformations of morphological origin
Removing agglutinative morphemes (ся and сь), e.g. веселиться → веселить
Transforming endings, e.g. стенной → стенен
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Transformation of Russian Adjectives
Russian Ending
Bulgarian Ending
Example
-нный -нен военный → военен
-ный -ен вечный → вечен
-нний -нен ранний → ранен
-ний -ен вечерний → вечерен
-ский -ски вражеский → вражески
-ый -и стрелковый → стрелкови
-нной -нен стенной → стенен
-ной -ен родной → роден
-ой -и деловой → делови
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Transformation of Russian Verbs
Russian
Ending
Bulgarian
EndingExamples
-овать -амдекорировать →
декорирам
-ить -я бродить → бродя
-ять -я блеять → блея
-ать -ам давать → давам
-уть -а гаснуть → гасна
-еть -ея белеть → белея
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Lemmatization Bulgarian and Russian are highly-
inflectional languages Variety of endings express the different
forms of the same word
What is lemmatization? Replacement of inflected wordforms
with their lemmata
E.g. късният → късен (Bulgarian), равняющимся → равнять (Russian)
Lemmatization can handle inflections
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria
Transformation Weights We use weights for letter substitutions
when measuring Levenshtein distance We account regular phonetic and spelling
letter correspondences
Some substitutions are unlikely E.g. о → у is more likely than о → щ
Replacing letter with itself has cost 0 Regular letter substitution cost is 1 Consonants and vowels with similar
sequences of distinctive phonetic features have less substitution cost (e.g. б → в)
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria