Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng
Jan 13, 2016
Translating
from Morphologically Complex Languages:
A Paraphrase-Based Approach
Preslav Nakov & Hwee Tou Ng
Overview
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Overview
Statistical Machine Translation systems Typically assume that word is the basic token-unit of translation
ProblemData sparseness issues for languages with rich morphology.
Our Solution Paraphrase-based approach to translating morphological variants.
3Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Introduction
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Morphologyin Statistical Machine Translation (SMT)
Traditionally, word was the basic token-unit of translation The earliest SMT models (aka, IBM models) were proposed for French and English, which have little morphology.
Most subsequent models remain word-atomic phrase-based hierarchical treelet syntactic
5Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Morphologyin Statistical Machine Translation (SMT)
Word as an atomic token-unit of translation
Fine for languages with little morphology: English, French, Spanish Chinese (almost no morphology)
Inadequate for morphologically rich languages: Arabic, Turkish, Finnish
word inflections
word-attached clitics German
compounds
6Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
The Case of Malay
Malay language rich derivational morphology but poor in
word inflections (unlike Arabic, Turkish, Finnish)
word-attached clitics (unlike Arabic, Turkish, Finnish)
concatenated compounds (unlike German, Finnish)
Problem: classic methods do not work for Malay
Solution: paraphrasing techniques word-level phrase-level sentence-level
7Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Related Work
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Related Work
Two general lines of research
1. Inflected forms of the same word are used as equivalence classes or as possible alternatives in translation stemming (Yang and Kirchhoff, 2006) lemmatization (Al-Onaizan et al., 1999; Goldwater and McClosky, 2005; Dyer, 2007) direct clustering (Talbot and Osborne, 2006) factored models (Koehn and Hoang, 2007).
2. Word segmentation compound words (Koehn and Knight,2003; Yang and Kirchhoff, 2006) clitics attached to the preceding word (Habash and Sadat, 2006) morpheme sequence representations (Lee, 2004;Dyer et al., 2008; Dyer, 2009).
Do not work well for Malay It has very little inflectional morphology, if any
compounds are not concatenated
clitics are rare
9Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Malay Morphology
ACL’2011 : Preslav Nakov & Hwee Tou Ng
The Malay Language
Malay Astronesian language ~180M speakers official in Malaysia, Indonesia, Singapore, and Brunei two major standard versions (mutually intelligible)
Bahasa Malaysia (lit. ‘language of Malaysia’)
Bahasa Indonesia (lit. ‘language of Indonesia’).
11Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
The Malay Language
Malay – an agglutinative language very rich derivational morphology but nearly non-existent derivational morphology
Inflectionally, Malay is like Chinese:
no grammatical gender, number or tense,
verbs are not marked for person, etc.
12Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Malay Morphology
New word formation processes affixation compounding reduplication
Other morphological processes clitic attachment
13Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
New Word Formation Processes in Malay
Affixation – attaching affixes, which are not words, to a word prefixes (e.g., ajar/‘teach’ pelajar/‘student’) suffixes (e.g., ajar ajaran/‘teachings’) circumfixes (e.g., ajar pengajaran/‘lesson’) infixes (e.g., gigi/‘teeth’ gerigi/‘toothed blade’)
Compounding – putting two or more existing words together e.g., kereta/‘car’ + api/‘fire’ keretapi or kereta api typically not concatenated
Reduplication – word repetition e.g., pelajar-pelajar/‘students’
14Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Clitics in Malay
Examples duduk/‘sit down’ + lah duduklah/‘please, sit down’, kereta + nya keretanya/‘his car’.
Notes: Clitics are not affixes. Clitic attachment is NOT
word inflection process
word derivation process
15Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
TranslatingMalay
Morphology
A Paraphrase-based Approach
to Translating from Malay
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Paraphrase-based Approachto Morphology
Given a complex Malay word, we generate morphologically simpler words from which it can be derived alternative word segmentations
We treat these forms as potential paraphrases of the original word.
We use paraphrasing techniques at three levels: word-level phrase-level sentence-level
17Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
GeneratingSimpler Morphological Variants Given a complex Malay word, we generate
1. words obtainable by affix strippinge.g., pelajaran pelajar, ajaran, ajar
2. words that are part of a compound worde.g., kerjasama kerja, sama
3. words appearing on either side of a dashe.g., adik-beradik adik, beradik
4. words without cliticse.g., keretanya kereta
5. clitic-segmented word sequencese.g., keretanya kereta nya
6. dash-segmented wordformse.g., aceh-nias aceh – nias
7. combinations of the above.
18
adik-beradiknya adik-beradiknyaadik-beradik nyaadik-beradikberadiknyaberadik nyaberadikadik nyaadik
berpelajaran berpelajaranpelajaranpelajarajaranajar
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Word-Level Paraphrases
Given a dev/test sentence:1. We generate a list of variants {w’} for each Malay word w.2. We add them to the sentence, thus forming a lattice.
19Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Word-Level Paraphrases (cont.)
The lattice requires a weight for each arc. We set 1.0 for the original word w. For each paraphrase w’ of w, we use the probability Pr(w’|w), estimated using word-level pivoting over English:
20Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Word-Level Paraphrases (cont.)
Estimating the probability Pr(w’|w):
21Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Sentence-Level Paraphrases
dev/test word-level paraphrases need matching phrases
Paraphrase the training data at the sentence-level: For each paraphrasable word w & for each of its paraphrases w’:
we create a version of the sentence with w substituted by w’.
Pair each paraphrased sentence with the original target
22
dia mahu membeli keretanya . || she wants to buy his car .dia mahu beli keretanya . || she wants to buy his car .dia mahu membeli kereta . || she wants to buy his car .dia mahu membeli kereta nya . || she wants to buy his car .
Paraphrased bi-text
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Sentence-Level Paraphrases (cont.)
We build two phrase tables Torig from the original training bi-text
Tpar from the paraphrased bi-text
We merge these tables1. Keep all entries from Torig.2. Add those phrase pairs from Tpar that are not in Torig. 3. Add extra features:
F1: 1 if the entry came from Torig, 0.5 otherwise.
F2: 1 if the entry came from Tpar, 0.5 otherwise.
F3: 1 if the entry was in both tables, 0.5 otherwise.
The feature weights are set using MERT, and the number of features is optimized on the development set.
23Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Phrase-Level Paraphrases
We further augment the phrase table with an extra feature, which is calculated using phrase-level pivoting: 1, for phrase pairs coming from Torig
maxp Pr(p’|p), for phrase pairs coming from Tpar
where p’ is a paraphrase of some original Malay phrase p
24Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Experimentsand Evaluation
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Data Training
bi-text: 350K sentence pairsEnglish: 10.4M words
Malay: 9.7M words
Developmentbi-text: 2,000 sentence pairs
English: 63.4K words
Malay: 58.5K words
Testingbi-text: 1,420 sentences
Malay: 28.8K words.
English: 32.8K, 32.4K, and 32.9K words (3 reference translations)
LM49.8M English words
26Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Evaluation Results: BLEU
27Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
28
Detailed BLEU
Improvementfor all n-gramsused in BLEU
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
29
Evaluation With 5 Measures
Consistent improvementfor 5 measures
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
30
Example Translations
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Conclusion
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Conclusion
Presented a novel approach to translating from a morphologically complex languageuses paraphrases at three levels of translation
word-level
phrase-level
sentence-level
Demonstrated the potential of the approach to Malay derivationally rich but almost no inflectional morphology
32Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Future Work
Improve the paraphrasing models use a richer sense similarity model that combines monolingual and bilingual similarity (Chen et al., 2010)
Try phrase table paraphrasing
instead of sentence-level paraphrasing (Nakov, 2008)
Try other morphologically complex languages SMT models
33
The presented work is supportedby research grant POD0713875.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach