Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University [email protected] NLP Meeting 10/19/2006
Jan 12, 2016
Morphological Preprocessing for Statistical Machine
Translation
Nizar Habash Columbia University
NLP Meeting 10/19/2006
Road Map
• Hybrid MT Research @ Columbia
• Morphological Preprocessing for SMT(Habash & Sadat, NAACL 2006)
• Combination of Preprocessing Schemes(Sadat & Habash, ACL 2006)
Why Hybrid MT?
• StatMT and RuleMT have complementary advantages
– RuleMT: Handling of possible but unseen word forms– StatMT: Robust translation of seen words
– RuleMT: Better global target syntactic structure– StatMT: Robust local phrase-based translation
– RuleMT: Cross-genre generalizations/robustness– StatMT: Robust within-genre translation
• StatMT and RuleMT use complementary resources
– Parallel corpora vs. dictionaries, parsers, analyzers, linguists
• Hybrids can potentially improve over either approach
Hybrid MT Challenges
• Linguistic phrase versus StatMT phrase
“. on the other hand , the”
• Meaningful probabilities for linguistic resources
• Increased system complexity
• The potential to produce the combined worst rather than the combined best
• Low Arabic parsing performance (~70% Parseval F-score)• Statistical hallucinations
Hybrid MT Continuum
• “Hybrid” is a moving target– StatMT systems use some rule-based components
• Orthographic normalization, number/date translation, etc.– RuleMT systems nowadays use statistical n-gram
language modeling
• Hybrid MT systems– Different mixes of statistical/rule-based components
• Resource availability
– General approach directions• Adding rules/linguistics to StatMT systems• Adding statistics/statistical resources to RuleMT systems
– Depth of hybridization• Morphology, syntax, semantics
Columbia MT Projects
• Arabic-English MT focus
• Different hybrid approaches
System Approach Collaborations
Virgo Syntax-aware SMT
Halim Abbas (Columbia Student)
Qumqum Stat Enriched Generation Heavy MT
Bonnie Dorr & Necip Ayan (University of Maryland)
Christof Monz (University of London)
SMT +MADA
Morphology Enriched SMT
Fatiha Sadat, Roland Kuhn & George Foster (NRC Canada)
Columbia MT Projects
• Arabic-English MT focus
• Different hybrid approaches
System Approach MTEval Submissions
Virgo Syntax-aware SMT
First Time 2006 - Primary
Qumqum Stat Enriched Generation Heavy MT
Second Time (2005, 2006) - Contrast2005 submission was part of UMD’s
SMT +MADA
Morphology Enriched SMT
MADA used in MTEVAL 2006 submissions by NRC and RWTH
System Overview
0 1 2 3 4 5 6
GHMT+SMT Syntax+SMT SMT+Morph
Parallel Corpus (5M wds) Language Model (1G wds) Morphology Syntax Rule-based Reordering
RuleMT StatMT
* Koehn Hybrid Scale
*
Columbia Contrast Columbia Primary
Research Directions
• Syntactic SMT preprocessing
• Syntax-aware phrase extraction
• Statistical linearization using richer CFGs
• Creation and integration of rule-generated phrase-tables
• Lowering dependence on source language resources
• Extension to other languages and dialects
Road Map
• Hybrid MT Research @ Columbia
• Morphological Preprocessing for SMT– Linguistic Issues– Previous Work– Schemes and Techniques– Evaluation
• Combination of Preprocessing Schemes
Arabic Linguistic Issues
• Rich Morphology– Clitics
[CONJ+ [PART+ [DET+ BASE +PRON]]] w+ l+ Al+ mktb and+ for+ the+ office
– Morphotacticsw+l+Al+mktb wllmktb مكتبللو
ال+و+ مكتب+ل
• Ambiguitywjd وجد– he found– جد+ w+ jd و and+grandfather
Previous Work
• Morphological & syntactic preprocessing for SMT– French-English (Berger et al., 1994) – German-English (Nießen and Ney 2000; 2004)– Spanish, Catalan and Serbian to English (Popović and
Ney, 2004)– Czech-English (Goldwater and McClosky, 2005)– Arabic-English (Lee, 2004)
• We focus on morphological preprocessing– Larger set of conditions: schemes, techniques, learning
curve, genre variation– No additional kinds of preprocessing (e.g. dates,
numbers)
Road Map
• Hybrid MT Research @ Columbia
• Morphological Preprocessing for SMT– Linguistic Issues– Previous Work– Schemes and Techniques– Evaluation
• Combination of Preprocessing Schemes
Preprocessing Schemes
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Preprocessing Schemes
• MT04,1353 sentences, 36000 words
0
200
400
600
800
1000
1200
1400
ST ON WA D1 L2 D2 L1 TB D3 EN BW
OOVs
perplexity
Preprocessing Schemes
• Scheme Accuracy – Measured against Penn Arabic Treebank
90
92
94
96
98
100
ST ON WA D1 D2 TB D3 MR L1 L2 EN
Preprocessing Techniques
• REGEX: Regular Expressions• BAMA: Buckwalter Arabic Morphological
Analyzer (Buckwalter 2002; 2004)
– Pick first analysis– Use TOKAN (Habash 2006)
• A generalized tokenizer• Assumes disambiguated morphological analysis• Declarative specification of any preprocessing scheme
• MADA: Morphological Analysis and Disambiguation for Arabic (Habash&Rambow 2005)
– Multiple SVM classifiers + combiner– Selects BAMA analysis – Use TOKAN
TOKAN
• A generalized tokenizer• Assumes disambiguated morphological analysis• Declarative specification of any tokenization
scheme– D1 w+ f+ REST– D2 w+ f+ b+ k+ l+ s+ REST– D3 w+ f+ b+ k+ l+ s+ Al+ REST +P: +O:– TB w+ f+ b+ k+ l+ REST +P: +O:– BW MORPH– L1 LEXEME + POS– ENG w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S:
• Uses generator (Habash 2006)
Road Map
• Hybrid MT Research @ Columbia
• Morphological Preprocessing for SMT– Linguistic Issues– Previous Work– Schemes and Techniques– Evaluation
• Combination of Preprocessing Schemes
Experiments• Portage Phrase-based MT (Sadat et al., 2005)
• Training Data: parallel 5 Million words only– All in News genre– Learning curve: 1%, 10% and 100%
• Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set• Test Data:
– MT04 (Mixed genre: news, speeches, editorials)– MT05 (All news)
Experiments (cont’d)
• Metric: BLEU (Papineni et al 2001)
– 4 references, case insensitive
• Each experiment– Select a preprocessing scheme– Select a preprocessing technique
• Some combinations do not exist– REGEX and EN
MT04 Results
0
5
10
15
20
25
30
35
40
45
ST BW D1 D2 D3 EN
MADA BAMA REGEX
BLE
U
100%
10%
1%
Training
> >
0
5
10
15
20
25
30
35
40
45
ST BW D1 D2 D3 EN
MT05 Results
MADA BAMA REGEX
BLE
U
100%
10%
1%
Training
> >
MT04 Genre VariationBest Schemes + Technique
EN+MADA @ 1%, D2+MADA @ 100%
0
5
10
15
20
25
30
35
40
45
1% News 1% NoNews 100% News 100% NoNews
Baseline (ST) MADA
BLE
U + 71%
+ 105%
+ 2%
+ 12%
Other Results
• Orthographic Normalization generally did better than the baseline ST – statistically significant at 1% training data only
• wa+ decliticization was generally similar to D1• Arabic Treebank scheme was similar to D2• Full lemmatization schemes behaved like EN but
always worse• 50% Training data
– D2 @ 50% data >= ST @ 100% data• Larger phrases size (14) did not have a
significant difference from the size 8 we used
Latest Results (July 2006)
Training Size
5M 111M
MT04 37.1 43.7
MT05 38.56 48.87
Road Map
• Hybrid MT Research @ Columbia
• Morphological Preprocessing for SMT
• Combination of Preprocessing Schemes
Oracle Combination
0%
2%
4%
6%
8%
10%
12%
14%
16%
D2 TB BW ON ST EN D3 WA L2 D1 L1
• Preliminary study: oracle combination• In MT04,100% data, MADA technique,
11 schemes, sentence level selection• Achieved 46.0 Bleu
– (24% improvement over best system 37.1)
System Combination
• Exploit scheme complementarity to improve MT quality
• Explore two methods of system combination– Rescoring-Only Combination (ROC)– Decoding-plus-Rescoring Combination (DRC)
• We us all 11 schemes with MADA technique
Rescoring-Only Combination(ROC)
• Rescore all the one-best outputs generated from separate scheme-specific systems and return the top choice
• Each scheme-specific system uses its own scheme-specific preprocessing, phrase tables and decoding weights
Rescoring-Only Combination (ROC)
• Standard combo– Trigram language model, phrase translation model,
distortion model, and sentence length– IBM model 1 and 2 probabilities in both directions
• Other combo: add more features – Perplexity of source sentence (PPL) against a source
LM (in same scheme)– Number of out-of-vocabulary words source sentence
(OOV)– Source sentence length (SL)– An encoding of the specific scheme (SC)
Decoding-plus-Rescoring Combination (DRC)
• Step 1: Decode– For each preprocessing scheme
• Use union of phrase tables from all schemes• Optimize and decode (with same scheme)
• Step 2: Rescore– Rescoring the one-best outputs of each
preprocessing scheme
Results
• MT04 set
• Best single scheme D2 scores 37.1
Combination All schemes 4 best
ROC
Standard 34.87 37.12
+PPL+SC 37.58 37.45
+PPL+SC+OOV 37.4 -
+PPL+SC+OOV+SL 37.39 -
+PPL+SC+SL 37.15 -
DRC +PPL+SC 38.67 37.73
Results• Statistical significance using bootstrap re-
sampling (Koehn, 2004)
DRC ROC D2 TB D1 WA ON
100 0 0 0 0 0 0
97.7 2.2 0.1 0 0 0
92.1 7.9 0 0 0
98.8 0.7 0.3 0.2
53.8 24.1 22.1
59.3 40.7
Conclusions
• For large amounts of training data, splitting off conjunctions and particles performs best
• For small amount of training data, following an English-like tokenization performs best
• Suitable choice of preprocessing scheme and technique yields an important increase in BLEU score if– there is little training data– there is a change in genre between training and test
• System combination is potentially highly rewarding especially when combining the phrase tables of different preprocessing schemes.
Future Work
• Study additional variant schemes that current results support
• Factored translation modeling
• Decoder extension to use multiple schemes in parallel
• Syntactic preprocessing
• Investigate combination techniques at the sentence and sub-sentence levels