Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University [email protected] NLP Meeting 10/19/2006.

Morphological Preprocessing for Statistical Machine

Translation

Nizar Habash Columbia University

[email protected]

NLP Meeting 10/19/2006

Road Map

• Hybrid MT Research @ Columbia

• Morphological Preprocessing for SMT(Habash & Sadat, NAACL 2006)

• Combination of Preprocessing Schemes(Sadat & Habash, ACL 2006)

Why Hybrid MT?

• StatMT and RuleMT have complementary advantages

– RuleMT: Handling of possible but unseen word forms– StatMT: Robust translation of seen words

– RuleMT: Better global target syntactic structure– StatMT: Robust local phrase-based translation

– RuleMT: Cross-genre generalizations/robustness– StatMT: Robust within-genre translation

• StatMT and RuleMT use complementary resources

– Parallel corpora vs. dictionaries, parsers, analyzers, linguists

• Hybrids can potentially improve over either approach

Hybrid MT Challenges

• Linguistic phrase versus StatMT phrase

“. on the other hand , the”

• Meaningful probabilities for linguistic resources

• Increased system complexity

• The potential to produce the combined worst rather than the combined best

• Low Arabic parsing performance (~70% Parseval F-score)• Statistical hallucinations

Hybrid MT Continuum

• “Hybrid” is a moving target– StatMT systems use some rule-based components

• Orthographic normalization, number/date translation, etc.– RuleMT systems nowadays use statistical n-gram

language modeling

• Hybrid MT systems– Different mixes of statistical/rule-based components

• Resource availability

– General approach directions• Adding rules/linguistics to StatMT systems• Adding statistics/statistical resources to RuleMT systems

– Depth of hybridization• Morphology, syntax, semantics

Columbia MT Projects

• Arabic-English MT focus

• Different hybrid approaches

System Approach Collaborations

Virgo Syntax-aware SMT

Halim Abbas (Columbia Student)

Qumqum Stat Enriched Generation Heavy MT

Bonnie Dorr & Necip Ayan (University of Maryland)

Christof Monz (University of London)

SMT +MADA

Morphology Enriched SMT

Fatiha Sadat, Roland Kuhn & George Foster (NRC Canada)

Columbia MT Projects

• Arabic-English MT focus

• Different hybrid approaches

System Approach MTEval Submissions

Virgo Syntax-aware SMT

First Time 2006 - Primary

Qumqum Stat Enriched Generation Heavy MT

Second Time (2005, 2006) - Contrast2005 submission was part of UMD’s

SMT +MADA

Morphology Enriched SMT

MADA used in MTEVAL 2006 submissions by NRC and RWTH

System Overview

0 1 2 3 4 5 6

GHMT+SMT Syntax+SMT SMT+Morph

Parallel Corpus (5M wds) Language Model (1G wds) Morphology Syntax Rule-based Reordering

RuleMT StatMT

* Koehn Hybrid Scale

*

Columbia Contrast Columbia Primary

Research Directions

• Syntactic SMT preprocessing

• Syntax-aware phrase extraction

• Statistical linearization using richer CFGs

• Creation and integration of rule-generated phrase-tables

• Lowering dependence on source language resources

• Extension to other languages and dialects

Road Map


• Morphological Preprocessing for SMT– Linguistic Issues– Previous Work– Schemes and Techniques– Evaluation

• Combination of Preprocessing Schemes

Arabic Linguistic Issues

• Rich Morphology– Clitics

[CONJ+ [PART+ [DET+ BASE +PRON]]] w+ l+ Al+ mktb and+ for+ the+ office

– Morphotacticsw+l+Al+mktb wllmktb مكتبللو

ال+و+ مكتب+ل

• Ambiguitywjd وجد– he found– جد+ w+ jd و and+grandfather

Previous Work

• Morphological & syntactic preprocessing for SMT– French-English (Berger et al., 1994) – German-English (Nießen and Ney 2000; 2004)– Spanish, Catalan and Serbian to English (Popović and

Ney, 2004)– Czech-English (Goldwater and McClosky, 2005)– Arabic-English (Lee, 2004)

• We focus on morphological preprocessing– Larger set of conditions: schemes, techniques, learning

curve, genre variation– No additional kinds of preprocessing (e.g. dates,

numbers)

Road Map




Preprocessing Schemes

Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?


• ST Simple Tokenization • D1 Decliticize CONJ+• D2 Decliticize CONJ+, PART+• D3 Decliticize all clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,

Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags


























• MT04,1353 sentences, 36000 words

0

200

400

600

800

1000

1200

1400

ST ON WA D1 L2 D2 L1 TB D3 EN BW

OOVs

perplexity


• Scheme Accuracy – Measured against Penn Arabic Treebank

90

92

94

96

98

100

ST ON WA D1 D2 TB D3 MR L1 L2 EN

Preprocessing Techniques

• REGEX: Regular Expressions• BAMA: Buckwalter Arabic Morphological

Analyzer (Buckwalter 2002; 2004)

– Pick first analysis– Use TOKAN (Habash 2006)

• A generalized tokenizer• Assumes disambiguated morphological analysis• Declarative specification of any preprocessing scheme

• MADA: Morphological Analysis and Disambiguation for Arabic (Habash&Rambow 2005)

– Multiple SVM classifiers + combiner– Selects BAMA analysis – Use TOKAN

TOKAN

• A generalized tokenizer• Assumes disambiguated morphological analysis• Declarative specification of any tokenization

scheme– D1 w+ f+ REST– D2 w+ f+ b+ k+ l+ s+ REST– D3 w+ f+ b+ k+ l+ s+ Al+ REST +P: +O:– TB w+ f+ b+ k+ l+ REST +P: +O:– BW MORPH– L1 LEXEME + POS– ENG w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S:

• Uses generator (Habash 2006)

Road Map




Experiments• Portage Phrase-based MT (Sadat et al., 2005)

• Training Data: parallel 5 Million words only– All in News genre– Learning curve: 1%, 10% and 100%

• Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set• Test Data:

– MT04 (Mixed genre: news, speeches, editorials)– MT05 (All news)

Experiments (cont’d)

• Metric: BLEU (Papineni et al 2001)

– 4 references, case insensitive

• Each experiment– Select a preprocessing scheme– Select a preprocessing technique

• Some combinations do not exist– REGEX and EN

MT04 Results

0

5

10

15

20

25

30

35

40

45

ST BW D1 D2 D3 EN

MADA BAMA REGEX

BLE

U

100%

10%

1%

Training

> >

0

5

10

15

20

25

30

35

40

45

ST BW D1 D2 D3 EN

MT05 Results

MADA BAMA REGEX

BLE

U

100%

10%

1%

Training

> >

MT04 Genre VariationBest Schemes + Technique

EN+MADA @ 1%, D2+MADA @ 100%

0

5

10

15

20

25

30

35

40

45

1% News 1% NoNews 100% News 100% NoNews

Baseline (ST) MADA

BLE

U + 71%

+ 105%

+ 2%

+ 12%

Other Results

• Orthographic Normalization generally did better than the baseline ST – statistically significant at 1% training data only

• wa+ decliticization was generally similar to D1• Arabic Treebank scheme was similar to D2• Full lemmatization schemes behaved like EN but

always worse• 50% Training data

– D2 @ 50% data >= ST @ 100% data• Larger phrases size (14) did not have a

significant difference from the size 8 we used

Latest Results (July 2006)

Training Size

5M 111M

MT04 37.1 43.7

MT05 38.56 48.87

Road Map


• Morphological Preprocessing for SMT


Oracle Combination

0%

2%

4%

6%

8%

10%

12%

14%

16%

D2 TB BW ON ST EN D3 WA L2 D1 L1

• Preliminary study: oracle combination• In MT04,100% data, MADA technique,

11 schemes, sentence level selection• Achieved 46.0 Bleu

– (24% improvement over best system 37.1)

System Combination

• Exploit scheme complementarity to improve MT quality

• Explore two methods of system combination– Rescoring-Only Combination (ROC)– Decoding-plus-Rescoring Combination (DRC)

• We us all 11 schemes with MADA technique

Rescoring-Only Combination(ROC)

• Rescore all the one-best outputs generated from separate scheme-specific systems and return the top choice

• Each scheme-specific system uses its own scheme-specific preprocessing, phrase tables and decoding weights

Rescoring-Only Combination (ROC)

• Standard combo– Trigram language model, phrase translation model,

distortion model, and sentence length– IBM model 1 and 2 probabilities in both directions

• Other combo: add more features – Perplexity of source sentence (PPL) against a source

LM (in same scheme)– Number of out-of-vocabulary words source sentence

(OOV)– Source sentence length (SL)– An encoding of the specific scheme (SC)

Decoding-plus-Rescoring Combination (DRC)

• Step 1: Decode– For each preprocessing scheme

• Use union of phrase tables from all schemes• Optimize and decode (with same scheme)

• Step 2: Rescore– Rescoring the one-best outputs of each

preprocessing scheme

Results

• MT04 set

• Best single scheme D2 scores 37.1

Combination All schemes 4 best

ROC

Standard 34.87 37.12

+PPL+SC 37.58 37.45

+PPL+SC+OOV 37.4 -

+PPL+SC+OOV+SL 37.39 -

+PPL+SC+SL 37.15 -

DRC +PPL+SC 38.67 37.73

Results• Statistical significance using bootstrap re-

sampling (Koehn, 2004)

DRC ROC D2 TB D1 WA ON

100 0 0 0 0 0 0

97.7 2.2 0.1 0 0 0

92.1 7.9 0 0 0

98.8 0.7 0.3 0.2

53.8 24.1 22.1

59.3 40.7

Conclusions

• For large amounts of training data, splitting off conjunctions and particles performs best

• For small amount of training data, following an English-like tokenization performs best

• Suitable choice of preprocessing scheme and technique yields an important increase in BLEU score if– there is little training data– there is a change in genre between training and test

• System combination is potentially highly rewarding especially when combining the phrase tables of different preprocessing schemes.

Future Work

• Study additional variant schemes that current results support

• Factored translation modeling

• Decoder extension to use multiple schemes in parallel

• Syntactic preprocessing

• Investigate combination techniques at the sentence and sub-sentence levels

Thank you!

Questions?

Nizar Habash

[email protected]

Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University [email protected] NLP Meeting 10/19/2006.

Documents