Course Summary LING 575 Fei Xia 03/06/07
Dec 21, 2015
Course Summary
LING 575
Fei Xia
03/06/07
Outline
• Introduction to MT: 1
• Major approaches– SMT: 3– Transfer-based MT: 2– Hybrid systems: 2
• Other topics
Introduction to MT
Major challenges
• Translation is hard.
• Getting the right words:– Choosing the correct root form– Getting the correct inflected form– Inserting “spontaneous” words
• Putting the words in the correct order:– Word order: SVO vs. SOV, …– Unique constructions: – Divergence
Lexical choice
• Homonymy/Polysemy: bank, run
• Concept gap: no corresponding concepts in another language: go Greek, go Dutch, fen sui, lame duck, …
• Coding (Concept lexeme mapping) differences:– More distinction in one language: e.g., kinship
vocabulary.– Different division of conceptual space:
Major approaches
• Transfer-based
• Interlingua
• Example-based (EBMT)
• Statistical MT (SMT)
• Hybrid approach
The MT triangle
word Word
Meaning
Transfer-based
Phrase-based SMT, EBMT
Word-based SMT, EBMT
(interlingua)
Ana
lysi
sS
ynthesis
Comparison of resource requirement
Transfer-based
Interlingua EBMT SMT
dictionary + + +
Transfer rules
+
parser + + + (?)
semantic
analyzer
+
parallel data + +
others Universal representation
Generator
thesaurus
Evaluation
• Unlike many NLP tasks (e.g., tagging, chunking, parsing, IE, pronoun resolution), there is no single gold standard for MT.
• Human evaluation: accuracy, fluency, …– Problem: expensive, slow, subjective, non-reusable.
• Automatic measures:– Edit distance– Word error rate (WER), Position-independent WER (PER)– Simple string accuracy (SSA), Generation string accuracy (GSA)– BLEU
Major approaches
Word-based SMT
• IBM Models 1-5
• Main concepts:– Source channel model– Hidden word alignment– EM training
Source channel model for MT
)|(*)(maxarg* EFPEPEE
Eng sent Noisy channel Fr sent
P(E) P(F | E)
Two types of parameters:• Language model: P(E) • Translation model: P(F | E)
Modeling p(F | E) with alignment
a
a
EaFPEaP
EFaPEFP
),|(*)|(
)|,()|(
Modeling
)|()1(
)|()|(
1 1i
m
j
l
ijmeft
l
lmPEFP
Parameters:• Length prob: P(m | l)• Translation prob: t(fj | ei)• Distortion prob (for Model 2): d(i | j, m, l)
Model 1:
Model 2: ))|(*),,|(()|()|(1 1
i
m
j
l
ij eftlmjidlmPEFP
Training
• Model 1:
FE
E
ii
F
jj
E
ii
eft
ffeeeft
efCt,
||
0''
||
0
||
0
)|('
)),((*)),((*)|('
),(
FVx
exCt
efCteft
),(
),()|(
Finding the best alignment
),|(argmax* FEaPaa
)|(argmax*1
ja
m
jj
aefta
Given E and F, we are looking for
)|(maxarg*ij
ij efta
),...,(* **1 maaa
Model 1:
Clump-based SMT
• The unit of translation is a clump.
• Training stage: – Word alignment– Extracting clump pairs
• Decoding stage:– Try all segmentations of the src sent and all the
allowed permutations– For each src clump, try TopN tgt clumps– Prune the hypotheses
Transfer-based MT
• Analysis, transfer, generation:– Example: (Quirk et al., 2005)
1.Parse the source sentence2.Transform the parse tree with transfer rules3.Translate source words4.Get the target sentence from the tree
• Translation as parsing:– Example: (Wu, 1995)
Hybrid approaches
• Preprocessing with transfer rules: (Xia and McCord, 2004), (Collins et al, 2005)
• Postprocessing with taggers, parsers, etc: JHU 2003 workshop
• Hierarchical phrase-based model: (Chiang, 2005)
• …
Other topics
Other issues
• Resources– MT for Low density languages– Using comparable corpora and wikipedia
• Special translation modules– Identifying and translating name entities and
abbreviations– …
To build an MT system (1)
• Gather resources– Parallel corpora, comparable corpora– Grammars, dictionaries, …
• Process data– Document alignment, sentence alignment– Tokenization, parsing, …
To build an MT system (2)
• Modeling
• Training– Word alignment and extracting clump pairs– Learning transfer rules
• Decoding– Identifying entities and translating them with special modules
(optional)– Translation as parsing, or parse + transfer + translation– Segmenting src sentence, replace src clump with target clump,
…
To build an MT system (3)
• Post-processing– System combination– Reranking
• Using the system for other applications:– Cross-lingual IR– Computer-assisted translation– ….
Misc
• Grades– Assignments ( hw1-hw3): 30%– Class participation: 20%– Project:
• Presentation: 25%• Final paper: 25%