Machine Translation evaluation for MT development Improving MT quality with evaluation-oriented methods
Bogdan Babych
Centre for Translation Studies
University of Leeds
QTLaunchPad Workshop @ EAMT 2014
Overview
• MT evaluation & MT development: science vs. engineering perspectives
Re-engineering models of automated evaluation for MT improvement
Limitations of automated metrics
• Usability of MT for usage scenarios
Engineers’ favorite toy vs. translators’ tool
Future: Realistic scenarios & evaluation-guided MT
MT evaluation & MT development • MT evaluation: the field in its own right
‘bigger’ than MT development ?
‘science’ methodology: quantifying & understanding natural phenomena behind engineering advances o understanding, well-formedness, mechanisms of language,
communication, cognition; how to move forward
Engineering context: test for theories & models
• Human evaluation: ultimate benchmark?
• Absolute vs. purpose-related quality
Skopos of human translation
Technical definitions of quality & MT performance
Automated MT evaluation: source of models for MT • What works for automated evaluation also
improves MT Evaluation uses some computable text features
that characterize MT quality
Features are calibrated by human judgments or performance
New models based on the same features can improve MT quality
• Examples: Named Entities, Information Extraction, MWE &Terminology Translation
NER: evaluation and improvement • MT errors more frequently destroy relevant
contexts don’t create spurious contexts for NE
• Difficulties for automatic tools on finding NEs ~proportional to relative “quality” (the amount of MT degradation)
• NER system (ANNIE) www.gate.ac.uk: o the number of extracted Organisation Names gives an
indication of Adequacy
ORI: … le chef de la diplomatie égyptienne HT: the <Title>Chief</Title> of the
<Organization>Egyptian Diplomatic Corps </Organization>
MT-Systran: the <JobTitle> chief </JobTitle> of the Egyptian diplomacy
MT Evaluation with Named Entity recognition
0
100
200
300
400
500
600
700
Org
aniz
atio
nTit l
e
JobTit l
e
{Job}T
it le
Firs
tPer
son
Pers
onDat
e
Loca
t ion
Money
Perc
ent
Reference
Expert
Candide
Globalink
Metal
Reverso
Systran
P&R of Organization Names vs. Human Adequacy & Fluency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P.HT- exp.
P.HT- ref
P.candide
P.globalink
P.ms
P.reverso
P.systran
R.HT- exp.
R.HT- ref
R.candide
R.globalink
R.ms
R.reverso
R.systran
HT- Ref
HT- Exp.
U/ I
BLEU/ade BLEU/flu R(Org)/ade R(Org)/flu
r correl 0.9535 0.995 0.8682 0.9806
NER for MT Improvement: RBMT (later generalized to SMT & Hybrid MT)
ProMT 1998
E-R
ProMT 2001
E-F
Systran 2000
E-F
Mark N Score N Score N Score
+1* 28 = +28.0 23 = + 23.0 18 = + 18.0
+0.5* 2 = +1.0 5 = + 2.5 24 = + 12.0
0* 4 = 0 7 = 0 8 = 0
–0.5* 3 = –1.5 1 = – 0.5 1 = – 0.5
–1* 13 = –13.0 14 = – 14.0 10 = – 10.0
SUM 50 +14.5 50 + 11.0 61 + 19.5
Gain +29% +22% +32%
NER improvement example
Original:
The agreement was reached by a coalition of four
of Pan Am's five unions.
Baseline translation:E-F
ProMT L'accord a été atteint par une coalition de quatre
de casserole cinq unions d'Am.
(‘The agreement was reached by a coalition of
four of saucepan five unions of Am.’)
DNT-processed translation:
L'accord a été atteint par une coalition de quatre
de cinq unions de Pan Am.
(‘The agreement was reached by a coalition of
four of five unions of Pan Am.’)
MWEs: MT and automated evaluation of concordances
Sent-N
(de-ori)
•Source text
Sent-N
(en-mt)
•Machine Translation output
Sent-N
(en-ht)
•Human Translation
Systematically mistranslated MWEs: x=log(freq); y=exp(BLEU)
0.5 1 1.5 2
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16
1.18
1.2
1.22
1.24
1.26
1.28
1.3
1.32
1.34
1.36
1.38
1.4
1.42
1.44
1.46
depleted uraniumsan suu kyihazardous substances
vis à vis electrical and electronicnobel prize winnerdeath penaltyarms exports
animal feed
drink drivingrenewable energy sourcesfoot and mouth
raw materials
mad cow disease sized enterprises
transportat ion of radioactive
atomic energy agency
intellectual property
animal welfareconstitut ional legality
bosnia herzegovina
interception of telecommunications
ir ish referendumnuclear materials
lawful interception
Evaluation feature potential vs. useful MT models
• Evaluation works as feature selection E.g., performance of Information Extraction (template
filling)
Slot filling performance from MT indicative of MT quality
• New models for MT require more work Perpetrator, Target, Instrument… slots filled from
multiple sentences o terrorist killing of Attorney General
o != killing of a terrorist (by analogy to “tourist killing” or “farmer killing”); =killing by terrorists
o ? “…just pretending to be a terrorist killing war machine…”
o ? “… who is working for the police on a terrorist killing mission…”
o ? “…merged into the "TKA" (Terrorist Killing Agency), they would … proceed to wherever terrorists operate and kill them…”
o “X’s defeat” == X’s loss
o “X’s defeat of Y” == X’s victory
• ORI: Swedish playmaker scored a hat-trick in the 4-2 defeat of Heusden-Zolder … its defeat of last night
… their FA Cup defeat of last season
… their defeat of last season’s Cup winners
… last season’s defeat of Durham
• ‘Non-monotonic’ filling from text could resolve ambiguity Semantic role labeling beyond sentence level
• Unclear how to use text-level disambiguation in MT models
Limitations of metrics
• ‘Used’ or ‘ideologically related’ evaluation metrics overestimate MT performance E.g., BLEU (Callison-Burch et al., 2006)
‘Surprise’ metric needed
• Require re-calibration to determine usability levels (text types, target languages) Regression parameters for predicting Ade/Flu
• Metric sensitivity depends on quality level BLEU has lower correlation for closely-related MT
& non-native HT
Non-native HT vs. MT
Sensitivity of BLEU vs. NER-based metrics: MT for distant & closely-related languages
Quality parameters for MT usage scenarios • There are no absolute standards of
translation quality but only more or less appropriate translations for the purpose for which they are intended. (Sager 1989: 91)
• Purpose-oriented MT quality definition o QTLaunchPad’s flagship examples: Post-Editing…
Usage scenarios and usability parameters
Moving beyond Ade, Flu, Inf … & BLEU
Evaluating & Improving MT for different target scenarios
MT usage scenarios • Place in the workflow
Pre/Post-editing
Controlled authoring, Sublanguage
Fully automatic & unrestricted authoring
• Purpose & deadlines (continuous range) High-quality publication
Internal communication (one-off translation)
Comprehension, getting information (on-line reading)
Performing tasks (following technical instructions, etc.)
Multilingual research (legal, medical advice, events)
Automatic processing (Information Extraction, Text Classification)
Evaluation metrics for usability levels and thresholds? • If cheaper, etc. than task-based evaluation
Would be of real interest for translation industry
• Establishing a project-specific usability threshold for MT system performance A metric should match usage scenario
Minimal value required to benefit from MT
• Predicting productivity gains from metrics Different thresholds for scenarios
Comparative usability of MT systems for scenarios may not coincide
• Usability thresholds with BLEU? No, unless we calibrate for specific TL, text type
z-scores for Intercept of regression lines: BLEU vs. Ade: TL & text type (line=sign. 99.9%)
TAUS Dynamic Quality Framework: PE productivity test
TAUS DQF reporting
Post-editing: productivity • post-editing increases translators’ productivity
(tested: IT documentation; no results for other combinations of translation directions, subject domains and text types)
Improvements by 74% on average,
figure varies widely between different translators (between 20% and 131%) depending on their attitude and experience with post-editing MT-generated texts. o Plitt & Masselot (2010)
• Importance of training in acquiring post-editing skills
Productivity = Processing speed & cognitive effort • Time to post-edit single segments
• Fixation intervals detected by eye-tracking system O’Brien, 2011
• Integration of MT into TM settings: translating unmatched segments Federico et al., 2012 (effort=no of changes)
http://amta2012.amtaweb.org/AMTA2012Files/papers/123.pdf
Federico et al., 2012: PE Effort
Federico et al, 2012: PE Speed
Lessons for translators • Individual translator’s performance – the major factor in the
variation in data Training matters
• Impact of subject domain & translation direction
• Behind variation? - Importance of understanding the purpose: Publication, in-house use?
o semantically correct & poor in style.
o Semantically, stylistically & terminologically correct?,
• Attitude issues Spending time for good-enough vs perfect translation
Quality standards influenced by received suggestions?
• Best approach – purpose-based MT evaluation(?) ‘Chess clock’ strategy: acceptable quality for the purpose stated in
the brief & the time allowed
Treating MT-translated texts in their own right: minimal changes for acceptable quality
Automated metric for usability levels • Unresolved problem: Usability-oriented
automated MT evaluation Task-based evaluation metrics (not proximity to
the reference) = success in using MT
Calibrated for realistic usage scenarios
Foundation for new MT models
• Candidates Performance on automated annotation tasks
(parsing, terminology, data mining)
Selective or weighted lexical overlap
Evaluation of MT vs. for MT
• MT resource evaluation
Understanding composition of the training corpora
• Performance benchmarking on text types
Understanding which training resources perform best for which project
Project-specific model creation, selection and combination optimized for specific usage scenario
Evaluation-guided MT • Evaluation has been part of the development
workflow Nowadays – new ways of integration
Systematic prioritizing of models, features, workflow: o What translators need: interface, functionality, quality
Accounting for different interests of stakeholders
Provides understanding of useful linguistic features that can be integrated into new models
The prima facie case against operational machine translation from the linguistic point of view will be to the effect that there is unlikely to be adequate engineering where we know there is no adequate science. (Martin Kay, 1980)
FT2MT – Project idea: Fast
track and fine-tune MT
Enabling Better Translation “FT2MT”
3. Data Built on TAUS Data repository of (now) 54 Billion words in 2,200 language pairs.
1. Models Language, translation and reordering models instead of ‘raw’ translation data
4. Linguistic annotation Rich linguistic annotation of language data helps to match texts with models for training
2. Framework Allowing users to try and optimize the combination of models
5. Evaluation Automatic task-oriented evaluation (correlated with human evaluation) to support selection of model combination.
6. Translator UI & API Interface for translators/users to test and fine-tune model selection + for developers for integration of new interfaces 7. Education - Outreach
Showcases, tutorials, and global open access to all developers, users and researchers
CTS/Leeds, TAUS, U Edinburgh, FBK, Translated, RACAI, Lingenio
Thanks!
• Questions…