Predicting Human Translation Quality - Quality Translation 21 · Predicting Human Translation Quality 5 / 23. Quality EstimationMethodHT vs MTData analysisConclusions Outline 1 Quality

Quality Estimation Method HT vs MT Data analysis Conclusions

Predicting Human Translation Quality

Lucia Specia

University of [email protected]

QTLaunchPad Workshop, Dubrovnik15 June 2014

(Joint work with Kashif Shah)

Predicting Human Translation Quality 1 / 23


Outline

1 Quality Estimation

2 Method

3 HT vs MT

4 Data analysis

5 Conclusions



Outline


2 Method

3 HT vs MT

4 Data analysis

5 Conclusions



Overview

Translation quality estimation (QE) automatic metrics toprovide an estimate on the quality of a translated text

No access to reference translations, MT systems in use

So far, only applied to machine translated (MT) texts



Overview






Overview






Applications

Can a reader get the gist?

Is it worth post-editing it?

How much time to fix it?

Can we publish it as is?

Does it need human checking?



Outline


2 Method

3 HT vs MT

4 Data analysis

5 Conclusions



Learning

Supervised machine learning to build models based ontraining data:

annotated with quality labels (human input at“training” time)described by features

“Quality” defined according to the problem (and data):

Post-editing time for a sentenceMQM issue for a word

Models predict such quality scores



Learning








Learning








Outline


2 Method

3 HT vs MT

4 Data analysis

5 Conclusions



Can we predict HT quality?

Key objective in QTLP: (automated) metrics to evaluateand estimate translation quality of human and machinetranslations

MT quality estimation works well (at sentence-level):

WMT12-14 shared tasks

QuEst framework: www.quest.dcs.shef.ukLarge number of recent papersCommercial adoption: Multilizer, SDL-LW, Yandex

Question: Can we apply the same framework to predictthe quality of human translations?




















Motivation: Automate/sample for quality assurance

Encouraging fact: hard to distinguish MT and HT (EAMT-Tuesday). But:

1 Do (professional) human translators make mistakes?

2 Are HT errors the same/similar to MT errors?

3 Are current quality estimation tools good for HT?

Data analysis to answer these questions: sentence- and(partially) word-level
















































QTLP datasets

Not possible before QTLP: no large enough datasetavailable with both MTs and HTs

Our data is different from existing ’learner’ corpora:

HTs produced by professionals2-3 state-of-the-art MT systems (RBMT, SMT, hybrid)Both HT and MT annotated by professionaltranslators4 language-pairsNews and ’customer’ data

Datasets also used for WMT14 QE shared tasks



QTLP datasets

Not possible before QTLP: no large enough datasetavailable with both MTs and HTs

Our data is different from existing ’learner’ corpora:

HTs produced by professionals2-3 state-of-the-art MT systems (RBMT, SMT, hybrid)Both HT and MT annotated by professionaltranslators4 language-pairsNews and ’customer’ data

Datasets also used for WMT14 QE shared tasks



QTLP datasets - sentence-level

Labels:

1 = Perfect translation, no post-editing needed at all

2 = Near miss translation: translation contains maximum of2-3 errors, and possibly additional errors that can be easilyfixed (capitalisation, punctuation, etc.)

3 = Very low quality translation, cannot be easily fixed

Sentences:

# Source # HT+MTs # Target1,104 English 4 4,416 Spanish500 English 4 2,000 German500 German 3 1,500 English500 Spanish 3 1,500 English



QTLP datasets - word-level

Labels: Core MQM:

Sentences: subset of 2s, from 2-3 MT systems + HT

# Source-target2,339 English-Spanish865 English-German450 German-English

1,050 Spanish-English




Labels: Core MQM:







Labels: Core MQM:






Outline


2 Method

3 HT vs MT

4 Data analysis

5 Conclusions



Translators are humans...


Sentence-level: de-en

1 - perfect 2 - few errors 3 - too bad

HT MT-1 MT-2





Sentence-level: en-es


HT MT-1 MT-2 MT-3





Sentence-level: en-de


HT MT-1 MT-2 MT-3





Sentence-level: es-en


HT MT-1 MT-2





Word-level: OK vs BAD

LangWords tagged

HT MT

de-en 808 7420

en-es 8933 48089

en-de 1241 12406

es-en 1206 21818

de-en en-es en-de es-en0

0.2

0.4

0.6

0.8

1

1.2

OK – HT OK – MT



Humans and machines are not the same


Word-level: Accuracy vs Fluency

de-en en-es en-de es-en

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fluency Accuracy

de-en en-es en-de es-en0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fluency Accuracy

HT

MT





Word-level: Types of errors de-en

Addition Agreement

Capitalization Function_words

Grammar Mistranslation

Morphology Part_of_speech

Punctuation Spelling

Style/register Tense/aspect/mood

Terminology Typography

Unintelligible Untranslated Word_order

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

MT

HT





Word-level: Types of errors en-es

Addition

Capitalization

Function_words

Mistranslation

Omission

Punctuation

Style/register

Terminology

Unintelligible

Word_order

0 0.05 0.1 0.15 0.2 0.25 0.3

MT

HT





Word-level: Types of errors en-de

Accuracy

Addition Agreement

Capitalization Function_words

Grammar Mistranslation

Morphology Omission

Part_of_speech Punctuation

Spelling Style/register

Tense/aspect/mood Terminology Typography

Unintelligible Untranslated Word_order

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

MT

HT





Word-level: Types of errors es-en

Accuracy

Agreement

Fluency

Grammar

Morphology

Part_of_speech

Spelling

Tense/aspect/mood

Typography

Untranslated

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MT

HT



Humans are harder to predict


en-de de-en en-es es-en0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

MT Delta Baseline-Quest

HT Delta Baseline-Quest

Incr

ea

de

in p

erf

orm

an

ce



Humans are harder to predict - Why?

Common (baseline) features (sentence-level):

no. of tokens in the source & target texts

average source token length

average no. of occurrences of target words in target text

no. of punctuation marks in source & target texts

language model probability of source & target texts

avg. no. of translations per source word

% of 1-grams, 2-grams & 3-grams in frequency quartiles1 & 4 (lower/higher frequency) in source language corpus

% of 1-grams in source text seen in source language corpus

Word-level: requires more labelled data



Outline


2 Method

3 HT vs MT

4 Data analysis

5 Conclusions



Conclusions

Human translation quality is harder to estimate:

Need more labelled data - hard to collect

Need more linguistically motivated features, to capturee.g. mistranslation - hard to generalise, require tools

On-going work within QTLP: collecting larger sets ofannotated data



Predicting Human Translation Quality

Lucia Specia

University of [email protected]

QTLaunchPad Workshop, Dubrovnik15 June 2014

(Joint work with Kashif Shah)


Predicting Human Translation Quality - Quality Translation 21 · Predicting Human Translation Quality 5 / 23. Quality EstimationMethodHT vs MTData analysisConclusions Outline 1 Quality

Documents