Human Evaluation of Machine Translation Systems MODL5003 Principles and applications of machine translation Lecture 13/03/2006 Bogdan Babych [email protected].

Human Evaluation of Machine Translation Systems

MODL5003 Principles and applications of machine translationLecture 13/03/2006

Bogdan [email protected]

(Slides: Debbie Elliott, Tony Hartley)

13 March 2006 MODL5003 Principles and applications of MT

2

Outline• MT Evaluation – general perspective• Purposes of MT evaluation• Why evaluating translation quality is

difficult• A brief history of MT evaluation• Examples of MT evaluation methods

for users• Where next?


3

MT Evaluation – a big space• Requirements

– Task: assimilation, dissemination, ...– Text: type, provenance, ...– User: translators, consumers, ...

• Quality attributes– Internal: architecture, resources, ...– External: readability, fidelity, well-

formedness, ...


4

What is evaluated• “It looks good to me” evaluation• Test suites

– Syntactic coverage, degradation• Corpus-based evaluation

– Real texts • Don’t have to read them all !

– Coverage of typical problems– Interaction between different levels– General performance (bird's-eye view)

• Aspects more/less important for overall quality


5

Purposes of MT evaluationInternational Standard for Language Engineering (ISLE)

http://www.mpi.nl/ISLE/ Framework for the Evaluation of Machine Translation in ISLE

(FEMTI)

http://www.issco.unige.ch/projects/isle/femti/framed-glossary.html

Defines 7 types of MT evaluation:1. Feasibility testing2. Requirements elicitation3. Internal evaluation4. Diagnostic evaluation5. Declarative evaluation6. Operational evaluation7. Usability evaluation


6

Purposes of MT evaluation1. Feasibility testing“An evaluation of the possibility that a particular

approach has any potential for success after further research and implementation.” (White 2000)

(Eg. sub-problems connected to a particular language pair)

Purpose: To decide whether to invest in further research

into a particular approach

For: Researchers, sponsors of research


7

Purposes of MT evaluation2. Requirements elicitationResearchers and developers create prototypes designed to demonstrate particular functional capabilities that might be implemented

Purpose: To elicit reactions from potential investors before implementing new approaches.

For: Researchers and developers, project managers, end-users


8

Purposes of MT evaluation3. Internal evaluation

Researchers and developers test components of a prototype or pre-release system. This can involve the use of test suites to evaluate output quality during the course of system modifications.

Purpose: • To measure how well each component performs its

function• To test linguistic coverage (that a new grammar rule

works in all circumstances)• Iterative testing: to check that particular modifications do

not have adverse effects elsewhereFor: Researchers, developers, investors


9

Purposes of MT evaluation4. Diagnostic evaluationResearchers and developers of prototype systems evaluate functionality characteristics and analyse intermediate results produced by the system

Purpose: To discover why a system did not give the expected results

For: Researchers and developers


10

Purposes of MT evaluation5. Declarative evaluation Evaluators rate the quality of MT output

Purpose: • To measure how well a system translates• To measure fidelity (how much of the source text content is

correctly conveyed in the target text)• To measure the fluency of the target text• To measure the usability of a MT output for a particular purpose• To evaluate a system’s improvability (to what extent can

dictionary update improve output quality?)• To help decide which system to buy• To indicate whether buying a system will be cost-effective (will

post-editing MT output be cheaper than translating from scratch?)

For: End-users, researchers, developers, managers, investors, vendors


11

Purposes of MT evaluation6. Operational evaluationManagers calculate purchase and running costs and compare with benefits

Purpose: To determine the cost-benefit of an MT system in a particular operational environment, and whether a system will serve its required purpose

For: Managers, investors, vendors


12

Purposes of MT evaluation

7. Usability evaluation• Evaluators test how easy the application is to use. • Systems are evaluated using questionnaires on usability. • Evaluators may record how long it takes to complete

particular tasks

Purpose: To measure how useful the product will be for the end-user in a

specific contextTo evaluate user-friendliness

For: End-users, researchers, developers, managers, investors, vendors


13

Why evaluating translation quality is difficult• No perfect standard exists for comparison• Scoring is subjective (so several

evaluators and texts are needed)• The "training effect" can influence results• Bilingual evaluators or human reference

translations are usually required• Different evaluations are needed

depending on use of MT output (eg. filtering, gisting, information gathering, post-editing for internal or external use)


14

A brief history of MT evaluation: the 1950’s and 1960’s

• 1954: First public demonstration MT (Georgetown University/IBM)

• Research in USA, Western Europe, Soviet Union and Japan

• 1966: ALPAC Report (funded by US Government sponsors of MT to advise on further R & D) …

• advised against further investment in MT• concluded that MT was slower, less accurate and more

expensive that human translation• recommended research into:

- practical methods for evaluation of translations- evaluation of quality and cost of various sources of translations- evaluation of the relative speed and cost of various

sorts of machine-aided translation


15

A brief history of MT evaluation: the 1970’s and 1980’s

• 1976: EC bought a version of Systran and began to develop own system Eurotra in 1978

• EC needed recommendations for evaluation: “Van Slype report” (Critical Methods for Evaluating the Quality of Machine Translation.) published in 1979Aims of the report:- to establish the state of MT evaluation- to advise the EC on evaluation methodology and research- to provide examples of evaluation methods and their applicationsAvailable online: http://issco-www.unige.ch/projects/isle/van-slype.pdf

• 1980’s: Greater need for MT evaluation: MT attracting commercial interest, tailor-made systems designed for large corporations

• 1987: First MT Summit (opportunity to publish research on evaluation)


16

A brief history of MT evaluation: The 1990’s

• 1992: AMTA workshop: MT Evaluation: Basis for Future DirectionsJEIDA Report presented (Japan Electronic Industry Development Association): Methodology and Criteria on Machine Translation Evaluation. This stressed the importance of judging systems according to context of use and user requirements

• 1993: Machine Translation journal devoted to MT evaluation

• 1992 -1994 DARPA (Defense Advanced Research Projects Agency) MT evaluations

• 1993 - 1999 EAGLES (Expert Advisory Group on Language Engineering) set up by European CommissionOne of aims: To propose standards, guidelines and recommendations for good practice in the evaluation of language engineering productsThe EAGLES 7-step recipe for evaluation:http://www.issco.unige.ch/projects/eagles/ewg99/7steps.html


17

The ISLE Project and FEMTIInternational Standards for Language EngineeringFramework for the Evaluation of Machine Translation in ISLE

• ISLE Evaluation Working Group set up in response to EAGLES

• Funded by EC, National Science Foundation of the USA and Swiss Government

• Established a classification scheme of quality characteristics of MT systems and a set of measures to use when evaluating these characteristics

• Scheme designed to help developers, users and evaluators to select evaluation criteria according to their needs

• Workshops organised to involve hands-on evaluation exercises to test reliability of metrics

• Latest research involves the investigation of automated evaluation methods: quicker and cheaper


18

Evaluation methods: Carroll 1966Source: Carroll, J. B. (1966). An experiment in evaluating the quality of translations. In Pierce, J. (Chair). (1966). Language and Machines: computers in Translation and Linguistics. Report by the Automatic Language Processing Advisory Committee (ALPAC). Publication 1416. National Academy of Sciences National Research Council, pp 67-75. http://www.nap.edu/books/ARC000005/html/

• Evaluation of scientific Russian texts translated into English• 3 human translations and 3 machine translations of 4 texts evaluated• Evaluators: 18 monolingual English speakers and 18 native English

speakers with good understanding of scientific Russian

Intelligibility:• Each sentence scored on a 9-point scale with no reference to source

text

Informativeness (fidelity) • Original Russian sentences rated for informativeness compared with

the translation• Monolinguals used human reference translations instead of source

texts for comparison


19

Evaluation methods: Carroll 1966Extracts from 9-point intelligibility scale

9. Perfectly clear and intelligible. Reads like ordinary text: has no stylistic infelicities

5. The general idea is intelligible only after considerable study, but after this study one is fairly confident that he understands. Poor word choice, grotesque syntactic arrangement, untranslated words, and similar phenomena are present, but constitute mainly "noise" through which the main idea is still perceptible

1. Hopelessly unintelligible. It appears that no amount of study and reflection would reveal the thought of the sentence.


20

Evaluation methods: Carroll 1966

Rating original sentencesExtracts from 10-point informativeness scale

9. Extremely informative. Makes "all the difference in the world" in comprehending the meaning intended. (A rating of 9 should always be assigned when the original completely changes or reverses the meaning conveyed by the translation)

4. In contrast to 3, adds a certain amount of information about the sentence structure and syntactical relationships; it may also correct minor misapprehensions about the general meaning of the sentence or the meaning of individual words

0. The original contains, if anything, less information than the translation. The translator has added certain meanings, apparently to make the passage more understandable.


21

Evaluation methods: Crook & Bishop 1979

Source: Crook & Bishop (reported by T C Halliday). Measurement of readability by the cloze test. In Van Slype, G.. (1979). Critical Methods for Evaluating the Quality of Machine Translation. Prepared for the European Commission Directorate General Scientific and Technical Information and Information Management. Report BR 19142. Bureau Marcel van Dijk, p65.

• Evaluators rate the readability of translations using a cloze test

• Human and machine translations produced• Every eighth word of machine translation omitted• Evaluators fill in the gaps• The more intelligible the MT output, the easier

the “test” is to complete


22

Evaluation methods: Sinaiko 1979

Source: Sinaiko, H. W. Measurement of usefulness by performance test. In Van Slype, G. (1979). Critical Methods for Evaluating the Quality of Machine Translation. Prepared for the European Commission Directorate General Scientific and Technical Information and Information Management. Report BR 19142. Bureau Marcel van Dijk, p91.

• Aim: to evaluate the English-Vietnamese LOGOS system

• All source texts contained instructions• Evaluators (native speakers of TL) use

machine translated instructions to perform tasks

• Errors in performance were measured (weighting system used)


23

Evaluation methods: Nagao 1985Source: Nagao, M., Tsujii, J. & Nakamura, J. (1985). The Japanese government project for machine translation. In Computational Linguistics 11, 91-109.

• Aim: to test the feasibility of using MT to translate abstracts of scientific papers

• 1,682 sentences from a Japanese scientific journal were machine translated into English

• Intelligibility:• 2 native speakers of English (with no knowledge of

Japanese) scored each sentence using a 5-point scale• Accuracy: • 4 Japanese-English translators evaluated how much

of the meaning of the original text was conveyed in the MT output


24

Evaluation methods: Nagao 1985

Extracts from 5-point intelligibility scale

1. The meaning of the sentence is clear, and there are no questions. Grammar, word usage, and style are all appropriate, and no rewriting is needed.

3. The basic thrust of the sentence is clear, but the evaluator is not sure of some detailed parts because of grammar and word usage problems. The problems cannot be resolved by any set procedure; the evaluator needs the assistance of a Japanese evaluator to clarify the meaning of those parts in the Japanese original.

5. The sentence cannot be understood at all. No amount of effort will produce any meaning.


25

Evaluation methods: Nagao 1985Extracts from 7-point accuracy scale

0. The content of the input sentence is faithfully conveyed to the output sentence. The translated sentence is clear to a native speaker and no rewriting is needed.

3. While the content of the input sentence is generally conveyed faithfully to the output sentence, there are some problems with things like relationships, between phrases and expressions, and with tense, voice, plurals, and the positions of adverbs. There is some duplication of nouns in the sentence.

6. The content of the input sentence is not conveyed at all. The output is not a proper sentence; subjects and predicates are missing. In noun phrases, the main noun (the noun positioned last in the Japanese) is missing, or a clause or phrase acting as a verb and modifying a noun is missing.


26

Evaluation methods: DARPA 1992-4Adequacy, fluency and informativeness

Sources: White, J., O'Connell, T., O’Mara, F.: The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proceedings of the 1994 Conference, Association for Machine Translation in the Americas, Columbia, Maryland (1994)White, J. (Forthcoming). How to evaluate Machine Translation. In H. Somers (ed.) Machine translation: a handbook for translators. Benjamins, Amsterdam.

• Aim: to compare prototype systems funded by DARPA• Evaluators: 100 monolingual native English speakers• Largest evaluation resulted in: corpus of 100 news articles (of c.400 words) in each

SL: French, Spanish and Japanese2 English human translations of eachEnglish machine translations of each text by several

systemsDetailed evaluation results


27

Evaluation methods: DARPA 1992-4DARPA: AdequacySegments of MT output were compared with equivalent human reference translations and scored on a 5-point scale according to how much of the original content was preserved (regardless of imperfect English)

5 – All meaning expressed in the source fragment appears in the translation fragment4 – Most of the source fragment meaning is expressed in the translation fragment3 – Much of the source fragment meaning is expressed in the translation fragment2 – Little of the source fragment meaning is expressed in the translation fragment1 – None of the meaning expressed in the source fragment is expressed in the translation fragment


28

Evaluation methods: DARPA 1992-4DARPA: Fluency• Each sentence scored for intelligibility without reference

to the source text or human reference translation• Simple 5-point scale used

DARPA: Informativeness• Designed to test whether enough information was

conveyed in MT output to enable evaluators to answer questions on its content

• Each translation accompanied by 6 multiple-choice questions on content

• 6 choices for each question


29

DARPA-inspired evaluation methods…Many subsequent evaluations have followed in the footsteps of DARPA …..

Fluency and Adequacy using 5-point scales: Source: Elliott, D., Atwell, E., Hartley, A.: Compiling and Using a Shareable Parallel Corpus for Machine Translation Evaluation. In: Proceedings of the Workshop on The Amazing Utility of Parallel and Comparable Corpora, Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal (2004)

Usability 5-point scale


30

Where next?• Beyond similarity metrics

– FEMTI offers a rich palette of techniques

• Beyond adequacy and fluency– Too generic / abstract for specific tasks?– Consider MT output in its own right

• Beyond conventional uses of MT as surrogate human translation (emulation)– MT as a component in a workflow


31

Restore a sense of purpose

Texts are meant to be used.There are no absolute standards of translation quality but only more or less appropriate translations for the purpose for which they are intended.(Sager 1989: 91)


32

Revisit MT proficiency (White 2000)

• View MT output as a genre– Characterise inadequacy, disfluency,

ill-formedness

• Embed MT and adapt (to) the environment– in IE, CLIR, CLQA, Speech2Speech– in pre- and post-editing

Human Evaluation of Machine Translation Systems

Any questions…?

Human Evaluation of Machine Translation Systems MODL5003 Principles and applications of machine translation Lecture 13/03/2006 Bogdan Babych [email protected].

Documents

applications of mt

types of mt evaluation

mt system

usability evaluation

2006modl5003 principles

quality of mt output

postediting mt output

system purpose