This document is part of the Coordination and Support Action “Preparation and Launch of a Large-scale Action for Quality Translation Technology (QTLaunchPad)”. This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 296347. Supplement 1 Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality Author(s): Aljoscha Burchardt and Arle Lommel (DFKI) Dissemination Level: Public Date: 19.11.2014 This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
19
Embed
Quality Translation 21 - Supplement 1 Practical Guidelines for ...Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
This document is part of the Coordination and Support Action “Preparation and Launch of a Large-scale Action for Quality Translation Technology (QTLaunchPad)”. This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 296347.
Supplement 1
Practical Guidelines for the Use of MQM in
Scientific Research on Translation Quality
Author(s): Aljoscha Burchardt and Arle Lommel (DFKI)
Dissemination Level: Public
Date: 19.11.2014
This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
2
Grant agreement no. 296347 Project acronym QTLaunchPad Project full title Preparation and Launch of a Large-scale Action for Quality Transla-
tion Technology Funding scheme Coordination and Support Action Coordinator Prof. Hans Uszkoreit (DFKI) Start date, duration 1 July 2012, 24 months Distribution Public Contractual date of delivery — Actual date of delivery 18.November 2014 Supplement number 1 Supplement title Practical Guidelines for the Use of MQM in Scientific Research on
Translation Quality Type Report Status and version Final, v1.0 Number of pages Contributing partners DFKI Authors Aljoscha Burchard, Arle Lommel EC project officer Aleksandra Wesolowska The partners in QTLaunchPad are:
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Germany
Dublin City University (DCU), Ireland
Institute for Language and Speech Processing, R.C. “Athena” (ILSP/ATHENA RC), Greece
The University of Sheffield (USFD), United Kingdom
This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
3
Table of Contents
1 Executive Summary .............................................................................................................. 4 2 MQM Process ........................................................................................................................ 4
2.1 Selecting a metric ............................................................................................................ 4 2.2 Selecting an Annotation Environment .......................................................................... 6 2.3 Selection of Annotators and Training ............................................................................ 6 2.4 Evaluation ....................................................................................................................... 7 2.5 Analysis ........................................................................................................................... 8
3 Costs ...................................................................................................................................... 8 4 Amount of text required ....................................................................................................... 9 5 Training materials ................................................................................................................. 9
5.1 Decision trees .................................................................................................................. 9 5.1.1 A generalized decision tree ..................................................................................... 11
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
4
1 Executive Summary This report provides practical guidelines for the use of the Multidimensional Quality Metrics (MQM) framework for assessing translation quality in scientific research projects. It does not address the use of MQM in production environments systematically, although notes are pro-vided concerning the use in these environments. It covers the process for using MQM, the costs, required amounts of text, training methods, and other relevant factors. MQM can provide detailed insights about translation issues/errors on different levels of granularity up to the word/phrase level as input for systematic approaches to overcome translation quality barriers. Like the common practice of post-editing, it requires manual work that will hopefully become less labor-intensive in the future through (partial) automa-tion.
2 MQM Process This section outlines the process for using MQM in a research scenario. It covers selection of a metric, training, the evaluation task itself, and analysis of results.
2.1 Selecting a metric The Multidimensional Quality Metrics (MQM) framework does not provide a translation quality metric, but rather provides a framework for defining task-specific translation metrics. Thus, rather than speaking of or using MQM itself for a specific quality evaluation task, one uses an MQM-compliant metric. To create an MQM-compliant metric, one must make a determination about which issues will be checked and to what level of granularity. At the coarsest level, it is possible to have an MQM-compliant metric that identifies as few as two error types: Accuracy and Fluency. (If only the target text is evaluated, it is even possible to have a single-issue metric with Fluency alone, but this metric could not be said to assess translation quality in any meaningful sense.) Generally, however, additional detail would be desirable and a more detailed metric would be needed. For example, the issue type hierarchy of the metric used for annotating corpus data in the QTLaunchPad project’s shared task can be graphically represented as shown in Figure 1. This particular metric was designed to provide analytic insight into the problems encountered in high-quality MT. With 19 issue types, it is considerably more gran-ular than would be used in many production evaluation environments, but the detail was needed to support the QTLaunchPad evaluation tasks. Note that it extends the MQM issue set by adding three custom subtypes to Function words: Extraneous, Incorrect, and Missing. These issues provide additional insight into one aspect of translation the proved to be partic-ularly difficult for MT.
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
5
Figure 1. MQM-compliant error hierarchy for diagnostic MT evaluation
This metric would not be suitable for all cases, and is presented here as an example. In gen-eral, an MQM-compliant metric designed for a research task should have the following quali-ties:
• It should be granular enough to address the relevant research questions. For example, a simple Accuracy-Fluency metric that emulates traditional Adequacy/Fluency evalua-tions in MT research would provide no insight into the specific nature of issues within those categories. Therefore, the metric selected should be certain to cover the research agenda. (In the case of QTLaunchPad, the research agenda was broad and focused on discovery of patterns, so the metric is fairly complex.)
• The metric should not contain extraneous categories or ask annotators to mark issues irrelevant to the research question. For example, it does not make sense to use Termi-nology in addition to general Mistranslation when working on news data where no de-fined terminology exists. Adding categories can increase “noise” in the data and also raises costs of annotation. However, if there are “borderline” categories that may be relevant, they should be included since retroactively adding them in would generally not be possible.
• The metric should be small enough to be maintained in the memory of the annotator. General psychometric guidelines suggest that categorizations used in evaluation should target six to seven items. For detailed evaluation such a small set may not be possible (the 19 categories of the MQM shared task are probably pushing the outer limit of what it is cognitively possible for annotators to keep in mind).
• Annotators must be given heuristics for selection of issues in ambiguous cases. (Ways to provide this guidance are covered in Section 5 (Training materials) below.
For translation production evaluation the QTLaunchPad website’s section on MQM1 contains useful information on creating relevant metrics based on project specifications. Research projects, by contrast, typically will have a clearer set of requirements (those needed to an-swer the research question at hand), but will also often be more complex than is recom-mended for production evaluation. After selecting the MQM issue types to be used in the evaluation task, an appropriate annota-tion environment needs to be configured to support the issue type selection. Both translate5 and the MQM scorecard are configured using a simple XML file (see Figure 2) that identifies the issues to be used. Other environments that could be configured to use MQM categories may use other mechanisms to declare a metric.
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
6
Figure 2. XML MQM metric definition file for use in translate5 and the MQM scorecard.
2.2 Selecting an Annotation Environment There are a number of types of annotation environments:
• At the coarsest level are questionnaires, spreadsheets and simple score card tools that simply count errors (but do not indicate their location within text) or evaluate texts as a whole. These tools are useful for looking at features of the text as a whole, but do not provide detailed insight into specific errors. Such systems are generally not advisable for translation research tasks that involve error analysis (but they may be suitable in some production environments or for research projects where finer granularity is not needed).
• At a finer level of granularity are scorecard systems that store annotations at the seg-ment level. They allow users to attach errors to specific segments, but not to specific words. They may support adding notes or highlighting text. These systems are typically easy to use but do not tie issues to specific locations. These systems are useful for quick annotation where it is sufficient to know which segments have which problems. The MQM Scorecard tool provides this functionality.
• Span-level annotation tools provide the ability to tie errors to particular spans in the text. Using them requires more training and care than is needed for the other tools since issues have to be associated with spans of text. These tools provide the greatest insight into errors. The translate5 tool used for most QTLaunchpad tasks is this sort of tool.
The environment selected must support the analysis intended for the annotated data. In general, it is wise to err on the side of caution and ask for more detail rather than less. After selecting the annotation environment it must be configured with the text(s) to be anno-tated and the appropriate metric definition.
2.3 Selection of Annotators and Training Annotation is an intellectually demanding task. Three typical layers of annotation in MT de-velopment are:
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
7
1. The phenomenological level (target errors/issues) 2. The linguistic level (source or target POS, phrases, etc.) 3. The explanatory level (source/system-related causes for certain errors)
MQM annotation is targeting the phenomenological level. Depending on the complexity of the metric, it may require expert-level skill in both translation theory and linguistics. Within the QTLaunchPad project, it was found that expert human translators represented ideal an-notators. However, not all translators were equally capable. In general, those with formal training in linguistics or with previous experience in error annotation (e.g., using a company-specific error scorecard system) were the most prepared for MQM annotation. As inter-annotator agreement (IAA) did not exceed 50%, even with training, it is important in research environments to have multiple annotators in order to control for variability be-tween individuals. Based on experience in the QTLaunchPad project, it is recommended that three annotators be used, if possible. It is anticipated that IAA would increase with experi-ence and feedback, but in most research scenarios it is unlikely that annotators will work with MQM for an extended period. Training is vital since the task and specific details of how to work with MQM-compliant met-rics and tools are not immediately apparent, even to highly skilled individuals. In general, the following training steps and materials are required:
• A live demo of the annotation environment. This step is vital to ensure that an-notators understand how to use the tool and are aware of all relevant features. Since annotation tools can be relatively complex, this demo should focus on a step-by-step explanation of the relevant process. It is recommended that the demo be recorded, if possible, for future reference.
• A decision tree and written annotation guidelines. A decision tree provides a relatively objective tool that helps guide the annotator to selection of the right issue. Written guidelines help annotators determine correct behavior in cases where the ap-propriate action is not self-evident (e.g., which portion of a text to mark when word or-der is wrong and multiple portions could be moved to fix the problem). These tools are discussed in Section 5 (Training materials) below.
• A calibration set. In this phase annotators are asked to work with a set where the er-ror properties are well known to the researchers. The data in such a set could be “real” data or could be data with known errors introduced into it. Comparing the annotators’ results for the calibration set with the idea profile allows the researcher to identify any problems or confusions with the evaluation and provide corrective guidance before the research data is considered. Note that the calibration set should be representative of the data to be annotated, and it is highly recommended that it in fact be drawn from the same data set as the data to be annotated. (E.g., if 1000 segments out of 1500 are to be evaluated, 150 might be set aside for calibration with the 1000 used for the research question then taken from the remaining 1350.)
2.4 Evaluation The evaluation/annotation task may proceed after training is completed and the results of the calibration set are verified. Based on experience, it is recommended that the annotators work in short segments (perhaps 30 minutes) with frequent breaks. The amount that can be evaluated in a given time frame depends on the number of errors present in the text: “clean-er” texts are faster to evaluate and annotate than are “dirty” texts with many errors. For MT evaluation, there is often a significant portion of the text that has so many errors that annotation is counter-productive since the nature of the errors may not be clear or the entire
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
8
text may be unintelligible. Therefore it is recommended that the annotators conduct an ini-tial “triage” phase in which segments are quickly categorized into one of three categories:
• perfect segments (which do not need to be annotated), • segments to be annotated (the QTLaunchPad project targeted segments with 1–3 er-
rors), and • “garbage” segments which contain too many errors to be annotated.
Annotation can then focus on the second category without worrying about the other two cat-egories. If the triage task is conducted by more than one individual, appropriate policies for reconciling differences of opinion should be established (e.g., if one annotator marks a sen-tence as perfect and another as needing annotation, it is probably wise to circulate it for an-notation by all annotators).
2.5 Analysis Multiple types of analysis are possible. Aggregate figures are often useful if multiple MT sys-tems are being compared as they can reveal system-level differences across engines. For de-termining the causes of specific errors, detailed analysis of specific issues is required. What-ever analysis is intended, it is important that data be preserved at all stages of transfor-mation (e.g., if errors are extracted, the process should make a copy of the original data) since it is easy to make mistakes that can result in irretrievable data loss.
3 Costs Based on the QTLaunchPad tasks, which focused on “near miss” translations, the direct costs of annotation, including triage selection of data to annotate, were approximately €1.50/segment.2 With previously trained annotators, the amount would probably drop to €1.00–1.25/segment. However, costs for MQM-based analysis are highly variable. For text with few errors, annotation would be quite inexpensive. For text with many errors, annota-tion would be much more expensive. This variability is one of the reasons why a triage phase is strongly recommended since it allows the researcher to select segments with relatively predictable costs. The cost per issue in the QTLaunchPad tasks was approximately 0.75€. Since the number of issues will vary between tasks, cost per issue cannot predict costs, but gives an idea of the productivity of evaluators. Finally, from the QTLaunchPad tasks the cost per word of annotation comes out to around €0.07–0.09/word. Accordingly, for pre-selected items with relatively few errors, the cost per thousand words would be around €7–9. If multiple annotations are factored in, the costs are multiplied by the number of annotators. In order to obtain sound data then, the best estimate at present is that the cost is between €20 and €30 for 100o words (assuming triple annotation). These figures do not include management or analysis, which can easily add 100–200% on top of the direct costs.
2 These figures are based on a payment system that paid a flat fee for a certain amount of text. An hourly fee was not used in QTLaunchPad because there was no previous experience on which to estimate time. However, if a fee of €50/hour is used with trained annotators, the figures presented here would be 15–20% lower for the sorts of text evaluated in the project.
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
9
4 Amount of text required There is no firm guidance for the amount of material needed for annotation. Based on QTLaunchPad results, it is possible to detect trends and identify major issue types with as few as 100–150 segments. Identifying rarer phenomena would require more data since in-teresting phenomena would be expected to display a “long-tail” distribution, with certain kinds of errors (and causes) accounting for the bulk of problems, while other errors are less common. If the goal is just to identify high-level distribution, small data sets may suffice, but in the QTLaunchPad project, a concerted effort was made to identify the causes of problems, a task which required many more segments. As is typical, the more data one has the better.
5 Training materials The most useful training materials are annotation guidelines and decision trees. A set of an-notation guidelines and a decision tree initially developed in QTLaunchPad and updated for use in the QTLeap project are included at the end of this document. The following subsec-tions describe these resources and how to create them.
5.1 Decision trees Decision trees are useful tools for learning a specific MQM metric’s issue types and distin-guishing between them. They are especially useful as a learning tool and to aid in determin-ing which issue applies in cases where the answer is not immediately apparent. There are at least as many possible decision trees as there are MQM metrics (more, in fact, because deci-sion trees can present issues in multiple orders). This document provides some guidance for making decision trees. Decision trees should work through branches of the hierarchy, with a single question sepa-rating each branch from other branches. This requirement is important because all children of a particular issue type (an issue and its children constitute a branch) could be classified as the parent type, so a single question is needed that can distinguish all of them as a group from other issues. After determining which node an issue is contained within, it is important to resolve more specific issue types before more general ones. This guideline works on the principle of exclu-sion: by eliminating specific cases the general case is what remains. For example, if an MQM metric has the following structure for Accuracy:
• Accuracy • Mistranslation • Terminology • Company terminology
• Number • Omission • Omitted variable
The process to work through the hierarchy is as follows:
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
10
• Determine whether the issue is a type of Mistranslation or, if it is not, if it is a type of Omission. Since these are the specific types of Accuracy, they need to be eliminated before declaring the issue a general Accuracy issue.
• If it is one of the subtypes, questions must determine if the issue is one of their children (or grandchildren). For example, if an issue is a type of Mistranslation, the question “Was a number mistranslated?” would identify (or rule out) a mistranslation of a num-ber; the question “Is a term translated incorrectly?” would identify (or rule out) Ter-minology. If the answer to both of those questions is “No” then the issue is Mis-translation.
• A similar principle would ask the evaluator to rule out Company terminology before using a general Terminology issue type.
In accordance with the above, a decision tree for the Accuracy branch in this metric might look like the following:
• Is content present in the source inappropriately omitted from the target? [This ques-tion selects or excludes Omission] • Yes: Go to question 2 [We know it is a type of Omission] • No: Go to 3. [Omission has been excluded, so now we need to see if is another type of
Accuracy]
• Is a variable omitted from the target content? [Tells us if the specific subtype of Omission should be selected] • Yes: Omitted variable • No: Omission [We have excluded the subtype of Omission, leaving the general
type] • Are words or phrases translated incorrectly (i.e., is meaning conveyed by the source
changed in the target)? [This question selects or excludes Mistranslation] • Yes: Go to question 4 [The issue is a type of Mistranslation] • No: Accuracy [Both Omission and Mistranslation have now been excluded,
leaving only Accuracy] • Were numbers translated incorrectly? [Selects or excludes Number]
• Yes: Number • No: Go to question 5 [Number is excluded, so we move on]
• Is a domain- or organization-specific word or phrase translated incorrectly? [Selects or excludes Mistranslation] • Yes: Go to question 6 [We know it is a type of Terminology] • No: Mistranslation [We have excluded every other option]
• Is the word or phrase translated contrary to company-specific terminology guidelines? [Identifies or excludes Company terminology]
• Yes: Company terminology • No: Terminology [Company terminology has been excluded, leaving the more gen-
eral Terminology] Note that the order in which children of an element are selected is theoretically unimportant. For example, question 1 above could have served to select or exclude Terminology and question 2 could have focused on Omission. The important aspect is that subtypes are ex-cluded before selecting a general type. Although there is no theoretical principle for placing one issue type before another, however, there may be practical reasons to do so: decision trees should be optimized for efficiency. If it is expected that one issue type will be quite rare while its sibling would be more common, the more common sibling should be placed first to make it easier to find.
Preparation and Launch of a Large-scale Action for Quality Translation Technology Practical Guidelines for the Use of MQM in Scientific Research on Translation Quality
11
5.1.1 A generalized decision tree
The attached decision tree covers the full MQM hierarchy. It is not expected that the entire tree will be used, but individual questions can be taken from this decision tree to build spe-cific trees. (Note that, due to its complexity, this tree is optimally printed on A0 paper. An A4-sized version is included in this document for reference. The full-sized version is availa-ble at http://qt21.eu/downloads/fullDecisionTreeComplete.pdf.) Note that portions of the tree generally should not be used without their parent issue unless the decision tree is intended to document only specific errors and not general types. For ex-ample, selecting Company Terminology without its ancestor nodes Terminology, Mis-translation, and Accuracy would result in a tree that cannot identify more general error types. This approach might be appropriate if the only issue being assessed is adherence to company terminology guidelines. If a metric is created that identifies only specific subtypes (e.g., there is a metric that counts only terminology violations and distinguishes between company and normative terminology), a decision tree is still possible, but could not be made from this resource without modification. To extract a portion of the tree that is less granular than the full tree, it is necessary to re-move any unneeded children of the types to be assessed. Guidance to remove these issues is beyond the scope of this description, but is relatively straight forward if the MQM hierarchy is understood. Note that the specific questions may vary from those presented in the decision tree as long as they are capable of identifying the appropriate issues. The specific questions presented here are not to be treated as normative.
5.2 Annotation guidelines Annotation guidelines provide practical guidance for the annotator. They need to provide a definition of the metric, instructions for how to realize the metric in the chosen annotation environment, and any specific items that need special attention. The guidelines are used in training, but also for reference during annotation. Therefore they need to be short and acces-sible. It may be advisable to maintain the guidelines in an accessible format where changes can be made to address queries and concerns that arise during annotation. A sample set of annotation guidelines is included at the end of this document. (The provided guidelines were given to annotators working on the MQM corpora analyzed in D1.3.1.)
Guide to selecting MQM issues for the MT Evaluation Metric
version 1.4 (2014 November 17)
Selecting issues can be a complex task. In order to assist evaluators, a decision tree helps evaluators select appropri-ate issues. Use the decision tree not only for learning about MQM issues, but to guide your annotation efforts and resolve any questions or concerns you may have.
Start at the upper left corner of the decision tree and then answer the questions and follow the arrows to find appropriate issues.
If using translate5, note that the decision tree is organized a bit differently than the hierarchy in translate5 because it eliminates specific issue types before moving to general ones, so you familiarize yourself with how issues are organized in translate5 before beginning annotation.
Add notes in translate5 of the scorecard to explain any decisions that you feel need clarification, to ask ques-tions, or to provide information needed to understand issues, such as notes about what has been omitted in a translation.
In addition to using the decision tree, please understand and follow the guidelines in this document. Email us at [email protected] if you have questions that the decision tree and other content in this document do not address.
1. What is an error?An error represents any issue you may find with the translated text that either does not correspond to the source or is considered incorrect in the target language. The list of language issues upon which you are to base your annota-tion is described in detail below and provides a range of examples.
The list is divided into two main issue categories, Accuracy and Fluency, each of which contains relevant, more detailed subcategories. Whenever possible, the correct subcategory should be chosen; however, if in doubt, please do not guess. Instead, select the category level about which you are most certain in order to avoid inconsis-tencies in the results.
Example: The German term Zoomfaktor was incorrectly translated as zoom shot factor, and you are unsure whether this represents a Mistranslation or an Addition. In this case, cat-egorize the error as an Accuracy error since it is unclear whether content has been added or a term mistranslated.
2. The Annotation ProcessThe translations you annotate should be a set of “near miss” (i.e., “almost perfect”) translations to annotate. Please follow these rules when selecting errors and tagging the respective text in the translations:
1. Use the examples in this documentation to understand specific classes.
2. If multiple types could be used to describe an issue (e.g., Agreement, Word form, Grammar, and Fluency), select the first one that the decision tree guides you to. The tree is organized along the following principles:
a. It prefers more specific types (e.g., Part of speech) to general ones (e.g., Grammar). However, if a specific type does not apply, it guides you to use the general type.
b. General types are used where the problem is of a general nature or where the specific problem does not have a precise type. For example He slept the baby exhibits what is technically known as a valency error, but because there is no specific type for this error available, it is assigned to Grammar.
3. Less is more. Only tag the relevant text. For example, if a single word is wrong in a phrase, tag only the single word rather than the entire phrase. If two words, separated by other words, constitute an error, mark only those two words separately. (See the section on “minimal markup” below.)
4. If correcting one error would take care of others, tag only that error. For example, if fixing an Agreement er-ror would fix other related issues that derive from it, tag only the Agreement error, not the errors that result from it.
ExamplesSource: Importfilter werden geladenTranslation: Import filter are being loadedCorrect: Import filters are being loaded
In this example, the only error is the translation of filter in the singular rather than the plural (as made clear by the verb form in the source text). This case should be classified as Mistranslation, even though it shows prob-lems with agreement: if the subject had been translated properly the agreement problem would be resolved. In this case only filter should be tagged as a Mistranslation.
Source: im Dialog ExportierenTranslation: in the dialog exportCorrect: in the Export dialog
In this example, only Mistranslation should be marked. While Word order and Spelling (capitalization) would be considered errors in other contexts, this would not be the case here, as these two words constitute one term that has been incorrectly translated.
5. If one word contains two errors (e.g., it has a Spelling issue and is also an Extraneous function word), enter both errors separately and mark the respective word in both cases.
6. If in doubt, choose a more general category. The categories Accuracy and Fluency can be used if the nature of an error is unclear. In such cases, providing notes to explain the problem will assist the QTLaunchPad team in its research.
3. Tricky casesThe following examples are ones that have been encountered in practice and that we wish to clarify.
• Function words: In some cases issues related to function words break the accuracy/fluency division seen in the decision tree because they are listed under Fluency even though they may impact meaning. Despite this issue, please categorize them as the appropriate class under Function words.
Example: The ejector may be found with the external case (should be on in this case). Even though this error changes the meaning, it should be classified as Function words: incorrect in the Fluency branch.
• Word order: Word order problems often affect long spans of text. When encountering word orders, mark the smallest possible portion that could be moved to correct the problem.
Example: He has the man with the telescope seen. Here only seen should be marked as moving this one word would fix the problem.
• Hyphenation: Hyphenation issues sometimes occur in untranslated content and should be classified as such. Otherwise they should be classified as Spelling.
Example: Load the XML-files (Spelling) Nützen Sie die macro-lens (Untranslated, if the source has macro-lens as well)
• Number (plural vs. singular) is a Mistranslation.
• Terminology: Inappropriate use of terms as distinct from general-language Mistranslation.
Example: An English translation uses the term thumb drive to translate the German USB Speicherkarte. This translation is intelligible, but if the translation mandated in specifi-cations or a relevant termbase is USB memory stick, the use of thumb drive constitutes a Terminology error, even if thumb drive would be acceptable in everyday usage. How-ever, if USB Speicherkarte were to be translated as USB Menu, this would be a Mistrans-
lation since the words would be translated incorrectly, regardless of whether the origi-nal phrase is a term.
NOTE: Because no terminology list is provided, please use your understanding of relevant IT terminology for the evaluation task.
• Unintelligible: Use Unintelligible if content cannot be understood and the reason cannot be analyzed according to the decision tree. This category is used as a last resort for text where the nature of the problem is not clear at all.
Example: In the sentence “You can also you can use this tab to precision, with the colours are described as well as the PostScript Level,” there are enough errors that the meaning is unclear and the precise nature of the errors that lead to its unintelligibility cannot be easily determined.
• Agreement: This category generally refers to agreement between subject and predicate or gender and case.
Examples: The boy was playing with her own train I is at work
• Untranslated: Many words may look as if they have been translated and simply forgotten to apply proper capitalization or hyphenations rules. In most, cases, this would represent an untranslated term and not a Spelling. If the target word or phrase is identical to the source word or phrase, it should be treated as Untranslated, even if a Spelling error could also account for the problem.
4. Minimal markupIt is vital in creating error markup that errors be marked up with the shortest possible spans. Markup must identify only that area needed to specify the problem. In some cases this requirement means that two separate spans must be identified.
The following examples help clarify the general principles:
Incorrect markup Problem Correct minimal markupDouble click on the number faded in the status bar.[Mistranslation]
Only the single word faded is prob-lematic, but the markup indicates that number faded in is incorrect.
Double click on the number faded in the status bar.
The standard font size for dialogs is 12pt, which corresponds to a stan-dard of 100%. [Mistranslation]
Only the term Maßstab has been translated incorrectly. The larger span indicates that text that is perfectly fine has a problem.
The standard font size for dialogs is 12pt, which corre-sponds to a standard of 100%.
The in 1938 nascent leader with flair divined %temp_name eating lonely. [Unintelligible]
The entire sentence is Unintelligible and should be marked as such.
The in 1938 nascent leader with flair divined %temp_name eating lonely.
As noted above, Word order can be problematic because it is often unclear what portion(s) of the text should be marked. In cases of word order, mark the shortest portion of text (in number of words) that could be moved to fix the problem. If two portions of the text could resolve the problem and are equal in length, mark the one that occurs first in the text. The following examples provide guidance:
Incorrect markup Problem Correct minimal markupThe telescope big observed the op-eration
Moving the word telescope would solve the problem and only this word should be marked (since it occurs first in the text).
Although this entire portion shows word order problems, moving was recorded would resolve the problem (and is the shortest span that would resolve the problem).
The eruption by many instru-ments was recorded.
The given policy in the manual user states that this action voids the warranty.
This example actually has two separate issues that should be marked separately.
The given policy in the manual user states that this action voids the warranty.
Agreement poses special challenges because portions that disagree may be widely separated. To select appropriate minimal spans, consider the following guidelines:
• If two items disagree and it is readily apparent which should be fixed, mark only the portion that needs to be fixed. E.g., in “The man and its companion were business partners” it is readily apparent that its should be his and the wrong grammatical gender has been used, so only its should be marked.
• If two items disagree and it is not clear which portion is incorrect, mark the both items and mark them for Agreement, as shown in the example in the table below.
The following examples demonstrate how to mark Agreement:
Incorrect markup Problem Correct minimal markupThe man and its companion were business partners. [Agreement]
In this example, it is clear that its is the problematic portion, and that man is correct, so only its should be marked.
The man and its companion were business partners.
The man whom they saw on Friday night at the store were very big. [Agreement]
In this example it is not clear whether man or were is the error since there is nothing to indicate whether singular or plural is intended. Here the highlighted portion identifies only a single word, insufficient to identify the agreement problem. The correct version highlights both words as separate issues. In such cases use the Notes field to explain the decision.
The man whom they saw on Friday night at the store were very big. [Agreement]
In the event of questions about the scope of markup that should be used, utilize the Notes field to make a query or explain your choice.
The error corpus uses the following issue categories:
• Accuracy. Accuracy addresses the extent to which the target text accurately renders the meaning of the source text. For example, if a translated text tells the user to push a button when the source tell the user not to push it, there is an accuracy issue.
• Mistranslation. The target content does not accurately represent the source content.
Example: A source text states that a medicine should not be administered in doses great-er than 200 mg, but the translation states that it should not be administered in doses less than 200 mg.
Note(s): Mistranslation can be used for both words and phrases.
• Terminology. Domain- or industry-specific terms (including multi-word terms) are trans-lated incorrectly.
Example: In a musicological text the term dog is encountered and translated into German as Hund rather than the domain-specific term Schnarre.
Note(s): Terminology errors may be valid translations for the source word in gen-eral language, but are incorrect for the specific domain or organization.
• Omission. Content is missing from the translation that is present in the source.
Example: A source text refers to a “mouse pointer” but the translation does not mention it.Note(s): Omission should be reserved for those cases where content present in the
source and essential to its meaning is not found in the target text.
• Addition. The target text includes text not present in the source.
Example: A translation includes portions of another translation that were inadvertently pasted into the document.
• Untranslated. Content that should have been translated has been left untranslated.
Example: A sentence in a Japanese document translated into English is left in Japanese.
Note(s): As noted above, if a term is passed through untranslated, it should be classified as Untranslated rather than as Mistranslation.
• Fluency. Fluency relates to the monolingual qualities of the source or target text, relative to agreed-upon specifications, but independent of relationship between source and target. In other words, fluency issues can be assessed without regard to whether the text is a translation or not. For example, a spelling error or a problem with register remain issues regardless of whether the text is translated or not.
• Spelling. Issues related to spelling of words (including capitalization)
Examples: The German word Zustellung is spelled Zustetlugn. The name John Smith is written as “john smith”.
• Typography. Issues related to the mechanical presentation of text. This category should be used for any typographical errors other than spelling.
Examples: Extra, unneeded carriage returns are present in a text. A semicolon is used in place of a comma.
• Grammar. Issues related to the grammar or syntax of the text, other than spelling and orthography.
Example: An English text reads “The man was in seeing the his wife.”Note(s): Use Grammar only if no subtype accurately describes the issue.
• Word form. The wrong form of a word is used. Subtypes should be used when possible.
Example: An English text has comed instead of came.
• Part of speech. A word is the wrong part of speech
Example: A text reads “Read these instructions careful” instead of “Read these instructions carefully.”
• Agreement. Two or more words do not agree with respect to case, number, person, or other grammatical features
Example: A text reads “They was expecting a report.”
• Tense/aspect/mood. A verbal form inappropriate for the context is used
Example: An English text reads “Yesterday he sees his friend” instead of “Yes-terday he saw his friend”; an English text reads “The button must be pressing” instead of “The button must be pressed”.
• Word order. The word order is incorrect
Example: A German text reads “Er hat gesehen den Mann” instead of “Er hat den Mann gesehen.”
• Function words. Linguistic function words such as prepositions, particles, and pronouns are used incorrectly
Example: An English text reads “He beat him around” instead of “he beat him up.”
Note(s): Function words is used for cases where individual words with a gram-matical function are used incorrectly. The most common problems will have to do with prepositions, and particles. For languages where verbal prefixes play a significant role in meaning (as in German), they should be included here, even if they are not independent words.
There are three subtypes of Function words. These are used to indicate whether an unneeded function word is present (Extraneous), a needed function word is missing (Missing), or a incorrect function word is used (Incorrect). Evaluators should use the note field to specify details for missing function words.
• Unintelligible. The exact nature of the error cannot be determined. Indicates a major break down in fluency.
Example: The following text appears in an English translation of a German automotive manual: “The brake from whe this કુતારો િસ S149235 part numbr,,."
Note(s): Use this category sparingly for cases where further analysis is too uncertain to be useful. If an issue is categorized as Unintelligible no further categorization is required. Unintelligible can refer to texts where a significant number of issues combine to create a text for which no further determination of error type can be made or where the relationship of target to source is entirely unclear.