Dion Wiggins Chief Executive Officer [email protected]

Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd

Dion WigginsChief Executive [email protected]

How to Measure the Success of Machine Translation

Copyright © 2012, Asia Online Pte Ltd

• A petabyte is one million gigabytes.• 8 x more than the information stored in all US libraries. • Equivalent of 20 million four drawer filing cabinets filled with text.• In 2012 we have 5 times more data

stored than we did in 2008. • The volume of data is growing

exponentially and is expected to increase by 20 times by 2020.

• We now have access to more data than at any time in human history.

The World We Live In

Every day 15 petabytes of new information is being generated



• We live in world which is increasingly instrumented and interconnected.

• The number of “smart” devices is growing everyday and the volume of data they produce is growing exponentially – doubling every 18 months.

• All these devices create new demand for access to information – access now, on demand, in real time.

By 2015 there will be more than 15 billion devices connected to the internet .



• Google’s message to the market has long been that its business is making the world’s information searchable and that MT is part of that mission.

Google translates more in 1 day than all human translators in 1 year

= x 36524


The World We Live In• Common Sense Advisory Calculates:– US$31.4 billion earned for language services in 2011– Divide by 365 days– Divide by 10 cents per words

LSPs translate a mere 0.00000067% of the text information created every day.

Even if only 1% of new text information created each day should be translated, that still means only

0.000067% is translated by LSPs

How much new text information should be translated?



Translation Demand is increasing.

Translator Supply is decreasing


The Impact of a Translator Shortage

• It is already clear that at 2,000-3,000 words per day per translator that demand is many multiples of supply.

• LSPs are having trouble finding qualified and skilled translators– In part due to lower rates in the market and more competition for resources

• Wave of new LSPs and translators– Many will try to capitalize on the market opportunity

created by translator shortage, but with deliver sub-standard services

– Lack of experience – both new LSPs and translators– Lower quality translations will become more

common place

Skilled Translator Supply is Shrinking


Expanding the Reach of Translation

User Generated Content

Support / Knowledge Base

Communications

Enterprise Information

User Documentation

User Interface

Products

Corporate

Part

ly M

ultil

ingu

al

Corporate Brochures

Product Brochures

Software Products

Manuals / Online Help

HR / Training / Reports

2,000

10,000

50,000

200,000

500,000

10,000,000

20,000,000+

50,000,000+

Email / IM

Call Center / Help Desk

Blogs / Reviews

Example WordsHuman

Machine

Existing Markets $31.4B

New Markets


Interesting QuestionVinod Bhaskar: Machine v/s Human Translation

Are machine translations gaining ground? Can they put translators out of circulation like cars did to horses and cart drivers?


Evolution of the Automotive Industry

Fuel and Energy Industry

Transportation

Mass Production

Service Industries

Research and Development

Customization and Parts

Production LineEarly Innovation


• Specialization• Editing / Proofing• Translation / Localization• Global work groups • Quality assurance• Managed Crowd and Community • Professionally managed amateur workforce

Translation is Evolving Along a Similar PathHuman TranslatorSingle Language

Vendors

Translation MemoryDictionary / Glossary

Multi Language Vendors

• Research Funding & Grants• Natural Language Programming• Language Technologies - search, voice, text, etc…Preservation of dying languagesAutomated & Enhanced post-editingEuro to Asian LP automation

• Custom engine development• Mass translation at high quality• Newly justifiable business models

• Data, reports• Phone, Mobile data• Internet, broad band• Website translation

• Mass translation• Knowledge dissemination• Data automation• Information processing• Multilingual software

Production LineEarly Innovation

Research and Development Communications

Technology IndustryService Industries

Translation for the MassesCustomization

• World Lingo• Google Translate• Yahoo Babelfish


Evolution of Machine Translation Quality

Google drivesMT acceptance

Qua

lity

911 -> Research funding

Processors became powerful enoughLarge volumes of digital data available

Quality plateau as RBMT reached its limits in many languages – only marginal improvement.

Babelfish

Early RBMT improved rapidly as new techniques were discovered.

Experimental

Gist

Near Human

Google switchesfrom Systran to SMT

Businesses start to consider MTLSPs start to adopt MT

New skills develop in editing MT

Early SMTVendors

New techniques matureHybrid MT Platforms

Good Enough Threshold


Plateau OfProductivity

Trough Of Disillusionment

Technology Trigger

Peak Of InflatedExpectations

Slope OfEnlightenment

Machine Translation Hype Cycle

Time / Maturity

Visib

ility

2015

Mainstream LSP use

1947 1954The "Translation" memorandum

Georgetown experiment

ALPAC report

1966

Microsoft & Google announce paid API

2011

Early LSP adopters

Babelfish

Google switches to SMT

9/11

2001 2007

IBM Research

1990

Move fromMainframe to PC Moses

Notable quality improvement

Near human quality examples emerge

*Not an official Gartner Hype Cycle


Top 5 Reasons For Not Adopting MT

1. Perception of Quality– Many believe Google Translation is as good as it gets / state of the art.– This is true for scale, but not for quality.

2. Perception of Quality– Perfect quality is expected from the outset and tests using Google or other out-of-the-box

translation tools are disappointing. – When combined with #1, other MT is quickly ruled out as an option.

3. Perception of Quality– The opposite to #2. Human resistance to MT. “A machine will never be able to deliver

human quality” mindset.

4. Perception of Quality– Few understand that out-of-the-box or free

MT and customized MT are different. – They don’t see why they should pay for

commercial MT as quality is perceived as the same.

5. Perception of Quality– Quality is not good enough as raw MT output.– The equation is not MT OR Human. It is MT AND Human


A An infinite demand – a well defined and growing problem that has always been looking for a solution – what was missing was QUALITY

Machine TranslationM T

eMpTy Promises50 Years of

What Is Different This Time Around?

Q Why does an industry that has spent 50 years failing to deliver on its promises still exist?


Whatever the customer says it is!Definition of Quality:


Quality Depends on the Purpose

• Document Search and Retrieval– Purpose: To find and locate information– Quality: Understandable, technical terms key– Technique: Raw MT + Terminology Work

• Knowledge Base– Purpose: To allow self support via web– Quality: Understandable, can follow directions provided– Technique: MT & Human for key documents

• Search Engine Optimization (SEO)– Purpose: To draw users to site– Quality: Higher quality, near human– Technique: MT + Human (student, monolingual)

• Magazine Publication– Purpose: To publish in print magazine– Quality: Human quality– Technique: MT + Human (domain specialist, bilingual)

Establish Clear Quality GoalsStep 1 – Define the purposeStep 2 – Determine the appropriate quality level


Reality Check – What Do You Really Get?


Typical MT + Post Editing

Speed

Translation Speed

28,0000

3,000

6,000

9,00012,000

25,000

21,000

18,00015,000

Human Translation Fastest MT + Post Editing

*Fastest MT + Post Editing Speed reported by clients.

*

Words Per Day Per Translator

Average person reads 200-250 words per minute. 96,000-120,000 in 8 hours. ~35 times faster than human translation.


Success Factors:Understanding Return On Investment

Cost Did we lower overall project costs?Time Did we deliver more quickly while

achieving the desired quality?Resources Were we able to do the job with fewer

resources?Quality Did we deliver a quality level that met or

exceeded a human only approach?Profit Less important in early projects, but the

key reason we are in business.


Success Factors:Understanding Return On Investment

Customer Is the customer satisfied? Have we met or exceeded their quality requirements?

Asset Building

Did we expand our linguistic assets? If we did the same kind of job again, would it be easier?

New Business

What business opportunities have been created that would not have otherwise been possible?What barriers have been removed by leveraging MT?


Targets should be defined, set and managed from the outset

Objective Measurement is Essential


Why Measure?

“The understanding of positive change is only possible when you understand the current system in terms of

efficiency.”...

“Any conclusion about consistent, meaningful, positive change in a process must be based on objective

measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “

– Kevin Nelson, Managing Director,

Omnilingua Worldwide


What to measure?• Automated metrics

– Useful to some degree, but not enough on their own

• Post editor feedback– Useful for sentiment, but not a reliable metric. When compared to technical

metrics, often reality is very different.

• Number of errors– Useful, but can be misleading. Complexity of error correction is often

overlooked.

• Time to correct– On its own useful for productivity metrics, but not enough when more depth

and understanding is required.

• Difference between projects– Combined the above allow an understanding of each project, but are much

more valuable when compared over several similar projects.

Objective measurement is the only means to understand


Rapid MT Quality Assessment

Long-term consistency , repeatability and objectivity are important

Butler Hill Group has developed a protocol that is widely accepted and used

Can be based on error categorization like SAE J2450

Should be used together with automated metrics

Will focus more on post-editing characteristics in future

BLEU is the most commonly used metric

“… the closer the machine translation is to a professional human translation, the better it is”

METEOR, TERp and many others in development

Limited but still useful for MT engine development if properly used

Automated Human Assessments


Different Automated MetricsAll four metrics compare a machine translation to human translations• BLEU (Bilingual Evaluation Understudy)

– BLEU was one of the first metrics to achieve a high correlation with human judgements of quality, and remains one of the most popular.

– Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations.

– Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.

– Intelligibility or grammatical correctness are not taken into account. – BLEU is designed to approximate human judgement at a corpus level, and performs badly if

used to evaluate the quality of individual sentences.– More: http://en.wikipedia.org/wiki/BLEU

• NIST– Name comes from the US National Institute of Standards and Technology.– It is based on the BLEU metric, but with some alterations:

• Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given.

• NIST also differs from BLEU in its calculation of the brevity penalty insofar as small variations in translation length do not impact the overall score as much.

– More: http://en.wikipedia.org/wiki/NIST_(metric)


Different Automated Metrics• F-Measure (F1 Score or F-Score)

– In statistics, the F-Measure is a measure of a test's accuracy. – It considers both the precision p and the recall r of the test to compute the score: p is the

number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned.

– The F-Measure score can be interpreted as a weighted average of the precision and recall, where a score reaches its best value at 1 and worst score at 0.

– More: http://en.wikipedia.org/wiki/F1_Score

• METEOR (Metric for Evaluation of Translation with Explicit ORdering) – The metric is based on the harmonic mean of unigram precision and recall, with recall

weighted higher than precision. – It also has several features that are not found in other metrics, such as stemming and

synonymy matching, along with the standard exact word matching. – The metric was designed to fix some of the problems found in the more popular BLEU

metric, and also produce good correlation with human judgement at the sentence or segment level.

– This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.– More: http://en.wikipedia.org/wiki/METEOR


Usability and Readability Criteria

Excellent (4)Read the MT output first. Then read the source text (ST). Your understanding is not improved by the reading of the ST because the MT output is satisfactory and would not need to be modified (grammatically correct/proper terminology is used/maybe not stylistically perfect but fulfills the main objective, i.e. transferring accurately all information.)

Good (3)Read the MT output first. Then read the source text. Your understanding is not improved by the reading of the ST even though the MT output contains minor grammatical mistakes .You would not need to refer to the ST to correct these mistakes.

Medium (2) Read the MT output first. Then read the source text. Your understanding is improved by the reading of the ST, due to significant errors in the MT output . You would have to re-read the ST a few times to correct these errors in the MT output.

Poor (1)Read the MT output first. Then read the source text. Your understanding only derives from the reading of the ST, as you could not understand the MT output. It contained serious errors. You could only produce a translation by dismissing most of the MT output and/or re-translating from scratch.

Evaluation Criteria of MT output


Human evaluators can develop custom error taxonomy to help identify key error pattern problems or use error taxonomy from standards such as the LISA QA Model or SAE J2450


Sample Metric Report From Language Studio™


Multiple References Increases Scores



Why do you need rules?

2 Port SwitchDouble Port Switch

Dual Port Switch

Normalization


Why do you need rules?

Non-Translatable Terms

Glossary such as Product Names

Job Specific Preferred Terminology

Terminology Control and Management


BLEU scores and other translation quality metrics will vary based upon:

1. The test set being measured: – Different test sets will give very different scores. A test set that is out of

domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score.

2. How many human reference translations were used: – If there is more than one human reference translation, the resulting BLEU

score will be higher as there are more opportunities for the machine translation to match part of the reference.

3. The complexity of the language pair: – Spanish is a simpler language in terms of

grammar and structure than Finnish or Chinese. – Typically if the source or target language is more

complex then the BLEU score will be lower.



4. The complexity of the domain: – A patent has far more complex text and structure than a children’s story

book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other.

5. The capitalization of the segments being measured: – When comparing metrics, the most common form of measurement is

Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured.

6. The measurement software: – There are many measurement tools for translation quality. Each may vary

slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same.

– The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics.It is clear from the above list of variations that a BLEU score number by itself has no real meaning.


Metrics Will Vary – Even the same metrics!!


What is your BLEU score?

This is the single most irrelevant question relating to translation quality, yet one of the most frequently asked.



1. The test set being measured: – Different test sets will give very different scores. A test set that is out of

domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score.

2. How many human reference translations were used: – If there is more than one human reference translation, the resulting BLEU

score will be higher as there are more opportunities for the machine translation to match part of the reference.

3. The complexity of the language pair: – Spanish is a simpler language in terms of

grammar and structure than Finnish or Chinese. – Typically if the source or target language is more

complex then the BLEU score will be lower.



4. The complexity of the domain: – A patent has far more complex text and structure than a children’s story

book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other.

5. The capitalization of the segments being measured: – When comparing metrics, the most common form of measurement is

Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured.

6. The measurement software: – There are many measurement tools for translation quality. Each may vary

slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same.

– The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics.It is clear from the above list of variations that a BLEU score number by itself has no real meaning.


Basic Test Set Criteria Checklist


Basic Test Set Criteria Checklist

• Test Set Data should be very high quality: – If the test set data are of low quality, then the metric delivered cannot be relied

upon.– Proof read a test set. Don’t just trust existing translation memory segments.

• Test set should be in domain: – The test set should represent the type of information that you are going to translate.

The domain, writing style and vocabulary should be representative of what you intend to translate. Testing on out-of-domain text will not result in a useful metric.

• Test Set Data must not be included in the training Data: – If you are creating an SMT engine, then you must make sure that the data you are

testing with or very similar data are not in the data that the engine was trained with. If the test data are in the training data the scores will be artificially high and will not represent the same level of quality that will be output when other data are translated.

The criteria specified by this checklist are absolute. Not complying with any of the checklist items will result

in a score that is unreliable and less meaningful.


Basic Test Set Criteria Checklist• Test Set Data should be data that can be translated:

– Test set segments should have a minimal amount of dates, times, numbers and names. While a valid part a segment, they are not parts of the segment that are translated; they are usually transformed or mapped. A focus for a test set should be on words that are to be translated.

• Test Set Data should have segments that are at between 8 and 15 words in length: – Short segments will artificially raise the quality scores as most metrics do not take into

account segment length. Short segments are more likely to get a perfect match of the entire phrase, which is not a translation and is more like 100% match with a translation memory. The longer the segment, the more opportunity there is for variations on what is being translated. This will result in artificially lower scores, even if the translation is good. A small number of segments shorter than 8 words or longer than 15 words are acceptable, but these should be very few.

• Test set should be at least 1,000 segments: – While it is possible to get a metric from shorter test sets, a reasonable statistic

representation of the metric can only be created when there are sufficient segments to build statistics from. When there are only a low number of segments, small anomalies in one or two segments can raise or reduce the test set score artificially.


Comparing Translation EnginesInitial Assessment Checklist


Comparing Translation Engines: Initial Assessment Checklist

• Test set must be consistent: – The exact same test set must be used for comparison across all translation engines. Do not

use different test sets for different engines.

• Test sets must be “blind”: – If the MT engine has seen the test set before or included the test set data in the training

data, then the quality of the output will be artificially high and not represent a true metric.

• Tests must be carried out transparently: – Where possible, submit the data yourself to the MT engine and get it back immediately. Do

not rely on a third party to submit the data. – If there are no tools or APIs for test set submission, the test

set should be returned within 10 minutes of being submitted to the vendor via email.

– This removes any possibility of the MT vendor tampering with the output or fine tuning the engine based on the output.

All conditions of the Basic Test Set Criteria must be met.If any condition is not met, then the results of the test could

be flawed and not meaningful or reliable.


Comparing Translation Engines: Initial Assessment Checklist

• Word Segmentation and Tokenization must be consistent: – If Word Segmentation is required (i.e. for languages such as Chinese, Japanese

and Thai) then the same word segmentation tool should be used on the reference translations and all the machine translation outputs. The same tokenization should also be used. Language Studio™ Pro provides a simple means to ensure all tokenization is consistent with its embedded tokenization technology.

• Provide each MT vendor a sample of at least 20 documents that are in domain– This allows each vendor to better understand they type of document and

customize accordingly. – This sample should not be the same as the test set data

• Test in Three Stages– Stage 1: Starting quality without customization– Stage 2: Initial quality after customization– Stage 3: Quality after first round of improvement.

• This should include post editing of at least 5,000 segments, preferably 10,000.


Understanding and Comparing Improvements


The Language Studio™ 4 Step Quality Plan

4. ManageManage translation projects while generating corrective data for quality improvement.

2. MeasureMeasure the quality of the engine for rating and future improvement comparisons

3. ImproveProvide corrective feedback removing potential for translation errors.

1. CustomizeCreate a new custom engine using foundation data and your own language assets


A Simple Recipe for Quality Improvement

• Asia Online develops a specific roadmap for improvement for each custom engine. – This ensures the fastest

development path to quality possible. – You can start from any level of data.

• We will develop based on the following:– Your quality goals– Amount of data available in foundation engine– Amount of data that you can provide– Quality expectations are set from the outset– Asia Online performs a majority of the tasks

• Many are fully automated


• High volume, high quality translation memories• Rich Glossaries• Large high quality monolingual data

The Data Path To Higher Quality

• Some high quality translation memories• Some high quality monolingual data• Glossaries

• Limited high quality translation memories• Some high quality monolingual data• Glossaries

• Limited high quality monolingual data• Glossaries

• Limited high quality monolingual data

Highes

t Qua

lity Entry Points

Less

High Q

ualit

y

Less

Editi

ng

Mor

e Editi

ngYou can start your custom engine with just monolingual data improving over time as data becomes available.Data can come from post editing feedback on initial custom engine.Quality constantly improves

1

2

3

4

5


Quality requires an understanding of the

data There is no exception to this rule


High Quality: Human Translation Project vs. Machine Translation Project

Human Only

• Terminology Definition• Non-Translatable Terms• Historical Translations• Style Guide• Quality Requirements• Translate• Edit• Proof• Project Management

MT + Human

• Terminology Definition• Non-Translatable Terms• Historical Translations• Style Guide**• Quality Requirements• Customize MT• Translate• Edit• Proof• Project Management

Almost identical information and data is required in order to customize a high quality machine translation system

When preparing for a high quality human translation project, many core steps are performed in order to ensure that the writing style and vocabulary are designed for a customers target audience.

MT Only

• Terminology Definition• Non-Translatable Terms• Historical Translations• Style Guide**• Quality Requirements• Customize MT• Translate• Edit• Proof• Project Management


Ability to Improve is More Important than Initial Translation Engine Quality

Language Studio™ is designed to deliver translated output that requires the least amounts of edits in order to publish.• The initial scores of a machine translation engine, while indicative of

quality, should be viewed as a starting point for rapid improvement.• Depending on the volume and quality of data provided to the SMT

vendor for learning from, the quality may be lower or higher. • Frequently a new translation engine will have

gaps in vocabulary and grammatical coverage. • All MT vendors should offer a clear

improvement path. Most do not. – Many simply tell you to post edit and add data…

Or worse – get more data from other non-trusted sources– Most do not tell you how much data is required – Many MT vendors do not improve at all or

merely improve very little unless huge volumes of data are added to the initial training data.


Data Required to Improve Quality• Competitors require 20% or

more additional data than the initial training data to show notable improvements. – This could take years for most LSPs– This is the dirty little secret of the

Dirty Data SMT approach that is frequently acknowledged.

• Asia Online has reference customers that have had notable improvements with just 1 days work of post editing.– Only possible with Clean Data SMT

Original Training

Data

Typically between 2 million

and 20 million sentences

20% or more200,000 to

2 million sentences

< 0.1%Improvements daily based on

edits

Typical Dirty Data SMT engines will have between

2 million and 20 million sentences in the initial

training data.

With Language Studio™ Every Edit Counts


Results of Refinements

Engine Learning Iteration1 2 543 6

Publication Quality Target

Post Editing Effort

Qua

lity

Post Editing Effort Reduces Over TimeThe post editing and cleanup effort gets easier as the MT engine improves.Initial efforts should focus on error analysis and correction of a representative sample data set. Each successive project should get easier and more efficient.

Raw MT Quality

Engine Learning Iteration1 2 543 6

654321

Post Editing (Human Translation)

MT Post Editing

Cost

Per

Wor

d

Post Editing Cost MT learns from post editing feedback and quality of

translation constantly improves.Cost of post editing progressively reduces as MT quality increases after each engine learning iteration.

Progressively develop engine quality to a level that exceeds the equivalent productivity of an

85% translation memory fuzzy match.

GOAL:


Comparing Translation Engines: Translation Quality Improvement Assessment

• Comparing Versions: – When comparing improvements between versions of a translation engine from a single

vendor, it is possible to work with just one test set, but the vendor must ensure that the test set remains “blind” and that the scores are not biased towards the test set. Only then can a meaningful representation of quality improvement be achieved.

• Comparing Machine Translation Vendors: – When comparing translation engine output from different vendors, a second “blind”

test set is often needed to measure improvement. While you can use the first test set, it is often difficult to ensure that the vendor did not adapt its system to better suit and be biased towards the test set and in doing so delivering an artificially high score. It is also possible for the proof read test set data to be added to engines training data which will also bias the score.

• Use a second blind test set:– As a general rule, if you cannot be 100% certain that the vendor has not included the

first test set data or adapted the engine to suit the test set, then a second “blind” test set is required.

– When a second test set is used, a measurement should be taken from the original translation engine and compared to the improved translation engine to give a meaningful result that can be trusted and relied upon.


How Test Sets are Measured

S

– Original Source: • The original sentences that are to be translated.

– Human Reference• The gold standard of what a high quality human translation would look like.

– Translation Candidate• This is the translated output from the machine translation system that you are comparing.

S

R

C

Machine Translate Compare and Score

Multiple machine translation candidates can be scored at one time to compare against each other.E.g. Asia Online, Google, Systran

Note:C

3 Measurement Tools• Human Quality Assessment• Automated Quality Metrics• Sentence Evaluation

OriginalSource

TranslationCandidate

HumanReference

RC


Defining a Tuning Set and Test Set• Tuning Set – 2,000-3,000 segments

– Used to guide the engine to the most optimized settings

• Test Set – 500-1,000 segments– Used to measure the quality of a translation engine– Can be run against multiple translation engines for

comparison purposes

• Preparation– Original Source and Human Reference must be of a Gold Standard. – This requires human checking and typically takes a linguist

1 day per 1,000 lines to prepare and check. Complex text 1 day per 500 lines.– Failure to prepare a true Gold Standard test set will result in metrics

that cannot be trusted.

• File Formats– All files are plain text. – Each line should have just 1 sentence.– Each line in the Original Source should match the corresponding line in the Human

Reference and the Translation Candidate. Each line is separated by a carriage return.– There should be the exact same number of lines in each file.

S R

CS

R

Asia Online can provide detailed guidance and training if required.


Tuning and Test Set Data• Each line must be a single sentence only• Each line should be an exact translation

– Not a summary or partial translation. – There should not be any extra information or phrases in the

Original Source that are not in the Human Reference and vice versa.– Should be same general word order between Original Source and the

Human Reference.

Hoy es viernes, el cielo es azul y hace frío.

Today is Friday, the sky is blue and the weather is cold. vs.The sky is blue, the weather is cold and today is Friday.

– Scores are calculated not just using correct words, but words in sequence. • A different word sequence from the Original Source to the Human Reference will result in

a lower score. • This is not about writer discretion to determine different word orders, this is about system

accuracy. If it is accurate to have the same word order, then the reference should show the same word order. With some languages this is not possible, but the general word order such as in lists should still be adhered to.

S

R

S R

R

R

S

Good, will score well.Not as good, will not score as well.

S

R


Warning Signs Some simply don’t know how to measure properly.Some don’t want to measure properly.


Whenever a BLEU score is too high (over 75):– It is possible, but unusual and should be carefully scrutinized– A typical human translator will rarely score above 75– Claims of scores in the 90’s are very highly suspect and almost

100% a sign of incorrect measurement– Anyone who says “I got 99.x % accuracy” or similar is not using valid metrics

Primary Causes– Training data contains Tuning Set / Test Set or data that

is very similar– Improvements were focused specifically on the test set

issues and not general engine issues.– Test set was not blind and MT vendor

adjusted engine or data to score better– Sample size very small < 1,000 segments.– Segments too short in length– Highly repetitive segments– Wrong file was used in metrics– Output was modified by a human– Made up metrics

Red flags for detecting when MT has been measured incorrectly

?


Different SMT Data Approaches• Data

– Gathered from as many sources as possible.– Domain of knowledge does not matter.– Data quality is not important. – Data quantity is important.

• Theory – Good data will be more statistically relevant.

• Data– Gathered from a small number of trusted

quality sources.– Domain of knowledge must match target– Data quality is very important.– Data quantity is less important.

• Theory– Bad or undesirable patterns cannot be

learned if they don’t exist in the data.

Dirty Data SMT Model

Clean Data SMT Model


Quality Data Makes A DifferenceClean and Consistent DataA statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training.

Controlled DataFewer translation options for the same source segment, and “clean” translations lead to better foundation patterns.

Common DataHigher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems.

Current DataEnsure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style


Rock Concert Audience Evolution

1960’s 1980’s

1990’s 2012


Controlling Style and GrammarDifferent needs for every customer

ESTranslated text can be stylized based on the style of the Monolingual data.

EN

ES

Millions of Sentence Pairs

News paper article

Business NewsThe EconomistNew York TimesForbes

Children’s BooksHarry PotterRupert the BearFamous Five

Bilingual Data Monolingual Data

Spanish Original Before Translation:

Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos.

Business News After Translation:

Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents.

Children’s Books After Translation:

A lot of care was taken to not upset others when organizing the meeting between the two long time enemies.

Text written in the style of

business news

EN

Text written in the style of

children’s books

EN

Possible Vocabulary

Writing Style & Grammar


How do you pay post-editors fairly if each engine is different?

The User needs tools for:• Quality metrics

– Automated – Human

• Confidence scores– Scores on a 0-100 scale– Can be mapped to fuzzy TM match

equivalents

• Post Edit Quality Analysis– After editing is complete or even while

editing is in progress, effort can be easily measured.

All MT Engines are Not Equal


Post Editing Investment• Training of post-editors – New Skills

– MT Post Editing Is Different to HT Post Editing• Different error patterns and different ways to resolve issues.• Several LSPs have now created their own e-learning courses for post editors. • These include beginners, intermediate and advanced level courses.

• 3 Kinds of Post Editors– Monolingual Post Editors

• Experts in the domain, but are not bilingual. • With a mature engine, this approach will often deliver the best, most natural

sounding results.– Professional Bilingual MT Post Editors:

• Often with domain expertise, these editors have been trained to understand issues with MT and not only correct the error in the sentence, but work to create rules for the MT engine to follow.

– Early Career Post Editors: • Editing work only, focused on corrections.


Measurement will be Essentual

“The understanding of positive change is only possible when you understand the current system in terms of

efficiency.”...

“Any conclusion about consistent, meaningful, positive change in a process must be based on objective

measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “

– Kevin Nelson, Managing Director,

Omnilingua Worldwide


Reality for LSPs

How Omnilingua Measures Quality– Triangulate to find the data– Raw MT J2450 v. Historical Human Quality J2450– Time Study Measurements– OmniMT EffortScore™

Everything must be measured by effort first– All other metrics support effort metrics– Productivity is key

∆ Effort > MT System Cost + Value Chain Sharing


SAE J2450• Built as a Human Assessment System: – Provides 7 defined and actionable error classifications.– 2 severity levels to identify severe and minor errors.

• Provides a Measurement Score Between 1 and 0: – A lower score indicates fewer errors.– Objective is to achieve a score as close to 0 (no errors/issues) as

possible.

• Provides Scores at Multiple Levels: – Composite scores across an entire set of data.– Scores for logical units such as sentences and paragraphs.


Comparing MT Systems: Omnilingua SAE J2450


Asia Online v. Competing MT System Factor

Total Raw J2450 Errors 2x FewerRaw J2450 Score 2x Better

Total PE J2450 Errors 5.3x FewerPE J2450 Score 4.8x Better

PE Rate 32% Faster


Case Study 1: Small Project

• LSP is a mid-sized European LSP• First Engine – Customized, without any additional engine feedback

• Domain: IT / Engineering• Words: 25,000• Measurements:– Cost– Timeframe– Quality

• Quality of client delivery with machine translation + human approach must be the same or better as a human only approach.

Copyright © 2012, Asia Online Pte LtdTime

76

Comparison of Time and Cost

Proofing

2 Days

Editing

3 Days

Translation

10 Days

Translation

1 Day

Post Editing

5 Days

Proofing

2 Days

46% Time Saving(7 Days)

27% Cost Saving

100%

20%

80%

90%

70%

40%

30%

10%

Cost

50%

60%

25,000 Words


Business Model Analysis

77

50%

Margin

Proofing

Editing

TranslationHuman Translation

Post Editing

Proofing

Margin

Machine Translation

Post Editing

Proofing

Margin

25%

45%

5%

20%

30%

20%

5%


Case Study 2: Large Project• LSP: Sajan• End Client Profile:

– Large global multinational corporation in the IT domain.– Has developed its own proprietary MT system that has been developed over many years.

• Project Goals– Eliminate the need for full translation and limit it to MT + Post-editing

• Language Pair: – English -> Simplified Chinese.– English -> European Spanish.– English -> European French.

• Domain: IT• 2nd Iteration of Customized Engine

– Customized initial engine, followed by an incremental improvement based on client feedback.

• Data – Client provided ~3,000,000 phrase pairs. – 26% were rejected in cleaning process as unsuitable for SMT training.

• Measurements:– Cost– Timeframe– Quality


Project Results• Quality

– Client performed their own metrics– Asia Online Language Studio™ was

considerably better than the clients own MT solution.

– Significant quality improvement after providing feedback – 65 BLEU score.

– Chinese scored better than first pass human translation as per client’s feedback and was faster and easier to edit.

• Result – Client extremely impressed with result

especially when compared to the output of their own MT engine.

– Client has commissioned Sajan to work with more languages

70% Time Saving

60% Cost Saving

LRC have uploaded Sajan’s slides and video Presentation from the recent LRC conference:Slides: http://bit.ly/r6BPkT Video: http://bit.ly/trsyhg


Small LSP Taking On the Big Guns• Small/Mid-Sized LSP, with offices in US, Thailand,

Singapore, Argentina, Australia and Columbia.• Competitors – SDL/Language Weaver and TransPerfect• Projects:

– Travelocity: Major travel booking site, wanting to expand their global presence for hotel reservations.– HolidayCheck: Major travel review site, wanting to expand their global presence for hotel reviews.– Sawadee.com: Small travel booking site. Had confidence due to other travel proof points.

• Results: – Travelocity: Won project for 22 language pairs– HolidayCheck: Won project for 11 language pairs, replacing already installed competing technology that

had not delivered as promised.– Sawadee.com: Won project for 2 language pairs

• Beat 2 of the largest global LSPs– Built an initial engine to demonstrate quality capabilities– Reused the various engines created for multiple clients– Worked on glossaries, non-translatable terms and data cleaning– A focus on quality, not on generating more human work– Provided a complete solution:

• MT, Human, Editing and copy writing.• Applying the right level of skill to the right task – kept costs down• Workflow management and integration• Project management• Quality management


LSPs Must Have Complete ControlTools to Analyze & Refine the Quality of Training Data and other Linguistic Assets– Bilingual Data– Monolingual Data

Tools to Rapidly Identify Errors & Make CorrectionsTools to Measure and Identify Error Patterns – Human Metrics– Machine Metrics

Tools to Manage and Gather Corrective Feedback


ROI Calculator - Parameters• 10 Projects in the same domain. Medical. DE-EN.• 1.85 Million words total• Below 85% Fuzzy Match sent to MT• Review of Fuzzy Match segments: $0.05• Human Translation: $0.15• Editing / Proofing Human Translation: $0.05• Editing / Proofing MT: $0.07-$0.05• Human Only: 5 Translators, 1 Editor• MT + Human: 3 Proof Readers• Cost to Client: $0.31


ROI Calculator: Cost Comparison


Cost Savings / Increases


Elapsed Time


Person Time


Margin


Margin


Dion WigginsChief Executive [email protected]

How to Measure the Success of Machine Translation

Dion Wiggins Chief Executive Officer [email protected]

Documents

asia online pte ltdthe

asia online pte ltdcopyright

asia online pte ltd1

asia online pte ltdwhy

new text information

new demand

information access

petabytes of new information