right © 2012, Asia Online Pte Ltd right © 2012, Asia Online Pte Ltd Dion Wiggins Chief Executive Officer [email protected] How to Measure the Success of Machine Translation
Feb 25, 2016
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Dion WigginsChief Executive [email protected]
How to Measure the Success of Machine Translation
Copyright © 2012, Asia Online Pte Ltd
• A petabyte is one million gigabytes.• 8 x more than the information stored in all US libraries. • Equivalent of 20 million four drawer filing cabinets filled with text.• In 2012 we have 5 times more data
stored than we did in 2008. • The volume of data is growing
exponentially and is expected to increase by 20 times by 2020.
• We now have access to more data than at any time in human history.
The World We Live In
Every day 15 petabytes of new information is being generated
Copyright © 2012, Asia Online Pte Ltd
The World We Live In
• We live in world which is increasingly instrumented and interconnected.
• The number of “smart” devices is growing everyday and the volume of data they produce is growing exponentially – doubling every 18 months.
• All these devices create new demand for access to information – access now, on demand, in real time.
By 2015 there will be more than 15 billion devices connected to the internet .
Copyright © 2012, Asia Online Pte Ltd
The World We Live In
• Google’s message to the market has long been that its business is making the world’s information searchable and that MT is part of that mission.
Google translates more in 1 day than all human translators in 1 year
= x 36524
Copyright © 2012, Asia Online Pte Ltd
The World We Live In• Common Sense Advisory Calculates:– US$31.4 billion earned for language services in 2011– Divide by 365 days– Divide by 10 cents per words
LSPs translate a mere 0.00000067% of the text information created every day.
Even if only 1% of new text information created each day should be translated, that still means only
0.000067% is translated by LSPs
How much new text information should be translated?
Copyright © 2012, Asia Online Pte Ltd
The World We Live In
Translation Demand is increasing.
Translator Supply is decreasing
Copyright © 2012, Asia Online Pte Ltd
The Impact of a Translator Shortage
• It is already clear that at 2,000-3,000 words per day per translator that demand is many multiples of supply.
• LSPs are having trouble finding qualified and skilled translators– In part due to lower rates in the market and more competition for resources
• Wave of new LSPs and translators– Many will try to capitalize on the market opportunity
created by translator shortage, but with deliver sub-standard services
– Lack of experience – both new LSPs and translators– Lower quality translations will become more
common place
Skilled Translator Supply is Shrinking
Copyright © 2012, Asia Online Pte Ltd
Expanding the Reach of Translation
User Generated Content
Support / Knowledge Base
Communications
Enterprise Information
User Documentation
User Interface
Products
Corporate
Part
ly M
ultil
ingu
al
Corporate Brochures
Product Brochures
Software Products
Manuals / Online Help
HR / Training / Reports
2,000
10,000
50,000
200,000
500,000
10,000,000
20,000,000+
50,000,000+
Email / IM
Call Center / Help Desk
Blogs / Reviews
Example WordsHuman
Machine
Existing Markets $31.4B
New Markets
Copyright © 2012, Asia Online Pte Ltd
Interesting QuestionVinod Bhaskar: Machine v/s Human Translation
Are machine translations gaining ground? Can they put translators out of circulation like cars did to horses and cart drivers?
Copyright © 2012, Asia Online Pte Ltd
Evolution of the Automotive Industry
Fuel and Energy Industry
Transportation
Mass Production
Service Industries
Research and Development
Customization and Parts
Production LineEarly Innovation
Copyright © 2012, Asia Online Pte Ltd
• Specialization• Editing / Proofing• Translation / Localization• Global work groups • Quality assurance• Managed Crowd and Community • Professionally managed amateur workforce
Translation is Evolving Along a Similar PathHuman TranslatorSingle Language
Vendors
Translation MemoryDictionary / Glossary
Multi Language Vendors
• Research Funding & Grants• Natural Language Programming• Language Technologies - search, voice, text, etc…Preservation of dying languagesAutomated & Enhanced post-editingEuro to Asian LP automation
• Custom engine development• Mass translation at high quality• Newly justifiable business models
• Data, reports• Phone, Mobile data• Internet, broad band• Website translation
• Mass translation• Knowledge dissemination• Data automation• Information processing• Multilingual software
Production LineEarly Innovation
Research and Development Communications
Technology IndustryService Industries
Translation for the MassesCustomization
• World Lingo• Google Translate• Yahoo Babelfish
Copyright © 2012, Asia Online Pte Ltd
Evolution of Machine Translation Quality
Google drivesMT acceptance
Qua
lity
911 -> Research funding
Processors became powerful enoughLarge volumes of digital data available
Quality plateau as RBMT reached its limits in many languages – only marginal improvement.
Babelfish
Early RBMT improved rapidly as new techniques were discovered.
Experimental
Gist
Near Human
Google switchesfrom Systran to SMT
Businesses start to consider MTLSPs start to adopt MT
New skills develop in editing MT
Early SMTVendors
New techniques matureHybrid MT Platforms
Good Enough Threshold
Copyright © 2012, Asia Online Pte Ltd
Plateau OfProductivity
Trough Of Disillusionment
Technology Trigger
Peak Of InflatedExpectations
Slope OfEnlightenment
Machine Translation Hype Cycle
Time / Maturity
Visib
ility
2015
Mainstream LSP use
1947 1954The "Translation" memorandum
Georgetown experiment
ALPAC report
1966
Microsoft & Google announce paid API
2011
Early LSP adopters
Babelfish
Google switches to SMT
9/11
2001 2007
IBM Research
1990
Move fromMainframe to PC Moses
Notable quality improvement
Near human quality examples emerge
*Not an official Gartner Hype Cycle
Copyright © 2012, Asia Online Pte Ltd
Top 5 Reasons For Not Adopting MT
1. Perception of Quality– Many believe Google Translation is as good as it gets / state of the art.– This is true for scale, but not for quality.
2. Perception of Quality– Perfect quality is expected from the outset and tests using Google or other out-of-the-box
translation tools are disappointing. – When combined with #1, other MT is quickly ruled out as an option.
3. Perception of Quality– The opposite to #2. Human resistance to MT. “A machine will never be able to deliver
human quality” mindset.
4. Perception of Quality– Few understand that out-of-the-box or free
MT and customized MT are different. – They don’t see why they should pay for
commercial MT as quality is perceived as the same.
5. Perception of Quality– Quality is not good enough as raw MT output.– The equation is not MT OR Human. It is MT AND Human
Copyright © 2012, Asia Online Pte Ltd
A An infinite demand – a well defined and growing problem that has always been looking for a solution – what was missing was QUALITY
Machine TranslationM T
eMpTy Promises50 Years of
What Is Different This Time Around?
Q Why does an industry that has spent 50 years failing to deliver on its promises still exist?
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Whatever the customer says it is!Definition of Quality:
Copyright © 2012, Asia Online Pte Ltd
Quality Depends on the Purpose
• Document Search and Retrieval– Purpose: To find and locate information– Quality: Understandable, technical terms key– Technique: Raw MT + Terminology Work
• Knowledge Base– Purpose: To allow self support via web– Quality: Understandable, can follow directions provided– Technique: MT & Human for key documents
• Search Engine Optimization (SEO)– Purpose: To draw users to site– Quality: Higher quality, near human– Technique: MT + Human (student, monolingual)
• Magazine Publication– Purpose: To publish in print magazine– Quality: Human quality– Technique: MT + Human (domain specialist, bilingual)
Establish Clear Quality GoalsStep 1 – Define the purposeStep 2 – Determine the appropriate quality level
Copyright © 2012, Asia Online Pte Ltd
Reality Check – What Do You Really Get?
Copyright © 2012, Asia Online Pte Ltd
Typical MT + Post Editing
Speed
Translation Speed
28,0000
3,000
6,000
9,00012,000
25,000
21,000
18,00015,000
Human Translation Fastest MT + Post Editing
*Fastest MT + Post Editing Speed reported by clients.
*
Words Per Day Per Translator
Average person reads 200-250 words per minute. 96,000-120,000 in 8 hours. ~35 times faster than human translation.
Copyright © 2012, Asia Online Pte Ltd
Success Factors:Understanding Return On Investment
Cost Did we lower overall project costs?Time Did we deliver more quickly while
achieving the desired quality?Resources Were we able to do the job with fewer
resources?Quality Did we deliver a quality level that met or
exceeded a human only approach?Profit Less important in early projects, but the
key reason we are in business.
Copyright © 2012, Asia Online Pte Ltd
Success Factors:Understanding Return On Investment
Customer Is the customer satisfied? Have we met or exceeded their quality requirements?
Asset Building
Did we expand our linguistic assets? If we did the same kind of job again, would it be easier?
New Business
What business opportunities have been created that would not have otherwise been possible?What barriers have been removed by leveraging MT?
Copyright © 2012, Asia Online Pte Ltd
Targets should be defined, set and managed from the outset
Objective Measurement is Essential
Copyright © 2012, Asia Online Pte Ltd
Why Measure?
“The understanding of positive change is only possible when you understand the current system in terms of
efficiency.”...
“Any conclusion about consistent, meaningful, positive change in a process must be based on objective
measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “
– Kevin Nelson, Managing Director,
Omnilingua Worldwide
Copyright © 2012, Asia Online Pte Ltd
What to measure?• Automated metrics
– Useful to some degree, but not enough on their own
• Post editor feedback– Useful for sentiment, but not a reliable metric. When compared to technical
metrics, often reality is very different.
• Number of errors– Useful, but can be misleading. Complexity of error correction is often
overlooked.
• Time to correct– On its own useful for productivity metrics, but not enough when more depth
and understanding is required.
• Difference between projects– Combined the above allow an understanding of each project, but are much
more valuable when compared over several similar projects.
Objective measurement is the only means to understand
Copyright © 2012, Asia Online Pte Ltd
Rapid MT Quality Assessment
Long-term consistency , repeatability and objectivity are important
Butler Hill Group has developed a protocol that is widely accepted and used
Can be based on error categorization like SAE J2450
Should be used together with automated metrics
Will focus more on post-editing characteristics in future
BLEU is the most commonly used metric
“… the closer the machine translation is to a professional human translation, the better it is”
METEOR, TERp and many others in development
Limited but still useful for MT engine development if properly used
Automated Human Assessments
Copyright © 2012, Asia Online Pte Ltd
Different Automated MetricsAll four metrics compare a machine translation to human translations• BLEU (Bilingual Evaluation Understudy)
– BLEU was one of the first metrics to achieve a high correlation with human judgements of quality, and remains one of the most popular.
– Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations.
– Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.
– Intelligibility or grammatical correctness are not taken into account. – BLEU is designed to approximate human judgement at a corpus level, and performs badly if
used to evaluate the quality of individual sentences.– More: http://en.wikipedia.org/wiki/BLEU
• NIST– Name comes from the US National Institute of Standards and Technology.– It is based on the BLEU metric, but with some alterations:
• Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given.
• NIST also differs from BLEU in its calculation of the brevity penalty insofar as small variations in translation length do not impact the overall score as much.
– More: http://en.wikipedia.org/wiki/NIST_(metric)
Copyright © 2012, Asia Online Pte Ltd
Different Automated Metrics• F-Measure (F1 Score or F-Score)
– In statistics, the F-Measure is a measure of a test's accuracy. – It considers both the precision p and the recall r of the test to compute the score: p is the
number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned.
– The F-Measure score can be interpreted as a weighted average of the precision and recall, where a score reaches its best value at 1 and worst score at 0.
– More: http://en.wikipedia.org/wiki/F1_Score
• METEOR (Metric for Evaluation of Translation with Explicit ORdering) – The metric is based on the harmonic mean of unigram precision and recall, with recall
weighted higher than precision. – It also has several features that are not found in other metrics, such as stemming and
synonymy matching, along with the standard exact word matching. – The metric was designed to fix some of the problems found in the more popular BLEU
metric, and also produce good correlation with human judgement at the sentence or segment level.
– This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.– More: http://en.wikipedia.org/wiki/METEOR
Copyright © 2012, Asia Online Pte Ltd
Usability and Readability Criteria
Excellent (4)Read the MT output first. Then read the source text (ST). Your understanding is not improved by the reading of the ST because the MT output is satisfactory and would not need to be modified (grammatically correct/proper terminology is used/maybe not stylistically perfect but fulfills the main objective, i.e. transferring accurately all information.)
Good (3)Read the MT output first. Then read the source text. Your understanding is not improved by the reading of the ST even though the MT output contains minor grammatical mistakes .You would not need to refer to the ST to correct these mistakes.
Medium (2) Read the MT output first. Then read the source text. Your understanding is improved by the reading of the ST, due to significant errors in the MT output . You would have to re-read the ST a few times to correct these errors in the MT output.
Poor (1)Read the MT output first. Then read the source text. Your understanding only derives from the reading of the ST, as you could not understand the MT output. It contained serious errors. You could only produce a translation by dismissing most of the MT output and/or re-translating from scratch.
Evaluation Criteria of MT output
Copyright © 2012, Asia Online Pte Ltd
Human evaluators can develop custom error taxonomy to help identify key error pattern problems or use error taxonomy from standards such as the LISA QA Model or SAE J2450
Copyright © 2012, Asia Online Pte Ltd
Sample Metric Report From Language Studio™
Copyright © 2012, Asia Online Pte Ltd
Multiple References Increases Scores
Copyright © 2012, Asia Online Pte Ltd
Copyright © 2012, Asia Online Pte Ltd
Why do you need rules?
2 Port SwitchDouble Port Switch
Dual Port Switch
Normalization
Copyright © 2012, Asia Online Pte Ltd
Why do you need rules?
Non-Translatable Terms
Glossary such as Product Names
Job Specific Preferred Terminology
Terminology Control and Management
Copyright © 2012, Asia Online Pte Ltd
BLEU scores and other translation quality metrics will vary based upon:
1. The test set being measured: – Different test sets will give very different scores. A test set that is out of
domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score.
2. How many human reference translations were used: – If there is more than one human reference translation, the resulting BLEU
score will be higher as there are more opportunities for the machine translation to match part of the reference.
3. The complexity of the language pair: – Spanish is a simpler language in terms of
grammar and structure than Finnish or Chinese. – Typically if the source or target language is more
complex then the BLEU score will be lower.
Copyright © 2012, Asia Online Pte Ltd
BLEU scores and other translation quality metrics will vary based upon:
4. The complexity of the domain: – A patent has far more complex text and structure than a children’s story
book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other.
5. The capitalization of the segments being measured: – When comparing metrics, the most common form of measurement is
Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured.
6. The measurement software: – There are many measurement tools for translation quality. Each may vary
slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same.
– The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics.It is clear from the above list of variations that a BLEU score number by itself has no real meaning.
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Metrics Will Vary – Even the same metrics!!
Copyright © 2012, Asia Online Pte Ltd
What is your BLEU score?
This is the single most irrelevant question relating to translation quality, yet one of the most frequently asked.
Copyright © 2012, Asia Online Pte Ltd
BLEU scores and other translation quality metrics will vary based upon:
1. The test set being measured: – Different test sets will give very different scores. A test set that is out of
domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the test set should be gold standard. Lower quality test set data will give a less meaningful score.
2. How many human reference translations were used: – If there is more than one human reference translation, the resulting BLEU
score will be higher as there are more opportunities for the machine translation to match part of the reference.
3. The complexity of the language pair: – Spanish is a simpler language in terms of
grammar and structure than Finnish or Chinese. – Typically if the source or target language is more
complex then the BLEU score will be lower.
Copyright © 2012, Asia Online Pte Ltd
BLEU scores and other translation quality metrics will vary based upon:
4. The complexity of the domain: – A patent has far more complex text and structure than a children’s story
book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other.
5. The capitalization of the segments being measured: – When comparing metrics, the most common form of measurement is
Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured.
6. The measurement software: – There are many measurement tools for translation quality. Each may vary
slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same.
– The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge that measures a variety of quality metrics.It is clear from the above list of variations that a BLEU score number by itself has no real meaning.
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Basic Test Set Criteria Checklist
Copyright © 2012, Asia Online Pte Ltd
Basic Test Set Criteria Checklist
• Test Set Data should be very high quality: – If the test set data are of low quality, then the metric delivered cannot be relied
upon.– Proof read a test set. Don’t just trust existing translation memory segments.
• Test set should be in domain: – The test set should represent the type of information that you are going to translate.
The domain, writing style and vocabulary should be representative of what you intend to translate. Testing on out-of-domain text will not result in a useful metric.
• Test Set Data must not be included in the training Data: – If you are creating an SMT engine, then you must make sure that the data you are
testing with or very similar data are not in the data that the engine was trained with. If the test data are in the training data the scores will be artificially high and will not represent the same level of quality that will be output when other data are translated.
The criteria specified by this checklist are absolute. Not complying with any of the checklist items will result
in a score that is unreliable and less meaningful.
Copyright © 2012, Asia Online Pte Ltd
Basic Test Set Criteria Checklist• Test Set Data should be data that can be translated:
– Test set segments should have a minimal amount of dates, times, numbers and names. While a valid part a segment, they are not parts of the segment that are translated; they are usually transformed or mapped. A focus for a test set should be on words that are to be translated.
• Test Set Data should have segments that are at between 8 and 15 words in length: – Short segments will artificially raise the quality scores as most metrics do not take into
account segment length. Short segments are more likely to get a perfect match of the entire phrase, which is not a translation and is more like 100% match with a translation memory. The longer the segment, the more opportunity there is for variations on what is being translated. This will result in artificially lower scores, even if the translation is good. A small number of segments shorter than 8 words or longer than 15 words are acceptable, but these should be very few.
• Test set should be at least 1,000 segments: – While it is possible to get a metric from shorter test sets, a reasonable statistic
representation of the metric can only be created when there are sufficient segments to build statistics from. When there are only a low number of segments, small anomalies in one or two segments can raise or reduce the test set score artificially.
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Comparing Translation EnginesInitial Assessment Checklist
Copyright © 2012, Asia Online Pte Ltd
Comparing Translation Engines: Initial Assessment Checklist
• Test set must be consistent: – The exact same test set must be used for comparison across all translation engines. Do not
use different test sets for different engines.
• Test sets must be “blind”: – If the MT engine has seen the test set before or included the test set data in the training
data, then the quality of the output will be artificially high and not represent a true metric.
• Tests must be carried out transparently: – Where possible, submit the data yourself to the MT engine and get it back immediately. Do
not rely on a third party to submit the data. – If there are no tools or APIs for test set submission, the test
set should be returned within 10 minutes of being submitted to the vendor via email.
– This removes any possibility of the MT vendor tampering with the output or fine tuning the engine based on the output.
All conditions of the Basic Test Set Criteria must be met.If any condition is not met, then the results of the test could
be flawed and not meaningful or reliable.
Copyright © 2012, Asia Online Pte Ltd
Comparing Translation Engines: Initial Assessment Checklist
• Word Segmentation and Tokenization must be consistent: – If Word Segmentation is required (i.e. for languages such as Chinese, Japanese
and Thai) then the same word segmentation tool should be used on the reference translations and all the machine translation outputs. The same tokenization should also be used. Language Studio™ Pro provides a simple means to ensure all tokenization is consistent with its embedded tokenization technology.
• Provide each MT vendor a sample of at least 20 documents that are in domain– This allows each vendor to better understand they type of document and
customize accordingly. – This sample should not be the same as the test set data
• Test in Three Stages– Stage 1: Starting quality without customization– Stage 2: Initial quality after customization– Stage 3: Quality after first round of improvement.
• This should include post editing of at least 5,000 segments, preferably 10,000.
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Understanding and Comparing Improvements
Copyright © 2012, Asia Online Pte Ltd
The Language Studio™ 4 Step Quality Plan
4. ManageManage translation projects while generating corrective data for quality improvement.
2. MeasureMeasure the quality of the engine for rating and future improvement comparisons
3. ImproveProvide corrective feedback removing potential for translation errors.
1. CustomizeCreate a new custom engine using foundation data and your own language assets
Copyright © 2012, Asia Online Pte Ltd
A Simple Recipe for Quality Improvement
• Asia Online develops a specific roadmap for improvement for each custom engine. – This ensures the fastest
development path to quality possible. – You can start from any level of data.
• We will develop based on the following:– Your quality goals– Amount of data available in foundation engine– Amount of data that you can provide– Quality expectations are set from the outset– Asia Online performs a majority of the tasks
• Many are fully automated
Copyright © 2012, Asia Online Pte Ltd
• High volume, high quality translation memories• Rich Glossaries• Large high quality monolingual data
The Data Path To Higher Quality
• Some high quality translation memories• Some high quality monolingual data• Glossaries
• Limited high quality translation memories• Some high quality monolingual data• Glossaries
• Limited high quality monolingual data• Glossaries
• Limited high quality monolingual data
Highes
t Qua
lity Entry Points
Less
High Q
ualit
y
Less
Editi
ng
Mor
e Editi
ngYou can start your custom engine with just monolingual data improving over time as data becomes available.Data can come from post editing feedback on initial custom engine.Quality constantly improves
1
2
3
4
5
Copyright © 2012, Asia Online Pte Ltd
Quality requires an understanding of the
data There is no exception to this rule
Copyright © 2012, Asia Online Pte Ltd
High Quality: Human Translation Project vs. Machine Translation Project
Human Only
• Terminology Definition• Non-Translatable Terms• Historical Translations• Style Guide• Quality Requirements• Translate• Edit• Proof• Project Management
MT + Human
• Terminology Definition• Non-Translatable Terms• Historical Translations• Style Guide**• Quality Requirements• Customize MT• Translate• Edit• Proof• Project Management
Almost identical information and data is required in order to customize a high quality machine translation system
When preparing for a high quality human translation project, many core steps are performed in order to ensure that the writing style and vocabulary are designed for a customers target audience.
MT Only
• Terminology Definition• Non-Translatable Terms• Historical Translations• Style Guide**• Quality Requirements• Customize MT• Translate• Edit• Proof• Project Management
Copyright © 2012, Asia Online Pte Ltd
Ability to Improve is More Important than Initial Translation Engine Quality
Language Studio™ is designed to deliver translated output that requires the least amounts of edits in order to publish.• The initial scores of a machine translation engine, while indicative of
quality, should be viewed as a starting point for rapid improvement.• Depending on the volume and quality of data provided to the SMT
vendor for learning from, the quality may be lower or higher. • Frequently a new translation engine will have
gaps in vocabulary and grammatical coverage. • All MT vendors should offer a clear
improvement path. Most do not. – Many simply tell you to post edit and add data…
Or worse – get more data from other non-trusted sources– Most do not tell you how much data is required – Many MT vendors do not improve at all or
merely improve very little unless huge volumes of data are added to the initial training data.
Copyright © 2012, Asia Online Pte Ltd
Data Required to Improve Quality• Competitors require 20% or
more additional data than the initial training data to show notable improvements. – This could take years for most LSPs– This is the dirty little secret of the
Dirty Data SMT approach that is frequently acknowledged.
• Asia Online has reference customers that have had notable improvements with just 1 days work of post editing.– Only possible with Clean Data SMT
Original Training
Data
Typically between 2 million
and 20 million sentences
20% or more200,000 to
2 million sentences
< 0.1%Improvements daily based on
edits
Typical Dirty Data SMT engines will have between
2 million and 20 million sentences in the initial
training data.
With Language Studio™ Every Edit Counts
Copyright © 2012, Asia Online Pte Ltd
Results of Refinements
Engine Learning Iteration1 2 543 6
Publication Quality Target
Post Editing Effort
Qua
lity
Post Editing Effort Reduces Over TimeThe post editing and cleanup effort gets easier as the MT engine improves.Initial efforts should focus on error analysis and correction of a representative sample data set. Each successive project should get easier and more efficient.
Raw MT Quality
Engine Learning Iteration1 2 543 6
654321
Post Editing (Human Translation)
MT Post Editing
Cost
Per
Wor
d
Post Editing Cost MT learns from post editing feedback and quality of
translation constantly improves.Cost of post editing progressively reduces as MT quality increases after each engine learning iteration.
Progressively develop engine quality to a level that exceeds the equivalent productivity of an
85% translation memory fuzzy match.
GOAL:
Copyright © 2012, Asia Online Pte Ltd
Comparing Translation Engines: Translation Quality Improvement Assessment
• Comparing Versions: – When comparing improvements between versions of a translation engine from a single
vendor, it is possible to work with just one test set, but the vendor must ensure that the test set remains “blind” and that the scores are not biased towards the test set. Only then can a meaningful representation of quality improvement be achieved.
• Comparing Machine Translation Vendors: – When comparing translation engine output from different vendors, a second “blind”
test set is often needed to measure improvement. While you can use the first test set, it is often difficult to ensure that the vendor did not adapt its system to better suit and be biased towards the test set and in doing so delivering an artificially high score. It is also possible for the proof read test set data to be added to engines training data which will also bias the score.
• Use a second blind test set:– As a general rule, if you cannot be 100% certain that the vendor has not included the
first test set data or adapted the engine to suit the test set, then a second “blind” test set is required.
– When a second test set is used, a measurement should be taken from the original translation engine and compared to the improved translation engine to give a meaningful result that can be trusted and relied upon.
Copyright © 2012, Asia Online Pte Ltd
How Test Sets are Measured
S
– Original Source: • The original sentences that are to be translated.
– Human Reference• The gold standard of what a high quality human translation would look like.
– Translation Candidate• This is the translated output from the machine translation system that you are comparing.
S
R
C
Machine Translate Compare and Score
Multiple machine translation candidates can be scored at one time to compare against each other.E.g. Asia Online, Google, Systran
Note:C
3 Measurement Tools• Human Quality Assessment• Automated Quality Metrics• Sentence Evaluation
OriginalSource
TranslationCandidate
HumanReference
RC
Copyright © 2012, Asia Online Pte Ltd
Defining a Tuning Set and Test Set• Tuning Set – 2,000-3,000 segments
– Used to guide the engine to the most optimized settings
• Test Set – 500-1,000 segments– Used to measure the quality of a translation engine– Can be run against multiple translation engines for
comparison purposes
• Preparation– Original Source and Human Reference must be of a Gold Standard. – This requires human checking and typically takes a linguist
1 day per 1,000 lines to prepare and check. Complex text 1 day per 500 lines.– Failure to prepare a true Gold Standard test set will result in metrics
that cannot be trusted.
• File Formats– All files are plain text. – Each line should have just 1 sentence.– Each line in the Original Source should match the corresponding line in the Human
Reference and the Translation Candidate. Each line is separated by a carriage return.– There should be the exact same number of lines in each file.
S R
CS
R
Asia Online can provide detailed guidance and training if required.
Copyright © 2012, Asia Online Pte Ltd
Tuning and Test Set Data• Each line must be a single sentence only• Each line should be an exact translation
– Not a summary or partial translation. – There should not be any extra information or phrases in the
Original Source that are not in the Human Reference and vice versa.– Should be same general word order between Original Source and the
Human Reference.
Hoy es viernes, el cielo es azul y hace frío.
Today is Friday, the sky is blue and the weather is cold. vs.The sky is blue, the weather is cold and today is Friday.
– Scores are calculated not just using correct words, but words in sequence. • A different word sequence from the Original Source to the Human Reference will result in
a lower score. • This is not about writer discretion to determine different word orders, this is about system
accuracy. If it is accurate to have the same word order, then the reference should show the same word order. With some languages this is not possible, but the general word order such as in lists should still be adhered to.
S
R
S R
R
R
S
Good, will score well.Not as good, will not score as well.
S
R
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Warning Signs Some simply don’t know how to measure properly.Some don’t want to measure properly.
Copyright © 2012, Asia Online Pte Ltd
Whenever a BLEU score is too high (over 75):– It is possible, but unusual and should be carefully scrutinized– A typical human translator will rarely score above 75– Claims of scores in the 90’s are very highly suspect and almost
100% a sign of incorrect measurement– Anyone who says “I got 99.x % accuracy” or similar is not using valid metrics
Primary Causes– Training data contains Tuning Set / Test Set or data that
is very similar– Improvements were focused specifically on the test set
issues and not general engine issues.– Test set was not blind and MT vendor
adjusted engine or data to score better– Sample size very small < 1,000 segments.– Segments too short in length– Highly repetitive segments– Wrong file was used in metrics– Output was modified by a human– Made up metrics
Red flags for detecting when MT has been measured incorrectly
?
Copyright © 2012, Asia Online Pte Ltd
Different SMT Data Approaches• Data
– Gathered from as many sources as possible.– Domain of knowledge does not matter.– Data quality is not important. – Data quantity is important.
• Theory – Good data will be more statistically relevant.
• Data– Gathered from a small number of trusted
quality sources.– Domain of knowledge must match target– Data quality is very important.– Data quantity is less important.
• Theory– Bad or undesirable patterns cannot be
learned if they don’t exist in the data.
Dirty Data SMT Model
Clean Data SMT Model
Copyright © 2012, Asia Online Pte Ltd
Quality Data Makes A DifferenceClean and Consistent DataA statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training.
Controlled DataFewer translation options for the same source segment, and “clean” translations lead to better foundation patterns.
Common DataHigher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems.
Current DataEnsure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style
Copyright © 2012, Asia Online Pte Ltd
Rock Concert Audience Evolution
1960’s 1980’s
1990’s 2012
Copyright © 2012, Asia Online Pte Ltd
Controlling Style and GrammarDifferent needs for every customer
ESTranslated text can be stylized based on the style of the Monolingual data.
EN
ES
Millions of Sentence Pairs
News paper article
Business NewsThe EconomistNew York TimesForbes
Children’s BooksHarry PotterRupert the BearFamous Five
Bilingual Data Monolingual Data
Spanish Original Before Translation:
Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos.
Business News After Translation:
Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents.
Children’s Books After Translation:
A lot of care was taken to not upset others when organizing the meeting between the two long time enemies.
Text written in the style of
business news
EN
Text written in the style of
children’s books
EN
Possible Vocabulary
Writing Style & Grammar
Copyright © 2012, Asia Online Pte Ltd
How do you pay post-editors fairly if each engine is different?
The User needs tools for:• Quality metrics
– Automated – Human
• Confidence scores– Scores on a 0-100 scale– Can be mapped to fuzzy TM match
equivalents
• Post Edit Quality Analysis– After editing is complete or even while
editing is in progress, effort can be easily measured.
All MT Engines are Not Equal
Copyright © 2012, Asia Online Pte Ltd
Post Editing Investment• Training of post-editors – New Skills
– MT Post Editing Is Different to HT Post Editing• Different error patterns and different ways to resolve issues.• Several LSPs have now created their own e-learning courses for post editors. • These include beginners, intermediate and advanced level courses.
• 3 Kinds of Post Editors– Monolingual Post Editors
• Experts in the domain, but are not bilingual. • With a mature engine, this approach will often deliver the best, most natural
sounding results.– Professional Bilingual MT Post Editors:
• Often with domain expertise, these editors have been trained to understand issues with MT and not only correct the error in the sentence, but work to create rules for the MT engine to follow.
– Early Career Post Editors: • Editing work only, focused on corrections.
Copyright © 2012, Asia Online Pte Ltd
Measurement will be Essentual
“The understanding of positive change is only possible when you understand the current system in terms of
efficiency.”...
“Any conclusion about consistent, meaningful, positive change in a process must be based on objective
measurements otherwise conjecture and subjectivity can steer efforts in the wrong direction. “
– Kevin Nelson, Managing Director,
Omnilingua Worldwide
Copyright © 2012, Asia Online Pte Ltd
Reality for LSPs
How Omnilingua Measures Quality– Triangulate to find the data– Raw MT J2450 v. Historical Human Quality J2450– Time Study Measurements– OmniMT EffortScore™
Everything must be measured by effort first– All other metrics support effort metrics– Productivity is key
∆ Effort > MT System Cost + Value Chain Sharing
Copyright © 2012, Asia Online Pte Ltd
SAE J2450• Built as a Human Assessment System: – Provides 7 defined and actionable error classifications.– 2 severity levels to identify severe and minor errors.
• Provides a Measurement Score Between 1 and 0: – A lower score indicates fewer errors.– Objective is to achieve a score as close to 0 (no errors/issues) as
possible.
• Provides Scores at Multiple Levels: – Composite scores across an entire set of data.– Scores for logical units such as sentences and paragraphs.
Copyright © 2012, Asia Online Pte Ltd
Comparing MT Systems: Omnilingua SAE J2450
Copyright © 2012, Asia Online Pte Ltd
Asia Online v. Competing MT System Factor
Total Raw J2450 Errors 2x FewerRaw J2450 Score 2x Better
Total PE J2450 Errors 5.3x FewerPE J2450 Score 4.8x Better
PE Rate 32% Faster
Copyright © 2012, Asia Online Pte Ltd
Case Study 1: Small Project
• LSP is a mid-sized European LSP• First Engine – Customized, without any additional engine feedback
• Domain: IT / Engineering• Words: 25,000• Measurements:– Cost– Timeframe– Quality
• Quality of client delivery with machine translation + human approach must be the same or better as a human only approach.
Copyright © 2012, Asia Online Pte LtdTime
76
Comparison of Time and Cost
Proofing
2 Days
Editing
3 Days
Translation
10 Days
Translation
1 Day
Post Editing
5 Days
Proofing
2 Days
46% Time Saving(7 Days)
27% Cost Saving
100%
20%
80%
90%
70%
40%
30%
10%
Cost
50%
60%
25,000 Words
Copyright © 2012, Asia Online Pte Ltd
Business Model Analysis
77
50%
Margin
Proofing
Editing
TranslationHuman Translation
Post Editing
Proofing
Margin
Machine Translation
Post Editing
Proofing
Margin
25%
45%
5%
20%
30%
20%
5%
Copyright © 2012, Asia Online Pte Ltd
Case Study 2: Large Project• LSP: Sajan• End Client Profile:
– Large global multinational corporation in the IT domain.– Has developed its own proprietary MT system that has been developed over many years.
• Project Goals– Eliminate the need for full translation and limit it to MT + Post-editing
• Language Pair: – English -> Simplified Chinese.– English -> European Spanish.– English -> European French.
• Domain: IT• 2nd Iteration of Customized Engine
– Customized initial engine, followed by an incremental improvement based on client feedback.
• Data – Client provided ~3,000,000 phrase pairs. – 26% were rejected in cleaning process as unsuitable for SMT training.
• Measurements:– Cost– Timeframe– Quality
Copyright © 2012, Asia Online Pte Ltd
Project Results• Quality
– Client performed their own metrics– Asia Online Language Studio™ was
considerably better than the clients own MT solution.
– Significant quality improvement after providing feedback – 65 BLEU score.
– Chinese scored better than first pass human translation as per client’s feedback and was faster and easier to edit.
• Result – Client extremely impressed with result
especially when compared to the output of their own MT engine.
– Client has commissioned Sajan to work with more languages
70% Time Saving
60% Cost Saving
LRC have uploaded Sajan’s slides and video Presentation from the recent LRC conference:Slides: http://bit.ly/r6BPkT Video: http://bit.ly/trsyhg
Copyright © 2012, Asia Online Pte Ltd
Small LSP Taking On the Big Guns• Small/Mid-Sized LSP, with offices in US, Thailand,
Singapore, Argentina, Australia and Columbia.• Competitors – SDL/Language Weaver and TransPerfect• Projects:
– Travelocity: Major travel booking site, wanting to expand their global presence for hotel reservations.– HolidayCheck: Major travel review site, wanting to expand their global presence for hotel reviews.– Sawadee.com: Small travel booking site. Had confidence due to other travel proof points.
• Results: – Travelocity: Won project for 22 language pairs– HolidayCheck: Won project for 11 language pairs, replacing already installed competing technology that
had not delivered as promised.– Sawadee.com: Won project for 2 language pairs
• Beat 2 of the largest global LSPs– Built an initial engine to demonstrate quality capabilities– Reused the various engines created for multiple clients– Worked on glossaries, non-translatable terms and data cleaning– A focus on quality, not on generating more human work– Provided a complete solution:
• MT, Human, Editing and copy writing.• Applying the right level of skill to the right task – kept costs down• Workflow management and integration• Project management• Quality management
Copyright © 2012, Asia Online Pte Ltd
LSPs Must Have Complete ControlTools to Analyze & Refine the Quality of Training Data and other Linguistic Assets– Bilingual Data– Monolingual Data
Tools to Rapidly Identify Errors & Make CorrectionsTools to Measure and Identify Error Patterns – Human Metrics– Machine Metrics
Tools to Manage and Gather Corrective Feedback
Copyright © 2012, Asia Online Pte Ltd
ROI Calculator - Parameters• 10 Projects in the same domain. Medical. DE-EN.• 1.85 Million words total• Below 85% Fuzzy Match sent to MT• Review of Fuzzy Match segments: $0.05• Human Translation: $0.15• Editing / Proofing Human Translation: $0.05• Editing / Proofing MT: $0.07-$0.05• Human Only: 5 Translators, 1 Editor• MT + Human: 3 Proof Readers• Cost to Client: $0.31
Copyright © 2012, Asia Online Pte Ltd
ROI Calculator: Cost Comparison
Copyright © 2012, Asia Online Pte Ltd
Cost Savings / Increases
Copyright © 2012, Asia Online Pte Ltd
Elapsed Time
Copyright © 2012, Asia Online Pte Ltd
Person Time
Copyright © 2012, Asia Online Pte Ltd
Margin
Copyright © 2012, Asia Online Pte Ltd
Margin
Copyright © 2012, Asia Online Pte LtdCopyright © 2012, Asia Online Pte Ltd
Dion WigginsChief Executive [email protected]
How to Measure the Success of Machine Translation