WeMT Tools and Processes TAUS Showcase October 2013 By Olga Beregovaya t © welocalize 2013. all rights reserved. www.welocalize.com
Nov 19, 2014
WeMT Tools and ProcessesTAUS Showcase October 2013By Olga Beregovaya
copyright © welocalize 2013. all rights reserved. www.welocalize.com
We’ll talk about:
• MT Programs• Metrics• Engines• Language Tools
www.welocalize.com
Current MT Programs
Dell – 27 languagesAutodesk – 11 languagesPayPal - 8 languagesCisco – 17 languages between 3 tiersIntuit – 20+languagesMicrosoft (pre-project support) McAfee (pilot) … many more in pilot stage
MT Program: Path-to-Success Components
A set of MT engines – “mix and match”
TMT Selection Mechanisms
Post-editing Environment
Processes and metrics
Data gathering and reporting tool – what, how much, how fast and at what effort
EDUCATION EDUCATION EDUCATION
CHANGE
The recipe
for success
Process and Workflow
All aspects of the localization ecosystem are taken into consideration
Selecting the right MT engine
By using our MT engine selection Scorecard we make sure all important KPIs are taken into consideration at selection time
Empowerment through educationInternal, by the use of customized Toolkits; external, through specialised Trainings.
MT KPIs: Productivity: Throughputs Productivity: Delta Quality: LQA Quality: Automatic Scores Cost GlobalSight: Connectivity GlobalSight: Tagging Human Evaluation Customization: Internal/External Customization: Time The feedback loop
Constructive communication from post-editor to MT provider
o Source content classification (i.e. marketing/UI/UA/UGC)o Length of the source segmento Source segment morpho-syntactic complexityo Presence/absence of pre-defined glossary terms or multi-word glossary
elements, UI elements, numeric variables, product lists, ‘do-not-translate’ and transliteration lists
o Tag density - Metadata attributes and their representation in localization industry standard formats (“tags”)
o ROC – quality levels based on content use (“impact”)
3D Model: Expected productivity mapped to desired quality levels and source content complexity
MT Program Design - Source
copyright © welocalize 2013. all rights reserved. www.welocalize.com
Productivity - ThroughputsNumber of post-edited words per hour
Productivity - Delta Percentage difference between translation and post-
editing timeCost
Extrapolation, cost per wordCMS - Connectivity
Is there a connector in place?Quality/Nature of sourceQuality (Final) - LQA
Internal quality verificationQuality (MT) - Automatic Scores
A set of automatic scoring systems is used
MT Engine Selection Scorecard
We have tested and used different engines so we’ve seen the good, the bad and the ugly; now we can better appreciate what we have
Scorecard - Metrics
Overall data
KPIs # 1 # 2 # 3 # 4 KPIs # 1 # 2 # 3 # 4Productivity 4 4 4 4 Productivity 4 5 3 4Productivity Increase 5 4 1 3 Productivity Increase 5 5 1 4Quality - LQA 2 2 1 2 Quality - LQA 5 3 3 4Quality - Automatic Scores 3 3 3 3 Quality - Automatic Scores 3 4 3 3Cost 4 2 3 3 Cost 4 2 3 3GlobalSight - Connectivity 4 3 2 4 GlobalSight - Connectivity 4 3 2 4GlobalSight - Tagging 4 2 4 2 GlobalSight - Tagging 4 2 2 2Human Evaluation 3 3 3 4 Human Evaluation 3 3 3 3Customization - Internal/External 4 2 3 3 Customization - Internal/External 4 2 3 3Customization - Time 3 1 2 1 Customization - Time 3 1 2 1Total 36 26 26 29 Total 39 30 25 31
German French Productivity metrics
Automatic Scoring
Human Evaluation
Toolkits and Trainings
Our experience:
Most translators know and have experienced post-editing but they have limited knowledge of any other related aspect (automatic scoring, output differences between RBMT and SMT...)
The majority of people who work in localization have heard about MT but most of them still find it a daunting subject.
Our answer:
Continuous MT and PE related trainings and documentation for language providers
Customized Toolkits for different internal departments (Production, Quality, Sales, Vendor Management)
copyright © welocalize 2013. all rights reserved. www.welocalize.com
Transparency and OwnershipTheory – knowledge foundations
Practice – customized PE sessions for different client accounts
Transparency – process, engine selection/customization, evaluations
Responsibility – valid evaluations, constructive feedback, quality ownership
Training helps a lot - After I was told some of the background information and tips and tricks for certain engines/outputs, I was much more relaxed and happy to give MT a go.
Legacy data – best prediction tool > Statistics from legacy knowledge base
The feedback loop
engine retraining improved significantly the handling of tags and spaces around tags, this is a productive achievement as it saves us a lot of manual corrections.
For me the biggest advantage would be
the possibility to implement a client terminology list [in
SMT]
I wish we could easily fix the corpus for
outdated terminology and characters
Teach the engine to properly cope with sentences containing more than one verb and/or verbs in progressive form
Feedback and Engine Improvement
“Beyond the Engine” Tools
• Teaminology - crowdsourcing platform for centralized term governance; simultaneous concordance search of TMs and term bases => clean training data
• Dispatcher - A global community content translation application that connects user generated content (UGC) including live chats, social media, forums, comments and knowledge bases to customized machine translation (MT) engines for real-time translation
• Source Candidate Scorer – scoring of candidate sentences against historically good and bad sentences based on POS and perplexity
• Corpus Preparation Toolkit – set of application to maximize data preparation for MT engine training
Teaminology
Teaminology
Dispatcher
Source Candidate Scorer
Source Candidate
Scorer
Compares your source content to “the good” and “the bad” legacy segments and estimates potential suitability for MT
Corpus Preparation Suite
Variety of tools to prepare corpus for training MT engines such as:
• Deleting formatting tags from TMX• Removing double spaces• Removing duplicated punctuation (e.g. commas)• Deleting segments where source = target• Deleting segments containing only URLs• Escaping characters• Removing duplicate sentences
copyright © welocalize 2013. all rights reserved. www.welocalize.com
Corpus Preparation: TM Creator
TM Creator
Aggregates training data from various relevant sources
Corpus Preparation: TMX Splitter
Extracts the relevant training corpus based on the TMX metadata
Welocalize Moses Implementation
• Why? Far more control over engine quality since we can control corpus preparation and output post-processing
• Control over metadata handling• Ties into our company open-source philosophy• Have experienced personnel in-house• Can extend and customize Moses functionality as necessary• Have connector to TMS (GlobalSight)
RESULTS: In our internal tests with Moses/DoMT, we are getting automated scores similar to commercial engines for the languages into which we localize most. Same feedback received from human evaluators
copyright © welocalize 2013. all rights reserved. www.welocalize.com
… And it works!
We are in the position to offer realistic discounts and aggressive timelines providing quality levels appropriate for the
content
copyright © welocalize 2013. all rights reserved. www.welocalize.com
“Work-in-progress” Projects
• Ongoing improvements to our adaptation of iOmegaT tool (Welocalize/CNGL)
• Industry Partner in CNGL “Source Content Profiler” project
• Adoption of TMTPrime (CNGL) - MT vs. Fuzzy Match selection mechanism
• Language and content-specific pre-processing for the in-house Moses deployment
• Teaminology – adding linguistic intelligence
copyright © welocalize 2013. all rights reserved. www.welocalize.com
Contact
[email protected] speak MT - the language of the future
Welocalize, Inc.www.welocalize.com Headquarters241 East 4th St. Suite 207Frederick, Maryland 21701 USA[t] +1.301.668.0330[t] +1.800.370.9515 Toll Free[f] +1.301.668.0335[e] [email protected]
copyright © welocalize 2013. all rights reserved. www.welocalize.com