Semantic Similarity Measures in Machine TranslationEvaluation
Hanna Bechara
ESR12Expert Project
June 27, 2015
Hanna Bechara June 27, 2015 1 / 21
Machine Translation Evaluation
How do we define translation quality?
Fluency? Grammaticality? Readability?
Post-editing effort?
How well it matches a reference translation?
Meaning Preservation?
Hanna Bechara June 27, 2015 2 / 21
Machine Translation Evaluation
How do we define translation quality?
Fluency? Grammaticality? Readability?
Post-editing effort?
How well it matches a reference translation?
Meaning Preservation?
Hanna Bechara June 27, 2015 2 / 21
Machine Translation Evaluation
How do we define translation quality?
Fluency? Grammaticality? Readability?
Post-editing effort?
How well it matches a reference translation?
Meaning Preservation?
Hanna Bechara June 27, 2015 2 / 21
Machine Translation Evaluation
How do we define translation quality?
Fluency? Grammaticality? Readability?
Post-editing effort?
How well it matches a reference translation?
Meaning Preservation?
Hanna Bechara June 27, 2015 2 / 21
Machine Translation Evaluation
How do we define translation quality?
Fluency? Grammaticality? Readability?
Post-editing effort?
How well it matches a reference translation?
Meaning Preservation?
Hanna Bechara June 27, 2015 2 / 21
Machine Translation Evaluation
How do we define translation quality?
Fluency? Grammaticality? Readability?
Post-editing effort?
How well it matches a reference translation?
Meaning Preservation!!
Hanna Bechara June 27, 2015 3 / 21
Semantic Textual Similarity
STS Explained
Semantic Textual Similarity (STS) captures the notion that sometexts are more similar than others
5 The two sentences are completely equivalent, as they mean the samething.
4 The two sentences are mostly equivalent, but some unimportantdetails differ.
3 The two sentences are roughly equivalent, but some importantinformation differs/missing.
2 The two sentences are not equivalent, but share some details.
1 The two sentences are not equivalent, but are on the same topic.
0 The two sentences are on different topics.
Hanna Bechara June 27, 2015 4 / 21
Semantic Textual Similarity
STS Explained
Semantic Textual Similarity (STS) captures the notion that sometexts are more similar than others
5 The two sentences are completely equivalent, as they mean the samething.
4 The two sentences are mostly equivalent, but some unimportantdetails differ.
3 The two sentences are roughly equivalent, but some importantinformation differs/missing.
2 The two sentences are not equivalent, but share some details.
1 The two sentences are not equivalent, but are on the same topic.
0 The two sentences are on different topics.
Hanna Bechara June 27, 2015 4 / 21
Semantic Textual Similarity
Examples
Example 1Sentence 1: A brown dog is attacking another animal in front of the manin pantsSentence 2:Two dogs are fighting
Example 2Sentence 1: A man is chopping butter into a container.Sentence 2: A woman is cutting shrimps.
Example 3Sentence 1: A cat is playing with a watermelon on a floor.Sentence 2: A man is pouring oil into a pan.
Hanna Bechara June 27, 2015 5 / 21
Semantic Textual Similarity
Examples
Example 1Sentence 1: A brown dog is attacking another animal in front of the manin pantsSentence 2:Two dogs are fighting
Example 2Sentence 1: A man is chopping butter into a container.Sentence 2: A woman is cutting shrimps.
Example 3Sentence 1: A cat is playing with a watermelon on a floor.Sentence 2: A man is pouring oil into a pan.
Hanna Bechara June 27, 2015 5 / 21
Semantic Textual Similarity
Examples
Example 1Sentence 1: A brown dog is attacking another animal in front of the manin pantsSentence 2:Two dogs are fighting
Example 2Sentence 1: A man is chopping butter into a container.Sentence 2: A woman is cutting shrimps.
Example 3Sentence 1: A cat is playing with a watermelon on a floor.Sentence 2: A man is pouring oil into a pan.
Hanna Bechara June 27, 2015 5 / 21
Semantic Textual Similarity
How do we estimate STS?
Crowd-Sourced Similarity Ratings
Created for SemEval Workshops
Expert’s SemEval Submission
SVM Regressor
Estimates score between 0 and 5
Train on human annotated sentences provided by the SemEval SharedTasks
Trained on a variety of features
Hanna Bechara June 27, 2015 6 / 21
Methodology
Research Question
Can we estimate the score X as a function of R (relatedness) andbA (Quality of A)?
Hanna Bechara June 27, 2015 7 / 21
Methodology
DGT Translation Memory
DGT-Translation Memory (EN-FR)
500 sentences x 5 most similar matches
Evaluation: S-BLEU (0–1) – Reference French Translations
Hanna Bechara June 27, 2015 8 / 21
Methodology
Machine Learning Task
Features
1 Baseline Experiments: 17 QuEst features
2 STS score for source sentence pair
3 S-BLEU score for Sentence Pair A
4 S-BLEU score comparing A to B (MT outputs)
SVM Regression Model
Predicts a score between 0–1
2000 sentences for training – 500 sentences for testing
Hanna Bechara June 27, 2015 9 / 21
Methodology
Machine Learning Task
Features
1 Baseline Experiments: 17 QuEst features
2 STS score for source sentence pair
3 S-BLEU score for Sentence Pair A
4 S-BLEU score comparing A to B (MT outputs)
SVM Regression Model
Predicts a score between 0–1
2000 sentences for training – 500 sentences for testing
Hanna Bechara June 27, 2015 9 / 21
Methodology
Results
Mean Baseline QuEst Baseline (17) STS (3) Combined (20)MAE 0.16 0.12 0.108 0.09
Table: Predicting the BLEU scores for DGT-TM - Mean Absolute Error
Hanna Bechara June 27, 2015 10 / 21
Methodology
SICK
SICK (Sentences Involving Compositional Knowledge )
4500 sentence pairs
Evaluation: S-BLEU Backtranslations
Hanna Bechara June 27, 2015 11 / 21
Methodology
Machine Learning Task
Features
1 STS score for source sentence pair
2 S-BLEU score for Sentence A
3 S-BLEU score comparing A to B (MT outputs)
SVM Regression Model
Predicts a score between 0–1
4000 sentences for training – 500 sentences for testing
Hanna Bechara June 27, 2015 12 / 21
Methodology
Machine Learning Task
Features
1 STS score for source sentence pair
2 S-BLEU score for Sentence A
3 S-BLEU score comparing A to B (MT outputs)
SVM Regression Model
Predicts a score between 0–1
4000 sentences for training – 500 sentences for testing
Hanna Bechara June 27, 2015 12 / 21
Methodology
Results
Mean Baseline STS (3)MAE 0.216 0.193
Table: Predicting the S-BLEU scores for SICK’s Backtranslations - MeanAbsolute Error
Hanna Bechara June 27, 2015 13 / 21
Methodology
Designing our Own
Objective
Create a dataset of semantically related sentences their machinetranslations, and their quality.
Hanna Bechara June 27, 2015 14 / 21
Methodology
Data Preparation
Extracted sentences from the FLICKR images dataset used forprevious SemEval tasks
Each pair has a human similarity rating between 0-5
Each sentence has a French machine translation and quality score foreach translation, between 1 and 5, assigned through manualevaluation
Each French sentence pair produced by the machine translation isalso assigned a similarity rating through manual evaluation.
Hanna Bechara June 27, 2015 15 / 21
Methodology
Example
Sentence AA group of kids is playing in a yard and an old man is standing in the background
Sentence BA group of boys in a yard is playing and a man is standing in the background
Semantic Similarity between A and B: 4.5
Sentence A - MT OutputUn groupe d’enfants joue dans une cour et un vieil homme est debout dans l’arriere-plan
Sentence B - MT OutputUn groupe de garcons dans une cour joue et un homme est debout dans l’arriere-plan
Semantic Similarity between A - MT Output and B - MT Output: ?
Hanna Bechara June 27, 2015 16 / 21
Methodology
Example
Sentence Aeurozone unemployment at record 12 percent
Sentence Beurozone unemployment hits record 12.1 % in march
Semantic Similarity between A and B: 4.5
Sentence A - MT Outputlors de la zone euro 12 % de chomage record
Sentence B - MT Outputle chomage frappe 12.1 % de la zone euro en marche proces-verbalan
Semantic Similarity between A - MT Output and B - MT Output: ?
Hanna Bechara June 27, 2015 17 / 21
Methodology
Experiments
Features Sets
1 Baseline Experiments: 17 QuEst features
2 STS score for sentence pair
3 Human evaluation score for Pair B (MT Output)
4 S-BLEU score comparing Pair A to Pair B (MT outputs)
SVM Regression Model
800 sentences for training – 200 sentences for testing
Predicts a score between 1–5
Hanna Bechara June 27, 2015 18 / 21
Methodology
Experiments
Features Sets
1 Baseline Experiments: 17 QuEst features
2 STS score for sentence pair
3 Human evaluation score for Pair B (MT Output)
4 S-BLEU score comparing Pair A to Pair B (MT outputs)
SVM Regression Model
800 sentences for training – 200 sentences for testing
Predicts a score between 1–5
Hanna Bechara June 27, 2015 18 / 21
Methodology
Results
Preliminary Results show that STS information can improve overthe baseline
Baseline Baseline + STS
MAE 0.639 0.575
Hanna Bechara June 27, 2015 19 / 21
Methodology
Summing up...
Results show that semantically motivated features can improve overthe quality estimation baseline
We can learn the quality of a Sentence B if we have a semanticallysimilar sentence A with a determined quality
However, we require access to semantically similar sentences
Hanna Bechara June 27, 2015 20 / 21
Methodology
Summing up...
Results show that semantically motivated features can improve overthe quality estimation baseline
We can learn the quality of a Sentence B if we have a semanticallysimilar sentence A with a determined quality
However, we require access to semantically similar sentences
Hanna Bechara June 27, 2015 20 / 21
Methodology
Summing up...
Results show that semantically motivated features can improve overthe quality estimation baseline
We can learn the quality of a Sentence B if we have a semanticallysimilar sentence A with a determined quality
However, we require access to semantically similar sentences
Hanna Bechara June 27, 2015 20 / 21
The End (For Now)
Enjoy the Weekend!
Hanna Bechara June 27, 2015 21 / 21