ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Similarity Measures in Machine TranslationEvaluation

Hanna Bechara

ESR12Expert Project

June 27, 2015

Hanna Bechara June 27, 2015 1 / 21

Machine Translation Evaluation

How do we define translation quality?

Fluency? Grammaticality? Readability?

Post-editing effort?

How well it matches a reference translation?

Meaning Preservation?



































Meaning Preservation!!


Semantic Textual Similarity

STS Explained

Semantic Textual Similarity (STS) captures the notion that sometexts are more similar than others

5 The two sentences are completely equivalent, as they mean the samething.

4 The two sentences are mostly equivalent, but some unimportantdetails differ.

3 The two sentences are roughly equivalent, but some importantinformation differs/missing.

2 The two sentences are not equivalent, but share some details.

1 The two sentences are not equivalent, but are on the same topic.

0 The two sentences are on different topics.



STS Explained

Semantic Textual Similarity (STS) captures the notion that sometexts are more similar than others

5 The two sentences are completely equivalent, as they mean the samething.

4 The two sentences are mostly equivalent, but some unimportantdetails differ.

3 The two sentences are roughly equivalent, but some importantinformation differs/missing.

2 The two sentences are not equivalent, but share some details.

1 The two sentences are not equivalent, but are on the same topic.

0 The two sentences are on different topics.



Examples

Example 1Sentence 1: A brown dog is attacking another animal in front of the manin pantsSentence 2:Two dogs are fighting

Example 2Sentence 1: A man is chopping butter into a container.Sentence 2: A woman is cutting shrimps.

Example 3Sentence 1: A cat is playing with a watermelon on a floor.Sentence 2: A man is pouring oil into a pan.



Examples






Examples






How do we estimate STS?

Crowd-Sourced Similarity Ratings

Created for SemEval Workshops

Expert’s SemEval Submission

SVM Regressor

Estimates score between 0 and 5

Train on human annotated sentences provided by the SemEval SharedTasks

Trained on a variety of features


Methodology

Research Question

Can we estimate the score X as a function of R (relatedness) andbA (Quality of A)?


Methodology

DGT Translation Memory

DGT-Translation Memory (EN-FR)

500 sentences x 5 most similar matches

Evaluation: S-BLEU (0–1) – Reference French Translations


Methodology

Machine Learning Task

Features

1 Baseline Experiments: 17 QuEst features

2 STS score for source sentence pair

3 S-BLEU score for Sentence Pair A

4 S-BLEU score comparing A to B (MT outputs)

SVM Regression Model

Predicts a score between 0–1

2000 sentences for training – 500 sentences for testing


Methodology


Features



3 S-BLEU score for Sentence Pair A






Methodology

Results

Mean Baseline QuEst Baseline (17) STS (3) Combined (20)MAE 0.16 0.12 0.108 0.09

Table: Predicting the BLEU scores for DGT-TM - Mean Absolute Error


Methodology

SICK

SICK (Sentences Involving Compositional Knowledge )

4500 sentence pairs

Evaluation: S-BLEU Backtranslations


Methodology


Features


2 S-BLEU score for Sentence A






Methodology


Features


2 S-BLEU score for Sentence A






Methodology

Results

Mean Baseline STS (3)MAE 0.216 0.193

Table: Predicting the S-BLEU scores for SICK’s Backtranslations - MeanAbsolute Error


Methodology

Designing our Own

Objective

Create a dataset of semantically related sentences their machinetranslations, and their quality.


Methodology

Data Preparation

Extracted sentences from the FLICKR images dataset used forprevious SemEval tasks

Each pair has a human similarity rating between 0-5

Each sentence has a French machine translation and quality score foreach translation, between 1 and 5, assigned through manualevaluation

Each French sentence pair produced by the machine translation isalso assigned a similarity rating through manual evaluation.


Methodology

Example

Sentence AA group of kids is playing in a yard and an old man is standing in the background

Sentence BA group of boys in a yard is playing and a man is standing in the background

Semantic Similarity between A and B: 4.5

Sentence A - MT OutputUn groupe d’enfants joue dans une cour et un vieil homme est debout dans l’arriere-plan

Sentence B - MT OutputUn groupe de garcons dans une cour joue et un homme est debout dans l’arriere-plan

Semantic Similarity between A - MT Output and B - MT Output: ?


Methodology

Example

Sentence Aeurozone unemployment at record 12 percent

Sentence Beurozone unemployment hits record 12.1 % in march

Semantic Similarity between A and B: 4.5

Sentence A - MT Outputlors de la zone euro 12 % de chomage record

Sentence B - MT Outputle chomage frappe 12.1 % de la zone euro en marche proces-verbalan

Semantic Similarity between A - MT Output and B - MT Output: ?


Methodology

Experiments

Features Sets


2 STS score for sentence pair

3 Human evaluation score for Pair B (MT Output)

4 S-BLEU score comparing Pair A to Pair B (MT outputs)





Methodology

Experiments

Features Sets


2 STS score for sentence pair

3 Human evaluation score for Pair B (MT Output)

4 S-BLEU score comparing Pair A to Pair B (MT outputs)





Methodology

Results

Preliminary Results show that STS information can improve overthe baseline

Baseline Baseline + STS

MAE 0.639 0.575


Methodology

Summing up...

Results show that semantically motivated features can improve overthe quality estimation baseline

We can learn the quality of a Sentence B if we have a semanticallysimilar sentence A with a determined quality

However, we require access to semantically similar sentences


Methodology

Summing up...





Methodology

Summing up...





The End (For Now)

Enjoy the Weekend!


ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Data & Analytics

ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015