Top Banner
Semantic Similarity Measures in Machine Translation Evaluation Hanna B´ echara ESR12 Expert Project June 27, 2015 Hanna B´ echara June 27, 2015 1 / 21
33

ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Apr 16, 2017

Download

Data & Analytics

RIILP
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Similarity Measures in Machine TranslationEvaluation

Hanna Bechara

ESR12Expert Project

June 27, 2015

Hanna Bechara June 27, 2015 1 / 21

Page 2: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Machine Translation Evaluation

How do we define translation quality?

Fluency? Grammaticality? Readability?

Post-editing effort?

How well it matches a reference translation?

Meaning Preservation?

Hanna Bechara June 27, 2015 2 / 21

Page 3: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Machine Translation Evaluation

How do we define translation quality?

Fluency? Grammaticality? Readability?

Post-editing effort?

How well it matches a reference translation?

Meaning Preservation?

Hanna Bechara June 27, 2015 2 / 21

Page 4: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Machine Translation Evaluation

How do we define translation quality?

Fluency? Grammaticality? Readability?

Post-editing effort?

How well it matches a reference translation?

Meaning Preservation?

Hanna Bechara June 27, 2015 2 / 21

Page 5: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Machine Translation Evaluation

How do we define translation quality?

Fluency? Grammaticality? Readability?

Post-editing effort?

How well it matches a reference translation?

Meaning Preservation?

Hanna Bechara June 27, 2015 2 / 21

Page 6: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Machine Translation Evaluation

How do we define translation quality?

Fluency? Grammaticality? Readability?

Post-editing effort?

How well it matches a reference translation?

Meaning Preservation?

Hanna Bechara June 27, 2015 2 / 21

Page 7: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Machine Translation Evaluation

How do we define translation quality?

Fluency? Grammaticality? Readability?

Post-editing effort?

How well it matches a reference translation?

Meaning Preservation!!

Hanna Bechara June 27, 2015 3 / 21

Page 8: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Textual Similarity

STS Explained

Semantic Textual Similarity (STS) captures the notion that sometexts are more similar than others

5 The two sentences are completely equivalent, as they mean the samething.

4 The two sentences are mostly equivalent, but some unimportantdetails differ.

3 The two sentences are roughly equivalent, but some importantinformation differs/missing.

2 The two sentences are not equivalent, but share some details.

1 The two sentences are not equivalent, but are on the same topic.

0 The two sentences are on different topics.

Hanna Bechara June 27, 2015 4 / 21

Page 9: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Textual Similarity

STS Explained

Semantic Textual Similarity (STS) captures the notion that sometexts are more similar than others

5 The two sentences are completely equivalent, as they mean the samething.

4 The two sentences are mostly equivalent, but some unimportantdetails differ.

3 The two sentences are roughly equivalent, but some importantinformation differs/missing.

2 The two sentences are not equivalent, but share some details.

1 The two sentences are not equivalent, but are on the same topic.

0 The two sentences are on different topics.

Hanna Bechara June 27, 2015 4 / 21

Page 10: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Textual Similarity

Examples

Example 1Sentence 1: A brown dog is attacking another animal in front of the manin pantsSentence 2:Two dogs are fighting

Example 2Sentence 1: A man is chopping butter into a container.Sentence 2: A woman is cutting shrimps.

Example 3Sentence 1: A cat is playing with a watermelon on a floor.Sentence 2: A man is pouring oil into a pan.

Hanna Bechara June 27, 2015 5 / 21

Page 11: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Textual Similarity

Examples

Example 1Sentence 1: A brown dog is attacking another animal in front of the manin pantsSentence 2:Two dogs are fighting

Example 2Sentence 1: A man is chopping butter into a container.Sentence 2: A woman is cutting shrimps.

Example 3Sentence 1: A cat is playing with a watermelon on a floor.Sentence 2: A man is pouring oil into a pan.

Hanna Bechara June 27, 2015 5 / 21

Page 12: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Textual Similarity

Examples

Example 1Sentence 1: A brown dog is attacking another animal in front of the manin pantsSentence 2:Two dogs are fighting

Example 2Sentence 1: A man is chopping butter into a container.Sentence 2: A woman is cutting shrimps.

Example 3Sentence 1: A cat is playing with a watermelon on a floor.Sentence 2: A man is pouring oil into a pan.

Hanna Bechara June 27, 2015 5 / 21

Page 13: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Textual Similarity

How do we estimate STS?

Crowd-Sourced Similarity Ratings

Created for SemEval Workshops

Expert’s SemEval Submission

SVM Regressor

Estimates score between 0 and 5

Train on human annotated sentences provided by the SemEval SharedTasks

Trained on a variety of features

Hanna Bechara June 27, 2015 6 / 21

Page 14: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Research Question

Can we estimate the score X as a function of R (relatedness) andbA (Quality of A)?

Hanna Bechara June 27, 2015 7 / 21

Page 15: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

DGT Translation Memory

DGT-Translation Memory (EN-FR)

500 sentences x 5 most similar matches

Evaluation: S-BLEU (0–1) – Reference French Translations

Hanna Bechara June 27, 2015 8 / 21

Page 16: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Machine Learning Task

Features

1 Baseline Experiments: 17 QuEst features

2 STS score for source sentence pair

3 S-BLEU score for Sentence Pair A

4 S-BLEU score comparing A to B (MT outputs)

SVM Regression Model

Predicts a score between 0–1

2000 sentences for training – 500 sentences for testing

Hanna Bechara June 27, 2015 9 / 21

Page 17: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Machine Learning Task

Features

1 Baseline Experiments: 17 QuEst features

2 STS score for source sentence pair

3 S-BLEU score for Sentence Pair A

4 S-BLEU score comparing A to B (MT outputs)

SVM Regression Model

Predicts a score between 0–1

2000 sentences for training – 500 sentences for testing

Hanna Bechara June 27, 2015 9 / 21

Page 18: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Results

Mean Baseline QuEst Baseline (17) STS (3) Combined (20)MAE 0.16 0.12 0.108 0.09

Table: Predicting the BLEU scores for DGT-TM - Mean Absolute Error

Hanna Bechara June 27, 2015 10 / 21

Page 19: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

SICK

SICK (Sentences Involving Compositional Knowledge )

4500 sentence pairs

Evaluation: S-BLEU Backtranslations

Hanna Bechara June 27, 2015 11 / 21

Page 20: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Machine Learning Task

Features

1 STS score for source sentence pair

2 S-BLEU score for Sentence A

3 S-BLEU score comparing A to B (MT outputs)

SVM Regression Model

Predicts a score between 0–1

4000 sentences for training – 500 sentences for testing

Hanna Bechara June 27, 2015 12 / 21

Page 21: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Machine Learning Task

Features

1 STS score for source sentence pair

2 S-BLEU score for Sentence A

3 S-BLEU score comparing A to B (MT outputs)

SVM Regression Model

Predicts a score between 0–1

4000 sentences for training – 500 sentences for testing

Hanna Bechara June 27, 2015 12 / 21

Page 22: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Results

Mean Baseline STS (3)MAE 0.216 0.193

Table: Predicting the S-BLEU scores for SICK’s Backtranslations - MeanAbsolute Error

Hanna Bechara June 27, 2015 13 / 21

Page 23: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Designing our Own

Objective

Create a dataset of semantically related sentences their machinetranslations, and their quality.

Hanna Bechara June 27, 2015 14 / 21

Page 24: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Data Preparation

Extracted sentences from the FLICKR images dataset used forprevious SemEval tasks

Each pair has a human similarity rating between 0-5

Each sentence has a French machine translation and quality score foreach translation, between 1 and 5, assigned through manualevaluation

Each French sentence pair produced by the machine translation isalso assigned a similarity rating through manual evaluation.

Hanna Bechara June 27, 2015 15 / 21

Page 25: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Example

Sentence AA group of kids is playing in a yard and an old man is standing in the background

Sentence BA group of boys in a yard is playing and a man is standing in the background

Semantic Similarity between A and B: 4.5

Sentence A - MT OutputUn groupe d’enfants joue dans une cour et un vieil homme est debout dans l’arriere-plan

Sentence B - MT OutputUn groupe de garcons dans une cour joue et un homme est debout dans l’arriere-plan

Semantic Similarity between A - MT Output and B - MT Output: ?

Hanna Bechara June 27, 2015 16 / 21

Page 26: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Example

Sentence Aeurozone unemployment at record 12 percent

Sentence Beurozone unemployment hits record 12.1 % in march

Semantic Similarity between A and B: 4.5

Sentence A - MT Outputlors de la zone euro 12 % de chomage record

Sentence B - MT Outputle chomage frappe 12.1 % de la zone euro en marche proces-verbalan

Semantic Similarity between A - MT Output and B - MT Output: ?

Hanna Bechara June 27, 2015 17 / 21

Page 27: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Experiments

Features Sets

1 Baseline Experiments: 17 QuEst features

2 STS score for sentence pair

3 Human evaluation score for Pair B (MT Output)

4 S-BLEU score comparing Pair A to Pair B (MT outputs)

SVM Regression Model

800 sentences for training – 200 sentences for testing

Predicts a score between 1–5

Hanna Bechara June 27, 2015 18 / 21

Page 28: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Experiments

Features Sets

1 Baseline Experiments: 17 QuEst features

2 STS score for sentence pair

3 Human evaluation score for Pair B (MT Output)

4 S-BLEU score comparing Pair A to Pair B (MT outputs)

SVM Regression Model

800 sentences for training – 200 sentences for testing

Predicts a score between 1–5

Hanna Bechara June 27, 2015 18 / 21

Page 29: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Results

Preliminary Results show that STS information can improve overthe baseline

Baseline Baseline + STS

MAE 0.639 0.575

Hanna Bechara June 27, 2015 19 / 21

Page 30: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Summing up...

Results show that semantically motivated features can improve overthe quality estimation baseline

We can learn the quality of a Sentence B if we have a semanticallysimilar sentence A with a determined quality

However, we require access to semantically similar sentences

Hanna Bechara June 27, 2015 20 / 21

Page 31: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Summing up...

Results show that semantically motivated features can improve overthe quality estimation baseline

We can learn the quality of a Sentence B if we have a semanticallysimilar sentence A with a determined quality

However, we require access to semantically similar sentences

Hanna Bechara June 27, 2015 20 / 21

Page 32: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Methodology

Summing up...

Results show that semantically motivated features can improve overthe quality estimation baseline

We can learn the quality of a Sentence B if we have a semanticallysimilar sentence A with a determined quality

However, we require access to semantically similar sentences

Hanna Bechara June 27, 2015 20 / 21

Page 33: ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

The End (For Now)

Enjoy the Weekend!

Hanna Bechara June 27, 2015 21 / 21