Top Banner
NewsReader is funded by the European Union’s 7th Framework Programme (ICT-316404) BiographyNet is funded by the Netherlands eScience Center. Partners in BiographyNet are Huygens/ING Institute of the Dutch Academy of Sciences and VU University Amsterdam. Wednesday, August 7, 13
19

Offspring from Reproduction Problems: what replication failure teaches us

Jun 18, 2015

Download

Technology

Marieke van Erp

Slides of ACL 2013 presentation of:
Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen and Nuno Freire (2013) Offspring from Reproduction Problems: what replication failure teaches us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 1691–1701, Sofia, Bulgaria, August 4-9 2013
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Offspring from Reproduction Problems: what replication failure teaches us

NewsReader is funded by the European Union’s 7th Framework Programme (ICT-316404)

BiographyNet is funded by the Netherlands eScience Center. Partners in BiographyNet are Huygens/ING Institute of the Dutch Academy of Sciences and VU University Amsterdam.

!Wednesday, August 7, 13

Page 2: Offspring from Reproduction Problems: what replication failure teaches us

OFFSPRING FROM REPRODUCTION PROBLEMS:

WHAT REPLICATION FAILURE TEACHES US

Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen and Nuno Freire

!Wednesday, August 7, 13

Page 3: Offspring from Reproduction Problems: what replication failure teaches us

NER EXPERIMENTS

• Nuno Freire, José Borbinha, and Pável Calado (2012) present an approach to recognise named entities in the cultural heritage domain with a small amount of training data

• The dataset used in Freire et al., (2012) is available

• Their software was not available

• The paper describes the feature set used as well as the (open source) machine learning package and the experimental setup

!Wednesday, August 7, 13

Page 4: Offspring from Reproduction Problems: what replication failure teaches us

•Nuno Freire provided additional feedback, but had finished his PhD and changed jobs: he had no access to his original experimental setup or his code

• The paper did not provide information on tokenisation, exact preprocessing steps, cuts for 10-fold cross validation, number of decimals used for rounding weights, etc.

!

NER EXPERIMENTS

Wednesday, August 7, 13

Page 5: Offspring from Reproduction Problems: what replication failure teaches us

!

NER EXPERIMENTS

Freire et al. (2012)Precision Recall F-score

P

Freire et al. (2012)Precision Recall F-score

P

Freire et al. (2012)Precision Recall F-score

P

Van Erp and Van der Meij (2013)Precision Recall F-score

Van Erp and Van der Meij (2013)Precision Recall F-score

Van Erp and Van der Meij (2013)Precision Recall F-score

LOC (388) 92% 55% 69 77.8% 39.2% 52.1

ORG (157) 90% 57% 70 65.8% 30.6% 41.7

PER (614) 91% 56% 69 73.3% 37.6% 49.7

Overall (1,159) 91% 55% 69 73.3% 37.1% 49.5

Wednesday, August 7, 13

Page 6: Offspring from Reproduction Problems: what replication failure teaches us

• Variations on tokenisation yielded a 15 point drop in overall F-score

• Results on individual folds different up to 25 points in F-score

• Experimenting with a different implementation of the CRF algorithm yielded significantly different scores (almost attaining those of Freire et al. (2012) without the complex features)

!

NER EXPERIMENTS

Wednesday, August 7, 13

Page 7: Offspring from Reproduction Problems: what replication failure teaches us

• Preprocessing of the data set and the extra resources used probably influenced our experiments

• Encoding the features as multivariate vs. boolean as input for the machine learner may have made a difference

•Without additional information (exact output, data/resources after preprocessing), it is hard to find out what causes these differences or which experiment provides the most indicative results of the potential of the approach.

!

NER EXPERIMENTS

Wednesday, August 7, 13

Page 8: Offspring from Reproduction Problems: what replication failure teaches us

WORDNET SIMILARITY EXPERIMENTS

•Marten Postma & Piek Vossen wanted to run a WordNet similarity experiment for Dutch previously done for English by Patwardhan & Pedersen (2006) and Pedersen (2010)

• This experiment ranks similarities between words based on WordNet similarity measures and compares it to human rankings

• Step 1: replicate the WordNet calculations of the original experiments

!Wednesday, August 7, 13

Page 9: Offspring from Reproduction Problems: what replication failure teaches us

WORDNET SIMILARITY EXPERIMENTS

• The code used in the original experiments is open source, but still results were different from the original

• Pedersen pointed out that the version of WordNet (and possibly Perl packages) may influence results

• Experiments were repeated with the exact same versions as the original: results still differed

!Wednesday, August 7, 13

Page 10: Offspring from Reproduction Problems: what replication failure teaches us

WORDNET SIMILARITY EXPERIMENTS

• Together with Ted Pedersen, we ran the experiment step by step comparing outcome until we obtained the same results•We identified the following factors that had lead to

differences:• Restriction on PoS tags• Gold Standard used• Ranking coefficient used• How much can these factors actually influence results?

Wednesday, August 7, 13

Page 11: Offspring from Reproduction Problems: what replication failure teaches us

WORDNET SIMILARITY EXPERIMENTS

!Wednesday, August 7, 13

Page 12: Offspring from Reproduction Problems: what replication failure teaches us

WORDNET SIMILARITY EXPERIMENTSVARIATIONS IN OUTPUT

!

Measure rho tau rank

path 0.08 0.07 1-8

wup 0.09 0.08 1-6

lch 0.08 0.07 1-7

res 0.10 0.31 4-11

lin 0.24 0.17 6-10

jcn 0.27 0.23 5,7-11

hso 0.07 0.05 1-3,5-10

vpairs 0.30 0.24 7-11

vector 0.44 0.43 1,2,4,6-11

lesk 0.17 0.63 1-8,11,12

Wednesday, August 7, 13

Page 13: Offspring from Reproduction Problems: what replication failure teaches us

WORDNET SIMILARITY EXPERIMENTS

Variation Spearman rho Kendall tau Different rank

WordNet version 0.44 0.42 88%

Gold Standard 0.24 0.21 71%

PoS-tag 0.09 0.08 41%

Configuration 0.08 0.60 41%

Wednesday, August 7, 13

Page 14: Offspring from Reproduction Problems: what replication failure teaches us

WORDNET SIMILARITY EXPERIMENTS

• Performance of similarity measures can vary significantly• Influential factors interact differently with individual scores

(i.e. comparative performance changes)• Apart from WordNet version, these factors have (to our

knowledge) not been discussed in previous literature, despite the fact that similarity scores are used very frequently•Open question: what is the impact of influential factors when

similarity measures are used for other tasks?

!Wednesday, August 7, 13

Page 15: Offspring from Reproduction Problems: what replication failure teaches us

DISCUSSION

• The results in this paper point to two main issues:

(1) Our methodological descriptions often do not contain the details needed to reproduce our results

(2) These details can have such high impact on our results that it is hard to distinguish the contribution of the approach from the contribution of preprocessing, the exact versions of tools/resources used, the evaluation set chosen etc.

Wednesday, August 7, 13

Page 16: Offspring from Reproduction Problems: what replication failure teaches us

CONCLUSION

• It is easier to find out how an approach really works, if you have:• the original code (even if containing hacks, unclean and/or undocumented)• a clear description of each individual step• the exact output on evaluation data (not just the overall numbers)• the preprocessed/modified/improved versions of standard resources

Wednesday, August 7, 13

Page 17: Offspring from Reproduction Problems: what replication failure teaches us

CONCLUSION

• Systematic testing can help to gain insight into the expected variation of a specific approach:• what is the performance on individual tools• what are the best and worst result using different parameters? • how does performance compare using different evaluation metrics?

• As a community, we should know where our approaches fail as much -if not more- as where they succeed

Wednesday, August 7, 13

Page 18: Offspring from Reproduction Problems: what replication failure teaches us

THANK YOU & THANKS

• To Ted Pedersen and Nuno Freire for writing this paper with us!

• To Ruben Izquierdo, Lourens van der Meij, Christoph Zwirello, Rebecca Dridan and the Semantic Web group at VU university for their help and feedback

• To the anonymous reviewers who really helped to make this a better paper

Wednesday, August 7, 13