Alignment Weighting for Short Answer Assessment Bj¨ornRudzewitz University of T¨ ubingen Introduction Data System Alignment Weighting General Linguistic Weighting Task-Specific Weighting Hybrid Approach Experimental Testing Discussion Conclusion Appendix References Alignment Weighting for Short Answer Assessment Bj¨ orn Rudzewitz 1 University of T¨ ubingen Presentation of B.A. Thesis October 30, 2015 1 [email protected]
46
Embed
Alignment Weighting for Short Answer Assessmentbrzdwtz/resources/BA_Slides.pdf · Alignment Weighting for Short Answer Assessment Bj orn Rudzewitz University of Tubingen ... Weighting
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
2. TA Token Overlap % aligned TA tokens3. Learner Token Overlap % aligned SA tokens4. TA Chunk Overlap % aligned TA chunks5. Learner Chunk Overlap % aligned SA chunks6. TA Triple Overlap % aligned TA dependency triples7. Learner Triple Overlap % aligned SA dependency triples
8. Token Match % token-identical token alignments9. Similarity Match % similarity-resolved token alignments10. Type Match % type-resolved token alignments11. Lemma Match % lemma-resolved token alignments12. Synonym Match % synonym-resolved token alignments13. Variety Number of kinds of token-level alignments (features 8-12)
Table: CoMiC baseline features.
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Alignment Weighting: Motivation
Idea:
I aligned elements have different properties
I alignments between certain elements may be moreimportant
→ weight existing alignments in new dimension of similarity
I tf.idf lemma-based weighting, adapted from Manningand Schutze [1999]
I generally applicable measure, but task-specific training
I document collection: all reading texts in CREG-5K
I for each aligned token, get tf.idf weight in reading textto which the SA refers
oltf .idf (Ah) =∑
wj∈WAh
weighttf .idf (wj , di )
weighttf .idf (wj , di ) =
0 , if (wj NOT new) OR
(wj NOT aligned) OR
(wj /∈ di )
(1 + log(tfj ,i ))× log Ndfj
, otherwise
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Experimental Testing
Significance Testing: McNemar’s test (α = 0.05)
H0: The binary classification performance of analignment-based short answer assessmentsystem does not change if it is augmentedwith part of speech or tf.idf features.
H1: The binary classification performance of analignment-based short answer assessmentsystem significantly improves if it is aug-mented with part of speech or tf.idf features.
Table: System performance for the baseline system augmentedwith tf.idf features in terms of accuracy. The symbol ∗ denotes astatistically significant improvement over the baseline (α = 0.05).
Table: System performance for the baseline system augmentedwith question type and STTS group part of speech features andtf.idf weighting in terms of accuracy. The symbol ∗ denotes astatistically significant improvement over the baseline (α = 0.05).
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Experimental Testing: Main results
I many more tables with accuracies and test statistics ...
I pos features alone result in highest accuracy on onedata set (90%)
I tf.idf always yields improvement
I question-types alone not as effective
I best overall result for combination of all 3 weightings
I linguistically interpretable question-type specific posalignment patterns (Appendix 1)
I question-type specific macro-averages showimprovement from Meurers et al. [2011] (Appendix 2)
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Discussion: Related work
I Ziai and Meurers [2014]: CoMiC + informationstructure
I Horbach et al. [2013]: CoMiC-reimplementation +pos-align criteria + use of reading text
I Hahn and Meurers [2012]: CoSeC
I many other SAA systems (see thesis)
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Conclusion
I significant improvements with novel techniques
I results highly competitive to state-of-the-art systems
I no human annotation needed
I linguistically interesting insights from ml algorithms
I combination of all feature variants most effective
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Appendix 1: q-type pos align patterns
q-type #inst. 10 most informative Part of Speech tagsAlternative 7 VVPP, PPOSAT, PPER, PPOS, VMFIN, PRELAT, PIS, PIDAT, PIAT, PDS
Table: Macro-averages of the best system variant on CREG-1032obtained by grouping results by question type. Boldface indicatesan improvement upon the results by Meurers et al. [2011]
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Jason Baldridge. The OpenNLP Project. URL:http://opennlp. apache. org/index. html,(accessed 25August 2015), 2005.
Walter Daelemans, Jakub Zavrel, Kurt van der Sloot, andAntal Van den Bosch. TiMBL: Tilburg Memory-BasedLearner. Tilburg University, 2004.
David Ferrucci and Adam Lally. UIMA: An ArchitecturalApproach to Unstructured Information Processing in theCorporate Research Environment. Natural LanguageEngineering, 10(3-4):327–348, 2004.
David Gale and Lloyd S Shapley. College Admissions and theStability of Marriage. American Mathematical Monthly,pages 9–15, 1962.
Michael Hahn and Detmar Meurers. Evaluating the Meaningof Answers to Reading Comprehension Questions ASemantics-Based Approach. In Proceedings of theSeventh Workshop on Building Educational ApplicationsUsing NLP, pages 326–336. Association forComputational Linguistics, 2012.
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Birgit Hamp, Helmut Feldweg, et al. GermaNet - aLexical-Semantic Net for German. In Proceedings of ACLworkshop Automatic Information Extraction and Buildingof Lexical Semantic Resources for NLP Applications,pages 9–15. Citeseer, 1997.
Andrea Horbach, Alexis Palmer, and Manfred Pinkal. Usingthe text to evaluate short answers for readingcomprehension exercises. In Second Joint Conference onLexical and Computational Semantics (* SEM), volume 1,pages 286–295, 2013.
Vladimir I Levenshtein. Binary codes capable of correctingdeletions, insertions, and reversals. In Soviet physicsdoklady, volume 10, pages 707–710, 1966.
Christopher D Manning and Hinrich Schutze. Foundations ofStatistical Natural Language Processing. MIT press, 1999.
Detmar Meurers, Niels Ott, Ramon Ziai, et al. Compiling aTask-Based Corpus for the Analysis of Learner Languagein Context. Proceedings of Linguistic Evidence. Tubingen,pages 214–217, 2010.
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Detmar Meurers, Ramon Ziai, Niels Ott, and Janina Kopp.Evaluating Answers to Reading Comprehension Questionsin Context: Results for German and the Role ofInformation Structure. In Proceedings of the TextInfer2011 Workshop on Textual Entailment, pages 1–9.Association for Computational Linguistics, 2011.
Bjorn Rudzewitz and Ramon Ziai. CoMiC: Adapting a ShortAnswer Assessment System for Answer Selection. InProceedings of the 9th International Workshop onSemantic Evaluation, SemEval, volume 15, 2015.
Anne Schiller, Simone Teufel, and Christine Thielen.Guidelines fur das Tagging deutscher Textcorpora mitSTTS. Manuscript, Universities of Stuttgart andTubingen, 66, 1995.
AlignmentWeighting forShort AnswerAssessment
Bjorn RudzewitzUniversity of
Tubingen
Introduction
Data
System
AlignmentWeighting
General LinguisticWeighting
Task-SpecificWeighting
Hybrid Approach
ExperimentalTesting
Discussion
Conclusion
Appendix
References
Helmut Schmid. Probabilistic Part-of-Speech Tagging UsingDecision Trees. In Proceedings of the InternationalConference on New Methods in Language Processing,volume 12, pages 44–49. Citeseer, 1994.
Peter Turney. Mining the Web for Synonyms: PMI-IRVersus LSA on TOEFL. 2001.
Ramon Ziai and Detmar Meurers. Focus Annotation inReading Comprehension Data. LAW VIII, page 159, 2014.