Rodney D. Nielsen 1,2 , Wayne Ward 1,2 and James H. Martin 1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language Technologies Reference Answer: A long string produces a low pitch. (Lawrence Hall of Science 2006, Assessing Science Knowledge) A harp has strings of different lengths. Describe how the sound of a longer string differs from the sound of a shorter string. When the string gets longer it makes the pitch lower. Classification Errors in a Domain-Independent Assessment System
20
Embed
Rodney D. Nielsen 1,2, Wayne Ward 1,2 and James H. Martin 1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Rodney D. Nielsen1,2, Wayne Ward1,2 and James H. Martin1
1 Center for Computational Language and Education Research, CU, Boulder
2 Boulder Language Technologies
Reference Answer: A long string produces a low pitch.
(Lawrence Hall of Science 2006, Assessing Science Knowledge)
A harp has strings of different lengths. Describe how the sound of a longer string
differs from the sound of a shorter string.
When the string gets longer it makes the pitch
lower.
Classification Errors in a Domain-Independent Assessment System
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 2
Tailoring the Tutor’s Response
Question: A harp has strings of different lengths. Describe
how the sound of a longer string differs from the sound of a shorter string.
Reference answer: A long string produces a low pitch.
Learner answers: When the string gets longer it makes the pitch
lower. A long string produces a pitch. It makes a loud pitch. It makes a high pitch. If the string is tighter, the pitch is higher.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 3
Necessity of Finer-Grained Analysis
Imagine a tutor only knowing that there is some unspecified part of the reference answer that we are not sure the student understands
Reference Answer: A long string produces a low pitch. Break the reference answer down into low-level
facets derived from a dependency parse and thematic roles
NMod(string, long) The string is long. Agent(produces, string) A string is producing
something. Product(produces, pitch) A pitch is being produced. NMod(pitch, low) The pitch is low.
Assess whether an understanding of each facet is implied by the student’s responseA long string produces a low pitch.
det
nmoddetnmod
object
subject
Follow-up Question: Does a long string produce a higher or lower pitch.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 4
Representing Fine-Grained Semantics
Assess the relationship between the student’s answer and the reference answer facets at a finer grain Reference Ans: A long string produces a low pitch.
Answer Annotation Labels Understood: Facets that are understood by the student
Assumed: Assumed to be understood a priori based on the question
Expressed: Directly expressed or inferred by simple reasoning Inferred: Inferred by pragmatics or nontrivial logical reasoning
Contradicted: Facets contradicted by the learner answer Contra-Expr: Directly contradicted by negation, antonymous
expressions and their paraphrases Contra-Infr: Contradicted by pragmatics or complex reasoning
Self-Contra: Facets that are both contradicted and implied (self contradictions)
Diff-Arg: The core relation is expressed, but it has a different modifier or argument
Unaddressed: Facets that are not addressed at all by the student’s answer
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 6
Assessment Technology Overview
Start with hand-generated reference answer facets Automatically parse reference & learner answer and
automatically extract representation Extract a feature vector for each reference answer
(RA) facet indicative of the student’s understanding of that facet
From answers, their automatic parses, the relations between these, and external corpus co-occurrence statistics
Train a machine learning classifier on the training set feature vectors
Use classifier to assess the test set answers, assigning one of five Tutor-Labels for each RA facet
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 7
Machine Learning FeaturesThe lexical entailment probabilities for the reference answer facet’s governor and modifier following (Glickman, Dagan and Koppel 2005 .; see also, Turney 2001)
Indicators of whether the reference answer governor’s (modifier’s) stem has an exact match in the learner answer
The lexical entailment probabilities for the primary constituent facets’ governors and modifiers when the facet in question represents a relation between propositions
The part of speech (POS) tags for the facet’s governor and modifier
The dependency or role type labels of the facet and the aligned learner answer dependency
The edit distance between the dependency path connecting the facet’s governor and modifier and the path connecting the aligned terms in the learner answer
True if the facet has a negation and the aligned learner dependency path has a single negation or if neither have a negation
The number of content words in the reference answer, motivated by the fact that longer answers were more likely to result in spurious alignments
Lexical
Syntactic
Other
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 8
Results (C4.5 decision tree)
Results on Tutor-Labels are: 24.4 and 15.4% over majority class baseline 19.4 and 5.9% over lexical baseline
# nonAsmdFacets
MajorityClass
LexicalBaseline
All Features
Training Set 10xCV 54,967 54.6 59.7 77.1
Unseen Answers 3,159 51.1 56.1 75.5
Unseen Modules 30,514 53.4 62.9 68.8
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 9
Error Analysis of Domain-Independent Asmt
Leave-one-module-out cross-validation on the 13 training set science modules
Train on 12 modules test on the held out module; do this for each of the 13 modules
Simulates Unseen Modules (domain-independent) test set
Trained and tested on all non-Assumed facets Analyzed random selection of subset of errors
100 Expressed and 100 Unaddressed Consistently annotated by all annotators
Consider the factors involved in decision by humans
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 10
Errors in Expressed Facets
Four main error factors by frequency: 72% Paraphrases