Rodney D. Nielsen 1,2, Wayne Ward 1,2 and James H. Martin 1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language.

Rodney D. Nielsen1,2, Wayne Ward1,2 and James H. Martin1

1 Center for Computational Language and Education Research, CU, Boulder

2 Boulder Language Technologies

Reference Answer: A long string produces a low pitch.

(Lawrence Hall of Science 2006, Assessing Science Knowledge)

A harp has strings of different lengths. Describe how the sound of a longer string

differs from the sound of a shorter string.

When the string gets longer it makes the pitch

lower.

Classification Errors in a Domain-Independent Assessment System

ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 2

Tailoring the Tutor’s Response

Question: A harp has strings of different lengths. Describe

how the sound of a longer string differs from the sound of a shorter string.

Reference answer: A long string produces a low pitch.

Learner answers: When the string gets longer it makes the pitch

lower. A long string produces a pitch. It makes a loud pitch. It makes a high pitch. If the string is tighter, the pitch is higher.


Necessity of Finer-Grained Analysis

Imagine a tutor only knowing that there is some unspecified part of the reference answer that we are not sure the student understands

Reference Answer: A long string produces a low pitch. Break the reference answer down into low-level

facets derived from a dependency parse and thematic roles

NMod(string, long) The string is long. Agent(produces, string) A string is producing

something. Product(produces, pitch) A pitch is being produced. NMod(pitch, low) The pitch is low.

Assess whether an understanding of each facet is implied by the student’s responseA long string produces a low pitch.

det

nmoddetnmod

object

subject

Follow-up Question: Does a long string produce a higher or lower pitch.


Representing Fine-Grained Semantics

Assess the relationship between the student’s answer and the reference answer facets at a finer grain Reference Ans: A long string produces a low pitch.

NMod(string, long) Agent(produces, string) Product(produces, pitch) NMod(pitch, low)

ExpressedExpressedExpressedUnaddressed

A long string produces a pitch.

YesYesYesNo

AssumedExpressedExpressedDifferent

ArgumentIt produces a loud pitch.

AssumedExpressedExpressedContradiction

Expressed

It produces a high pitch.


Answer Annotation Labels Understood: Facets that are understood by the student

Assumed: Assumed to be understood a priori based on the question

Expressed: Directly expressed or inferred by simple reasoning Inferred: Inferred by pragmatics or nontrivial logical reasoning

Contradicted: Facets contradicted by the learner answer Contra-Expr: Directly contradicted by negation, antonymous

expressions and their paraphrases Contra-Infr: Contradicted by pragmatics or complex reasoning

Self-Contra: Facets that are both contradicted and implied (self contradictions)

Diff-Arg: The core relation is expressed, but it has a different modifier or argument

Unaddressed: Facets that are not addressed at all by the student’s answer


Assessment Technology Overview

Start with hand-generated reference answer facets Automatically parse reference & learner answer and

automatically extract representation Extract a feature vector for each reference answer

(RA) facet indicative of the student’s understanding of that facet

From answers, their automatic parses, the relations between these, and external corpus co-occurrence statistics

Train a machine learning classifier on the training set feature vectors

Use classifier to assess the test set answers, assigning one of five Tutor-Labels for each RA facet


Machine Learning FeaturesThe lexical entailment probabilities for the reference answer facet’s governor and modifier following (Glickman, Dagan and Koppel 2005 .; see also, Turney 2001)

Indicators of whether the reference answer governor’s (modifier’s) stem has an exact match in the learner answer

The lexical entailment probabilities for the primary constituent facets’ governors and modifiers when the facet in question represents a relation between propositions

The part of speech (POS) tags for the facet’s governor and modifier

The dependency or role type labels of the facet and the aligned learner answer dependency

The edit distance between the dependency path connecting the facet’s governor and modifier and the path connecting the aligned terms in the learner answer

True if the facet has a negation and the aligned learner dependency path has a single negation or if neither have a negation

The number of content words in the reference answer, motivated by the fact that longer answers were more likely to result in spurious alignments

Lexical

Syntactic

Other


Results (C4.5 decision tree)

Results on Tutor-Labels are: 24.4 and 15.4% over majority class baseline 19.4 and 5.9% over lexical baseline

# nonAsmdFacets

MajorityClass

LexicalBaseline

All Features

Training Set 10xCV 54,967 54.6 59.7 77.1

Unseen Answers 3,159 51.1 56.1 75.5

Unseen Modules 30,514 53.4 62.9 68.8


Error Analysis of Domain-Independent Asmt

Leave-one-module-out cross-validation on the 13 training set science modules

Train on 12 modules test on the held out module; do this for each of the 13 modules

Simulates Unseen Modules (domain-independent) test set

Trained and tested on all non-Assumed facets Analyzed random selection of subset of errors

100 Expressed and 100 Unaddressed Consistently annotated by all annotators

Consider the factors involved in decision by humans


Errors in Expressed Facets

Four main error factors by frequency: 72% Paraphrases

43% Phrase-based paraphrasing 35% Lexical substitution 26% Coreference 1% Syntactic alternation (Vanderwende et al.

2005) 22% Logical Inference 22% Pragmatics 6% Preprocessing Errors



43% Phrase-based paraphrasing 32 typical paraphrase occurrences

in the middle versus halfway between one mineral will leave a scratch versus one

will scratch the other 14 uses of concept definitions

circuit versus electrical pathway 6 negations of antonyms

not a lot for a little no one has the same fingerprint for

everyone has a different print


Errors in Expressed Facets 35% Lexical substitution

Synonymy, hypernymy, hyponymy, meronymy, derivational changes, and other lexical paraphrases

Half detectable by broad coverage resource Tiny for small, CO2 for gas, put for place, pen for

ink and push for carry Many not easily detectable in lexical

resources put the pennies for distribute the pennies, and

have for contain



26% Coreference Resolution 15 pronouns (11 it, 3 she, 1 one) 6 NP term substitutions

Ref Ans: clay particles are lightLearner Ans: clay is the lightest

6 other common noun coreference issues


Errors in Expressed Facets 22% Logical inference

no, cup 1 would be a plastic cup 25 ml water and cup 2 paper cup 25 ml and 10 g sugar=> the two cups have the same amount of water

… it is easy to discriminate…=> the two sounds are very different

22% Pragmatics Because the vibrations

=> the rubberband is vibrating … the fulcrum is too close to the earth

=> the earth is the load in the system



6% Preprocessing errors Normalization issues Parser errors



Over half of the errors involved more than one of the fine-grained factors There is a shadow there because the

sun is behind it and light cannot go through solid objects. Note, I think that question was kind of dumb.=> the tree blocks the light


Errors in Unaddressed Facets

Many are questionable annotations You could take a couple of cardboard

houses and … 1 with thick glazed insulation. … =/> installing the insulation in the houses

Because the darker the color the faster it will heat up =/> darkest color


Errors in Unaddressed Facets

Biggest source of error: lexical similarity Ignorance of context

[the electromagnet] has to be iron…=> steel is made from iron

Antonyms closer versus greater distance and

absorbs energy versus reflects energy Misguided trust

I learned it in class


Conclusion New assessment paradigm

Fine-grained facets and labels Corpus of 146K fine-grained inference

annotations Answer assessment system

24.4 and 15.4% over baseline results for in-domain and out-of-domain, respectively

First successful assessment of Grade 3-6 constructed responses

Error analysis provides insight into where future work is most appropriate


Thanks! We are grateful to the anonymous

reviewers, whose comments improved the paper, and the Lawrence Hall of Science for the data.

This work was partially funded by Award Numbers: NSF 0551723, IES R305B070434, and NSF DRL-0733323.

Rodney D. Nielsen 1,2, Wayne Ward 1,2 and James H. Martin 1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language.

Documents

long string

low pitch

longer string

pitch lower

lower pitch

string productproduces

shorter string

pitch nmodpitch