-
Vol.:(0123456789)
Reading and
Writinghttps://doi.org/10.1007/s11145-020-10026-4
1 3
Modeling question‑answer relations: the development
of the integrative inferential reasoning comic
assessment
Alexander Mario Blum1,2 ·
James M. Mason1 · Jinho Kim1,3 ·
P. David Pearson1
© Springer Nature B.V. 2020
AbstractWe constructed a new taxonomy for inferential thinking,
a construct called Integra-tive Inferential Reasoning (IIR). IIR
extends Pearson and Johnson’s (1978) frame-work of text-implicit
and script-implicit question-answer relations, and integrates
several other prominent literacy theories to form a unified
inferential reasoning con-struct, that is positioned as a type of
cognitive processing disposition. We validated our construct using
a researcher-made IIR instrument which was administered to 72
primary-grade students. Participants answered open-ended inference
questions about various aspects of visual narratives presented in
comic-strip format. We categorized participants’ responses as
exemplifying one of the levels of IIR: text-implicit,
script-implicit, or a combination of both. We used item response
models to validate the ordinal nature of IIR, and its structure.
Specifically, we fit Masters’ (1982) partial credit model and
obtained a Wright map, mean location data, and reliability
esti-mates. Results confirmed that the IIR construct behaves
ordinally. Additionally, age was found to be a reliable predictor
of IIR, and item types (each modeled as a separate dimension) were
found to have reasonable latent correlations between the
dimensions. This research provides theoretical and practical
insights for the peda-gogy and assessment of narrative
comprehension.
Keywords Cognitive processing disposition · Comics ·
Inferential thinking · Item response models · Narrative
comprehension · Scale validation
* Alexander Mario Blum [email protected]
1 Graduate School of Education, University
of California, Berkeley, Berkeley, USA2 Department
of Special Education, San Francisco State University, 1600
Holloway Avenue, Burk
Hall 156, San Francisco, CA 94132, USA3 IMEC Research
Group at KU Leuven, KU Leuven and ITEC, Kortrijk,
Belgium
http://orcid.org/0000-0002-5887-7417https://orcid.org/0000-0002-3549-638Xhttps://orcid.org/0000-0003-0583-7750http://crossmark.crossref.org/dialog/?doi=10.1007/s11145-020-10026-4&domain=pdf
-
A. M. Blum et al.
1 3
Introduction
Different theories illuminate the meaning making process of
narrative compre-hension, particularly when readers are faced with
a question of some sort. This process has been discussed in terms
of its question-answer relations (QARs; Pear-son &
Johnson, 1978), the type of inference probe (Warren, Nicholas,
& Trabasso, 1979), the relationship between QAR and inference
probe (Chikalanga, 1992), the type of inferential response and
associated coherence (Graesser, Singer, & Trabasso, 1994), and
the notion of integration that readers create between the text-base
they construct from a text and the knowledge-base they bring to
their reading (Kintsch, 1988, 1998; Kintsch & van Dijk, 1978).
This gives rise to the following questions in literacy research:
(a) can these theories be united to form a single construct; (b)
how many levels are there to this construct; (c) is there a
hierarchical relationship among the levels as they relate to
narrative comprehen-sion; and (d) does this construct develop in an
orderly way across age levels? We describe the construction and
validation of an instrument, through the use of appropriate
measurement methodologies, to attempt to answer these
questions.
A brief history of the research in inferential
reasoning
In 1978, Pearson and Johnson published a chapter (Pearson &
Johnson, 1978) exploring the nature of questions and diving into
their relationship with the text and the reader. Questions warrant
our attention not only because of their wide-spread use, but also
because of their central role in any classroom discussion activity
and in measuring the construct of reading comprehension (Pearson,
1982; Pearson & Johnson, 1978). Questions on their own cannot
determine the type of comprehension readers enact. Questions serve
as a kind of “invitation” to engage at a particular level of
cognitive challenge, but readers may decide to respond at a
different level (Pearson & Johnson, 1978). Thus it is only
after examining the responses that a particular student offers, in
concert with the invitation offered by the question, that we can
actually determine the true nature of the act of com-prehension
that occurred. It is the question-answer relationship that matters.
To determine that relationship, we must consult the information
provided in the text-base that a reader created from what the
author provided, and the information provided from the reader’s
world-knowledge. The Pearson-Johnson scheme of QARs includes three
levels of reasoning:
Text-explicit (where both the question and the answer are in the
text, and the QAR is rendered explicit by the logic of the
grammatical relationship between the text and the question.). For
example:
1. John rode the horse to improve his riding skills. He saw a
championship in his future.
2. Why did John ride the horse?3. To improve his riding
skills.
-
1 3
Integrative inferential reasoning
Text-implicit (both the question and answer are derivable from
the text, but there is no logical or grammatical cue explicitly
linking the question to the answer. The reader must establish the
connection between question and answer to complete the QAR. Point 4
represents a text-implicit QAR (in regards to points 1 and 2)
because the reader had to infer the link of logic between the
proximal goal of improving skills and the distal goal of a
championship.), and
4. To become a champion.
Script-implicit (the question is motivated by the text but the
answer comes from the world knowledge script that the reader brings
to the task). Often the log-ical link between the answer and the
text is reasonable, even transparent, but the reader has to supply
the link. Point 5 provides such an example because the reader has
generalized from improving skills (or perhaps the championship
phrase) to the “always do your best” script to guide human
behavior.
5. Because he was always driven to do his best.
Thus, these categories cannot be classified as question types in
their own right, considering that the question in point 2 could be
answered using different sources of information resulting in
different QARs. The logic of QARs continues to this day in terms of
instructional routines based on the work of Raphael and Au
(2005).
Warren et al. (1979) argued for another set of inferences
that are derived from a different set of relationships,
specifically logical and evaluative relations. Logi-cal relations
are relationships between events that involve causes, motivations,
and conditions that allow events to happen, which are explanatory
in nature. These are the things that drive actions and events in
stories. One particular type of logical relation they argue for is
called motivational inference, which involves inferring why an
agent engaged in an intentional action and tends to privilege why
and how questions. Evaluative relations apply notions of
significance, moral-ity, themes, and lessons within the story, and
are highly sensitive and related to the readers’ world
knowledge.
Kintsch (1988) describes the notion of a situation-model where
the reader inte-grates his/her world knowledge with the text-base.
The process begins with con-text, essentially the reader is first
situated by the discourse context (e.g., motive for reading the
story). Then the reader engages in the construction process, where
they construct a text-base. In order to accomplish this, the
learner forms concepts directly corresponding to the linguistic
input from the narrative. Some of these ideas are based in their
world-knowledge, others are based on the text-base, such as maybe
the reader is hearing or reading the climax of the narrative (e.g.,
macro-proposition). Then, the reader goes through the integration
process, where they use their world-knowledge to constrain only
relevant ideas, as it pertains to the linguistic input of the
narrative. This process will continue as the reader comes across
information in the narrative, constantly updating their situation
model.
-
A. M. Blum et al.
1 3
In 1992, Chikalanga proposed his own theoretical levels of
comprehension (Chi-kalanga, 1992) combining both Pearson and
Johnson’s (1978) and Warren et al. (1979) inferential
thinking models. Chikalanga argued that depending on the
rela-tionship of the information provided by the text-base and the
readers world-knowl-edge, logical and informational inferences
could either be text-based or script-based (see Chikalanga, 1992,
p. 706, for an example).
Graesser et al. (1994) suggested a set of knowledge-based
inferences that reflect the influence of Kintsch’s (1988)
Construction-Integration model of comprehen-sion. Recall that
Kintsch posited three models (more like levels) of comprehension
that entail the text and the reader’s knowledge base as primary
resources available to assist in model building. The surface code
consists of a cursory account of the words and sentence syntax with
very little processing by the reader. The text base requires the
reader to make low level inferential connections required for
establish-ing cohesion among the propositions in the text (e.g.,
resolving anaphora and estab-lishing explanatory links from one
proposition to the next). They posit 13 classes of knowledge-based
inferences in the context of narrative comprehension (see Graesser
et al., 1994 for full review). A subset of these inferences
can theoretically be mapped onto motivational inferences (Warren
et al., 1979), specifically causal antecedent, superordinate
goal, thematic, character emotional reaction, causal consequence,
and state inferences (Graesser et al., 1994). For example,
when asking a motiva-tional inference question, such as “why did he
steal the car?”, one could answer based on the character’s
superordinate goals (“he wanted to get to the store”—the desire to
go to the store motivated the action), character emotional reaction
(“he was feeling angry”—his emotion was the motivation for his
action), causal-antecedent (“someone dared him to do it right
before”—the event prior was the motivation for the action), state
(“he believed he could succeed”—his belief was the motivation for
the action), casual-consequence (“his friends would celebrate him
after”—the forecasted consequential outcome of the event in
question was the motivation for his action), or even a thematic
inference (“because desperate people do desperate things”—the
moral, lesson, or principle was the motivation for the agent’s
action). Evaluative inferences could be answered based on thematic
inferences (Graesser et al., 1994).
Basaraba, Yovanoff, Alonzo, and Tindal (2013) also investigated
the notion of levels of comprehension. They found preliminary
evidence supporting this, through their consistent mean item
location increase. They mapped different item types that
represented different levels of comprehension aligning with Pearson
and Johnson’s (1978) QARs. Literal (text-explicit QAR), inferential
(text-implicit QAR), and eval-uative (script-implicit QAR).
Essentially, they are mapping items on different ordi-nal levels.
By placing these levels on an ordinal continuum, they investigated
the location of mean item difficulty along a logit scale
representing a unidimensional construct. They found that items that
represented literal comprehension were easier than items that
represented inferential comprehension, which were easier than
items that represented evaluative comprehension; supporting the
notion of QARs being placed on an ordinal scale. They found
preliminary evidence supporting the exist-ence of these categories,
even though their analyses did not use explicit parameter linking
between samples, but the evidence was not conclusive in all
cases.
-
1 3
Integrative inferential reasoning
Arguing for a taxonomy of meaning making, Perfetti and Stafura
(2015) described categories of reasoning, akin to QARs, and placed
them into an ordinal structure. Their first level represented
explicit meaning (i.e., text-explicit QARs), the second level
represented implicit meanings bound to text language (i.e.,
text-implicit QARs), and the third level represented inferences not
bound by text language and thus reflecting world knowledge (i.e.,
script-implicit QARs).
Ways of investigating inferential thinking
Over the last several decades, there have been three recurring
themes in investigating inferential thinking: in some studies,
researchers have made inferences about partici-pants’ inferencing
using their response times, some treated inferencing as something
to be tested using a classical psychometric approach, and some
treated inferencing as something to be measured using a modern
psychometric approach—some studied used multiple approaches, since
each illuminates a different faced of inferencing.
Chronometrics: response times
One theme involves the use of various measures of response time
(e.g., explicit read-ing times, eye tracking, and matching; see
Briner, Virtue, & Kurby, 2012; Cozijn, Commandeur, Vonk, &
Noordman, 2011; Gernsbacher, Robertson, Palladino, & Werner,
2004; Long & Chong, 2001; Noordman, Vonk, & Kempff, 1992;
Singer, 1980). Other particularly interesting examples of this
theme include Ramachandran, Mitchell, and Ropar (2009) investigated
the response time (RT) of participants with and without autism
engaging in trait inferences. Participants read trait-implying
sen-tences and then were presented with a pair of words from which
they chose the best related to the sentence. The participants’ task
was to match the word that best related to the sentence that they
just read. Malle and Holbrook (2012) investigated whether there is
a hierarchical relationship among goal, belief, personality, and
intentional-ity inferences using both RT and proportion of correct
responses. Participants read different sentences on a screen that
were designed to elicit a particular inference, fol-lowed by a
probe word that represented a type of inference possibly being
elicited. Then the participants were to click ‘yes’ or ‘no’ for
whether the probe word aligned with the prior sentence presented.
Van Overwalle, Van Duynslaeger, Coomans, and Timmermans (2012)
explored the relationship between trait inferences and goal
inferences, in particular whether one type of inference results in
shorter RT. On a computer screen, participants were shown a series
of sentences designed to elicit particular types of inference.
After reading each passage, the participants were told that a
series of probe words would appear; for each they would click ‘yes’
or ‘no’ a representation of the inference made.
Classical psychometrics: score‑based tests
Many approaches to investigating inferential thinking align with
a Classical Test The-ory approach, in which inferences are tested
using a series of items with numeric scores
-
A. M. Blum et al.
1 3
(e.g., 0/1 for incorrect/correct, 0–3 for a long form response
scored with a rubric, or 0–4 for a 5-point Likert scale).
Participants’ total (or average) scores are treated as an observed
variable representing their inferential thinking, suitable for the
application of common statistical models (Barnes, Dennis, &
Haefele-Kalvaitis, 1996; Cain, Oakhill, Barnes, & Bryant, 2001;
Hagá, Garcia-Marques, & Olson, 2014; Wagner & Rohwer,
1981). Additional exemplars include White, Hill, Happé, and Frith
(2009), who utilized five different types of social stories to
compare individuals with autism comprehen-sion performance to a
control group of neurotypical developing counterparts. Narrative
comprehension outcomes were scored 0–3 for each item, and used in a
regression anal-ysis. Nuske and Bavin (2011) sought to measure
narrative comprehension skills in indi-viduals with
high-functioning autism relative to their typically developing
peers. Tasks included literal comprehension items, propositional
inference items, and script infer-ence items. Items for main idea
and details were scored 0–3, and inference items were scored 0–6.
Differences in mean scores between groups were tested using ANCOVA.
As mentioned above, Ramachandran et al. (2009) also used the
proportion of cor-rect responses in their analysis. Hagá
et al. (2014) set out to investigate how learners incorporate
added contextual information when generating an inference. They had
five groups of participants (kindergarteners, 2nd, 6th, and 9th
graders, and undergraduates) who responded to Likert-type rating
items. They used ANOVA to analyze differences in mean ratings.
Modern psychometrics: latent variable modeling
Some studies used approaches like Rasch models, item-response
theory, or struc-tural equation models which participants’
inferential thinking is measured by sta-tistically modeling this
type of cognition as a latent variable that contributes to their
response patterns (Alonzo, Basaraba, Tindal, & Carriveau, 2009;
Baghaei & Ravand, 2015; Embretson & Wetzel, 1987; Pitts
& Thompson, 1984). As noted above, Basaraba et al. (2013)
investigated levels of comprehension, akin to Pearson and Johnson’s
(1978) QARs. Using Rasch modeling, they found evidence for
dis-tinct levels of comprehension based on mean item difficulty.
Language and Reading Research Consortium (LARRC) and Muijselaar
(2018) investigated the dimension-ality of local and global reading
comprehension tasks, with item types akin to Pear-son and Johnson’s
(1978) Text-Explicit and Text-Implicit QARs, using structural
equation modeling. Santos et al. (2016) used Rasch modeling to
compare the mean difficulties among item types, where the items
were mapped onto an ordinal series of reading comprehension levels
(literal comprehension, inferential comprehension, critical
comprehension, and reorganization).
Towards a contemporary approach to measuring
inferential thinking
The parable of the measurer
Imagine that a research team was constructing a measure of
inferential thinking. As is typically done (DeVellis, 2006), they
began with a series of items which they and
-
1 3
Integrative inferential reasoning
other experts believed to represent the skill set of inferential
thinking. They piloted the measure many times and kept only highly
correlated items (i.e., items that have a high proportion of common
variance, rather than unique variance). They also kept a broad
range of easy items (high proportion of being endorsed) and
relatively more difficult items (low proportion of being endorsed).
They decided to make some items worth more points than others based
on which items they considered more “difficult.”
Specifically, they made two measures, one that measured
propositional inferen-tial thinking and another that measured
script-implicit inferential thinking (Nuske & Bavin, 2011).
They gave both measures to two populations, autistic and
neuro-typical. They then computed the mean sum-score within each
population, and per-formed t-tests for differences in means. They
did not find a significant difference (between the two groups) in
the propositional inferential thinking task, but they did find a
significant difference for the script-implicit inference task. As a
result, they concluded that autistic children have a deficit in
script-implicit inferential thinking.
Is this true, particularly in the deterministic manner in which
this hypothetical finding would suggest (Hambleton & Jones,
1993)? Consider what would happen if the research team had selected
script-implicit inference items that were much eas-ier—easy enough
that both groups would perform well. In this case no difference
would be found: does this mean that autistic children don’t have a
deficit? Now con-sider the scenario where the script-implicit
inference items were all very difficult. Again, no difference would
be found: does this mean that since both populations struggle in
this area, it no longer constitutes a deficit? Note how in this
example, the answer to a scientific question about the nature of
autism depends on the apparently-unrelated issue of the
difficulties of the items included in the measure.
Questioning the interpretability of scores
without a construct
How can empirical studies determine the degree to which a
respondent has a relative deficit in a latent variable, beyond the
raw score on the instrument (e.g., an assess-ment) used to measure
it (Borsboom, 2005a, b, 2008)? Consider two respondents, one with a
score of 80% and another with a score of 70%. How can “80%” be an
exact measure of a child’s inference ability—what does 80% even
mean in this con-text? Similarly, how can the second respondent’s
“deficit” of 10% be interpreted? Since a respondent’s score is
based on the difficulty of the test, a respondent could have
received a higher score with an easier test (Borsboom, 2005a, b,
2008; Hamb-leton & Jones, 1993). Furthermore, if a respondent’s
score is based on the range of item difficulties on the test, how
can item “difficulty” be established? If difficulty
is operationalized as the proportion of correct answers in a
given sample (DeVel-lis, 2006), then difficulty is specific to that
sample (Borsboom, 2005b; Hambleton & Jones, 1993). Finally, how
can researchers be certain of someone’s ability based only on a
score, where the connection between the measurement result (the
score) and the measured property (the “ability”) is not apparent in
the way height would be when measured using a ruler (Borsboom,
2005a, 2008)?
-
A. M. Blum et al.
1 3
Measurement for science and measurement
as science
Item response theory, a contemporary approach to measurement,
makes it possible to locate both respondents and items on a common
scale (Borsboom, 2008; Wilson, 2005). This scale can be established
in a way that is not bound to a specific sample (sample
independent), nor is it dependent on a specific set of items (test
independ-ent). In order to locate a respondent’s ability on a given
construct (e.g., inferential thinking), rather than on a given
test, a measurement model (such as a latent-var-iable model) is
required in order to quantify the construct, in addition to the
the-oretical model embodied in the construct map. Measurement
models include test-able assumptions, which allow the researcher to
assess the quality of measurement obtained. Additionally, the
construct map, as a theoretical model, contains hypoth-eses that
can be empirically tested. In this way, the iterative refinement of
theories through empirical research can be applied both to the
substantive research area and to the clarification and measurement
of its constructs.
The BEAR Assessment System (BAS; Wilson, 2005, 2009; Wilson
& Carstensen, 2007) is a principled approach to assessment
design that exemplifies measurement as science, where the
measurement process is itself a scientific endeavor, rather than
merely serving as a necessary step in a scientific investigation.
Using BAS, a meas-urer proceeds through four building blocks, doing
work that is very similar to the application of the scientific
method in the physical sciences, including building a theory based
on extant literature, making hypotheses based on that theory,
empiri-cally testing these hypotheses, and refining the hypotheses
and the theory as needed based on the results of analysis. The
first building block consists of the develop-ment of a clear theory
and formulation of the variable to be measured. This theory is
summarized in a visual representation called a Construct Map, which
guides subse-quent assessment development. The second and third
building blocks, Items Design and Outcome Space, operationalize
this theory: Items Design consists of designing stimuli, tasks, or
questions to elicit responses on the variable of interest, and the
Outcome Space is a set of rules for placing responses to items into
levels on the Construct Map. Both of these steps involve the
formulation of hypotheses about the variable: items represent
hypotheses about contexts that might elicit relevant responses, and
the Outcome Space represents local hypotheses about the response
processes involved in responding to specific items in particular
ways. At this point, data can be collected by administering the
assessment to respondents, collecting their responses, and scoring
the responses using the outcome space. The fourth building block
consists of fitting an appropriate measurement model to the scored
responses, and analyzing the results. In this stage, special
emphasis is placed on the Wright Map, which is a visual
representation of the results from the measurement model that is
analogous to the Construct Map: the Wright Map is essentially the
empirical version of the Construct Map, and comparing the two maps
gives an ini-tial assessment of which hypotheses were confirmed by
the data and which were not.
We chose this assessment framework because it places central
emphasis on devel-oping the Construct Map, a theoretical
representation of a unidimensional construct. This map serves as a
metaphor and provides grounding for the other assessment
development activities. It is the work that goes into situating the
construct being
-
1 3
Integrative inferential reasoning
modeled, through an extensive literature review and consulting
the field, that holds together the items design, outcome space, and
statistical model. Other principled assessment frameworks exist
(for a review, see Ferrara, Lai, Reilly, & Nichols, 2016), but
we selected BAS primarily because of our focus on theory building
and iterative refinement. Evidence Centered Design (ECD; Mislevy,
Almond, & Lukas, 2003; Riconscente, Mislevy, & Corrigan,
2015) would also have been an accept-able choice—components of ECD
include the Student Model, Task Model, Evidence Model, and
Measurement Model, which parallel the Construct Map, Items Design,
Outcome Space and Measurement Model in BAS, respectively—but ECD is
focused more on the construction of evidentiary arguments for
validity than on theory-build-ing. Additionally, ECD is more suited
to larger assessment projects and includes additional components
and processes for addressing the complex test assembly and delivery
issues that arise in such contexts.
Towards a revised model of inferential reasoning
In order to investigate the nature of inferential reasoning, we
needed to measure it. We used BAS with multiple iterations of the
four building blocks to (a) formulate a theory of inferential
reasoning, (b) develop an instrument based on our theory, (c)
collect empirical evidence about our theory by administering this
instrument, (d) use the empirical data to assess the quality of
measurement obtained using our instru-ment and to determine areas
of refinement, and (e) use the data both to test hypoth-eses about
our theory, and also to potentially refine the theory as well.
To formulate our theory of inferential reasoning we argue that
Kintsch’s (1988) framework allows for degrees of low integration
and high integration of world knowledge, along an ordinal
continuum. This continuum which we call Integrative Inferential
Reasoning (IIR) is anchored by Pearson and Johnson’s (1978) QARs,
specifically text-implicit and script-implicit QARs.
Perfetti and Stafura (2015) have argued that taxonomies are less
helpful if they do not incorporate some kind of hierarchical
structure which holds the framework together. Kintsch’s (1988)
construction integration framework provides such a struc-ture, in
which degrees of integration provide directionality (i.e., more or
less inte-gration). This hierarchy is therefore ordinal and
compatible with the concept of a unidimensional variable.
We then integrate this continuum with the work of Chikalanga
(1992) and the work of Graesser et al. (1994). Just as
Chikalanga (1992) reported that motivational inferences could be
text-implicit or script-implicit, depending on the information
available in the text and the readers knowledge base, so too can
those from Graesser et al. (1994), depending on the
information provided in the text-base. Chikalanga does not suggest
that evaluative inferences can derive from either text or script,
suggesting instead that, and they are all script-implicit. We argue
that thematic or rather, evaluative inferences can also be
text-implicit. For example, imagine that a group of young children
heard a narrative about kids kicking a three-legged dog and then
getting in trouble for doing so. When, at the end of the story the
children were asked what the lesson was, they all said, “don’t kick
three legged dogs.” Although
-
A. M. Blum et al.
1 3
this would be a representation of evaluative thinking, it is
also very literal, resem-bling the explicit nature of the narrative
itself. Instead, the lesson could have been more altruistic such
as, “it’s wrong to bully animals” which would represent a
script-implicit QAR, since it includes information about cultural
scripts, such as notions of right, wrong, and what is considered
bullying.
As noted above, Pearson and Johnson’s (1978) text-implicit and
script-implicit QARs form the first two levels of IIR.
Text-implicit QARs, which don’t require explicit evidence of
world-knowledge to be integrated, and base their inference solely
on establishing links between text-base propositions, are
positioned as low integration. Script-implicit QARs, which require
the integration of world-knowl-edge, are positioned higher on this
continuum. We extend this work by adding an additional level even
higher on the continuum (i.e., with more explicit integration).
This new third level, requires the combination of two sources of
reasoning, a text-implicit QAR and a script-implicit QAR.
Figure 1 demonstrates this in the context of a motivational
inference question: “Why did Alex give his toy away?”
As stated earlier, the question does not constrain the level of
the response, indeed this question could be answered at all three
proposed levels. One could say, because “the boy was sad and Alex
wanted to be kind.” In this case, the first part of the answer
represents a text-implicit causal-antecedent inference, since the
text-base says he is sad, and it was the prior event to the
question. The second part of the answer represents a
script-implicit superordinate goal inference, since the goal of the
character being kind was not mentioned in the comic, and the notion
of kind-ness is culturally specific (someone could easily see this
act of giving the toy away as suspicious based on their culture and
prior experience), but something about this behavior resembles the
script of wanting to be kind. The response as a whole repre-sents a
combination of script and text implicit inferential thinking, and
is even more explicitly integrative between the text-base and the
respondent’s world knowledge.
Although we have described the levels of IIR in terms of QARs
(i.e., responses) IIR is a theory of reasoning, and therefore it
must be possible to place the reasoners (i.e., respondents) on the
IIR continuum. We view a respondent’s location on IIR as a
cognitive processing disposition, not as a level of achievement.
That is, a respond-ent’s location does not describe the maximum
level of which they are capable, but rather what they do
spontaneously and consistently, when faced with different types
Fig. 1 Sample narrative vignette
-
1 3
Integrative inferential reasoning
of items. This also positions IIR as situated by context: In
certain contexts, one might respond at high levels of IIR, whereas
in other contexts one might respond at lower levels; this is still
dispositional, but where the disposition may be affected by
context. We formulate IIR in this way as a counterpoint to the
tendency to study nar-rative comprehension in terms of achievement.
Achievement models align with def-icit-oriented perspectives,
positioning poor performance as a deficit, thus requiring
remediation, and ultimately internalizes the fundamental error
within the learner. Rather, a situated view externalizes the
challenge to being environmental and con-text driven. Change the
environment, not the person essentially (e.g., change reading
modality to tap into same construct).
This also positions comprehension as a moving target: Depending
on context, students using IIR to varying degrees may also apply
different forms of meaning making, and engage differently in QARs.
By positioning IIR as a disposition, we hope to shed some light on
dispositions in narrative comprehension overall, and how they
contribute to one’s mode of experience when engaging with
narratives. This may be particularly useful in a formative context.
If one narrative is shown to elicit higher degrees of integration
compared to other narratives, then the teacher can be informed
about what materials are most helpful in tapping into their
students’ dispo-sition and bringing out their best during
individualized instruction.
IIR thus far has been discussed in terms of motivational and
evaluative infer-ences. We propose another category of questions
that tap into IIR: meta-reasoning questions, for example, “what
made you think of that answer?” While this may not directly involve
an inference, it is certainly part of the inferential reasoning
process. We argue that this proposed IIR scale could also be
applied to this type of question. In Fig. 2, we present a
Construct Map (see Wilson, 2005), representing the combina-tion of
theories of levels of comprehension integrating the theories
discussed above (Chikalanga, 1992; Graesser et al., 1994;
Pearson & Johnson, 1978; Warren et al., 1979).
Perfetti and Stafura (2015), whose hierarchical structure, as
described above, is similar to both Pearson and Johnson’s (1978)
QARs and IIR, point out that notions of local and global coherence
were not represented in their taxonomy. The nature of IIR, being
anchored both with QARs and with a construction integration
framework, allows notions of local and global coherence (Graesser
et al., 1994) to be incor-porated as follows. Inferential
thinking at the first level of IIR is characterized by low
integration and by text-implicit QARs, which are essential in
maintaining local coherence. Global coherence, on the other hand,
requires the integration of world knowledge, and thus requires the
type of inferential reasoning described by the sec-ond level of
IIR.
Construct map for IIR
Our research questions center around the validation of IIR,
hypothesized above, as a unidimensional continuum of how explicitly
integrated the text base is with the learners’ knowledge base, and
also around the development of a method for meas-uring IIR in a way
that may be useful in understanding inferencing. As seen in
-
A. M. Blum et al.
1 3
Fig. 2, we posit four qualitatively-distinct ordinal levels
for IIR. Starting with the least integrative level of the Construct
Map at the bottom of the figure, a response at Level 0 makes no use
of inferences. At Level 1 respondents demonstrate a text-implicit
QAR where an inference has been made, but it is comprised of
proposi-tions from the text-base only. At Level 2 respondents
demonstrate a script-implicit QAR, where the response is comprised
of information derived from their world knowledge, and implies that
it was built upon propositions in the narrative. At the most
explicitly integrative level, Level 3, respondents demonstrate a
combination of script-implicit and text-implicit QARs, as two
sources of reasoning for their
Fig. 2 Construct map for IIR
-
1 3
Integrative inferential reasoning
response, demonstrating an explicit integration of the text-base
and the respond-ents’ world-knowledge.
Methods
Participants
We recruited 72 students in general education (ages: 8–12, see
Table 1; gender: 30 male, 34 female, 8 declined to state) from
a San Francisco Bay Area K–6 elemen-tary school.1 Participants took
the instrument described below on a computer at their school site,
without time constraints.
Materials and procedure
Visual narrative design
In comics, graphics are placed in individual panels that can be
thought of as atten-tion units and considered the basic unit of a
visual narrative (Cohn, 2007). Depend-ing on how the panels are
drawn, panels can also demonstrate points of view of characters,
different perspectives of events, and demonstrate focus points
(Pantaleo, 2013). The narrative structure takes these panels and
orders them in such a way where there is a particular pace set, and
based on the sequenced panels, the reader derives meaning from both
the graphics inside of the panels and the events they engage in
(Cohn, 2013). In order for the comic to be best understood, there
needs to be some type of structure to the sequenced panels known as
a narrative gram-mar (Cohn, 2013; Cohn, Paczynski, Jackendoff,
Holcomb, & Kuperberg, 2012). A Narrative grammar (Which
illuminates a narrative structure) includes five primary
categories: establisher, initial, prolongation, peak, and release.
Establishers are the first impressions of the scene: they set up
the characters and lay the founda-tions for new information;
initials set the action or event in motion, and typically display
the action that leads to a peak, (e.g., someone getting ready to
run, but not
Table 1 Number of participants by age
Age N
8 69 1110 2111 2012 14
1 This study was approved by the U.C. Berkeley Committee of the
Protection of Human Subjects, and students participated with the
informed consent of their parents/guardians.
-
A. M. Blum et al.
1 3
actually running yet); a prolongation can function as a pause,
as a demonstration of the trajectory of an event, as a tool for
suspense, as a cliffhanger, or as anything else that delays the
peak; peaks represent the culmination of everything built up in the
graphic narrative and may include a change of state, an action
carried out, or the interruption of an event; the release is the
result of the peak and serves to release the tension of the peak,
as a wrap-up, or as a conflict resolution (Cohn, 2013).
Three visual narratives, such as the one shown in Fig. 3,
were developed in comic-strip format by the first author and a
contracted artist. Each narrative con-sisted of 3 panels
(antecedent, behavior, consequence/result of that behavior). These
narratives also align with Cohn’s (2013) narrative grammar theory
described above. The structure contains an expository panel,
followed by a peak, followed by a reso-lution. By using this
narrative arc, we can maintain visual narrative referential
cohe-sion, at least at the macro-proposition level, similar to the
referential cohesion dis-cussed by Kintsch and van Dijk (1978)
among the text-base. The comics were each
Fig. 3 Sample comic with questions
-
1 3
Integrative inferential reasoning
followed by four items: (a) a motivational inference item (why
the character engaged in an intentional action), (b) a
meta-reasoning item (what made the participant think of that
answer), (c) an evaluative inference item (what the lesson of the
story was), and (d) another meta-reasoning item. As mentioned
above, regardless of the item type, students can respond to any of
the items at any level of IIR (i.e., all levels exist within every
item). Students responded to the entire instrument, consisting of
all three narratives, along with demographic items (age and gender,
but not name), on a Google Forms web document.
Before engaging with the three narratives, participants read a
sample comic and answered literal questions about it, in order to
familiarize students with the format. Participants then read each
comic in turn, and answered the associated questions before
advancing to the next comic.
These comics were visual narratives in a multi-modal sequential
image format. They are multimodal because in addition to a sequence
of images, there is print built into the narrative as well, working
closely, and in essence informing one another, with the images. The
modality was also chosen to support the notion that these
socio-cognitive literacy theories lend themselves to narratives in
any format. In fact, Magliano, Larson, Higgs, and Loschky (2016)
investigated bridging inferences, due to their ability to establish
how two or more panels, or elements of a narrative are semantically
related. This is similar to text-implicit QARs and local coherence
in visual narrative comprehension. They had a series of sequential
images, with a beginning state, bridging event, and an end state
image. The researchers then would take away the bridging event
image for participants, and examined differences find-ing longer
viewing times for the end state image when the middle panel was
absent, supporting that an online inference is taking place. Just
as in the case of reading a sentence, they argue for a shared
systems hypothesis where both visuospatial and linguistic working
memory systems support inferential thinking in visual narratives.
This would also support the notion of inferential thinking
transcending modality (Kendeou, 2015).
Measurement models
The responses to the items were scored ordinally, with point
values corresponding to the level numbers in Fig. 2. Three
points were assigned to responses that represented a combination of
script-implicit and text-implicit QARs; two points were assigned to
responses that represented script-implicit QAR; one point was
assigned to responses that demonstrated text-implicit QAR, and zero
points were assigned to responses that demonstrated no-inferential
reasoning. We fit three different measurement mod-els to the scored
data using ConQuest [version 3.0] (Adams, Wu, & Wilson,
2012).
Measurement model 1: partial credit model
We used the Partial Credit Model (PCM; Masters, 1982, 2016) to
estimate both item difficulty and person location parameters. This
is the model that would be most appropriate for using our IIR
instrument, as well as for addressing most of our
-
A. M. Blum et al.
1 3
research questions (we fit two other models to address specific
research questions). As a Rasch-family model, this model estimates
the item parameters (the step dif-ficulties, �ij of the item i and
of the levels j within the item) and the person locations �p ,
based on the model-implied likelihood of the observed response
patterns for each item and person. Equation 1 shows the
likelihood, under the PCM, of a respondent at location �p
responding to item i at level m (where the item has Mi steps,
and step difficulties �i1, �i2,… , �iMi for steps 1, 2,…
,Mi).
where �p ∼ N(0, �2
�
) and �p − �i0 = 0 . For model identification, we constrain
the
mean person location to be zero.The PCM places the respondents
and the items on a common scale, often called
the logit scale because of the functional form of Eq. 1.
Conceptually, each unit on this scale represents an equal
“distance,” as inches do on a ruler: one logit repre-sents a
difference of one in the log of the odds of a (generic) respondent
scoring at a higher level, on a (generic) item, versus scoring at
the next level down.
A person’s location on the logit scale is often interpreted as a
“proficiency” or “ability”, a term from academic testing, since in
that case it represents their likeli-hood of responding correctly
to more items. In the case of IIR, the person loca-tion can be
interpreted dispositionally: On any given item, compared to a
person lower on the scale, a person higher on the scale would be
more likely to respond at a higher level, as defined in the IIR
Construct Map (Fig. 2) which represents their disposition to
express more explicit integration.
For items (or for levels within an item), being higher on the
scale is interpreted as the item being more “difficult” (again, a
term from testing). In the IIR case, it means that the item is
relatively less conducive to explicitly-integrated responses, and
so respondents will have a lower likelihood of doing so.
Measurement model 2: latent regression PCM
A latent regression PCM is a person explanatory extension of the
PCM (Wilson & De Boeck, 2004). We fit this model to address
Research Question (d), by using par-ticipant age as the regressor.
In a latent regression, the person location �p in Eq. 1 is
replaced by a regression model where each person has J covariates.
For Measure-ment Model 2, the regression is given by Eq. 2 ( J
= 1 , Zp1 is the person’s age).
The full likelihood is obtained by replacing �p in Eq. 1
with the right-hand side of Eq. 2, the item parameters are
constrained to have the values from Model 1, and �p ∼ N
(0, �2
�
) . Here, Zpj is person p’s value on covariate j, �j is the
regression weight
for covariate j, �0 is the intercept, and �p is the residual of
the person’s location after controlling for the effects of all J
covariates. The regression coefficient �j can be interpreted as the
effect of (a one-unit change in) covariate j on the construct
(i.e.,
(1)P�Xpi = m��p, �ij
�=
exp∑m
j=0 (�p−�ij)∑Mi
k=0exp
∑kj=0 (�p−�ij)
(2)�p =J∑j=1
�jZpj + �p
-
1 3
Integrative inferential reasoning
on the person’s likelihood of responding at higher levels). In
the present study, �1 is interpreted as the estimated effect of a
1-year difference in age on the participant’s IIR.
Measurement model 3: multi‑dimensional PCM
We fit a multi-dimensional PCM (Adams, Wilson, & Wang, 1997;
Briggs & Wilson, 2003) in order to address part of Research
question (a), as well as to provide valid-ity evidence for the
structure of IIR. Since our items design is structured with the
same item types (motivational inference, evaluative inference, and
meta-reasoning) repeated for each narrative, we fit a between-item
multidimensional PCM, in which the item types were coded as
separate dimensions, as seen in Fig. 4 (As is common in
Rasch modeling, the numeral “1” is omitted from all the factor
loadings, as are the item-specific residuals which are assumed to
be uncorrelated.). This extension of PCM estimates the latent
correlation between the dimensions, allowing us to exam-ine whether
different item types measure IIR in the same way.
Results
As noted above, the PCM is the primary model of interest for
measurement using our IIR assessment. Therefore, we examined the
performance of the instrument pri-marily with the results from
Measurement Model 1.
Wright map (measurement model 1)
Figure 5 shows both the respondents’ and items’ locations
on the same scale, allowing their relative distributions to be
examined. The left panel represents the distribution of
respondents’ estimated locations, with the “X” symbols
propor-tional to the number of people located at each point along
the scale. The right panel represents the locations of the levels
within each item. For example, the
Fig. 4 Path diagram for multidimensional PCM
-
A. M. Blum et al.
1 3
Threshold 2 rectangle above Q10 indicates the point on the logit
scale where a respondent would be equally likely to respond to this
item at Level 1 or below vs. at Level 2 or above (i.e., the second
Thurstonian threshold).
Note that, as a group, the first thresholds (where people would
have equal probability of responding at Level 0 vs. at Level 1 or
above) were generally below the second thresholds (where people
would have equal probability of responding at Level 1 or below vs.
at Level 2 or above), with only two excep-tions. All the second
thresholds, however, were below all of the third thresholds (where
people would have equal probability of responding at Level 2 or
below vs. at Level 3). This phenomenon, known as banding, is
evidence in favor of the three levels being distinct, since it is
the level, and not the item which seems to have been the primary
determinant of threshold locations. This supports the approach of
mapping different levels of the construct within different items,
rep-resenting different types of inferential relations, in this
case item types.
Fig. 5 Wright map (empirical representation of the construct
map) for IIR
-
1 3
Integrative inferential reasoning
Mean location (measurement model 1)
Further evidence that the levels of IIR are distinct can be seen
in Table 2, which examines the distribution of respondents,
rather than of items, for each level. When levels are well-ordered,
items should display a mean location increase through suc-cessive
levels, as this would mean that within each item, respondents who
respond at higher levels are estimated to have more of the
construct (i.e., IIR). All 12 items dis-played mean location
increases across all levels, except Item 12. In this case,
how-ever, the level that is out of order also has only three
respondents (and thus more data would be required to determine the
true ordering of the levels in this item).
Reliability (measurement model 1)
Coefficient � was .86. We also calculated person separation
reliability, which is commonly used in Rasch measurement. This
statistic ranges from 0 to 1 and is often
Table 2 Mean person locations within each level of each item
Item Level Count Mean θ Item Level Count Mean θ (logits)
Item 1 0 1 − 2.94 Item 7 0 5 − 2.071: Text-Implicit 30 − 0.32 1:
Text-Implicit 31 − 0.352: Script-Implicit 37 0.40 2:
Script-Implicit 34 0.643: Combination 4 0.54 3: Combination 2
1.95
Item 2 0 6 − 2.07 Item 8 0 10 − 1.891: Text-Implicit 36 − 0.16
1: Text-Implicit 26 − 0.222: Script-Implicit 28 0.75 2:
Script-Implicit 32 0.733: Combination 2 0.78 3: Combination 4
1.39
Item 3 0 2 − 2.30 Item 9 0 2 − 2.231: Text-Implicit 5 − 1.34 1:
Text-Implicit 21 − 0.262: Script-Implicit 65 0.24 2:
Script-Implicit 43 0.163: Combination 0 NA 3: Combination 6
1.22
Item 4 0 8 − 2.02 Item 10 0 14 − 1.211: Text-Implicit 14 − 0.70
1: Text-Implicit 27 − 0.332: Script-Implicit 46 0.54 2:
Script-Implicit 27 0.833: Combination 4 1.38 3: Combination 4
1.98
Item 5 0 3 − 1.57 Item 11 0 6 − 2.181: Text-Implicit 36 − 0.44
1: Text-Implicit 2 − 0.542: Script-Implicit 25 0.67 2:
Script-Implicit 63 0.273: Combination 8 1.01 3: Combination 1
1.67
Item 6 0 9 − 1.56 Item 12 0 14 − 1.601: Text-Implicit 31 − 0.34
1: Text-Implicit 13 − 0.202: Script-Implicit 31 0.85 2:
Script-Implicit 42 0.683: Combination 1 2.40 3: Combination 3
0.34
-
A. M. Blum et al.
1 3
comparable to coefficient � . Person separation reliability
represents the degree to which the instrument was able to separate
people with varying proficiency levels. The PCM person separation
reliability2 was .85. These reliability estimates indicate that
this instrument measures IIR in a highly consistent manner.
Item fit (measurement model 1)
For a particular item, if that item’s estimated “difficulty”
parameters are treated as known, then PCM likelihood predicts a
specific distribution of response levels from the respondents
(e.g., 27% of people located one logit below the item’s first
parame-ter would be expected to respond at Level 1, with 73% at
level 0; see Wilson, 2005). The degree to which the observed
distribution of responses to the item being exam-ined matches the
model-implied distribution is called the Weighted Mean-square item
fit (WMNSQ; Wu & Adams, 2013): The WMNSQ is scaled so that the
mean WMNSQ over the whole instrument is one. A low WMNSQ implies
that responses to the item were too predictable, separating
respondents more sharply than the instrument as a whole. Of more
concern, if in item has a large WMNSQ, this means that responses
were less predictable than expected, suggesting that the item
meas-ures the construct badly, or perhaps measures something
different.
In applied Rasch Measurement, it is common to flag items with
WMNSQ less than 0.75 or greater than 1.33. Items outside this range
are examined for any issues that may be causing their responses to
be either too predictable or too unpredictable, with the latter
being of greater concern.
As seen in Table 3, item WMNSQ ranged from 0.75 to 1.34;
with a single item (item 9) slightly above the upper bound (i.e.,
“too unpredictable”). Since this item WMNSQ was only barely above
1.33, and since no defects were found in an exami-nation of the
unscored responses, we elected to maintain the item.
Table 3 Item fit statistics Item WMNSQ Item WMNSQ
1 1.24 7 0.882 1.02 8 0.753 1.13 9 1.344 0.82 10 0.875 1.00 11
1.016 0.89 12 1.03
2 WLE person-separation reliability (0.85) applies to
conclusions based on individual scores; EAP per-son-separation
reliability (0.87) applies to population-level conclusions.
-
1 3
Integrative inferential reasoning
Latent regression PCM (measurement model 2)
We now examine results from a latent regression PCM, in which
respondents’ IIR was regressed on age in order to address Research
Question (d). Table 4 shows the regression coefficient and
associated standard error. Respondents’ IIR appears to increase
with age (in this cross-sectional sample). In this sample, a
difference in age of 1 year was estimated to correspond to a
mean IIR increase of 0.282 logits. This was significant at the 5%
level.
Multi‑dimensional PCM (measurement model 3)
We fit a multidimensional PCM in which each item type
(motivational, evaluative, and meta-reasoning) was modeled as a
separate dimension, in order to estimate inter-dimensional (latent)
correlations, which are presented in Table 5. Motivational and
evaluative inferences had a latent correlation of .767,
motivational and meta-reasoning had a latent correlation of .760,
and evaluative and meta-reasoning had a latent correlation of
.834.
Validity
We base our validity argument on the five strands of validity
listed in the Standards for Educational and Psychological Testing
(American Educational Research Asso-ciation, American Psychological
Association, & National Council on Measurement in Education,
2014).
Evidence based on the instrument content
This instrument was designed based on Wilson’s (2005) four
building blocks. The first step was operationalizing the construct,
based on research by Pearson and John-son (1978); Warren et
al. (1979); Chikalanga (1992); and Graesser et al. (1994).
The next step was items design: Each comic had a different theme,
but had the same structure and format, with motivational,
evaluative, and meta-reasoning items.
Table 4 Latent regression
*p < .05
Regressor Coefficient SE
Age 0.282* 0.122
Table 5 Inter-dimension latent correlations
Dimension 1 2
1: Motivational2: Evaluative . 7673: Meta-reasoning . 760
.834
-
A. M. Blum et al.
1 3
Inferential reasoning was required, since there was no logical
or grammatical cue explicitly linking the question to the answer,
but participants could use any aspect of the narrative and their
background knowledge to generate a response. The third step,
determining the outcome space (or scoring rubric) was based on the
literature (e.g., Pearson & Johnson, 1978). The inferential
categories were ranked on an ordi-nal scale because of the
integrative nature of the inference categories. It is hypoth-esized
that more integrative is more sophisticated than less integrative,
and thus the associated inferences will also fall on that
scale.
Evidence based on response processes
At the end of the IIR instrument, participants completed an exit
survey in which they were asked to indicate which questions, if
any, they didn’t understand. Some students found the meta-reasoning
items unfamiliar, but no other items were found to be problematic.
Participants were also asked to rate their difficulty in
understand-ing the material using a five-point Likert-type item. On
average, most students found the material to be fairly easy; the
mean was 3.9 on a scale from 1 (most difficult) to 5 (easiest).
Evidence based on internal structure
We checked the trajectory of respondents’ estimated mean
locations across the levels within each item (Measurement Model 1).
All items showed an increase of mean location across all levels,
with the exception of one level within one item. We examined the
Wright Map (Measurement Model 1), and found a relative banding of
the thresholds (by level). This suggests that the levels of IIR are
distinct, since the placement of the thresholds was dominated by
their levels, rather than by the overall locations of their items.
In a multidimensional analysis (Measurement Model 3), we found
latent correlations between the items types to be .76 and above,
which is con-sistent with the assumed unidimensional structure of
IIR.
Evidence based on relations to other variables
Other studies have found inferential thinking to be positively
related to age (e.g., Barnes et al., 1996; Hagá et al.,
2014; Wagner & Rohwer, 1981). Therefore, we regressed IIR on
age using a Latent Regression PCM (Measurement Model 2).
Con-sistent with existing literature, age was found to be a
significant predictor of IIR at the 5% level: A
1 year-increase in age corresponded to a mean estimated
increase of 0.282 logits on the IIR scale.
Evidence based on consequences of using IIR
Since this instrument is not yet in wide-scale use, no data are
currently available on the consequences of its use. However, as the
developers of IIR, we feel it is incumbent on us to stress that our
IIR instrument does not attempt to measure one’s ability in terms
of achievement. Rather, it is designed to measure one’s
cognitive
-
1 3
Integrative inferential reasoning
processing disposition. It would therefore be inappropriate to
interpret IIR results in achievement contexts such as course
selection, assignment of grades, or similar decisions affecting the
educational trajectories of students.
Discussion
Results from our analyses provided evidence that we were
successful in developing an assessment to measure IIR, and suggest
answers to the four questions the instru-ment was designed to
address. Addressing Research Questions (b) and (c), both the number
of categories, and their ordering, are supported by the combined
findings of banding in the Wright Map and mean location increases
within almost all items. Additionally, results from latent
regression analysis address Research Question (d): We find
respondents’ age to be a significant predictor of IIR, consistent
with other studies.
Implications for the study of comprehension
Below, we will address how our results support a unification of
several meaning making frameworks, how the unification allows for
an important extension, and the positionality of the results among
the larger argument in the field.
Does IIR represent a unification of the five
theories?
Addressing Research Question (a), results from our
multidimensional analysis (com-bined with the evidence of
reliability above) revealed that the item types (motiva-tional,
evaluative, and meta-reasoning), if considered as separate
dimensions, had latent correlations of .76 and above. While the
item types may not be measuring IIR in exactly the same way, these
results are nevertheless consistent with a unified IIR
construct.
Why the combination level is important
IIR extends Pearson and Johnson’s (1978) taxonomy by adding a
third “combina-tion” level representing even more explicit
integration than either of their QARs. Unlike their text-implicit
and script implicit QARs (our first and second levels,
respectively), our third level only makes sense in an integrative
construct, and was only possible because we developed a Construct
Map first. Indeed the first two lev-els have appeared in other
contexts (e.g., Basaraba et al., 2013). Additionally,
Per-fetti and Stafura (2015) argue that taxonomies should be held
together by a hierar-chical structure, and their taxonomy meets
this requirement by being ordinal. IIR is likewise ordinal, but
with a stronger sense of an underlying variable. We argue that our
third level is not only unique to IIR, but it is essential to IIR
as a variable. A line can always be drawn between any two points;
it takes a third point to verify that a figure indeed forms a
single line. In the same way, the first two levels (and the
-
A. M. Blum et al.
1 3
related QARs) have been used without hypothesizing a single
underlying variable, but adding a third level requires an
hypothesis about how the levels are connected (i.e., do they form a
single variable, multiple variables, or are they merely nominal
categories). We find that the three levels (text-implicit,
script-implicit, and combina-tion) are distinct, but that together
they form a single underlying variable; this gives evidence
supporting the usefulness of IIR as a taxonomy that presents a new
way of thinking about comprehension and inferencing.
Importantly, the combination category is the only one that
requires the reader to coordinate thoughts that are based on
information from two different processing skills (i.e., local and
global). In IIR, Level 1 represents local details, Level 2
repre-sents the inclusion of memories and world knowledge, and the
coordination of both local and global processing is represented by
Level 3. The three levels are ordinal in the sense that local
processing supports global processing, and both local and global
processing are prerequisite to their coordination.
Connections to related empirical studies
As noted in the introduction, Basaraba et al. (2013) used
similar categories, and found preliminary results consistent with
the present study. However, our results are more pronounced, and
there are many potential reasons for this. First, we formu-late IIR
as a disposition, whereas Basaraba et al. (2013) used a
measure of read-ing comprehension in an achievement context: These
may, in fact, be different vari-ables, with IIR being more stable
across assessment contexts than the latter. Second, their approach
was more exploratory (they were looking for their categories both
as separate dimensions and as levels within a single dimension, and
also checking for measurement invariance), whereas our approach was
more confirmatory, starting with a clear theory of inferencing (as
represented in our Construct Map) and seek-ing to confirm this
theory empirically. This difference in approach also allowed us to
use a simpler research design and required us to collect data using
a purpose-built instrument; both of these may have contributed to
the results being clearer. Third, in our analyses we estimate the
parameters of interest directly, rather than using a two-step
analysis (e.g., using a latent regression rather than regressing
location estimates post-estimation): such direct estimation
strategies tend to yield clearer results since they account for
measurement error.
LARRC and Muijselaar (2018) investigated the dimensionality of
local and global inferences. Since their items align with
text-explicit and text-implicit QARs respectively, they correspond
to Level 0 and Level 1 of IIR. (IIR Level 0 represents no
inference: text-explicit QARs align with this level because no
inference is neces-sary to answer with information explicitly
provided in the text.) They found these dimensions to be
indistinguishable which is consistent with our findings that IIR
behaves unidimensionally.
-
1 3
Integrative inferential reasoning
Educational implications
Teachers strive to help students connect their world-knowledge
with the text-base evoked by the text itself. This process is how
deeper comprehension is achieved (Graesser et al., 1994) and
how rich situation-models are formed (Kintsch, 1988). As teachers
question students, they hear answers representing different parts
of the nar-rative and the students’ unique world-views. Taxonomies,
such as the one we con-structed, are powerful tools for teachers
because they can give a roadmap to help guide and develop their
students’ inferential reasoning, for example by positioning their
stu-dents’ responses on the IIR continuum. IIR is an inferential
reasoning taxonomy that allows an educator to examine how a student
coordinates different sources of informa-tion at both the local and
global levels, and also how they coordinate different types of
inferences across the levels of IIR. Furthermore, Kintsch (2012)
describes the value of not only considering the psychometrics that
describe levels, but also the dimensions relevant to education for
teachers. These levels are granular enough to be distinguished, but
large enough to capture a wide range of knowledge-based
inferences.
Teachers may also benefit from the notion that student
performance may reflect a disposition as opposed to reflecting only
an achievement. When achievement is the dominating narrative, and
educators think of learners in terms of what they can and cannot
do, they will tend to adopt a deficit mindset, particularly in an
era of high-stakes testing. But if educators view these cognitive
abilities as dispositional, then the ques-tion is not about
locating learners on an ability continuum, but rather how to shape
the context to facilitate the development of those dispositions.
For example, should a learner not demonstrate a particular reading
comprehension skill (e.g., an optimal level of IIR), this may not
be due only to the learners’ ability, but also to the context. This
is consistent with a universal design for learning perspective
where one changes the con-text, not the learner.
Although respondents tended to demonstrate consistent levels of
IIR across items, as supported by the banding of thresholds on the
Wright Map (Fig. 5), it is still important to take a situated
view when it comes to thinking about a learner’s best. For example,
there is evidence for consistency of response patterns across items
based on the band-ing of thresholds, but yet the multidimensional
analysis based on item types did not come out perfectly correlated.
Instead, the latent correlations were moderately high, suggesting
that the different items types might measure IIR in slightly
different ways. This may also suggest that some students might have
more of an interest in the global thematic components of some
stories, while other students may be more interested in the casual
relations, resulting in these students responding at higher levels
of IIR on the item types that align with their interests. Even if
we had not found banding, IIR levels would still be situated within
items, giving a roadmap to interpret student responses to a variety
of inferential thinking questions.
-
A. M. Blum et al.
1 3
Limitations and future research
Sampling limitations
We did not use a random sample in this study, but rather a
geographically-limited convenience sample: Participants were
recruited from schools in the San Fran-cisco Bay Area, from grades
3–6 (ages 8–12), and were self-selected (everyone who volunteered
and met the inclusion criteria was included). Of particular
con-cern is the lack of identification of English language learners
and students from low-income households. The nature of the sample,
coupled with the relatively small sample size (n = 72) limits the
generalizability of the findings, which should be treated as
tentative pending replication in a larger and more diverse
sample.
Assessment design limitations
In this study, we used vignettes, rather than fully developed
narratives—IIR may be manifested differently with longer texts.
Additionally, the assessment design may have induced local
dependence among items, both because of the common stimulus
material (four items attached to each of three stories) and also
because of the parallel nature of the questions (essentially the
same four questions are asked about each story). Future research
should examine the effect, if any, of this issue on our
findings.
IIR and autism
In order to further investigate how to promote reading
comprehension we need to find a way to promote more sophisticated
integrative reasoning, particularly for populations that are known
to have narrative comprehension challenges, such as those on the
autism spectrum. Seeing IIR as a cognitive processing disposition
sheds light on its relationship with individuals on the autism
spectrum. The first levels of IIR are considered local, and would
be attractive to those on the spec-trum since they are known to
have a local processing disposition (Frith, 2003; Frith &
Happé, 1994; Happé & Booth, 2008; Happé & Frith, 2006; Van
der Hal-len, Evers, Brewaeys, Van den Noortgate, & Wagemans,
2015). This can impact how they make the inferences needed to gain
a coherent understanding of narra-tives, leading to challenges and
differences in comprehension and narrative gen-eration (Capps,
Losh, & Thurber, 2000; McIntyre et al., 2018; White
et al., 2009). If this preference can be demonstrated in IIR,
it would further validate IIR, espe-cially in this population.
In addition, perhaps different modalities of narrative can
promote IIR in a non-invasive manner, such as comics. Individuals
on the spectrum are known to have a visual processing disposition
(Gaffrey et al., 2007; Kamio & Toichi, 2000) and comics
may be a good medium to promote IIR.
-
1 3
Integrative inferential reasoning
Integrative inferential reasoning and narratives
It would also be useful to investigate how IIR is situated
relative to the archi-tecture of narratives. Cohn (2013) describes
the grammar of visual narratives, which includes a hierarchy of
categories of panels as they relate to each other (i.e.,
establisher, initial, prolongation, peak, and release). These
categories align with Kintsch and van Dijk (1978)
macro-propositions, a sort of narrative gram-mar. Is IIR situated
by the macro-propositions of the narrative? Are people just as
likely to engage in higher levels of IIR during the climax, rather
than the resolu-tion, or the establisher? Knowing what facets of
the narrative provide more affor-dances and opportunities for
relatively higher levels of IIR, would indicate where to focus when
promoting IIR for a more coherent representation of the
narrative.
Noordman, Vonk, Cozijn, and Frank (2015) argued that unfamiliar
relations between clauses in a text, maintained by some causal
conjunction (e.g., because), do not promote online inferences.
Rather, the inference takes place later when the respondent is
asked to engage in some sort of verification phase. On the other
hand, readers who were familiar with those relations, made the
causal inference during reading (i.e., online).
In a similar vein familiarity with the modality used to present
the narrative may also affect when inferences are made and thus
affect the level of IIR respond-ents demonstrate. Future studies
should investigate the impact of modality, and respondents’
familiarity with modality used, one their location on the IIR
scale.
Contributions to evolving theories of inferential
reasoning
Above all, we believe that we have provided an empirical
demonstration of the value of a new twist in how we think about
inferential reasoning, one that acknowledges the contributions of
previous scholars, including Pearson and John-son (1978), Warren
et al. (1979), Graesser et al. (1994), Chikalanga (1992)
and Kintsch (1998), but moves beyond their work to include the sort
of meta-reason-ing that promotes a more integrative view of
inferential thinking. In a way, each inferential reasoning
framework is like a flashlight that illuminates the phenom-enon of
meaning making in its own way, shedding light on different facets
of this phenomenon. IIR brings those flashlights together and
provides general illumina-tion, giving an integrated view, from a
different vantage point, on the meaning making system. We hope our
work represents an encouraging first step in that direction.
Acknowledgements We would like to thank Karen Draney, Mark
Wilson, and Pamela Wolfberg for their support throughout the
development of this project. We would also like to thank the
elementary school (name withheld for privacy reasons) for their
participation and support in bringing this project to fruition.
-
A. M. Blum et al.
1 3
References
Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The
multidimensional random coefficients multino-mial logit model.
Applied Psychological Measurement, 21(1), 1–23.
Adams, R. J., Wu, M. L., & Wilson, M. (2012). ACER ConQuest:
Generalised item response model-ling software (Version 3) [Computer
software]. Camberwell: Australian Council for Educational
Research.
Alonzo, J., Basaraba, D., Tindal, G., & Carriveau, R. S.
(2009). They read, but how well do they understand?: An empirical
look at the nuances of measuring reading comprehension. Assessment
for Effective Intervention: Official Journal of the Council for
Educational Diagnostic Services, 35(1), 34–44.
American Educational Research Association, American
Psychological Association, & National Coun-cil on Measurement
in Education. (2014). Standards for educational and psychological
testing. Washington, DC: American Educational Research
Association.
Baghaei, P., & Ravand, H. (2015). A cognitive processing
model of reading comprehension in English as a foreign language
using the linear logistic test model. Learning and Individual
Differences, 43, 100–105.
Barnes, M. A., Dennis, M., & Haefele-Kalvaitis, J. (1996).
The effects of knowledge availability and knowledge accessibility
on coherence and elaborative inferencing in children from 6 to 15
years of age. Journal of Experimental Child Psychology, 61(3),
216–241.
Basaraba, D., Yovanoff, P., Alonzo, J., & Tindal, G. (2013).
Examining the structure of reading com-prehension: Do literal,
inferential, and evaluative comprehension truly exist? Reading and
Writ-ing, 26(3), 349–379.
Borsboom, D. (2005a). Latent variables. Measuring the mind:
Conceptual issues in contemporary psychometrics (pp. 49–84).
Cambridge: Cambridge University Press.
Borsboom, D. (2005b). True scores. Measuring the mind:
Conceptual issues in contemporary psycho-metrics (pp. 11–47).
Cambridge: Cambridge University Press.
Borsboom, D. (2008). Latent variable theory. Measurement:
Interdisciplinary Research and Perspec-tives, 6(1–2), 25–53.
Briggs, D. C., & Wilson, M. (2003). An introduction to
multidimensional measurement using Rasch models. Journal of Applied
Measurement, 4(1), 87–100.
Briner, S. W., Virtue, S., & Kurby, C. A. (2012). Processing
causality in narrative events: Temporal order matters. Discourse
Processes, 49(1), 61–77.
Cain, K., Oakhill, J. V., Barnes, M. A., & Bryant, P. E.
(2001). Comprehension skill, inference-mak-ing ability, and their
relation to knowledge. Memory & Cognition, 29(6), 850–859.
Capps, L., Losh, M., & Thurber, C. (2000). The frog ate the
bug and made his mouth sad: Narrative competence in children with
autism. Journal of Abnormal Child Psychology, 28(2), 193–204.
Chikalanga, I. (1992). A suggested taxonomy of inferences for
the reading teacher. Reading in a For-eign Language, 8, 697.
Cohn, N. (2007). A visual lexicon. Public Journal of Semiotics,
1(1), 35–56.Cohn, N. (2013). Visual narrative structure. Cognitive
Science, 37(3), 413–452.Cohn, N., Paczynski, M., Jackendoff, R.,
Holcomb, P. J., & Kuperberg, G. R. (2012). (Pea)nuts and
bolts of visual narrative: structure and meaning in sequential
image comprehension. Cognitive Psychology, 65(1), 1–38.
Cozijn, R., Commandeur, E., Vonk, W., & Noordman, L. G. M.
(2011). The time course of the use of implicit causality
information in the processing of pronouns: A visual world paradigm
study. Journal of Memory and Language, 64(4), 381–403.
DeVellis, R. F. (2006). Classical test theory. Medical Care,
44(11 Suppl 3), S50–S59.Embretson, S. E., & Wetzel, C. D.
(1987). Component latent trait models for paragraph comprehen-
sion tests. Applied Psychological Measurement, 11(2),
175–193.Ferrara, S., Lai, E., Reilly, A., & Nichols, P. D.
(2016). Principled approaches to assessment design,
development, and implementation. In A. A. Rupp & J. P.
Leighton (Eds.), The handbook of cog-nition and assessment (Vol. 4,
pp. 41–74). Hoboken: Wiley.
Frith, U. (2003). Autism: Explaining the enigma (2nd ed.).
Malden: Wiley-Blackwell.Frith, U., & Happé, F. (1994). Autism:
beyond “theory of mind”. Cognition, 50(1), 115–132.
-
1 3
Integrative inferential reasoning
Gaffrey, M. S., Kleinhans, N. M., Haist, F., Akshoomoff, N.,
Campbell, A., Courchesne, E., et al. (2007). Atypical
[corrected] participation of visual cortex during word processing
in autism: an fMRI study of semantic decision. Neuropsychologia,
45(8), 1672–1684.
Gernsbacher, M. A., Robertson, R. R. W., Palladino, P., &
Werner, N. K. (2004). Managing mental repre-sentations during
narrative comprehension. Discourse Processes, 37(2), 145–164.
Graesser, A. C., Singer, M., & Trabasso, T. (1994).
Constructing inferences during narrative text compre-hension.
Psychological Review, 101(3), 371–395.
Hagá, S., Garcia-Marques, L., & Olson, K. R. (2014). Too
young to correct: a developmental test of the three-stage model of
social inference. Journal of Personality and Social Psychology,
107(6), 994–1012.
Hambleton, R. K., & Jones, R. W. (1993). An NCME
instructional module on: Comparison of classical test theory and
item response theory and their applications to test development.
Educational Meas-urement: Issues and Practice, 12(3), 38–47.
Happé, F., & Booth, R. D. L. (2008). The power of the
positive: Revisiting weak coherence in autism spectrum disorders.
Quarterly Journal of Experimental Psychology, 61(1), 50–63.
Happé, F., & Frith, U. (2006). The weak coherence account:
detail-focused cognitive style in autism spec-trum disorders.
Journal of Autism and Developmental Disorders, 36(1), 5–25.
Kamio, Y., & Toichi, M. (2000). Dual access to semantics in
autism: is pictorial access superior to verbal access? Journal of
Child Psychology and Psychiatry and Allied Disciplines, 41(7),
859–867.
Kendeou, P. (2015). A general inference skill. In E. J. O’Brien,
A. E. Cook, & R. F. Lorch Jr. (Eds.), Inferences during Reading
(pp. 160–181). Cambridge: Cambridge University Press.
Kintsch, W. (1988). The role of knowledge in discourse
comprehension: a construction-integration model. Psychological
Review, 95(2), 163–182.
Kintsch, W. (1998). Comprehension: A paradigm for cognition.
Cambridge: Cambridge University Press.Kintsch, W. (2012).
Psychological models of reading comprehension and their
implications for assess-
ment. In J. Sabatini, E. Albro, & T. O’Reilly (Eds.),
Measuring up: Advances in how we assess reading ability (pp.
21–38). Lanham: Rowman & Littlefield Education.
Kintsch, W., & van Dijk, T. A. (1978). Toward a model of
text comprehension and production. Psycho-logical Review, 85(5),
363–394.
Language and Reading Research Consortium (LARRC), &
Muijselaar, M. M. L. (2018). The dimen-sionality of inference
making: Are local and global inferences distinguishable? Scientific
Studies of Reading: The Official Journal of the Society for the
Scientific Study of Reading, 22(2), 117–136.
Long, D. L., & Chong, J. L. (2001). Comprehension skill and
global coherence: A paradoxical picture of poor comprehenders’
abilities. Journal of Experimental Psychology. Learning, Memory,
and Cogni-tion, 27(6), 1424–1429.
Magliano, J. P., Larson, A. M., Higgs, K., & Loschky, L. C.
(2016). The relative roles of visuospatial and linguistic working
memory systems in generating inferences during visual narrative
comprehension. Memory & Cognition, 44(2), 207–219.
Malle, B. F., & Holbrook, J. (2012). Is there a hierarchy of
social inferences? The likelihood and speed of inferring
intentionality, mind, and personality. Journal of Personality and
Social Psychology, 102(4), 661–684.
Masters, G. N. (1982). A Rasch model for partial credit scoring.
Psychometrika, 47(2), 149–174.Masters, G. N. (2016). Partial credit
model. In W. J. van der Linden (Ed.), Handbook of item response
theory (Vol. 1, pp. 109–126). New York: Taylor and
Francis.McIntyre, N. S., Oswald, T. M., Solari, E. J., Zajic, M.
C., Lerro, L. E., Hughes, C., et al. (2018). Social
cognition and reading comprehension in children and adolescents
with autism spectrum disorders or typical development. Research in
Autism Spectrum Disorders, 54, 9–20.
Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003). A
brief introduction to evidence-centered design (ETS Research Report
Series RR-03-16). Princeton: Educational Testing Service.
Noordman, L. G. M., Vonk, W., Cozijn, R., & Frank, S.
(2015). Causal inferences and world knowledge. In E. J. O’Brien, A.
E. Cook, & R. F. Lorch Jr. (Eds.), Inferences during Reading
(pp. 260–289). Cambridge: Cambridge University Press.
Noordman, L. G. M., Vonk, W., & Kempff, H. J. (1992). Causal
inferences during the reading of exposi-tory texts. Journal of
Memory and Language, 31(5), 573–590.
Nuske, H. J., & Bavin, E. L. (2011). Narrative comprehension
in 4–7-year-old children with autism: Test-ing the weak central
coherence account. International Journal of Language &
Communication Dis-orders, 46(1), 108–119.
-
A. M. Blum et al.
1 3
Pantaleo, S. (2013). Paneling “matters” in elementary students’
graphic narratives. Literacy Research and Instruction, 52(2),
150–171.
Pearson, P. D. (1982). Asking questions about stories (writings
in reading and language arts 15). Colum-bus: Ginn and Company.
Pearson, P. D., & Johnson, D. D. (1978). Questions. Teaching
reading comprehension (pp. 153–178). New York: Holt.
Perfetti, C. A., & Stafura, J. Z. (2015). Comprehending
implicit meanings in text without making infer-ences. In E. J.
O’Brien, A. E. Cook, & R. F. Lorch Jr. (Eds.), Inferences
during reading (pp. 1–18). Cambridge: Cambridge University
Press.
Pitts, M. M., & Thompson, B. (1984). Cognitive styles as
mediating variables in inferential comprehen-sion. Reading Research
Quarterly, 19(4), 426–435.
Ramachandran, R., Mitchell, P., & Ropar, D. (2009). Do
individuals with autism spectrum disorders infer traits from
behavior? Journal of Child Psychology and Psychiatry and Allied
Disciplines, 50(7), 871–878.
Raphael, T. E., & Au, K. H. (2005). QAR: Enhancing
comprehension and test taking across grades and content areas. The
Reading Teacher, 59(3), 206–221.
Riconscente, M. M., Mislevy, R. J., & Corrigan, S. (2015).
Evidence-centered design. In S. Lane, M. R. Raymond, & T. M.
Haladyna (Eds.), Handbook of test development (2nd ed., pp. 40–63).
New York: Routledge.
Santos, S., Cadime, I., Viana, F. L., Prieto, G., Chaves-Sousa,
S., Spinillo, A. G., et al. (2016). An appli-cation of the
Rasch model to reading comprehension measurement. Psicologia:
Reflexão e Crítica, 29(1), 38.
Singer, M. (1980). The role of case-filling inferences in the
coherence of brief passages. Discourse Pro-cesses, 3(3),
185–201.
Van der Hallen, R., Evers, K., Brewaeys, K., Van den Noortgate,
W., & Wagemans, J. (2015). Global processing takes time: A
meta-analysis on local-global visual processing in ASD.
Psychological Bulletin, 141(3), 549–573.
Van Overwalle, F., Van Duynslaeger, M., Coomans, D., &
Timmermans, B. (2012). Spontaneous goal inferences are often
inferred faster than spontaneous trait inferences. Journal of
Experimental Social Psychology, 48(1), 13–18.
Wagner, M., & Rohwer, W. D., Jr. (1981). Age differences in
the elaboration of inferences from text. Journal of Educational
Psychology, 73(5), 728–735.
Warren, W. H., Nicholas, D. W., & Trabasso, T. (1979). Event
chains and inferences in understanding narratives. In R. O. Freedle
(Ed.), New directions in discourse processing (pp. 23–52). Norwood:
Ablex Publishing Corporation.
White, S., Hill, E., Happé, F., & Frith, U. (2009).
Revisiting the strange stories: Revealing mentalizing impairments
in autism. Child Development, 80(4), 1097–1117.
Wilson, M. (2005). Constructing measures: An item response
modeling approach. Mahwah: Lawrence Erlbaum.
Wilson, M. (2009). Measuring progressions: Assessment structures
underlying a learning progression. Journal of Research in Science
Teaching, 46(6), 716–730.
Wilson, M., & Carstensen, C. (2007). Assessment to improve
learning in mathematics: The BEAR Assessment System. In A. H.
Schoenfeld (Ed.), Assessing mathematical proficiency (mathematical
sciences research institute publications (pp. 311–332). Cambridge:
Cambridge University Press.
Wilson, M., & De Boeck, P. (2004). Descriptive and
explanatory item response models. In P. De Boeck & M. Wilson
(Eds.), Explanatory item response models: A generalized linear and
nonlinear approach (pp. 43–74). New York: Springer.
Wu, M. L., & Adams, R. J. (2013). Properties of Rasch
residual fit statistics. Journal of Applied Measure-ment, 14(4),
339–355.
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional
affiliations.
Modeling question-answer relations: the development
of the integrative inferential reasoning comic
assessmentAbstractIntroductionA brief history
of the research in inferential reasoningWays
of investigating inferential thinkingChronometrics: response
timesClassical psychometrics: score-based testsModern
psychometrics: latent variable modeling
Towards a contemporary approach to measuring
inferential thinkingThe parable
of the measurerQuestioning the interpretability
of scores without a constructMeasurement
for science and measurement as scienceTowards
a revised model of inferential reasoningConstruct map
for IIR
MethodsParticipantsMaterials and procedureVisual narrative
design
Measurement modelsMeasurement model 1: partial credit
modelMeasurement model 2: latent regression PCMMeasurement model 3:
multi-dimensional PCM
ResultsWright map (measurement model 1)Mean location
(measurement model 1)Reliability (measurement model 1)Item fit
(measurement model 1)Latent regression PCM (measurement model
2)Multi-dimensional PCM (measurement model 3)ValidityEvidence based
on the instrument contentEvidence based on response
processesEvidence based on internal structureEvidence based
on relations to other variablesEvidence based
on consequences of using IIR
DiscussionImplications for the study
of comprehensionDoes IIR represent a unification
of the five theories?Why the combination level
is importantConnections to related empirical studies
Educational implicationsLimitations and future
researchSampling limitationsAssessment design limitationsIIR
and autismIntegrative inferential reasoning
and narratives
Contributions to evolving theories of inferential
reasoning
Acknowledgements References