M‑er˜rela:velopment of˜the˜integrative inferential reasoning … · 2020. 5. 28. · 13 Integrativeinferentialreasoning Txt-implicit(boththequestionandanswerarederivablefromthetext,but

Vol.:(0123456789)

Reading and Writinghttps://doi.org/10.1007/s11145-020-10026-4

1 3

Modeling question‑answer relations: the development of the integrative inferential reasoning comic assessment

Alexander Mario Blum1,2 · James M. Mason1 · Jinho Kim1,3 · P. David Pearson1

© Springer Nature B.V. 2020

AbstractWe constructed a new taxonomy for inferential thinking, a construct called Integra-tive Inferential Reasoning (IIR). IIR extends Pearson and Johnson’s (1978) frame-work of text-implicit and script-implicit question-answer relations, and integrates several other prominent literacy theories to form a unified inferential reasoning con-struct, that is positioned as a type of cognitive processing disposition. We validated our construct using a researcher-made IIR instrument which was administered to 72 primary-grade students. Participants answered open-ended inference questions about various aspects of visual narratives presented in comic-strip format. We categorized participants’ responses as exemplifying one of the levels of IIR: text-implicit, script-implicit, or a combination of both. We used item response models to validate the ordinal nature of IIR, and its structure. Specifically, we fit Masters’ (1982) partial credit model and obtained a Wright map, mean location data, and reliability esti-mates. Results confirmed that the IIR construct behaves ordinally. Additionally, age was found to be a reliable predictor of IIR, and item types (each modeled as a separate dimension) were found to have reasonable latent correlations between the dimensions. This research provides theoretical and practical insights for the peda-gogy and assessment of narrative comprehension.

Keywords Cognitive processing disposition · Comics · Inferential thinking · Item response models · Narrative comprehension · Scale validation

* Alexander Mario Blum [email protected]

1 Graduate School of Education, University of California, Berkeley, Berkeley, USA2 Department of Special Education, San Francisco State University, 1600 Holloway Avenue, Burk

Hall 156, San Francisco, CA 94132, USA3 IMEC Research Group at KU Leuven, KU Leuven and ITEC, Kortrijk, Belgium

http://orcid.org/0000-0002-5887-7417https://orcid.org/0000-0002-3549-638Xhttps://orcid.org/0000-0003-0583-7750http://crossmark.crossref.org/dialog/?doi=10.1007/s11145-020-10026-4&domain=pdf

A. M. Blum et al.

1 3

Introduction

Different theories illuminate the meaning making process of narrative compre-hension, particularly when readers are faced with a question of some sort. This process has been discussed in terms of its question-answer relations (QARs; Pear-son & Johnson, 1978), the type of inference probe (Warren, Nicholas, & Trabasso, 1979), the relationship between QAR and inference probe (Chikalanga, 1992), the type of inferential response and associated coherence (Graesser, Singer, & Trabasso, 1994), and the notion of integration that readers create between the text-base they construct from a text and the knowledge-base they bring to their reading (Kintsch, 1988, 1998; Kintsch & van Dijk, 1978). This gives rise to the following questions in literacy research: (a) can these theories be united to form a single construct; (b) how many levels are there to this construct; (c) is there a hierarchical relationship among the levels as they relate to narrative comprehen-sion; and (d) does this construct develop in an orderly way across age levels? We describe the construction and validation of an instrument, through the use of appropriate measurement methodologies, to attempt to answer these questions.

A brief history of the research in inferential reasoning

In 1978, Pearson and Johnson published a chapter (Pearson & Johnson, 1978) exploring the nature of questions and diving into their relationship with the text and the reader. Questions warrant our attention not only because of their wide-spread use, but also because of their central role in any classroom discussion activity and in measuring the construct of reading comprehension (Pearson, 1982; Pearson & Johnson, 1978). Questions on their own cannot determine the type of comprehension readers enact. Questions serve as a kind of “invitation” to engage at a particular level of cognitive challenge, but readers may decide to respond at a different level (Pearson & Johnson, 1978). Thus it is only after examining the responses that a particular student offers, in concert with the invitation offered by the question, that we can actually determine the true nature of the act of com-prehension that occurred. It is the question-answer relationship that matters. To determine that relationship, we must consult the information provided in the text-base that a reader created from what the author provided, and the information provided from the reader’s world-knowledge. The Pearson-Johnson scheme of QARs includes three levels of reasoning:

Text-explicit (where both the question and the answer are in the text, and the QAR is rendered explicit by the logic of the grammatical relationship between the text and the question.). For example:

1. John rode the horse to improve his riding skills. He saw a championship in his future.

2. Why did John ride the horse?3. To improve his riding skills.

1 3

Integrative inferential reasoning

Text-implicit (both the question and answer are derivable from the text, but there is no logical or grammatical cue explicitly linking the question to the answer. The reader must establish the connection between question and answer to complete the QAR. Point 4 represents a text-implicit QAR (in regards to points 1 and 2) because the reader had to infer the link of logic between the proximal goal of improving skills and the distal goal of a championship.), and

4. To become a champion.

Script-implicit (the question is motivated by the text but the answer comes from the world knowledge script that the reader brings to the task). Often the log-ical link between the answer and the text is reasonable, even transparent, but the reader has to supply the link. Point 5 provides such an example because the reader has generalized from improving skills (or perhaps the championship phrase) to the “always do your best” script to guide human behavior.

5. Because he was always driven to do his best.

Thus, these categories cannot be classified as question types in their own right, considering that the question in point 2 could be answered using different sources of information resulting in different QARs. The logic of QARs continues to this day in terms of instructional routines based on the work of Raphael and Au (2005).

Warren et al. (1979) argued for another set of inferences that are derived from a different set of relationships, specifically logical and evaluative relations. Logi-cal relations are relationships between events that involve causes, motivations, and conditions that allow events to happen, which are explanatory in nature. These are the things that drive actions and events in stories. One particular type of logical relation they argue for is called motivational inference, which involves inferring why an agent engaged in an intentional action and tends to privilege why and how questions. Evaluative relations apply notions of significance, moral-ity, themes, and lessons within the story, and are highly sensitive and related to the readers’ world knowledge.

Kintsch (1988) describes the notion of a situation-model where the reader inte-grates his/her world knowledge with the text-base. The process begins with con-text, essentially the reader is first situated by the discourse context (e.g., motive for reading the story). Then the reader engages in the construction process, where they construct a text-base. In order to accomplish this, the learner forms concepts directly corresponding to the linguistic input from the narrative. Some of these ideas are based in their world-knowledge, others are based on the text-base, such as maybe the reader is hearing or reading the climax of the narrative (e.g., macro-proposition). Then, the reader goes through the integration process, where they use their world-knowledge to constrain only relevant ideas, as it pertains to the linguistic input of the narrative. This process will continue as the reader comes across information in the narrative, constantly updating their situation model.

A. M. Blum et al.

1 3

In 1992, Chikalanga proposed his own theoretical levels of comprehension (Chi-kalanga, 1992) combining both Pearson and Johnson’s (1978) and Warren et al. (1979) inferential thinking models. Chikalanga argued that depending on the rela-tionship of the information provided by the text-base and the readers world-knowl-edge, logical and informational inferences could either be text-based or script-based (see Chikalanga, 1992, p. 706, for an example).

Graesser et al. (1994) suggested a set of knowledge-based inferences that reflect the influence of Kintsch’s (1988) Construction-Integration model of comprehen-sion. Recall that Kintsch posited three models (more like levels) of comprehension that entail the text and the reader’s knowledge base as primary resources available to assist in model building. The surface code consists of a cursory account of the words and sentence syntax with very little processing by the reader. The text base requires the reader to make low level inferential connections required for establish-ing cohesion among the propositions in the text (e.g., resolving anaphora and estab-lishing explanatory links from one proposition to the next). They posit 13 classes of knowledge-based inferences in the context of narrative comprehension (see Graesser et al., 1994 for full review). A subset of these inferences can theoretically be mapped onto motivational inferences (Warren et al., 1979), specifically causal antecedent, superordinate goal, thematic, character emotional reaction, causal consequence, and state inferences (Graesser et al., 1994). For example, when asking a motiva-tional inference question, such as “why did he steal the car?”, one could answer based on the character’s superordinate goals (“he wanted to get to the store”—the desire to go to the store motivated the action), character emotional reaction (“he was feeling angry”—his emotion was the motivation for his action), causal-antecedent (“someone dared him to do it right before”—the event prior was the motivation for the action), state (“he believed he could succeed”—his belief was the motivation for the action), casual-consequence (“his friends would celebrate him after”—the forecasted consequential outcome of the event in question was the motivation for his action), or even a thematic inference (“because desperate people do desperate things”—the moral, lesson, or principle was the motivation for the agent’s action). Evaluative inferences could be answered based on thematic inferences (Graesser et al., 1994).

Basaraba, Yovanoff, Alonzo, and Tindal (2013) also investigated the notion of levels of comprehension. They found preliminary evidence supporting this, through their consistent mean item location increase. They mapped different item types that represented different levels of comprehension aligning with Pearson and Johnson’s (1978) QARs. Literal (text-explicit QAR), inferential (text-implicit QAR), and eval-uative (script-implicit QAR). Essentially, they are mapping items on different ordi-nal levels. By placing these levels on an ordinal continuum, they investigated the location of mean item difficulty along a logit scale representing a unidimensional construct. They found that items that represented literal comprehension were easier than items that represented inferential comprehension, which were easier than items that represented evaluative comprehension; supporting the notion of QARs being placed on an ordinal scale. They found preliminary evidence supporting the exist-ence of these categories, even though their analyses did not use explicit parameter linking between samples, but the evidence was not conclusive in all cases.

1 3


Arguing for a taxonomy of meaning making, Perfetti and Stafura (2015) described categories of reasoning, akin to QARs, and placed them into an ordinal structure. Their first level represented explicit meaning (i.e., text-explicit QARs), the second level represented implicit meanings bound to text language (i.e., text-implicit QARs), and the third level represented inferences not bound by text language and thus reflecting world knowledge (i.e., script-implicit QARs).

Ways of investigating inferential thinking

Over the last several decades, there have been three recurring themes in investigating inferential thinking: in some studies, researchers have made inferences about partici-pants’ inferencing using their response times, some treated inferencing as something to be tested using a classical psychometric approach, and some treated inferencing as something to be measured using a modern psychometric approach—some studied used multiple approaches, since each illuminates a different faced of inferencing.

Chronometrics: response times

One theme involves the use of various measures of response time (e.g., explicit read-ing times, eye tracking, and matching; see Briner, Virtue, & Kurby, 2012; Cozijn, Commandeur, Vonk, & Noordman, 2011; Gernsbacher, Robertson, Palladino, & Werner, 2004; Long & Chong, 2001; Noordman, Vonk, & Kempff, 1992; Singer, 1980). Other particularly interesting examples of this theme include Ramachandran, Mitchell, and Ropar (2009) investigated the response time (RT) of participants with and without autism engaging in trait inferences. Participants read trait-implying sen-tences and then were presented with a pair of words from which they chose the best related to the sentence. The participants’ task was to match the word that best related to the sentence that they just read. Malle and Holbrook (2012) investigated whether there is a hierarchical relationship among goal, belief, personality, and intentional-ity inferences using both RT and proportion of correct responses. Participants read different sentences on a screen that were designed to elicit a particular inference, fol-lowed by a probe word that represented a type of inference possibly being elicited. Then the participants were to click ‘yes’ or ‘no’ for whether the probe word aligned with the prior sentence presented. Van Overwalle, Van Duynslaeger, Coomans, and Timmermans (2012) explored the relationship between trait inferences and goal inferences, in particular whether one type of inference results in shorter RT. On a computer screen, participants were shown a series of sentences designed to elicit particular types of inference. After reading each passage, the participants were told that a series of probe words would appear; for each they would click ‘yes’ or ‘no’ a representation of the inference made.

Classical psychometrics: score‑based tests

Many approaches to investigating inferential thinking align with a Classical Test The-ory approach, in which inferences are tested using a series of items with numeric scores

A. M. Blum et al.

1 3

(e.g., 0/1 for incorrect/correct, 0–3 for a long form response scored with a rubric, or 0–4 for a 5-point Likert scale). Participants’ total (or average) scores are treated as an observed variable representing their inferential thinking, suitable for the application of common statistical models (Barnes, Dennis, & Haefele-Kalvaitis, 1996; Cain, Oakhill, Barnes, & Bryant, 2001; Hagá, Garcia-Marques, & Olson, 2014; Wagner & Rohwer, 1981). Additional exemplars include White, Hill, Happé, and Frith (2009), who utilized five different types of social stories to compare individuals with autism comprehen-sion performance to a control group of neurotypical developing counterparts. Narrative comprehension outcomes were scored 0–3 for each item, and used in a regression anal-ysis. Nuske and Bavin (2011) sought to measure narrative comprehension skills in indi-viduals with high-functioning autism relative to their typically developing peers. Tasks included literal comprehension items, propositional inference items, and script infer-ence items. Items for main idea and details were scored 0–3, and inference items were scored 0–6. Differences in mean scores between groups were tested using ANCOVA. As mentioned above, Ramachandran et al. (2009) also used the proportion of cor-rect responses in their analysis. Hagá et al. (2014) set out to investigate how learners incorporate added contextual information when generating an inference. They had five groups of participants (kindergarteners, 2nd, 6th, and 9th graders, and undergraduates) who responded to Likert-type rating items. They used ANOVA to analyze differences in mean ratings.

Modern psychometrics: latent variable modeling

Some studies used approaches like Rasch models, item-response theory, or struc-tural equation models which participants’ inferential thinking is measured by sta-tistically modeling this type of cognition as a latent variable that contributes to their response patterns (Alonzo, Basaraba, Tindal, & Carriveau, 2009; Baghaei & Ravand, 2015; Embretson & Wetzel, 1987; Pitts & Thompson, 1984). As noted above, Basaraba et al. (2013) investigated levels of comprehension, akin to Pearson and Johnson’s (1978) QARs. Using Rasch modeling, they found evidence for dis-tinct levels of comprehension based on mean item difficulty. Language and Reading Research Consortium (LARRC) and Muijselaar (2018) investigated the dimension-ality of local and global reading comprehension tasks, with item types akin to Pear-son and Johnson’s (1978) Text-Explicit and Text-Implicit QARs, using structural equation modeling. Santos et al. (2016) used Rasch modeling to compare the mean difficulties among item types, where the items were mapped onto an ordinal series of reading comprehension levels (literal comprehension, inferential comprehension, critical comprehension, and reorganization).

Towards a contemporary approach to measuring inferential thinking

The parable of the measurer

Imagine that a research team was constructing a measure of inferential thinking. As is typically done (DeVellis, 2006), they began with a series of items which they and

1 3


other experts believed to represent the skill set of inferential thinking. They piloted the measure many times and kept only highly correlated items (i.e., items that have a high proportion of common variance, rather than unique variance). They also kept a broad range of easy items (high proportion of being endorsed) and relatively more difficult items (low proportion of being endorsed). They decided to make some items worth more points than others based on which items they considered more “difficult.”

Specifically, they made two measures, one that measured propositional inferen-tial thinking and another that measured script-implicit inferential thinking (Nuske & Bavin, 2011). They gave both measures to two populations, autistic and neuro-typical. They then computed the mean sum-score within each population, and per-formed t-tests for differences in means. They did not find a significant difference (between the two groups) in the propositional inferential thinking task, but they did find a significant difference for the script-implicit inference task. As a result, they concluded that autistic children have a deficit in script-implicit inferential thinking.

Is this true, particularly in the deterministic manner in which this hypothetical finding would suggest (Hambleton & Jones, 1993)? Consider what would happen if the research team had selected script-implicit inference items that were much eas-ier—easy enough that both groups would perform well. In this case no difference would be found: does this mean that autistic children don’t have a deficit? Now con-sider the scenario where the script-implicit inference items were all very difficult. Again, no difference would be found: does this mean that since both populations struggle in this area, it no longer constitutes a deficit? Note how in this example, the answer to a scientific question about the nature of autism depends on the apparently-unrelated issue of the difficulties of the items included in the measure.

Questioning the interpretability of scores without a construct

How can empirical studies determine the degree to which a respondent has a relative deficit in a latent variable, beyond the raw score on the instrument (e.g., an assess-ment) used to measure it (Borsboom, 2005a, b, 2008)? Consider two respondents, one with a score of 80% and another with a score of 70%. How can “80%” be an exact measure of a child’s inference ability—what does 80% even mean in this con-text? Similarly, how can the second respondent’s “deficit” of 10% be interpreted? Since a respondent’s score is based on the difficulty of the test, a respondent could have received a higher score with an easier test (Borsboom, 2005a, b, 2008; Hamb-leton & Jones, 1993). Furthermore, if a respondent’s score is based on the range of item difficulties on the test, how can item “difficulty” be established? If difficulty is operationalized as the proportion of correct answers in a given sample (DeVel-lis, 2006), then difficulty is specific to that sample (Borsboom, 2005b; Hambleton & Jones, 1993). Finally, how can researchers be certain of someone’s ability based only on a score, where the connection between the measurement result (the score) and the measured property (the “ability”) is not apparent in the way height would be when measured using a ruler (Borsboom, 2005a, 2008)?

A. M. Blum et al.

1 3

Measurement for science and measurement as science

Item response theory, a contemporary approach to measurement, makes it possible to locate both respondents and items on a common scale (Borsboom, 2008; Wilson, 2005). This scale can be established in a way that is not bound to a specific sample (sample independent), nor is it dependent on a specific set of items (test independ-ent). In order to locate a respondent’s ability on a given construct (e.g., inferential thinking), rather than on a given test, a measurement model (such as a latent-var-iable model) is required in order to quantify the construct, in addition to the the-oretical model embodied in the construct map. Measurement models include test-able assumptions, which allow the researcher to assess the quality of measurement obtained. Additionally, the construct map, as a theoretical model, contains hypoth-eses that can be empirically tested. In this way, the iterative refinement of theories through empirical research can be applied both to the substantive research area and to the clarification and measurement of its constructs.

The BEAR Assessment System (BAS; Wilson, 2005, 2009; Wilson & Carstensen, 2007) is a principled approach to assessment design that exemplifies measurement as science, where the measurement process is itself a scientific endeavor, rather than merely serving as a necessary step in a scientific investigation. Using BAS, a meas-urer proceeds through four building blocks, doing work that is very similar to the application of the scientific method in the physical sciences, including building a theory based on extant literature, making hypotheses based on that theory, empiri-cally testing these hypotheses, and refining the hypotheses and the theory as needed based on the results of analysis. The first building block consists of the develop-ment of a clear theory and formulation of the variable to be measured. This theory is summarized in a visual representation called a Construct Map, which guides subse-quent assessment development. The second and third building blocks, Items Design and Outcome Space, operationalize this theory: Items Design consists of designing stimuli, tasks, or questions to elicit responses on the variable of interest, and the Outcome Space is a set of rules for placing responses to items into levels on the Construct Map. Both of these steps involve the formulation of hypotheses about the variable: items represent hypotheses about contexts that might elicit relevant responses, and the Outcome Space represents local hypotheses about the response processes involved in responding to specific items in particular ways. At this point, data can be collected by administering the assessment to respondents, collecting their responses, and scoring the responses using the outcome space. The fourth building block consists of fitting an appropriate measurement model to the scored responses, and analyzing the results. In this stage, special emphasis is placed on the Wright Map, which is a visual representation of the results from the measurement model that is analogous to the Construct Map: the Wright Map is essentially the empirical version of the Construct Map, and comparing the two maps gives an ini-tial assessment of which hypotheses were confirmed by the data and which were not.

We chose this assessment framework because it places central emphasis on devel-oping the Construct Map, a theoretical representation of a unidimensional construct. This map serves as a metaphor and provides grounding for the other assessment development activities. It is the work that goes into situating the construct being

1 3


modeled, through an extensive literature review and consulting the field, that holds together the items design, outcome space, and statistical model. Other principled assessment frameworks exist (for a review, see Ferrara, Lai, Reilly, & Nichols, 2016), but we selected BAS primarily because of our focus on theory building and iterative refinement. Evidence Centered Design (ECD; Mislevy, Almond, & Lukas, 2003; Riconscente, Mislevy, & Corrigan, 2015) would also have been an accept-able choice—components of ECD include the Student Model, Task Model, Evidence Model, and Measurement Model, which parallel the Construct Map, Items Design, Outcome Space and Measurement Model in BAS, respectively—but ECD is focused more on the construction of evidentiary arguments for validity than on theory-build-ing. Additionally, ECD is more suited to larger assessment projects and includes additional components and processes for addressing the complex test assembly and delivery issues that arise in such contexts.

Towards a revised model of inferential reasoning

In order to investigate the nature of inferential reasoning, we needed to measure it. We used BAS with multiple iterations of the four building blocks to (a) formulate a theory of inferential reasoning, (b) develop an instrument based on our theory, (c) collect empirical evidence about our theory by administering this instrument, (d) use the empirical data to assess the quality of measurement obtained using our instru-ment and to determine areas of refinement, and (e) use the data both to test hypoth-eses about our theory, and also to potentially refine the theory as well.

To formulate our theory of inferential reasoning we argue that Kintsch’s (1988) framework allows for degrees of low integration and high integration of world knowledge, along an ordinal continuum. This continuum which we call Integrative Inferential Reasoning (IIR) is anchored by Pearson and Johnson’s (1978) QARs, specifically text-implicit and script-implicit QARs.

Perfetti and Stafura (2015) have argued that taxonomies are less helpful if they do not incorporate some kind of hierarchical structure which holds the framework together. Kintsch’s (1988) construction integration framework provides such a struc-ture, in which degrees of integration provide directionality (i.e., more or less inte-gration). This hierarchy is therefore ordinal and compatible with the concept of a unidimensional variable.

We then integrate this continuum with the work of Chikalanga (1992) and the work of Graesser et al. (1994). Just as Chikalanga (1992) reported that motivational inferences could be text-implicit or script-implicit, depending on the information available in the text and the readers knowledge base, so too can those from Graesser et al. (1994), depending on the information provided in the text-base. Chikalanga does not suggest that evaluative inferences can derive from either text or script, suggesting instead that, and they are all script-implicit. We argue that thematic or rather, evaluative inferences can also be text-implicit. For example, imagine that a group of young children heard a narrative about kids kicking a three-legged dog and then getting in trouble for doing so. When, at the end of the story the children were asked what the lesson was, they all said, “don’t kick three legged dogs.” Although

A. M. Blum et al.

1 3

this would be a representation of evaluative thinking, it is also very literal, resem-bling the explicit nature of the narrative itself. Instead, the lesson could have been more altruistic such as, “it’s wrong to bully animals” which would represent a script-implicit QAR, since it includes information about cultural scripts, such as notions of right, wrong, and what is considered bullying.

As noted above, Pearson and Johnson’s (1978) text-implicit and script-implicit QARs form the first two levels of IIR. Text-implicit QARs, which don’t require explicit evidence of world-knowledge to be integrated, and base their inference solely on establishing links between text-base propositions, are positioned as low integration. Script-implicit QARs, which require the integration of world-knowl-edge, are positioned higher on this continuum. We extend this work by adding an additional level even higher on the continuum (i.e., with more explicit integration). This new third level, requires the combination of two sources of reasoning, a text-implicit QAR and a script-implicit QAR. Figure 1 demonstrates this in the context of a motivational inference question: “Why did Alex give his toy away?”

As stated earlier, the question does not constrain the level of the response, indeed this question could be answered at all three proposed levels. One could say, because “the boy was sad and Alex wanted to be kind.” In this case, the first part of the answer represents a text-implicit causal-antecedent inference, since the text-base says he is sad, and it was the prior event to the question. The second part of the answer represents a script-implicit superordinate goal inference, since the goal of the character being kind was not mentioned in the comic, and the notion of kind-ness is culturally specific (someone could easily see this act of giving the toy away as suspicious based on their culture and prior experience), but something about this behavior resembles the script of wanting to be kind. The response as a whole repre-sents a combination of script and text implicit inferential thinking, and is even more explicitly integrative between the text-base and the respondent’s world knowledge.

Although we have described the levels of IIR in terms of QARs (i.e., responses) IIR is a theory of reasoning, and therefore it must be possible to place the reasoners (i.e., respondents) on the IIR continuum. We view a respondent’s location on IIR as a cognitive processing disposition, not as a level of achievement. That is, a respond-ent’s location does not describe the maximum level of which they are capable, but rather what they do spontaneously and consistently, when faced with different types

Fig. 1 Sample narrative vignette

1 3


of items. This also positions IIR as situated by context: In certain contexts, one might respond at high levels of IIR, whereas in other contexts one might respond at lower levels; this is still dispositional, but where the disposition may be affected by context. We formulate IIR in this way as a counterpoint to the tendency to study nar-rative comprehension in terms of achievement. Achievement models align with def-icit-oriented perspectives, positioning poor performance as a deficit, thus requiring remediation, and ultimately internalizes the fundamental error within the learner. Rather, a situated view externalizes the challenge to being environmental and con-text driven. Change the environment, not the person essentially (e.g., change reading modality to tap into same construct).

This also positions comprehension as a moving target: Depending on context, students using IIR to varying degrees may also apply different forms of meaning making, and engage differently in QARs. By positioning IIR as a disposition, we hope to shed some light on dispositions in narrative comprehension overall, and how they contribute to one’s mode of experience when engaging with narratives. This may be particularly useful in a formative context. If one narrative is shown to elicit higher degrees of integration compared to other narratives, then the teacher can be informed about what materials are most helpful in tapping into their students’ dispo-sition and bringing out their best during individualized instruction.

IIR thus far has been discussed in terms of motivational and evaluative infer-ences. We propose another category of questions that tap into IIR: meta-reasoning questions, for example, “what made you think of that answer?” While this may not directly involve an inference, it is certainly part of the inferential reasoning process. We argue that this proposed IIR scale could also be applied to this type of question. In Fig. 2, we present a Construct Map (see Wilson, 2005), representing the combina-tion of theories of levels of comprehension integrating the theories discussed above (Chikalanga, 1992; Graesser et al., 1994; Pearson & Johnson, 1978; Warren et al., 1979).

Perfetti and Stafura (2015), whose hierarchical structure, as described above, is similar to both Pearson and Johnson’s (1978) QARs and IIR, point out that notions of local and global coherence were not represented in their taxonomy. The nature of IIR, being anchored both with QARs and with a construction integration framework, allows notions of local and global coherence (Graesser et al., 1994) to be incor-porated as follows. Inferential thinking at the first level of IIR is characterized by low integration and by text-implicit QARs, which are essential in maintaining local coherence. Global coherence, on the other hand, requires the integration of world knowledge, and thus requires the type of inferential reasoning described by the sec-ond level of IIR.

Construct map for IIR

Our research questions center around the validation of IIR, hypothesized above, as a unidimensional continuum of how explicitly integrated the text base is with the learners’ knowledge base, and also around the development of a method for meas-uring IIR in a way that may be useful in understanding inferencing. As seen in

A. M. Blum et al.

1 3

Fig. 2, we posit four qualitatively-distinct ordinal levels for IIR. Starting with the least integrative level of the Construct Map at the bottom of the figure, a response at Level 0 makes no use of inferences. At Level 1 respondents demonstrate a text-implicit QAR where an inference has been made, but it is comprised of proposi-tions from the text-base only. At Level 2 respondents demonstrate a script-implicit QAR, where the response is comprised of information derived from their world knowledge, and implies that it was built upon propositions in the narrative. At the most explicitly integrative level, Level 3, respondents demonstrate a combination of script-implicit and text-implicit QARs, as two sources of reasoning for their

Fig. 2 Construct map for IIR

1 3


response, demonstrating an explicit integration of the text-base and the respond-ents’ world-knowledge.

Methods

Participants

We recruited 72 students in general education (ages: 8–12, see Table 1; gender: 30 male, 34 female, 8 declined to state) from a San Francisco Bay Area K–6 elemen-tary school.1 Participants took the instrument described below on a computer at their school site, without time constraints.

Materials and procedure

Visual narrative design

In comics, graphics are placed in individual panels that can be thought of as atten-tion units and considered the basic unit of a visual narrative (Cohn, 2007). Depend-ing on how the panels are drawn, panels can also demonstrate points of view of characters, different perspectives of events, and demonstrate focus points (Pantaleo, 2013). The narrative structure takes these panels and orders them in such a way where there is a particular pace set, and based on the sequenced panels, the reader derives meaning from both the graphics inside of the panels and the events they engage in (Cohn, 2013). In order for the comic to be best understood, there needs to be some type of structure to the sequenced panels known as a narrative gram-mar (Cohn, 2013; Cohn, Paczynski, Jackendoff, Holcomb, & Kuperberg, 2012). A Narrative grammar (Which illuminates a narrative structure) includes five primary categories: establisher, initial, prolongation, peak, and release. Establishers are the first impressions of the scene: they set up the characters and lay the founda-tions for new information; initials set the action or event in motion, and typically display the action that leads to a peak, (e.g., someone getting ready to run, but not

Table 1 Number of participants by age

Age N

8 69 1110 2111 2012 14

1 This study was approved by the U.C. Berkeley Committee of the Protection of Human Subjects, and students participated with the informed consent of their parents/guardians.

A. M. Blum et al.

1 3

actually running yet); a prolongation can function as a pause, as a demonstration of the trajectory of an event, as a tool for suspense, as a cliffhanger, or as anything else that delays the peak; peaks represent the culmination of everything built up in the graphic narrative and may include a change of state, an action carried out, or the interruption of an event; the release is the result of the peak and serves to release the tension of the peak, as a wrap-up, or as a conflict resolution (Cohn, 2013).

Three visual narratives, such as the one shown in Fig. 3, were developed in comic-strip format by the first author and a contracted artist. Each narrative con-sisted of 3 panels (antecedent, behavior, consequence/result of that behavior). These narratives also align with Cohn’s (2013) narrative grammar theory described above. The structure contains an expository panel, followed by a peak, followed by a reso-lution. By using this narrative arc, we can maintain visual narrative referential cohe-sion, at least at the macro-proposition level, similar to the referential cohesion dis-cussed by Kintsch and van Dijk (1978) among the text-base. The comics were each

Fig. 3 Sample comic with questions

1 3


followed by four items: (a) a motivational inference item (why the character engaged in an intentional action), (b) a meta-reasoning item (what made the participant think of that answer), (c) an evaluative inference item (what the lesson of the story was), and (d) another meta-reasoning item. As mentioned above, regardless of the item type, students can respond to any of the items at any level of IIR (i.e., all levels exist within every item). Students responded to the entire instrument, consisting of all three narratives, along with demographic items (age and gender, but not name), on a Google Forms web document.

Before engaging with the three narratives, participants read a sample comic and answered literal questions about it, in order to familiarize students with the format. Participants then read each comic in turn, and answered the associated questions before advancing to the next comic.

These comics were visual narratives in a multi-modal sequential image format. They are multimodal because in addition to a sequence of images, there is print built into the narrative as well, working closely, and in essence informing one another, with the images. The modality was also chosen to support the notion that these socio-cognitive literacy theories lend themselves to narratives in any format. In fact, Magliano, Larson, Higgs, and Loschky (2016) investigated bridging inferences, due to their ability to establish how two or more panels, or elements of a narrative are semantically related. This is similar to text-implicit QARs and local coherence in visual narrative comprehension. They had a series of sequential images, with a beginning state, bridging event, and an end state image. The researchers then would take away the bridging event image for participants, and examined differences find-ing longer viewing times for the end state image when the middle panel was absent, supporting that an online inference is taking place. Just as in the case of reading a sentence, they argue for a shared systems hypothesis where both visuospatial and linguistic working memory systems support inferential thinking in visual narratives. This would also support the notion of inferential thinking transcending modality (Kendeou, 2015).

Measurement models

The responses to the items were scored ordinally, with point values corresponding to the level numbers in Fig. 2. Three points were assigned to responses that represented a combination of script-implicit and text-implicit QARs; two points were assigned to responses that represented script-implicit QAR; one point was assigned to responses that demonstrated text-implicit QAR, and zero points were assigned to responses that demonstrated no-inferential reasoning. We fit three different measurement mod-els to the scored data using ConQuest [version 3.0] (Adams, Wu, & Wilson, 2012).

Measurement model 1: partial credit model

We used the Partial Credit Model (PCM; Masters, 1982, 2016) to estimate both item difficulty and person location parameters. This is the model that would be most appropriate for using our IIR instrument, as well as for addressing most of our

A. M. Blum et al.

1 3

research questions (we fit two other models to address specific research questions). As a Rasch-family model, this model estimates the item parameters (the step dif-ficulties, �ij of the item i and of the levels j within the item) and the person locations �p , based on the model-implied likelihood of the observed response patterns for each item and person. Equation 1 shows the likelihood, under the PCM, of a respondent at location �p responding to item i at level m (where the item has Mi steps, and step difficulties �i1, �i2,… , �iMi for steps 1, 2,… ,Mi).

where �p ∼ N(0, �2

�

) and �p − �i0 = 0 . For model identification, we constrain the

mean person location to be zero.The PCM places the respondents and the items on a common scale, often called

the logit scale because of the functional form of Eq. 1. Conceptually, each unit on this scale represents an equal “distance,” as inches do on a ruler: one logit repre-sents a difference of one in the log of the odds of a (generic) respondent scoring at a higher level, on a (generic) item, versus scoring at the next level down.

A person’s location on the logit scale is often interpreted as a “proficiency” or “ability”, a term from academic testing, since in that case it represents their likeli-hood of responding correctly to more items. In the case of IIR, the person loca-tion can be interpreted dispositionally: On any given item, compared to a person lower on the scale, a person higher on the scale would be more likely to respond at a higher level, as defined in the IIR Construct Map (Fig. 2) which represents their disposition to express more explicit integration.

For items (or for levels within an item), being higher on the scale is interpreted as the item being more “difficult” (again, a term from testing). In the IIR case, it means that the item is relatively less conducive to explicitly-integrated responses, and so respondents will have a lower likelihood of doing so.

Measurement model 2: latent regression PCM

A latent regression PCM is a person explanatory extension of the PCM (Wilson & De Boeck, 2004). We fit this model to address Research Question (d), by using par-ticipant age as the regressor. In a latent regression, the person location �p in Eq. 1 is replaced by a regression model where each person has J covariates. For Measure-ment Model 2, the regression is given by Eq. 2 ( J = 1 , Zp1 is the person’s age).

The full likelihood is obtained by replacing �p in Eq. 1 with the right-hand side of Eq. 2, the item parameters are constrained to have the values from Model 1, and �p ∼ N

(0, �2

�

) . Here, Zpj is person p’s value on covariate j, �j is the regression weight

for covariate j, �0 is the intercept, and �p is the residual of the person’s location after controlling for the effects of all J covariates. The regression coefficient �j can be interpreted as the effect of (a one-unit change in) covariate j on the construct (i.e.,

(1)P�Xpi = m��p, �ij

�=

exp∑m

j=0 (�p−�ij)∑Mi

k=0exp

∑kj=0 (�p−�ij)

(2)�p =J∑j=1

�jZpj + �p

1 3


on the person’s likelihood of responding at higher levels). In the present study, �1 is interpreted as the estimated effect of a 1-year difference in age on the participant’s IIR.

Measurement model 3: multi‑dimensional PCM

We fit a multi-dimensional PCM (Adams, Wilson, & Wang, 1997; Briggs & Wilson, 2003) in order to address part of Research question (a), as well as to provide valid-ity evidence for the structure of IIR. Since our items design is structured with the same item types (motivational inference, evaluative inference, and meta-reasoning) repeated for each narrative, we fit a between-item multidimensional PCM, in which the item types were coded as separate dimensions, as seen in Fig. 4 (As is common in Rasch modeling, the numeral “1” is omitted from all the factor loadings, as are the item-specific residuals which are assumed to be uncorrelated.). This extension of PCM estimates the latent correlation between the dimensions, allowing us to exam-ine whether different item types measure IIR in the same way.

Results

As noted above, the PCM is the primary model of interest for measurement using our IIR assessment. Therefore, we examined the performance of the instrument pri-marily with the results from Measurement Model 1.

Wright map (measurement model 1)

Figure 5 shows both the respondents’ and items’ locations on the same scale, allowing their relative distributions to be examined. The left panel represents the distribution of respondents’ estimated locations, with the “X” symbols propor-tional to the number of people located at each point along the scale. The right panel represents the locations of the levels within each item. For example, the

Fig. 4 Path diagram for multidimensional PCM

A. M. Blum et al.

1 3

Threshold 2 rectangle above Q10 indicates the point on the logit scale where a respondent would be equally likely to respond to this item at Level 1 or below vs. at Level 2 or above (i.e., the second Thurstonian threshold).

Note that, as a group, the first thresholds (where people would have equal probability of responding at Level 0 vs. at Level 1 or above) were generally below the second thresholds (where people would have equal probability of responding at Level 1 or below vs. at Level 2 or above), with only two excep-tions. All the second thresholds, however, were below all of the third thresholds (where people would have equal probability of responding at Level 2 or below vs. at Level 3). This phenomenon, known as banding, is evidence in favor of the three levels being distinct, since it is the level, and not the item which seems to have been the primary determinant of threshold locations. This supports the approach of mapping different levels of the construct within different items, rep-resenting different types of inferential relations, in this case item types.

Fig. 5 Wright map (empirical representation of the construct map) for IIR

1 3


Mean location (measurement model 1)

Further evidence that the levels of IIR are distinct can be seen in Table 2, which examines the distribution of respondents, rather than of items, for each level. When levels are well-ordered, items should display a mean location increase through suc-cessive levels, as this would mean that within each item, respondents who respond at higher levels are estimated to have more of the construct (i.e., IIR). All 12 items dis-played mean location increases across all levels, except Item 12. In this case, how-ever, the level that is out of order also has only three respondents (and thus more data would be required to determine the true ordering of the levels in this item).

Reliability (measurement model 1)

Coefficient � was .86. We also calculated person separation reliability, which is commonly used in Rasch measurement. This statistic ranges from 0 to 1 and is often

Table 2 Mean person locations within each level of each item

Item Level Count Mean θ Item Level Count Mean θ (logits)

Item 1 0 1 − 2.94 Item 7 0 5 − 2.071: Text-Implicit 30 − 0.32 1: Text-Implicit 31 − 0.352: Script-Implicit 37 0.40 2: Script-Implicit 34 0.643: Combination 4 0.54 3: Combination 2 1.95


Item 3 0 2 − 2.30 Item 9 0 2 − 2.231: Text-Implicit 5 − 1.34 1: Text-Implicit 21 − 0.262: Script-Implicit 65 0.24 2: Script-Implicit 43 0.163: Combination 0 NA 3: Combination 6 1.22




A. M. Blum et al.

1 3

comparable to coefficient � . Person separation reliability represents the degree to which the instrument was able to separate people with varying proficiency levels. The PCM person separation reliability2 was .85. These reliability estimates indicate that this instrument measures IIR in a highly consistent manner.

Item fit (measurement model 1)

For a particular item, if that item’s estimated “difficulty” parameters are treated as known, then PCM likelihood predicts a specific distribution of response levels from the respondents (e.g., 27% of people located one logit below the item’s first parame-ter would be expected to respond at Level 1, with 73% at level 0; see Wilson, 2005). The degree to which the observed distribution of responses to the item being exam-ined matches the model-implied distribution is called the Weighted Mean-square item fit (WMNSQ; Wu & Adams, 2013): The WMNSQ is scaled so that the mean WMNSQ over the whole instrument is one. A low WMNSQ implies that responses to the item were too predictable, separating respondents more sharply than the instrument as a whole. Of more concern, if in item has a large WMNSQ, this means that responses were less predictable than expected, suggesting that the item meas-ures the construct badly, or perhaps measures something different.

In applied Rasch Measurement, it is common to flag items with WMNSQ less than 0.75 or greater than 1.33. Items outside this range are examined for any issues that may be causing their responses to be either too predictable or too unpredictable, with the latter being of greater concern.

As seen in Table 3, item WMNSQ ranged from 0.75 to 1.34; with a single item (item 9) slightly above the upper bound (i.e., “too unpredictable”). Since this item WMNSQ was only barely above 1.33, and since no defects were found in an exami-nation of the unscored responses, we elected to maintain the item.

Table 3 Item fit statistics Item WMNSQ Item WMNSQ

1 1.24 7 0.882 1.02 8 0.753 1.13 9 1.344 0.82 10 0.875 1.00 11 1.016 0.89 12 1.03

2 WLE person-separation reliability (0.85) applies to conclusions based on individual scores; EAP per-son-separation reliability (0.87) applies to population-level conclusions.

1 3


Latent regression PCM (measurement model 2)

We now examine results from a latent regression PCM, in which respondents’ IIR was regressed on age in order to address Research Question (d). Table 4 shows the regression coefficient and associated standard error. Respondents’ IIR appears to increase with age (in this cross-sectional sample). In this sample, a difference in age of 1 year was estimated to correspond to a mean IIR increase of 0.282 logits. This was significant at the 5% level.

Multi‑dimensional PCM (measurement model 3)

We fit a multidimensional PCM in which each item type (motivational, evaluative, and meta-reasoning) was modeled as a separate dimension, in order to estimate inter-dimensional (latent) correlations, which are presented in Table 5. Motivational and evaluative inferences had a latent correlation of .767, motivational and meta-reasoning had a latent correlation of .760, and evaluative and meta-reasoning had a latent correlation of .834.

Validity

We base our validity argument on the five strands of validity listed in the Standards for Educational and Psychological Testing (American Educational Research Asso-ciation, American Psychological Association, & National Council on Measurement in Education, 2014).

Evidence based on the instrument content

This instrument was designed based on Wilson’s (2005) four building blocks. The first step was operationalizing the construct, based on research by Pearson and John-son (1978); Warren et al. (1979); Chikalanga (1992); and Graesser et al. (1994). The next step was items design: Each comic had a different theme, but had the same structure and format, with motivational, evaluative, and meta-reasoning items.

Table 4 Latent regression

*p < .05

Regressor Coefficient SE

Age 0.282* 0.122

Table 5 Inter-dimension latent correlations

Dimension 1 2

1: Motivational2: Evaluative . 7673: Meta-reasoning . 760 .834

A. M. Blum et al.

1 3

Inferential reasoning was required, since there was no logical or grammatical cue explicitly linking the question to the answer, but participants could use any aspect of the narrative and their background knowledge to generate a response. The third step, determining the outcome space (or scoring rubric) was based on the literature (e.g., Pearson & Johnson, 1978). The inferential categories were ranked on an ordi-nal scale because of the integrative nature of the inference categories. It is hypoth-esized that more integrative is more sophisticated than less integrative, and thus the associated inferences will also fall on that scale.

Evidence based on response processes

At the end of the IIR instrument, participants completed an exit survey in which they were asked to indicate which questions, if any, they didn’t understand. Some students found the meta-reasoning items unfamiliar, but no other items were found to be problematic. Participants were also asked to rate their difficulty in understand-ing the material using a five-point Likert-type item. On average, most students found the material to be fairly easy; the mean was 3.9 on a scale from 1 (most difficult) to 5 (easiest).

Evidence based on internal structure

We checked the trajectory of respondents’ estimated mean locations across the levels within each item (Measurement Model 1). All items showed an increase of mean location across all levels, with the exception of one level within one item. We examined the Wright Map (Measurement Model 1), and found a relative banding of the thresholds (by level). This suggests that the levels of IIR are distinct, since the placement of the thresholds was dominated by their levels, rather than by the overall locations of their items. In a multidimensional analysis (Measurement Model 3), we found latent correlations between the items types to be .76 and above, which is con-sistent with the assumed unidimensional structure of IIR.

Evidence based on relations to other variables

Other studies have found inferential thinking to be positively related to age (e.g., Barnes et al., 1996; Hagá et al., 2014; Wagner & Rohwer, 1981). Therefore, we regressed IIR on age using a Latent Regression PCM (Measurement Model 2). Con-sistent with existing literature, age was found to be a significant predictor of IIR at the 5% level: A 1 year-increase in age corresponded to a mean estimated increase of 0.282 logits on the IIR scale.

Evidence based on consequences of using IIR

Since this instrument is not yet in wide-scale use, no data are currently available on the consequences of its use. However, as the developers of IIR, we feel it is incumbent on us to stress that our IIR instrument does not attempt to measure one’s ability in terms of achievement. Rather, it is designed to measure one’s cognitive

1 3


processing disposition. It would therefore be inappropriate to interpret IIR results in achievement contexts such as course selection, assignment of grades, or similar decisions affecting the educational trajectories of students.

Discussion

Results from our analyses provided evidence that we were successful in developing an assessment to measure IIR, and suggest answers to the four questions the instru-ment was designed to address. Addressing Research Questions (b) and (c), both the number of categories, and their ordering, are supported by the combined findings of banding in the Wright Map and mean location increases within almost all items. Additionally, results from latent regression analysis address Research Question (d): We find respondents’ age to be a significant predictor of IIR, consistent with other studies.

Implications for the study of comprehension

Below, we will address how our results support a unification of several meaning making frameworks, how the unification allows for an important extension, and the positionality of the results among the larger argument in the field.

Does IIR represent a unification of the five theories?

Addressing Research Question (a), results from our multidimensional analysis (com-bined with the evidence of reliability above) revealed that the item types (motiva-tional, evaluative, and meta-reasoning), if considered as separate dimensions, had latent correlations of .76 and above. While the item types may not be measuring IIR in exactly the same way, these results are nevertheless consistent with a unified IIR construct.

Why the combination level is important

IIR extends Pearson and Johnson’s (1978) taxonomy by adding a third “combina-tion” level representing even more explicit integration than either of their QARs. Unlike their text-implicit and script implicit QARs (our first and second levels, respectively), our third level only makes sense in an integrative construct, and was only possible because we developed a Construct Map first. Indeed the first two lev-els have appeared in other contexts (e.g., Basaraba et al., 2013). Additionally, Per-fetti and Stafura (2015) argue that taxonomies should be held together by a hierar-chical structure, and their taxonomy meets this requirement by being ordinal. IIR is likewise ordinal, but with a stronger sense of an underlying variable. We argue that our third level is not only unique to IIR, but it is essential to IIR as a variable. A line can always be drawn between any two points; it takes a third point to verify that a figure indeed forms a single line. In the same way, the first two levels (and the

A. M. Blum et al.

1 3

related QARs) have been used without hypothesizing a single underlying variable, but adding a third level requires an hypothesis about how the levels are connected (i.e., do they form a single variable, multiple variables, or are they merely nominal categories). We find that the three levels (text-implicit, script-implicit, and combina-tion) are distinct, but that together they form a single underlying variable; this gives evidence supporting the usefulness of IIR as a taxonomy that presents a new way of thinking about comprehension and inferencing.

Importantly, the combination category is the only one that requires the reader to coordinate thoughts that are based on information from two different processing skills (i.e., local and global). In IIR, Level 1 represents local details, Level 2 repre-sents the inclusion of memories and world knowledge, and the coordination of both local and global processing is represented by Level 3. The three levels are ordinal in the sense that local processing supports global processing, and both local and global processing are prerequisite to their coordination.

Connections to related empirical studies

As noted in the introduction, Basaraba et al. (2013) used similar categories, and found preliminary results consistent with the present study. However, our results are more pronounced, and there are many potential reasons for this. First, we formu-late IIR as a disposition, whereas Basaraba et al. (2013) used a measure of read-ing comprehension in an achievement context: These may, in fact, be different vari-ables, with IIR being more stable across assessment contexts than the latter. Second, their approach was more exploratory (they were looking for their categories both as separate dimensions and as levels within a single dimension, and also checking for measurement invariance), whereas our approach was more confirmatory, starting with a clear theory of inferencing (as represented in our Construct Map) and seek-ing to confirm this theory empirically. This difference in approach also allowed us to use a simpler research design and required us to collect data using a purpose-built instrument; both of these may have contributed to the results being clearer. Third, in our analyses we estimate the parameters of interest directly, rather than using a two-step analysis (e.g., using a latent regression rather than regressing location estimates post-estimation): such direct estimation strategies tend to yield clearer results since they account for measurement error.

LARRC and Muijselaar (2018) investigated the dimensionality of local and global inferences. Since their items align with text-explicit and text-implicit QARs respectively, they correspond to Level 0 and Level 1 of IIR. (IIR Level 0 represents no inference: text-explicit QARs align with this level because no inference is neces-sary to answer with information explicitly provided in the text.) They found these dimensions to be indistinguishable which is consistent with our findings that IIR behaves unidimensionally.

1 3


Educational implications

Teachers strive to help students connect their world-knowledge with the text-base evoked by the text itself. This process is how deeper comprehension is achieved (Graesser et al., 1994) and how rich situation-models are formed (Kintsch, 1988). As teachers question students, they hear answers representing different parts of the nar-rative and the students’ unique world-views. Taxonomies, such as the one we con-structed, are powerful tools for teachers because they can give a roadmap to help guide and develop their students’ inferential reasoning, for example by positioning their stu-dents’ responses on the IIR continuum. IIR is an inferential reasoning taxonomy that allows an educator to examine how a student coordinates different sources of informa-tion at both the local and global levels, and also how they coordinate different types of inferences across the levels of IIR. Furthermore, Kintsch (2012) describes the value of not only considering the psychometrics that describe levels, but also the dimensions relevant to education for teachers. These levels are granular enough to be distinguished, but large enough to capture a wide range of knowledge-based inferences.

Teachers may also benefit from the notion that student performance may reflect a disposition as opposed to reflecting only an achievement. When achievement is the dominating narrative, and educators think of learners in terms of what they can and cannot do, they will tend to adopt a deficit mindset, particularly in an era of high-stakes testing. But if educators view these cognitive abilities as dispositional, then the ques-tion is not about locating learners on an ability continuum, but rather how to shape the context to facilitate the development of those dispositions. For example, should a learner not demonstrate a particular reading comprehension skill (e.g., an optimal level of IIR), this may not be due only to the learners’ ability, but also to the context. This is consistent with a universal design for learning perspective where one changes the con-text, not the learner.

Although respondents tended to demonstrate consistent levels of IIR across items, as supported by the banding of thresholds on the Wright Map (Fig. 5), it is still important to take a situated view when it comes to thinking about a learner’s best. For example, there is evidence for consistency of response patterns across items based on the band-ing of thresholds, but yet the multidimensional analysis based on item types did not come out perfectly correlated. Instead, the latent correlations were moderately high, suggesting that the different items types might measure IIR in slightly different ways. This may also suggest that some students might have more of an interest in the global thematic components of some stories, while other students may be more interested in the casual relations, resulting in these students responding at higher levels of IIR on the item types that align with their interests. Even if we had not found banding, IIR levels would still be situated within items, giving a roadmap to interpret student responses to a variety of inferential thinking questions.

A. M. Blum et al.

1 3

Limitations and future research

Sampling limitations

We did not use a random sample in this study, but rather a geographically-limited convenience sample: Participants were recruited from schools in the San Fran-cisco Bay Area, from grades 3–6 (ages 8–12), and were self-selected (everyone who volunteered and met the inclusion criteria was included). Of particular con-cern is the lack of identification of English language learners and students from low-income households. The nature of the sample, coupled with the relatively small sample size (n = 72) limits the generalizability of the findings, which should be treated as tentative pending replication in a larger and more diverse sample.

Assessment design limitations

In this study, we used vignettes, rather than fully developed narratives—IIR may be manifested differently with longer texts. Additionally, the assessment design may have induced local dependence among items, both because of the common stimulus material (four items attached to each of three stories) and also because of the parallel nature of the questions (essentially the same four questions are asked about each story). Future research should examine the effect, if any, of this issue on our findings.

IIR and autism

In order to further investigate how to promote reading comprehension we need to find a way to promote more sophisticated integrative reasoning, particularly for populations that are known to have narrative comprehension challenges, such as those on the autism spectrum. Seeing IIR as a cognitive processing disposition sheds light on its relationship with individuals on the autism spectrum. The first levels of IIR are considered local, and would be attractive to those on the spec-trum since they are known to have a local processing disposition (Frith, 2003; Frith & Happé, 1994; Happé & Booth, 2008; Happé & Frith, 2006; Van der Hal-len, Evers, Brewaeys, Van den Noortgate, & Wagemans, 2015). This can impact how they make the inferences needed to gain a coherent understanding of narra-tives, leading to challenges and differences in comprehension and narrative gen-eration (Capps, Losh, & Thurber, 2000; McIntyre et al., 2018; White et al., 2009). If this preference can be demonstrated in IIR, it would further validate IIR, espe-cially in this population.

In addition, perhaps different modalities of narrative can promote IIR in a non-invasive manner, such as comics. Individuals on the spectrum are known to have a visual processing disposition (Gaffrey et al., 2007; Kamio & Toichi, 2000) and comics may be a good medium to promote IIR.

1 3


Integrative inferential reasoning and narratives

It would also be useful to investigate how IIR is situated relative to the archi-tecture of narratives. Cohn (2013) describes the grammar of visual narratives, which includes a hierarchy of categories of panels as they relate to each other (i.e., establisher, initial, prolongation, peak, and release). These categories align with Kintsch and van Dijk (1978) macro-propositions, a sort of narrative gram-mar. Is IIR situated by the macro-propositions of the narrative? Are people just as likely to engage in higher levels of IIR during the climax, rather than the resolu-tion, or the establisher? Knowing what facets of the narrative provide more affor-dances and opportunities for relatively higher levels of IIR, would indicate where to focus when promoting IIR for a more coherent representation of the narrative.

Noordman, Vonk, Cozijn, and Frank (2015) argued that unfamiliar relations between clauses in a text, maintained by some causal conjunction (e.g., because), do not promote online inferences. Rather, the inference takes place later when the respondent is asked to engage in some sort of verification phase. On the other hand, readers who were familiar with those relations, made the causal inference during reading (i.e., online).

In a similar vein familiarity with the modality used to present the narrative may also affect when inferences are made and thus affect the level of IIR respond-ents demonstrate. Future studies should investigate the impact of modality, and respondents’ familiarity with modality used, one their location on the IIR scale.

Contributions to evolving theories of inferential reasoning

Above all, we believe that we have provided an empirical demonstration of the value of a new twist in how we think about inferential reasoning, one that acknowledges the contributions of previous scholars, including Pearson and John-son (1978), Warren et al. (1979), Graesser et al. (1994), Chikalanga (1992) and Kintsch (1998), but moves beyond their work to include the sort of meta-reason-ing that promotes a more integrative view of inferential thinking. In a way, each inferential reasoning framework is like a flashlight that illuminates the phenom-enon of meaning making in its own way, shedding light on different facets of this phenomenon. IIR brings those flashlights together and provides general illumina-tion, giving an integrated view, from a different vantage point, on the meaning making system. We hope our work represents an encouraging first step in that direction.

Acknowledgements We would like to thank Karen Draney, Mark Wilson, and Pamela Wolfberg for their support throughout the development of this project. We would also like to thank the elementary school (name withheld for privacy reasons) for their participation and support in bringing this project to fruition.

A. M. Blum et al.

1 3

References

Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multino-mial logit model. Applied Psychological Measurement, 21(1), 1–23.

Adams, R. J., Wu, M. L., & Wilson, M. (2012). ACER ConQuest: Generalised item response model-ling software (Version 3) [Computer software]. Camberwell: Australian Council for Educational Research.

Alonzo, J., Basaraba, D., Tindal, G., & Carriveau, R. S. (2009). They read, but how well do they understand?: An empirical look at the nuances of measuring reading comprehension. Assessment for Effective Intervention: Official Journal of the Council for Educational Diagnostic Services, 35(1), 34–44.

American Educational Research Association, American Psychological Association, & National Coun-cil on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Baghaei, P., & Ravand, H. (2015). A cognitive processing model of reading comprehension in English as a foreign language using the linear logistic test model. Learning and Individual Differences, 43, 100–105.

Barnes, M. A., Dennis, M., & Haefele-Kalvaitis, J. (1996). The effects of knowledge availability and knowledge accessibility on coherence and elaborative inferencing in children from 6 to 15 years of age. Journal of Experimental Child Psychology, 61(3), 216–241.

Basaraba, D., Yovanoff, P., Alonzo, J., & Tindal, G. (2013). Examining the structure of reading com-prehension: Do literal, inferential, and evaluative comprehension truly exist? Reading and Writ-ing, 26(3), 349–379.

Borsboom, D. (2005a). Latent variables. Measuring the mind: Conceptual issues in contemporary psychometrics (pp. 49–84). Cambridge: Cambridge University Press.

Borsboom, D. (2005b). True scores. Measuring the mind: Conceptual issues in contemporary psycho-metrics (pp. 11–47). Cambridge: Cambridge University Press.

Borsboom, D. (2008). Latent variable theory. Measurement: Interdisciplinary Research and Perspec-tives, 6(1–2), 25–53.

Briggs, D. C., & Wilson, M. (2003). An introduction to multidimensional measurement using Rasch models. Journal of Applied Measurement, 4(1), 87–100.

Briner, S. W., Virtue, S., & Kurby, C. A. (2012). Processing causality in narrative events: Temporal order matters. Discourse Processes, 49(1), 61–77.

Cain, K., Oakhill, J. V., Barnes, M. A., & Bryant, P. E. (2001). Comprehension skill, inference-mak-ing ability, and their relation to knowledge. Memory & Cognition, 29(6), 850–859.

Capps, L., Losh, M., & Thurber, C. (2000). The frog ate the bug and made his mouth sad: Narrative competence in children with autism. Journal of Abnormal Child Psychology, 28(2), 193–204.

Chikalanga, I. (1992). A suggested taxonomy of inferences for the reading teacher. Reading in a For-eign Language, 8, 697.

Cohn, N. (2007). A visual lexicon. Public Journal of Semiotics, 1(1), 35–56.Cohn, N. (2013). Visual narrative structure. Cognitive Science, 37(3), 413–452.Cohn, N., Paczynski, M., Jackendoff, R., Holcomb, P. J., & Kuperberg, G. R. (2012). (Pea)nuts and

bolts of visual narrative: structure and meaning in sequential image comprehension. Cognitive Psychology, 65(1), 1–38.

Cozijn, R., Commandeur, E., Vonk, W., & Noordman, L. G. M. (2011). The time course of the use of implicit causality information in the processing of pronouns: A visual world paradigm study. Journal of Memory and Language, 64(4), 381–403.

DeVellis, R. F. (2006). Classical test theory. Medical Care, 44(11 Suppl 3), S50–S59.Embretson, S. E., & Wetzel, C. D. (1987). Component latent trait models for paragraph comprehen-

sion tests. Applied Psychological Measurement, 11(2), 175–193.Ferrara, S., Lai, E., Reilly, A., & Nichols, P. D. (2016). Principled approaches to assessment design,

development, and implementation. In A. A. Rupp & J. P. Leighton (Eds.), The handbook of cog-nition and assessment (Vol. 4, pp. 41–74). Hoboken: Wiley.

Frith, U. (2003). Autism: Explaining the enigma (2nd ed.). Malden: Wiley-Blackwell.Frith, U., & Happé, F. (1994). Autism: beyond “theory of mind”. Cognition, 50(1), 115–132.

1 3


Gaffrey, M. S., Kleinhans, N. M., Haist, F., Akshoomoff, N., Campbell, A., Courchesne, E., et al. (2007). Atypical [corrected] participation of visual cortex during word processing in autism: an fMRI study of semantic decision. Neuropsychologia, 45(8), 1672–1684.

Gernsbacher, M. A., Robertson, R. R. W., Palladino, P., & Werner, N. K. (2004). Managing mental repre-sentations during narrative comprehension. Discourse Processes, 37(2), 145–164.

Graesser, A. C., Singer, M., & Trabasso, T. (1994). Constructing inferences during narrative text compre-hension. Psychological Review, 101(3), 371–395.

Hagá, S., Garcia-Marques, L., & Olson, K. R. (2014). Too young to correct: a developmental test of the three-stage model of social inference. Journal of Personality and Social Psychology, 107(6), 994–1012.

Hambleton, R. K., & Jones, R. W. (1993). An NCME instructional module on: Comparison of classical test theory and item response theory and their applications to test development. Educational Meas-urement: Issues and Practice, 12(3), 38–47.

Happé, F., & Booth, R. D. L. (2008). The power of the positive: Revisiting weak coherence in autism spectrum disorders. Quarterly Journal of Experimental Psychology, 61(1), 50–63.

Happé, F., & Frith, U. (2006). The weak coherence account: detail-focused cognitive style in autism spec-trum disorders. Journal of Autism and Developmental Disorders, 36(1), 5–25.

Kamio, Y., & Toichi, M. (2000). Dual access to semantics in autism: is pictorial access superior to verbal access? Journal of Child Psychology and Psychiatry and Allied Disciplines, 41(7), 859–867.

Kendeou, P. (2015). A general inference skill. In E. J. O’Brien, A. E. Cook, & R. F. Lorch Jr. (Eds.), Inferences during Reading (pp. 160–181). Cambridge: Cambridge University Press.

Kintsch, W. (1988). The role of knowledge in discourse comprehension: a construction-integration model. Psychological Review, 95(2), 163–182.

Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge: Cambridge University Press.Kintsch, W. (2012). Psychological models of reading comprehension and their implications for assess-

ment. In J. Sabatini, E. Albro, & T. O’Reilly (Eds.), Measuring up: Advances in how we assess reading ability (pp. 21–38). Lanham: Rowman & Littlefield Education.

Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and production. Psycho-logical Review, 85(5), 363–394.

Language and Reading Research Consortium (LARRC), & Muijselaar, M. M. L. (2018). The dimen-sionality of inference making: Are local and global inferences distinguishable? Scientific Studies of Reading: The Official Journal of the Society for the Scientific Study of Reading, 22(2), 117–136.

Long, D. L., & Chong, J. L. (2001). Comprehension skill and global coherence: A paradoxical picture of poor comprehenders’ abilities. Journal of Experimental Psychology. Learning, Memory, and Cogni-tion, 27(6), 1424–1429.

Magliano, J. P., Larson, A. M., Higgs, K., & Loschky, L. C. (2016). The relative roles of visuospatial and linguistic working memory systems in generating inferences during visual narrative comprehension. Memory & Cognition, 44(2), 207–219.

Malle, B. F., & Holbrook, J. (2012). Is there a hierarchy of social inferences? The likelihood and speed of inferring intentionality, mind, and personality. Journal of Personality and Social Psychology, 102(4), 661–684.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174.Masters, G. N. (2016). Partial credit model. In W. J. van der Linden (Ed.), Handbook of item response

theory (Vol. 1, pp. 109–126). New York: Taylor and Francis.McIntyre, N. S., Oswald, T. M., Solari, E. J., Zajic, M. C., Lerro, L. E., Hughes, C., et al. (2018). Social

cognition and reading comprehension in children and adolescents with autism spectrum disorders or typical development. Research in Autism Spectrum Disorders, 54, 9–20.

Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003). A brief introduction to evidence-centered design (ETS Research Report Series RR-03-16). Princeton: Educational Testing Service.

Noordman, L. G. M., Vonk, W., Cozijn, R., & Frank, S. (2015). Causal inferences and world knowledge. In E. J. O’Brien, A. E. Cook, & R. F. Lorch Jr. (Eds.), Inferences during Reading (pp. 260–289). Cambridge: Cambridge University Press.

Noordman, L. G. M., Vonk, W., & Kempff, H. J. (1992). Causal inferences during the reading of exposi-tory texts. Journal of Memory and Language, 31(5), 573–590.

Nuske, H. J., & Bavin, E. L. (2011). Narrative comprehension in 4–7-year-old children with autism: Test-ing the weak central coherence account. International Journal of Language & Communication Dis-orders, 46(1), 108–119.

A. M. Blum et al.

1 3

Pantaleo, S. (2013). Paneling “matters” in elementary students’ graphic narratives. Literacy Research and Instruction, 52(2), 150–171.

Pearson, P. D. (1982). Asking questions about stories (writings in reading and language arts 15). Colum-bus: Ginn and Company.

Pearson, P. D., & Johnson, D. D. (1978). Questions. Teaching reading comprehension (pp. 153–178). New York: Holt.

Perfetti, C. A., & Stafura, J. Z. (2015). Comprehending implicit meanings in text without making infer-ences. In E. J. O’Brien, A. E. Cook, & R. F. Lorch Jr. (Eds.), Inferences during reading (pp. 1–18). Cambridge: Cambridge University Press.

Pitts, M. M., & Thompson, B. (1984). Cognitive styles as mediating variables in inferential comprehen-sion. Reading Research Quarterly, 19(4), 426–435.

Ramachandran, R., Mitchell, P., & Ropar, D. (2009). Do individuals with autism spectrum disorders infer traits from behavior? Journal of Child Psychology and Psychiatry and Allied Disciplines, 50(7), 871–878.

Raphael, T. E., & Au, K. H. (2005). QAR: Enhancing comprehension and test taking across grades and content areas. The Reading Teacher, 59(3), 206–221.

Riconscente, M. M., Mislevy, R. J., & Corrigan, S. (2015). Evidence-centered design. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 40–63). New York: Routledge.

Santos, S., Cadime, I., Viana, F. L., Prieto, G., Chaves-Sousa, S., Spinillo, A. G., et al. (2016). An appli-cation of the Rasch model to reading comprehension measurement. Psicologia: Reflexão e Crítica, 29(1), 38.

Singer, M. (1980). The role of case-filling inferences in the coherence of brief passages. Discourse Pro-cesses, 3(3), 185–201.

Van der Hallen, R., Evers, K., Brewaeys, K., Van den Noortgate, W., & Wagemans, J. (2015). Global processing takes time: A meta-analysis on local-global visual processing in ASD. Psychological Bulletin, 141(3), 549–573.

Van Overwalle, F., Van Duynslaeger, M., Coomans, D., & Timmermans, B. (2012). Spontaneous goal inferences are often inferred faster than spontaneous trait inferences. Journal of Experimental Social Psychology, 48(1), 13–18.

Wagner, M., & Rohwer, W. D., Jr. (1981). Age differences in the elaboration of inferences from text. Journal of Educational Psychology, 73(5), 728–735.

Warren, W. H., Nicholas, D. W., & Trabasso, T. (1979). Event chains and inferences in understanding narratives. In R. O. Freedle (Ed.), New directions in discourse processing (pp. 23–52). Norwood: Ablex Publishing Corporation.

White, S., Hill, E., Happé, F., & Frith, U. (2009). Revisiting the strange stories: Revealing mentalizing impairments in autism. Child Development, 80(4), 1097–1117.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah: Lawrence Erlbaum.

Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching, 46(6), 716–730.

Wilson, M., & Carstensen, C. (2007). Assessment to improve learning in mathematics: The BEAR Assessment System. In A. H. Schoenfeld (Ed.), Assessing mathematical proficiency (mathematical sciences research institute publications (pp. 311–332). Cambridge: Cambridge University Press.

Wilson, M., & De Boeck, P. (2004). Descriptive and explanatory item response models. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: A generalized linear and nonlinear approach (pp. 43–74). New York: Springer.

Wu, M. L., & Adams, R. J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measure-ment, 14(4), 339–355.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Modeling question-answer relations: the development of the integrative inferential reasoning comic assessmentAbstractIntroductionA brief history of the research in inferential reasoningWays of investigating inferential thinkingChronometrics: response timesClassical psychometrics: score-based testsModern psychometrics: latent variable modeling

Towards a contemporary approach to measuring inferential thinkingThe parable of the measurerQuestioning the interpretability of scores without a constructMeasurement for science and measurement as scienceTowards a revised model of inferential reasoningConstruct map for IIR

MethodsParticipantsMaterials and procedureVisual narrative design

Measurement modelsMeasurement model 1: partial credit modelMeasurement model 2: latent regression PCMMeasurement model 3: multi-dimensional PCM

ResultsWright map (measurement model 1)Mean location (measurement model 1)Reliability (measurement model 1)Item fit (measurement model 1)Latent regression PCM (measurement model 2)Multi-dimensional PCM (measurement model 3)ValidityEvidence based on the instrument contentEvidence based on response processesEvidence based on internal structureEvidence based on relations to other variablesEvidence based on consequences of using IIR

DiscussionImplications for the study of comprehensionDoes IIR represent a unification of the five theories?Why the combination level is importantConnections to related empirical studies

Educational implicationsLimitations and future researchSampling limitationsAssessment design limitationsIIR and autismIntegrative inferential reasoning and narratives

Contributions to evolving theories of inferential reasoning

Acknowledgements References

M‑er˜rela:velopment of˜the˜integrative inferential reasoning … · 2020. 5. 28. · 13 Integrativeinferentialreasoning Txt-implicit(boththequestionandanswerarederivablefromthetext,but

Documents