The Critical Thinking Analytic Rubric (CTAR): Investigating intra-rater and inter-rater reliability of a scoring mechanism for critical thinking performance assessments

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

This article appeared in a journal published by Elsevier. The attached

copy is furnished to the author for internal non-commercial research

and education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling or

licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the

article (e.g. in Word or Tex form) to their personal website or

institutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies are

encouraged to visit:

http://www.elsevier.com/copyright

http://www.elsevier.com/copyright

Author's personal copy

Assessing Writing 17 (2012) 251–270

Contents lists available at SciVerse ScienceDirect

Assessing Writing

The Critical Thinking Analytic Rubric (CTAR): Investigatingintra-rater and inter-rater reliability of a scoringmechanism for critical thinking performance assessments

Emily Saxtona,!, Secret Belangerb, William Beckera

a Center for Science Education, Portland State University, 1136 SW Montgomery Street, Portland, OR 97201, United Statesb Century High School, Hillsboro School District, 2000 SE Century Blvd., Hillsboro, OR 97123, United States

a r t i c l e i n f o

Article history:Received 28 November 2011Received in revised form 2 July 2012Accepted 8 July 2012

Keywords:Critical thinkingHigher-order cognitive skillsPerformance-based assessmentAnalytic rubricInter-rater reliabilityIntra-rater reliability

a b s t r a c t

The purpose of this study was to investigate the intra-raterand inter-rater reliability of the Critical Thinking Analytic Rubric(CTAR). The CTAR is composed of 6 rubric categories: interpreta-tion, analysis, evaluation, inference, explanation, and disposition.To investigate inter-rater reliability, two trained raters scored foursets of performance-based student work samples derived from apilot study and subsequent larger study. The two raters also blindlyscored a subset of student work samples a second time to inves-tigate intra-rater reliability. Participants in this study were highschool seniors enrolled in a college preparation course. Both ratersshowed acceptable levels of intra-rater reliability (˛ " 0.70) in fiveof the six rubric categories. One rater showed poor consistency( ̨ = 0.56) for the analysis category of the rubric, while the otherrater showed excellent consistency (˛ = 0.91) for the same cate-gory suggesting the need for further training of the former rater.The results of the inter-rater reliability investigation demonstrateacceptable levels of consistency (˛ " 0.70) in all rubric categories.This investigation demonstrated that the CTAR can be used by ratersto score student work samples in a consistent manner.

Published by Elsevier Ltd.

! Corresponding author. Tel.: +1 503 725 3190.E-mail addresses: [email protected] (E. Saxton), [email protected] (S. Belanger), [email protected] (W. Becker).

1075-2935/$ – see front matter. Published by Elsevier Ltd.http://dx.doi.org/10.1016/j.asw.2012.07.002


252 E. Saxton et al. / Assessing Writing 17 (2012) 251–270

1. Introduction

1.1. Critical thinking and college readiness

Critical thinking is an important goal of education at all levels. The importance of critical thinkingin preparing students for college, the work force, and their role as responsible citizens has been wellacknowledged (American Association of Colleges and University, 2005; Cline, Bissell, Hafner, & Katz,2007; Facione, 1990). Special emphasis is placed on the cognitive skills associated with critical thinkingin the preparation of students for college (Conley, 2007a, 2007b). Despite being a well acknowledgedgoal, little progress has been made in reaching college readiness outcomes in high school education(Blair, 2009). This lack of progress is a national issue in the United States. For example, Moore et al.(2010) report that only one third of high school graduates in the state of Texas were considered collegeready in math and science. In a similar study of Oregon high school graduates, only 27% of students metACT college readiness benchmarks in English, mathematics, reading, and science (American CollegeTesting, 2010). This lack of progress is further evidenced in the high percentages of students requiringremedial courses in college and low college retention rates. In their analysis of the National EducationalLongitudinal Study data set, Attewell, Lavin, Domina, and Levey (2006) attributed decreased collegegraduation rates to a lack of college readiness, specifically mentioning deficits in high school students’skill levels. The exact role of critical thinking in the college readiness problem, however, is largelyunknown because current college readiness data are derived from multiple choice tests, like the ACTand SAT, which arguably do not measure the higher order skills associated with critical thinking.

The challenges associated with critical thinking education for high school students are multi-faceted. They include (1) lack of K-12 teacher preparation to instruct critical thinking, (2) absenceof empirical evidence regarding what instructional methods promote critical thinking at the highschool level, (3) existing curricula fails to target higher-order thinking skills, and (4) lack of assessmentinstruments that validly and reliably measure critical thinking in high school students (Blair, 2009;Ennis, 2003). The lack of assessment instruments for measuring critical thinking acts as a barrierto addressing the other challenges listed above, particularly the investigation of effective instruc-tional methodologies and curricula. The reliable and valid assessment of critical thinking is essentialto determining progress at the student, instructional, and program level. Without such assessmentinstruments, progress towards the educational goal of critical thinking cannot be determined becausemeasurement is essential in gauging failure and success (Halpern, 2003).

1.2. Defining the critical thinking construct

A well-defined construct is a fundamental requirement of reliable and valid measurement in edu-cation (Bond & Fox, 2007). Scholars debate the best definition of critical thinking in the literature.There is one definition, however, that has persisted over time and is associated with the rigor ofmulti-disciplinary expert consensus. In the late 1980s, the American Philosophical Association spon-sored a multi-disciplinary panel of 46 experts from the fields of philosophy, education, social science,and physical science to define critical thinking (Facione, 1990). This collaboration resulted in a DelphiReport, which defined critical thinking as:

“[the] purposeful, self-regulatory judgment which results in interpretation, analysis, evaluationand inference, as well as explanation of the evidential, conceptual, methodological, criteriologi-cal, or contextual considerations upon which that judgment is based. . . The ideal critical thinkeris habitually inquisitive, well-informed, trustful of reason, open-minded, flexible, fair mindedin evaluation, honest in facing personal biases, prudent in making judgments, willing to recon-sider, clear about issues, orderly in complex matters, diligent in seeking relevant information,reasonable in the selection of criteria, focused in inquiry, and persistent in seeking results whichare as precise as the subject and the circumstances of inquiry permit.” (Facione, 1990, p. 2).

Further reinforcing the Delphi definition, Jones et al. (1995) conducted a national survey whichprovided additional expert consensus around the definition (as cited in Quitadamo & Kurtz, 2007).


E. Saxton et al. / Assessing Writing 17 (2012) 251–270 253

It is worth noting that the Delphi definition includes the cognitive skills of interpretation, analysis,evaluation, inference, and explanation, as well as, the dispositional aspects relevant to critical thinking(Facione, 1990). The Delphi definition was adopted as the definition of the critical thinking constructthat serves as the foundation of this research and its associated instrument development.

1.3. Performance-based assessment of critical thinking

Performance-based assessments represent a type of instrument that shows potential for the mea-surement of complex constructs such as critical thinking (Jackson, Draugalis, Slack, & Zachry, 2002;Lomask & Baron, 2003). Performance-based assessment is defined as a method that requires partic-ipants to produce an original artifact that shows evidence of knowledge or skills (Johnson, Penny, &Gordon, 2008). A comprehensive measurement of any of the critical thinking skills requires that thetest taker generate their answer independently, in contrast to choosing the answer from a predeter-mined list (Ennis, 1993). Student generated answers are essential as evidence of thought process.Performance-based assessments, therefore, are well suited for the assessment of critical thinkingbecause students are required to independently generate the evidence of these cognitive skills.

In addition to having exceptional suitability for assessing critical thinking, performance-basedassessments are especially useful for assessing the success of teaching strategies and the fulfillment ofprogrammatic goals. Performance-based assessments allow for the demonstration of what a studentis able to accomplish with the skills imparted during instruction (Jackson et al., 2002). Thus theseinstruments hold potential for answering the questions surrounding critical thinking education asoutlined above.

Performance-based assessment for critical thinking gains particular strength when the instrumentcontains a number of characteristics that are recognized in the literature as best practice in criticalthinking assessment. First, the assessment of critical thinking should not target the correct answer orbe constrained to the teacher’s anticipated answer; the target of critical thinking assessment shouldbe the thought process, and must therefore focus on the level of rational evaluation and explanationin the student’s answer (Case, 2009; Moss & Koziol, 1991; Nosich, 2009). Second, critical thinkingassessments should invoke topics that were not directly instructed in the classroom. Moss and Koziol(1991) advocate for the provision of sufficient material within the assessment instrument for studentsto interpret, analyze, evaluate, and cite within in their originally constructed response. Third, assess-ment of critical thinking should measure cognitive skills and critical thinking dispositions (Ku, 2009;Mejia, 2009).

Ennis (2003), a critical thinking assessment developer himself, describes the availability of crit-ical thinking instruments as falling short of current demand. The high demand for critical thinkingassessment can be linked to the proliferation of critical thinking as a goal at all levels of education. Inthose instruments that are available, the need for instruments that reliably and validly measure crit-ical thinking sub-skills, rather than critical thinking in general, remains unmet (Bernard et al., 2008;Ennis, 2003).

1.4. Analytic rubrics

The need for instruments that measure critical thinking sub-skills, rather than critical thinking ingeneral, argues in favor of the use of analytic rubrics rather than holistic rubrics. Furthermore, the useof holistic rubrics to score critical thinking implies the expectation that students will likely performsimilarly across the cognitive skills that composed critical thinking. This assumption, however, isfaulty. For example, a student who is proficient in the cognitive skill of interpretation is not necessarilyproficient in the skill of evaluation. Consequently, generalizing a students’ critical thinking abilitylevel to one holistic score represents a loss of diagnostic data. This diagnostic data can be importantfor teacher instructional decisions; therefore, analytic rubrics are a superior tool for the purposes ofmeasuring critical thinking sub-skills.

Current critical thinking assessment instruments also frequently fail to measure both cognitiveskills and the dispositional aspect of critical thinking (Ku, 2009; Mejia, 2009). The analytic rubric isan assessment tool that shows promise of meeting this need for the measurement of critical thinking



sub-skills and disposition towards critical thinking. The benefits of analytic rubrics include (1) greaterlevel of detail in the information derived from the assessment scores, (2) clear designation of whatis being measured, (3) detailed guidance regarding what should be included in the interpretation ofdata, and (4) increased clarity between the relationship of what is being measured and correspondingscores (Clauser, 2000; Ennis, 2003; Kane, Cooks, & Cohen, 1999). These benefits of analytic rubrics,over holistic rubrics, were the basis of the rubric design in this study.

While the scoring of performance-based assessments with analytic rubrics provides an array ofbenefits for critical thinking assessment, analytic rubrics are often associated with the challenge ofobtaining acceptable reliability indices. Shavelson, Baxter, and Gao (1993) outline four sources ofreliability measurement error: task, occasion, method, and rater. Rezaei and Lovorn (2010) furthercontend “analytic rubric-based assessment has not been adequately and experimentally studied.”This investigation focuses on the rater as a source of potential measurement error that may impact thereliability of the Critical Thinking Analytic Rubric (CTAR). The purpose of this study was to investigatethe inter-rater and intra-rater reliability of the CTAR.

2. Methods

2.1. Instrument development

2.1.1. The critical thinking constructAs previously outlined, the Facione (1990) Delphi Report’s definition of critical thinking was

adopted as the construct for this instrument development project which leads to a focus on fivecognitive skills (interpretation, analysis, evaluation, inference, and explanation) and the dispositiontowards thinking critically. The cognitive skill of interpretation was defined as the ability “to com-prehend and express the meaning or significance of a wide variety of experiences, situations, data. . .”(Facione, 1990, p. 6). Interpretation relates to understanding and accurately conveying the mean-ing of information from various sources and points of view. The expert consensus further outlinedthree sub-skills associated with interpretation: categorization, decoding significance, and clarifyingmeaning (Facione, 1990).

Analysis is defined as the cognitive process of “identify[ing] the intended and actual inferentialrelationships among statements, questions, concepts. . . or other forms of representation intendedto express beliefs, judgments. . . information, or opinions” (Facione, 1990, p. 7). Analysis essentiallyinvolves examining the relationship among and between different sources of information as well ascomparing and contrasting different points of view. The Facione (1990) Delphi definition specifiesthree analysis sub-skills: examining ideas, detecting arguments, and analyzing arguments.

The cognitive skill of evaluation is defined as “assess[ing] the credibility of statements or other rep-resentations which are accounts or descriptions of a person’s perception . . . or opinion; and to assessthe logical strength of the actual or intend inferential relationships among statements . . . or otherforms of representation” (Facione, 1990, p. 8). Evaluation requires identifying pertinent arguments,judging the credibility of sources and the logic of statements. The sub-skills associated with evaluationinclude assessing claims and assessing arguments (Facione, 1990).

Inference is the cognitive skill that involves “identify[ing] and secur[ing] elements needed to drawreasonable conclusions; to form conjectures and hypotheses; to consider relevant information andto educe the consequences flowing from data, statements, principles, evidence . . . or other forms ofrepresentation” (Facione, 1990, p. 9). Inference involves making realistic predictions, drawing ratio-nal conclusions within a context, and deducing logical implications. The sub-skills associated withinference are querying evidence, conjecturing alternatives, and drawing conclusions (Facione, 1990).

Explanation is the process of “stat[ing] the results of one’s reasoning; . . . justify[ing] that reasoningin terms of the evidential, conceptual . . . and contextual considerations upon which one’s resultswere based; and to present one’s reasoning in the form of cogent arguments” (Facione, 1990, p. 10).In essence, explanation is a skill whereby evidence and reasoning are used to support an argumentor particular point of view. Explanation sub-skills include stating results, justifying procedures, andpresenting arguments (Facione, 1990).



Table 1Relationship between questions and prompts in form 1 of the PACT and aspects of the construct targeted with those questions.

Form 1 of PACT: Example questions Targeted skills

Part 1, question 1: Identify and define in your own words each leader’s preferred method foraffecting change. What is the reasoning for their chosen method?

Interpretation

Part 1, question2: Compare and contrast the two methods. Analysis

Part 2, essay prompt: . . . write an essay explaining what method for affecting change you woulduse if you found yourself in the following scenario. . . What method would you use in your newmovement for change?

InterpretationAnalysisEvaluationDisposition

Part 2, essay prompt: Make sure you justify your chosen method. . . InterpretationEvaluationExplanationInferenceDisposition

Part 2, essay prompt: . . . comment on the implications your choice would likely have on your lifeand the movement you are involved in.

Inference

Part 2, essay prompt: In your essay, support your position with specific points you noted duringyour analysis of the reading packet completed in Part 1. Please cite your sources by providingthe source number in parenthesis.

EvaluationExplanation

Disposition towards critical thinking is an affective dimension of the critical thinking construct.In order to make use of critical thinking skills for solving problems or making decisions one mustbe motivated, or positively disposed, towards using those skills. The multi-disciplinary panel defineddisposition as “. . . diligence in seeking relevant information, reasonableness in selecting and applyingcriteria, care in focusing attention on the concern at hand, persistence though difficulties are encoun-tered, precision to the degree permitted by the subject and the circumstance” (Facione, 1990, p. 13).These detailed, clearly defined cognitive skills, sub-skills, and dispositional aspect of critical thinkingformed the construct definition that was the foundation of the development of the instrument in thisstudy.

2.1.2. Instrument descriptionPerformance-based assessments by their very nature are more complex instruments than tra-

ditional assessments; these instruments are not simply made up of a set of questions and apredetermined answer set. Solano-Flores and Shavelson (1997) characterize performance-basedinstruments as ‘the triple’ indicating that the assessment is made up of a prompt, the student createdanswer, and the scoring rubric. The prompt itself directs those being assessed and elicits a response.The student created response is an original product and serves as evidence of student’s ability inthe construct being measured. The scoring rubric is the mechanism for interpreting the student-created response in a consistent manner so that comparable performance-based data are generated.The critical thinking performance-based instrument in this study has these 3 components.

2.1.2.1. Performance Assessment for Critical Thinking: the prompt. The Performance Assessment forCritical Thinking (PACT) prompt consists of two parts: a short answer section and an essay section(Appendices A and B). Each part contains subsections with questions that were designed to presentstudents with opportunities to engage in the cognitive skills represented in the critical thinking con-struct as defined above (Table 1). This structure of the PACT provided a template so that multipleforms of the PACT could be built around contextually comparable topics. The appropriateness of thetopic was determined by (1) the theme of the course it was being administered in and (2) the timeof year it was administered. These two details were important because the topic was intended to becomplimentary to the course, but not directly instructed in the course prior to the administration ofthe instrument (Moss & Koziol, 1991). Topics directly covered in the course could, therefore, be usedonly if administered prior to instruction.



Table 2Summary of the current forms of the PACT including topic, perspectives presented in the reading packet, and the dilemmapresented in the essay prompt.

Form Topic Perspectives Dilemma in essay prompt

1 What is the best method ofaffecting change?

Proponents and opponents ofcivil disobedience and guerillawarfare. Backgroundinformation about influentialleaders of two movements forchange.

What method would you use inyour new movement forchange?

2 How should physician-assistedsuicide (PAS) be regulated?

Proponents and opponents ofthe Oregon Death with DignityAct and the Dutch EuthanasiaAct. Background informationabout each law.

What policy, regardingphysician-assisted suicide,would you advocate for as amember of the ethicscommittee?

3 How should the difference inhealth between social classesbe addressed?

Proponents and opponents ofuniversalhealthcare/governmentregulation of health relatedbehaviors. Backgroundinformation about factorsimpacting population health.

What policy, regardinghealthcare reform, would youadvocate for as a member ofthe Health, Education andLabor committee?

In addition to the prompt, the PACT contained a reading packet that was built around the topicassociated with each form of the instrument. The topic selected for each form was chosen for itscontroversial nature. The controversial topic allowed the packet to be constructed around sourcesfrom multiple perspectives. For example, the current forms of the PACT include the controversialtopics of (1) the best method for affecting change, (2) regulation of physician assisted suicide, and(3) socioeconomic status and health (Table 2). Student responses to the PACT prompt resulted fromtheir interpretation, analysis, and evaluation of the provided material. This format created an assess-ment environment where there were fewer restraints on the student’s answer and where thoughtprocess was prioritized over any anticipated answer (Case, 2009; Moss & Koziol, 1991; Nosich, 2009).This format also enables student cognitive skills and the disposition to use those skills to be simul-taneously measured, which represents best practice in critical thinking assessment (Ku, 2009; Mejia,2009).

Part 1 of the PACT prompt instructs students to ‘identify and define’ and ‘compare and contrast’ theperspectives presented in the reading packet. These questions lead students to interpret and analyzethe material in the packet and to access the specific sub-skills of interpretation and analysis as definedby the Facione (1990) Delphi definition. Part 2 of the PACT prompt contains a scenario that creates adilemma (Table 2; Appendices A and B) and instructs students to take and justify a position on the topicas well as consider the implications of their chosen position. This portion of the prompt leads studentsto interpret, analyze, and evaluate the material in the reading packet and to access the specific sub-skills defined by the Facione (1990) Delphi definition. Part 2 also instructs students to explain theirposition and draw inferences about the implications of their position. Finally, the disposition towardcritical thinking is evident in the student’s interaction with the controversial topic and the material inthe reading packet. The relationships between the questions/prompts and the cognitive skills and/ordisposition the questions were designed to measure are shown in detail in Table 1.

2.1.2.2. Performance Assessment for Critical Thinking: student created responses. The student createdresponse to part 1 of the PACT prompt took multiple open-ended formats because the studentwas instructed to choose the format that best suited their preference. The formats included shortanswer paragraphs, Venn diagrams, concept maps, outlines, or some combination thereof. The stu-dent response to part 2 of the PACT prompt was in essay format. In addition to part 1 and 2,the reading packet was designed to encourage students to take margin notes. These margin notesserved as an additional source of evidence of student thought processes when constructing theirresponse.



2.1.2.3. Critical Thinking Analytic Rubric: the scoring mechanism. The Critical Thinking Analytic Rubric(CTAR) was designed to assess the five cognitive skills outlined above from the Delphi definitionof critical thinking and the dispositional aspect of critical thinking (Facione, 1990). The CTAR was,therefore, composed of 6 rubric categories: interpretation, analysis, evaluation, inference, explanation,and disposition (Appendix C). Within each rubric category, there was a 6-point scale that contained 2–3bullets that describe the characteristics of each level of performance. The cognitive skill categories ofthe CTAR were primarily designed from the Facione (1990) Delphi report’s description of the cognitiveskills. More specifically each rubric category’s bulleted list was designed to measure the identified sub-skills of each cognitive process (previously outlined in Section 2.1.1). For example, the three sub-skillsassociated with interpretation (categorization, decoding significance, and clarifying meaning) werethe specific skills targeted with the three bullets under the interpretation rubric category. The authorsalso consulted and drew from the holistic rubric of Facione and Facione (1994) during the developmentof the cognitive skill categories of the CTAR.

For the disposition category of the CTAR, the first bullet in the rubric was based on King andKitchener’s (1994, 2004) theory of reflective thinking and the Facione (1990) Delphi definition ofthe disposition towards thinking critically. The seven stages of reflective judgment represent anempirically based theory on the progression of an individual’s view of knowledge and conceptionof justification (King & Kitchener, 1994, 2004). This model is specifically focused on individuals inadulthood (high school through graduate school) and contributed to the progression of dispositionrepresented in the CTAR’s 6-point scale (King & Kitchener, 1994).

The disposition category of the CTAR was also based on the progression of intellectual develop-ment work of Perry (1970). Perry’s (1970) theory of intellectual development has three main stages:dualism, multiplicity, and relativism. Students in the dualism stage tend to see the world in terms ofright/wrong and black/white (Perry, 1970). Dualists will sometimes acknowledge that other beliefsand perspectives exist, but if they are not in agreement with the Authorities, then those extraneousperspectives are wrong (Perry, 1970). Students in the multiplicity stage understand that different peo-ple are going to have different perspectives, but do not see a basis for evaluating these different pointsof view (Perry, 1970). These students may thus believe that everyone is entitled to an opinion, yetmay be unaware that not all opinions are equally valid depending on the context of the situation. Asstudents move into the relativism stage, they begin to understand the importance of context whenmaking decisions (Perry, 1970). Relativistic students are aware of the validity of alternate perspec-tives, but they are able to adopt a particular point of view based on the context of the situation and cansupport their chosen viewpoint with appropriate justification (Perry, 1970). This model of intellectualdevelopment was highly influential in the design of the second bullet of the disposition rubric categoryand the detail of each score value on the 6-point scale.

Slight modifications and refinements were made to several categories of the CTAR between the2008–2009 study and the 2009–2010 study to improve the clarity of the rubric. This parallel anditerative process of the development of the CTAR and the PACT is consistent with other descriptionsof the development of scoring procedures for performance assessments (Clauser, 2000).

2.2. Study context

2.2.1. Study participantsA total of 306 participants took part in this study. All participants were high school seniors reg-

istered in a dual enrollment course for college preparation. The participant group (N = 114) for the2008–2009 pilot of the PACT was administered form 1 as a pretest and form 2 as a posttest. The partic-ipants’ ages ranged from 16 to 18 and 54% were female. All participants in the 2008–2009 pilot attendedthe same suburban high school and were enrolled in one of three sections of the college preparationcourse. In the suburban high school, 57% of the student body were White, 21% Asian/Pacific Islander,and 12% Hispanic. The school can be further characterized as having 23% of students qualifying for freeor reduced lunch, 6% of students in an English as a second language program, and a 90% graduationrate (Table 3).

The participant group (N = 192) for the 2009–2010 study was enrolled in the college preparationcourse at one of four participating high schools (3 urban and 1 suburban). All sections of the college


258

E.

Saxton

et

al.

/

Assessing

Writing

17

(2012)

251–270

Table 3School level demographics including percentage of students eligible for free/reduced lunch, graduation rates, English as a second language, and ethnicity for the 2008–2009 pilot studyand 2009–2010 study (Oregon Department of Education, 2012).

School Eligible for freeor reducedlunch

Graduationrate

English assecondlanguage

White Black Hispanic Asian/ PacificIslander

AmericanIndian/ Alaskannative

Multi-ethnic Other

2008–2009 Pilot Study1 23% 90% 6% 57% 3% 12% 21% >1% 6% >1%

2009–2010 Study1 26% 89% 5% 55% 4% 14% 21% >1% 4% >1%2 57% 80% 7% 20% 53% 16% 7% >1% 3% 2%3 77% 70% 17% 44% 8% 15% 24% 3% 5% 2%4 77% 71% 10% 52% 9% 23% 10% 1% >1% 1%



preparation course were administered form 1 as a pretest. All sections, except one, were administeredform 2 as a posttest. One section of the college preparation course was administered form 3 as aposttest because form 2 did not fit well into the theme of the course. The posttest data from form 3were not included in the reliability study. In the 2009–2010 study, the participants’ ages ranged from16 to 21, and 53% were female. In the four high schools, the student bodies represented a wide rangeof ethnicities (Table 3). The four schools can be further characterizes as having between 26 and 77%of students qualifying for free or reduced lunch, 5–17% of students in English as a second languageprograms, and graduation rates between 70 and 89% (Table 3).

2.2.2. Curriculum and critical thinkingThe college preparation program offering these courses specified critical thinking as one of its goals.

These courses were also designed parallel to the sponsoring university’s undergraduate freshman gen-eral education course. The students in the college preparation course earned college credit that fulfilledthe same requirements as the undergraduate freshman general education course, which speaks to therigor of the college preparation course. One or two high school teachers and one university facultymember taught the courses with an interdisciplinary, team teaching instructional approach. Duringthe 2008–2009 pilot study, all three authors served as the university appointed instructors of thecollege preparation course for at least one term.

Instructional strategies intended to increase students’ critical thinking skills were implementedin the college preparatory course. These strategies include, but are not limited to, the use of leveledquestions and Socratic seminars. Leveled questioning is a technique that involves students in applyingthe skills of interpretation, analysis, and inference to pose questions for class discussions. The Socraticseminar method is a format for class discussions that entails the use of explanation and evaluationskills and positive dispositions towards critical thinking. During the course, students also completedseveral assignments designed to expose them to college level course work and to develop and assesscritical thinking skills. Examples of assessments used in the courses included research papers, evalua-tion of scientific knowledge claims papers, and literary analysis papers. These assessments frequentlytargeted one or more of the cognitive skills associated with critical thinking.

The PACT was administered at the beginning and end of the school year as an embedded assessmentin the course. The equivalent of approximately 180 minutes of class time was allotted for the comple-tion of the student response to the PACT. Students who were unable to finish during the allotted classtime were allowed to continue to work after school or during study halls, but were never allowed totake the assessment home. Students generally spent the first #90 minutes reading the packet, takingmargin notes, and completing Part 1 of the PACT, with the remaining time devoted to Part 2 of thePACT.

2.3. CTAR scoring protocol

There were two raters in this study. Jonsson and Svingby (2007) report in their review of 75 empiri-cal research studies of scoring rubrics, “two raters are, under restrained conditions, enough to produceacceptable levels of inter-rater agreement” (p. 136). This study met the “restrained conditions” ref-erenced by Jonsson and Svingby (2007) by using raters with similar backgrounds and demographics,implementing a strict scoring protocol, conducting calibration training, and creating a rubric notescompanion document to increase rater consistency. Both raters had approximately the same amountof teaching experience; they were both recent graduates of masters programs in their first year ofteaching in the college readiness program during the 2008–2009 study year. Both raters were whitefemales and neither grew up speaking another language at home.

In addition to similar demographics, the raters in this study followed a consistent scoring protocol.The scoring protocol included (1) looking at the material in the student work samples in a consistentorder (part 1, margin notes, essay), (2) ignoring grammatical errors and refraining from judging onpoor communication (Gustafson & Bochner, 2009), and (3) following weight of evidence when a worksample did not neatly fall into a certain score on a rubric category. This strict protocol provided anadditional restraint to the conditions under which scoring took place.



In addition to the scoring protocol, the raters engaged in calibration training prior to the scoringof every data set in this investigation. The training of raters is well recognized in the literature asa method of increasing inter-rater reliability (Miller & Linn, 2000; Myerberg, 1996; Rezaei & Lovorn,2010). During the calibration training, the raters in this study scored the same set of randomly selectedstudent work and then immediately discussed the papers. The discussions were framed around twogoals. First, the discussions focused the raters on building a common understanding of the CTAR byconcentrating on building a consensus about the critical thinking construct, the rubric categories, andthe score levels. Second, the raters’ discussions concentrated on calibrating the interpretations of thestudent-created responses to the PACT assessment instrument. The calibration training followed aniterative process where same sets of student work were independently scored and discussed by theraters, then a new set of student work was scored and discussed; this process was repeated untilthe raters felt they had reached sufficient agreement to score a set of student work for the purposeof establishing inter-rater reliability. Approximately, 20–30 student work samples were included ineach training session.

Finally, a companion document containing rubric notes was created and updated during the scoringcalibration training and the scoring of data sets. This companion document laid out each rubric categoryon its own page with a space designated for rater notes. The notes were intended as supplementalmaterial to aid the raters in maintaining consistency. It also allowed raters to make specific notesabout the different forms of the PACT because the CTAR is written in a general manner to work for alltopics. The CTAR’s scoring system as outlined above creates the “restrained conditions” called for byJonsson and Svingby (2007).

2.4. Data analysis

The inter-rater reliability of the CTAR was investigated with the consistency measure, Cronbach’salpha for the 2008–2009 and 2009–2010 data sets, respectively. The inter-rater reliability measure,Cronbach’s alpha, was calculated for each rubric category in SPSS.

Jonsson and Svingby (2007) report that, despite the importance of intra-rater reliability as a con-sistency measure, few studies in their literature review reported intra-rater reliability measures. Inorder to investigate the intra-rater reliability, 30 student work samples were randomly selected andassigned new identification numbers. These intra-rater reliability student work samples were ran-domly mixed into the existing data set so that each rater would blindly score 15 student work samplesa second time. This intra-rater analysis was conducted for the data set from the 2009–2010 admin-istration of form 1 of the PACT. Cronbach’s alpha was calculated for each rater and rubric category inSPSS.

3. Results

3.1. Intra-rater reliability

During the 2009–2010 study, a subsample of student work (N = 30) was randomly selected, assignednew ID numbers, and 15 papers were blindly re-scored by each rater for the intra-rater reliability inves-tigation. Consistency measures of 0.70 or greater are commonly deemed acceptable in the literature(Jonsson & Svingby, 2007; Stemler, 2004). The intra-rater reliability of rater 1 and 2 ranged from 0.56to 0.87 and 0.76–0.96 (Table 4). Both raters show acceptable levels of consistency with alphas at orabove 0.70 in the rubric categories of interpretation, evaluation, inference, explanation, and disposi-tion. The consistency for rater 1 in the analysis category ( ̨ = 0.56) fails to meet the 0.70 level whilerater 2 shows a high degree of consistency ( ̨ = 0.91).

3.2. Inter-rater reliability

During the pilot study, both raters scored 100% of the student work samples (N = 114). The inter-rater reliability indices ranged from 0.76 to 0.90 for both forms of the PACT. Therefore, the inter-raterreliability consistency estimates met or exceeded the 0.70 level for all rubric categories (Table 5). The



Table 4Cronbach’s alpha coefficient for intra-rater investigation of the 2009–2010 study’s data set.

Rubric categories 2009–2010, Cronbach’s alpha

Scorer 1 (N = 15) Scorer 2 (N = 15)

Interpretation 0.70 0.81Analysis 0.56 0.91Evaluation 0.83 0.85Inference 0.86 0.91Explanation 0.87 0.96Disposition 0.86 0.76

Table 5Cronbach’s alpha coefficient for the 2008–2009 pilot study and the 2009–2010 study inter-rater reliability investigation.

Rubric categories 2008–2009 Pilot Study 2009–2010 Study

Form 1 (N = 114) Form 2 (N = 114) Form 1 (N = 30) Form 2 (N = 30)

Interpretation 0.82 0.83 0.72 0.79Analysis 0.77 0.90 0.85 0.90Evaluation 0.87 0.86 0.76 0.83Inference 0.86 0.76 0.82 0.82Explanation 0.85 0.83 0.92 0.78Disposition 0.85 0.81 0.81 0.81

frequency of scores for the pilot study show that student abilities were distributed across 4–5 abilitylevels across all rubric categories, raters, and forms with one exception (Figs. 1–4). In the interpretationcategory of form 2, rater 2 scored only 3 ability levels, however, the number of scores were fairly welldistributed across the 3 ability levels (Fig. 4).

During the 2009–2010 study, a randomly selected subsample of student work (N = 30) was scoredby both raters for the inter-rater reliability investigation. This subsample represented 15.6% of the dataset for form 1 and 28% of the data set for form 2. This difference in the percentage of the sample wasdue to two factors: (1) form 2 of the PACT was not administered for one section of the dual enrollmentcourse for reasons described earlier and (2) attrition in the number of students completing form 2,which was administered as a posttest. The inter-rater reliability indices for both forms of the PACT

Fig. 1. Frequency of scores assigned by rater 1 for each CTAR rubric category during the 2008–2009 pilot study, PACT form 1.








ranged from 0.72 to 0.92. The consistency estimates, therefore, met or exceeded the 0.70 level for allrubric categories in the 2009–2010 inter-rater reliability investigation (Table 5).

4. Findings and implications

4.1. Inter-rater reliability of the CTAR

In summary, the inter-rater reliability indices associated with all six CTAR categories met orexceeded acceptable levels for all forms and administrations of the PACT across the pilot study and2009–2010 study. The range of scores for the pilot study demonstrates that the sample of studentabilities was well distributed across 3–5 ability levels for all rubric categories, raters, and forms(Figs. 1–4). These distributions enable a fair test of the CTAR’s inter-rater reliability because thediverse sample of student abilities provides ample opportunity to test consistency among raters. Inthe 2009–2010 study, the population of students was further diversified with the addition of threeurban high schools (Table 3). The 2009–2010 inter-rater reliability data set was derived from a randomsample of papers from the data set creating a smaller sample size than the pilot study. This reduc-tion of sample size arguably creates a more rigorous test of the CTAR’s inter-rater reliability becausedata sets with smaller sample sizes increase the impact of each disagreement between the raterson inter-rater reliability indices. The consistent findings of acceptable inter-rater reliability acrossthese studies leads to increased confidence in the conclusion that trained raters can use the CTARreliably.

A common criticism of performance-based instruments is that their scoring mechanism, typicallyrubrics, are associated with low reliability measures (Halpern, 2003). Both of the inter-rater reliabilitystudies in this investigation, however, demonstrate that the CTAR can be used to reliably score thestudent-created responses to the PACT. In a team teaching context, like that of the college preparatoryprogram this instrument was administered in, it is common practice for faculty to distribute thescoring of student assessments across each classroom’s instructors. This practice of multiple ratersacross individual classrooms, and the program as a whole, makes inter-rater reliability especiallyimportant to assure consistent interpretations of student’s critical thinking abilities, equitable gradingpractices, and precise assessment data for teachers to use in instructional decisions. This investigationdemonstrated that the CTAR can be used by two raters to score student work samples in a consistentmanner so that data can be aggregated.

As previously outlined, the cognitive skills and dispositions associated with critical thinkingare important to college readiness (American Association of Colleges and University, 2005; Clineet al., 2007; Facione, 1990). Critical thinking should, therefore, be an important outcome of collegepreparatory courses. The CTAR shows potential for both summative and formative assessment pur-poses. Cognitive scientists assert the importance of assessment in the design of successful learningenvironments (Bransford, Brown, & Cocking, 2000). Jonsson and Svingby (2007) conclude “rubricssupport learning and instruction by making expectations and criteria explicit, which also facilitatesfeedback and self-assessment” (p. 139). The CTAR, when used to analyze student work prior to orduring instruction, can serve formative purposes by providing teachers and students with diagnos-tic assessment data that identifies which specific aspects of critical thinking need to be targeted forteaching and learning. The CTAR can also provide summative data when applied to student workderived from a posttest administration. Both of these assessment applications are important forhelping high school students in meeting critical thinking proficiency levels required for success incollege.

4.2. Intra-rater reliability of the CTAR

The investigation into the intra-rater reliability revealed the raters in the study were able toapply the CTAR in a consistent manner for 5 of the 6 rubric categories. The intra-rater reliabil-ity investigation did reveal a lack of consistency in the application of the analysis category ofthe rubric for one of the raters. In the intra-rater reliability study, only a small number of stu-dent work samples (N = 15) were blindly rescored. This small samples size makes this particular



investigation of intra-rater reliability rigorous, essentially raising the stakes so that only a verysmall number of disagreements could occur without dropping below the acceptable level of reli-ability. One could, therefore, attribute the low intra-rater reliability to a sample size that was toosmall. However, the second rater showed a high level of consistency when applying the sameanalysis rubric category. These findings, taken together, suggest that it is likely that the particularrater needs further training to score student work samples in the analysis category in a consistentmanner.

5. Limitations and future research

This investigation is part of a larger validation study of a newly developed performance-basedinstrument which is composed of the PACT and its scoring tool, the CTAR. This study was limited tothe reliability of the CTAR. It remains to be determined if the PACT/CTAR can be considered a validmeasure of high school students’ critical thinking abilities.

Approximately seven hours per administration of the PACT, were required to reach the inter-rater reliability levels reported in this study. Rezaei and Lovorn (2010) conclude in their study ofan analytic rubric for writing that rubrics do not increase reliability without training on how touse the rubric. Scoring training is an important precursor to the consistent use of rubrics; how-ever, in order for the use of CTAR to be feasible for in-service teachers the length of training timewould need to be decreased. Use of anchor papers in the training of raters is frequently cited asan important tool for increasing the reliability of scores (Jonsson & Svingby, 2007; Rezaei & Lovorn,2010). This study was a preliminary study of a newly developed rubric; therefore, anchor paperswere not available to augment training of raters. Future study of the CTAR once anchor papers havebeen selected would be worthwhile. More specifically, it would be worthwhile to determine if thedesignation of anchor papers would reduce the amount of time needed for calibration training ofraters.

The use of the rubric notes document in this study may be a tool to help raters maintain consistencywithin their own scores, between themselves and other raters, and between different scoring sessions.Therefore, the effect of the use of rubric notes document on both inter-rater reliability and intra-raterreliability deserves further study in this instrument context and others.

Finally, the method of estimating intra-rater reliability used in this study may be of use to otherassessment developers or large scale assessment programs to monitor rater consistency when largenumbers of work samples are being scored. The determination of how many samples each rater blindlyrescores, however, is a matter worthy of careful consideration.

6. Conclusion

The reliable and valid measurement of critical thinking is key to answering questions that surroundthe preparation of high school students for both the higher order cognitive processes and dispositionsneeded for success in college. Evidence of reliability is a requirement in the validation of an assessmentinstrument, but reliability alone is not sufficient to establish a rigorous case for validity. Clauser (2000)casts rubric development as an important component of performance-based assessment development.This investigation is an important first step in the validation process for the PACT and its associatedscoring mechanism, the CTAR.

Acknowledgments

We would like to gratefully acknowledge the high school students, for participating in this study;the instructors in the college preparatory course, for their enthusiasm and interest in the project; theleadership and staff of Portland State University’s Center for Science Education and the UniversityStudies Program, for making this research possible. Finally, a special thanks to Dalton Miller-Jones atPSU for his encouragement, insight, and valuable feedback.



Appendix A. The Performance Assessment for Critical Thinking (PACT) – example promptfrom form 1

What is the best method to affect change?History provides a long list of political, social, and economic movements that have actively sought to bring about

change. A few examples of such movements include efforts that focused on civil rights, anti-colonialism, women’srights, poverty and labor issues. Many different strategies and tactics have been used as a means to bring about thedesired changes. Your summer reading assignment, Malcolm X, provided one well-known example of an influentialfigure who dedicated his life to change. The readings for this assignment will introduce you to two additionalimportant figures: Ernesto “Che” Guevara and Mohandas Gandhi.

Please review the provided packet, which includes relevant background information and excerpts of the writings ofboth Che Guevara and Mohandas Gandhi.

While reviewing the packet, use one of the provided organizational tools (choose the tool that works best for you) totake notes and organize your thoughts.

Part 1: Based on the provided information, answer the following short answer questions regardingthe preferred methods of Ernesto “Che” Guevara and Mohandas Gandhi. In each of your answers, pleasecite your sources by providing the source number and page from the packet:

1. Identify and define in your own words each leader’s preferred method for affecting change. Whatis the reasoning for their chosen method?

2. Compare and contrast the two methods.

Part 2: Based on your analysis of the two positions and insight from the reading of Malcolm X,write an essay explaining what method for affecting change you would use if you found yourself inthe following scenario:

In 2015, there is a devastating terrorist attack on Seattle, Washington. After a quick investigation,the identified terrorists are found to have one major defining characteristic. This characteristic(be it ethnicity, religion, or sexual preference) is a trait your best friend holds in common withthe terrorists. Due to this common characteristic, your best friend finds himself/herself guiltyby association. The US government enacts emergency legislation for National Security purposesthat greatly impacts his/her daily life and freedom. For example, his/her phone calls and homeare under constant surveillance, he/she is subject to random police searches, and his/her travelis restricted to the local geographic area. Observing this chain of events, you feel your friendis being unjustly persecuted and you become impassioned to start a movement that will bringabout the change of these laws and the restoration of your friend’s rights.

What method would you use in your new movement for change? Make sure you justify yourchosen method and comment on the implications your choice would likely have on your life and themovement you are involved in.

In your essay, support your position with specific points you noted during your analysis of thereading packet completed in Part 1. Please cite your sources by providing the source number inparenthesis.



Appendix B. The Performance Assessment for Critical Thinking (PACT) – example promptfrom form 2

How should physician-assisted suicide (PAS) be regulated?The debate over what constitutes a “good death” can be traced back in history to the time of Socrates who said “A man

should wait, and not take his own life until God summons him.” Ironically, later in life after being convicted ofcorrupting the youth of Athens and condemned to death by poison, Socrates took his own life by drinking a cup ofhemlock. It is said he chose to commit suicide so that he could maintain his independence until the end. Theconcept of dying with one’s independence intact is often referred to as a “Socratic death.” In our contemporarylanguage, a Socratic death is more often called “dying with dignity.” Recently the right-to-die movement hasrevived the debate over what defines a “good death” and whether people suffering from terminal illness orinsufferable pain have a right to end their own life with the assistance of a physician.

The readings for this assignment will introduce you to opposing viewpoints associated with this issue.

Please review the provided reading packet, which includes relevant background information, descriptions of twodifferent policies designed to regulate physician-assisted suicide, and excerpts from both proponents andopponents of the two different policies.

Part 1: As you review the reading packet, use one of the provided organizational tools (choose thetool that works best for you) to take notes and organize your thoughts. Make sure you address thefollowing questions in your use of the organizational tool:

1. Identify and define in your own words each law’s basic premise. What details are important indefining each law?

2. Compare and contrast the two policies/laws.

Part 2: Based on your analysis of the two policies and insight from the ethics theme from semester 2,write an essay explaining what type of policy you would support if you found yourself in the followingscenario.

In 2020, you are working as a Colorado State Senator. Physician-assisted suicide (PAS) is currentlyprohibited by law in the state of Colorado, but Colorado has a long history of politicians, courts, andcitizens participating in the debate regarding the legalization of PAS. For example in 2000, a petitionto legalize PAS was rejected by the Colorado Court of Appeals. More recently, the debate about the reg-ulation of PAS has been revived in Colorado. The controversial case that revived the debate involved aterminally ill patient, suffering from extreme pain, whose request for PAS was denied on legal grounds.In fact, citizens are so passionate on both sides of the issue that the Senate ethics committee has beenasked to consider drafting a new law to allow PAS. As a member of the ethics committee, you have anobligation to participate in this process.

What policy would you advocate for as a member of the ethics committee? Make sure you justifyyour position and comment on the implications this would likely have on your life and the state ofColorado.

In your essay, support your position with specific points you noted during your analysis of thereading packet completed in Part 1. Please cite your sources by providing the source number inparenthesis.



Appendix C. The Critical Thinking Analytic Rubric

Score Interpretation Analysis Score

6 • Clearly and accurately identifies all of themajor viewpoints.• Accurately interprets evidence, statements,graphics, questions, etc. with precision anddetail.• Demonstrates confident ability to work withthe key concepts and terminology.

• Thoughtfully analyzes all points of view topresent a thorough evaluation of similaritiesand differences.• Accurately identifies important claims,arguments, patterns, and/or assumptions inthe evidence.• Consistently demonstrates clear, accurate,detailed and comprehensive ability to organizethe information for further examination.

6

5 • Clearly identifies all of the major viewpoints.• Accurately interprets evidence, statements,graphics, questions, etc.• Demonstrates a strong ability to work withthe key concepts and terminology.

• Analyzes all points of view to present athorough evaluation of similarities anddifferences.• Accurately identifies claims, arguments,patterns, and/or assumptions in the evidence.• Consistently demonstrates an accurate anddetailed ability to organize the information forfurther examination.

5

4 • Identifies not only the major viewpoints, butrecognizes some of the nuances of thosepositions.• Interprets evidence, statements, graphics,questions, etc.• Demonstrates a clear ability to work with thekey concepts.

• Analyzes all points of view to present anevaluation of similarities and differences.• Identifies claims, arguments, patterns, and/orassumptions in the evidence.• Demonstrates clear ability to organize theinformation for further examination.

4

3 • Identifies only the basics of each viewpoint,relying heavily on quotes and failing toarticulate points in own words.• Interprets some evidence, statements,graphics, questions, etc.• Demonstrates an uneven or shaky ability towork with the key concepts.

• Analyzes all points of view to present anevaluation of obvious or oversimplifiedsimilarities and differences.• Superficially identifies the basic claims,arguments, patterns, and/or assumptions inthe evidence.• Demonstrates an adequate ability to organizethe information for further examination.

3

2 • Identifies few viewpoints or instead identifiesonly personal position or point of view.• Offers incorrect or no interpretations ofevidence, statements, graphics, questions, etc.• Demonstrates an extremely limited ability towork with the key concepts.

• Presents a superficial analysis of similaritiesand differences between the various points ofview.• Incorrectly identifies claims, arguments,patterns, and/or assumptions in the evidence.• Demonstrates an inadequate ability toorganize the information for furtherexamination.

2

1 • Does not identify the viewpoint, but offered abiased position based on previously heldbeliefs.• Offers no or only biased interpretations ofevidence, statements, graphics, questions,information, or the points of view of others.• Demonstrates no ability to work with the keyconcepts.

• Presents little to no analysis of similaritiesand differences between the various points ofview.• Does not identify claims, arguments,patterns, and/or assumptions in the evidence.• Demonstrates no ability to organize theinformation for further examination.

1



Score Evaluation Inference Score

6 • Identifies the salient arguments (reasons andclaims) from multiple perspectives with a clearexplanation of each perspective.• Thoughtfully analyzes and evaluates allmajor alternative points of view.

• Demonstrates confident ability to apply orextend key concepts to make predictions,drawing inferences, and analyzingimplications.• Demonstrates surprising/insightful ability totake concepts further into new territory withbroader generalizations and implications.

6

5 • Identifies the salient arguments (reasons andclaims) from multiple perspectives.• Offers analyses and evaluations of mostalternative points of view.

• Demonstrates a clear ability to apply orextend key concepts to make predictions,drawing inferences, and analyzingimplications.• Demonstrates strong ability to take conceptsfurther into new territory with broadergeneralizations and implications.

5

4 • Identifies relevant arguments (reasons andclaims) from multiple perspectives.• Offers analyses and evaluations of alternativepoints of view.

• Demonstrates an adequate ability to apply orextend key concepts to make predictions,drawing inferences, and analyzingimplications.• Demonstrates an adequate ability to takeconcepts further into new territory withbroader generalizations and implications.

4

3 • Superficially identifies some arguments(reasons and claims) from main perspectives.• Superficially evaluates obvious alternativepoints of view.

• Demonstrates an shaky ability to apply orextend key concepts to make predictions,drawing inferences, and analyzingimplications.• Demonstrates an uneven ability to takeconcepts further into new territory withbroader generalizations and implications.

3

2 • Hastily dismisses relevantcounter-arguments.• Ignores obvious and important alternativepoints of view.

• Demonstrates inadequate ability to apply orextend key concepts to make predictions,drawing inferences, and analyzingimplications.• Demonstrates a superficial ability to takeconcepts further into new territory withbroader generalizations.

2

1 • Fails to identify relevantcounter-arguments.p̂• Ignores all alternative points of view.

• Demonstrates no ability to apply or extendkey concepts to make predictions, drawinginferences, and analyzing implications.• Demonstrates no ability to take conceptsfurther into new territory with broadergeneralizations.

1



Score Explanation Disposition Score

6 • Explicitly integrates key sources to supportconclusions that address the question.• Clearly justifies and explains assumptionsand reasons with evidence.• Demonstrates warranted, judicious,non-fallacious conclusions by using strong,persuasive support.

• Objectively follows where evidence leads byconsidering the provided context.• Student demonstrates relativist view ofknowledge through the adoption of aconsistent point of view with appropriatejustification as well as awareness of alternativeviewpoints.

6

5 • Integrates multiple sources to supportconclusions that address the question.• Justifies and explains some assumptions andreasons with evidence.• Demonstrates warranted, non-fallaciousconclusions by using strong support.

• Fair-mindedly follows where evidence leadsby considering the provided context.• Student demonstrates relativist view ofknowledge through the adoption of a clearpoint of view with appropriate justification aswell as awareness of alternative viewpoints.

5

4 • Utilizes information from several sources, butexcludes an important view point.• Justifies and explains reasons with evidence.• Conclusions appear reasonable through useof support.

• Fair-mindedly follows where evidence leadsby addressing the provided context.• Student demonstrates an understanding ofthe existence of multiple perspectives.

4

3 • Correctly references information from fewsources, but excludes any sources that supporta conflicting view.• Justifies and explains some reasons withevidence.• Conclusions are acceptable, but support isweak.

• Follows where evidence leads, but fails toconsider the provided context.• Student demonstrates an understanding ofthe existence of multiple perspectives, butstruggles to evaluate these diverseperspectives.

3

2 • Misuse of information or vague reference ofinformation from the sources.• Seldom justifies or explains reasons withevidence.• Conclusions are limited because support islacking.

• Defends only with a single perspective andfails to discuss other possible perspectives,especially those salient to the providedcontext.• Student demonstrates a dualist view ofknowledge through a treatment of the issue interms of right/wrong, black/white, andgood/bad.

2

1 • References information from none of therelevant material.• Does explain or explicitly state reasons withevidence.• Argues using fallacious or irrelevant reasons,and unwarranted, unsupported claims.

• Maintains views based on preconceptionsand exhibits close-mindedness or hostility toreason.• Student demonstrates a dualist view ofknowledge through a treatment of the issue interms of right/wrong, black/white, andgood/bad, and focuses on only one side of theissue.

1

References

American Association of Colleges and Universities. (2005). Liberal education outcomes: A preliminary report on student achievementin college. Washington, DC: AAC&U.

American College Testing (ACT). (2010). Oregon: The condition of college and career readiness: Class of 2010. Iowa City, IA: AmericanCollege Testing.

Attewell, P., Lavin, D., Domina, T., & Levey, T. (2006). New evidence on college remediation. The Journal of Higher Education, 77(5), 886–924.

Bernard, R. M., Zhang, D., Abrami, P. C., Sicoly, F., Borokhovski, E., & Surkes, M. A. (2008). Exploring the structure of the Watson-Glaser Critical Thinking Appraisal: One scale or many subscales? Thinking Skills and Creativity, 3, 15–22.

Blair, J. A. (2009). Who teachers K-12 critical thinking? In: J. Sobocan & L. Goarke (Eds.), Critical thinking education and assessment:Can higher order thinking be tested? (pp. 267–279). Ontario, Canada: The Althouse Press.

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch Model: Fundamental measurement in the human sciences. New York, NY: Taylor& Francis Group.

Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.). (2000). How people learn: Brain, mind, experience, and school. Washington,DC: National Academies Press.

Case, R. (2009). Teaching and assessing the “Tools” for thinking. In: J. Sobocan & L. Goarke (Eds.), Critical thinking education andassessment: Can higher order thinking be tested? (pp. 197–214). Ontario, Canada: The Althouse Press.

Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assessments. Applied Psychological Measure-ment, 24 (4), 310–324.

Cline, Z., Bissell, J., Hafner, A., & Katz, M. (2007). Closing the college readiness gap. Leadership, 37, 30–33.Conley, D. T. (2007). The challenge of college readiness. Educational Leadership, 64, 23–29.



Conley, D. T. (2007). Eugene, OR: Educational Policy Improvement Center.Ennis, R. H. (1993). Critical thinking assessment. Theory into Practice, 32 (3), 179–186.Ennis, R. (2003). Critical thinking assessment. In: D. Fasko (Ed.), Critical thinking and reasoning current research, theory, and

practice (pp. 293–313). Cresskill, NJ: Hampton Press.Facione, P. A. (1990). Critical thinking: A statement of expert consensus for purposes of educational assessment and instruction.

Millbrae, CA: California Academic Press. Retrieved from http://www.insightassessment.com/pdf files/DEXadobe.PDFFacione, P. A., & Facione, N. C. (1994). Holistic critical thinking scoring rubric. Millbrae, CA: California Academic Press. Retrieved

from http://www.insightassessment.com/Products/Rubrics-Rating-Forms-and-Other-Tools/Holistic-Critical-Thinking-Scoring-Rubric-HCTSR

Gustafson, M., & Bochner, J. (2009). Assessing critical thinking skills in students with limited English language proficiency.Assessment Update, 21 (4), 8–10.

Halpern, D. F. (2003). The “How” and “Why” of critical thinking assessment. In: D. Fasko (Ed.), Critical thinking and reasoningcurrent research, theory, and practice (pp. 355–366). Cresskill, NJ: Hampton Press.

Jackson, T. R., Draugalis, J. R., Slack, M. K., & Zachry, W. M. (2002). Validation of authentic performance assessment: A processsuited for Rasch Modeling. American Journal of Pharmaceutical Education, 66, 233–242.

Johnson, R. L., Penny, J. A., & Gordon, B. (2008). Assessing performance: Designing, scoring, and validating performance tasks. NewYork, NY: The Guilford Press.

Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity, and educational consequences. EducationalResearch Review, 2, 130–144.

Kane, M., Cooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17.

King, P. M., & Kitchener, K. S. (1994). Developing reflective judgment: Understanding and promoting intellectual growth and criticalthinking in adolescents and adults. San Francisco, CA: Jossey-Bass.

King, P. M., & Kitchener, K. S. (2004). Reflective judgment: Theory and research on the development of epistemic assumptionsthrough adulthood. Educational Psychologist, 39 (1), 5–18.

Ku, K. Y. L. (2009). Assessing students’ critical thinking performance: Urging for measurements using multi-response format.Thinking Skills and Creativity, 4, 70–76.

Lomask, M. S., & Baron, J. B. (2003). What can performance-based assessment tell us about students’ reasoning? In: D. Fasko(Ed.), Critical thinking and reasoning current research, theory, and practice (pp. 331–354). Cresskill, NJ: Hampton Press.

Mejia, A. D. (2009). In just what sense should I be critical? An exploration into the notion of ‘assumption’ and some implicationsfor assessment. Studies in Philosophy and Education, 28 (4), 351–367.

Miller, M. D., & Linn, R. L. (2000). Validation of performance-based assessments. Applied Psychological Measurement, 24 (4),367–378.

Moore, G. W., Slate, J. R., Edmonson, S. L., Combs, J. P., Bustamante, R., & Onwuegbuzie, A. J. (2010). High school students andtheir lack of preparedness for college: A statewide study. Education and the Urban Society, 20 (10), 1–22.

Moss, P. A., & Koziol, S. M. (1991). Investigating the validity of a locally developed critical thinking test. Educational Measurement:Issues and Practice, 10 (3), 17–22.

Myerberg, J. N. (1996). Inter-rater reliability on various types of assessments scored by school district staff. Paper presented atthe Annual meeting of the National Association of Research in Science Teaching.

Nosich, G. (2009). Central reasoning assessment: Critical thinking in a discipline. In: J. Sobocan & L. Goarke (Eds.), Critical thinkingeducation and assessment: Can higher order thinking be tested? (pp. 215–2228). Ontario, Canada: The Althouse Press.

Oregon Department of Education. (2012). School report cards (data file).. Available from http://www.ode.state.or.us/data/reportcard/reports.aspx

Perry, W. G. (1970). Forms of intellectual and ethical development in the college years: A scheme. New York, NY: Holt, Rinehart andWinston.

Quitadamo, I. J., & Kurtz, M. J. (2007). Learning to improve: Using writing to increase critical thinking performance in generalEducation Biology. Life Sciences Education, 6, 140–154.

Rezaei, A. R., & Lovorn, M. (2010). Reliability and validity or rubrics for assessment through writing. Assessing Writing, 15, 18–39.Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measure-

ment, 30 (3), 215–232.Solano-Flores, G., & Shavelson, R. J. (1997). Development of performance assessments in science: Conceptual, practical, and

logistical issues. Educational Measurement: Issues and Practice, 16 (3), 16–25.Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability.

Practical Assessment, Research, and Evaluation, 9 (4). Retrieved from http://PARAonline.net/getvn.asp?v9&n=4

Emily Saxton, MST, is a research associate with the Center for Science Education at Portland State University. Her Mastersdegree in Science Teaching is from Portland State University and her BS degree in Environmental Science is from the Universityof Florida. Her research interests include assessment development and validation and evaluation of professional development.

Secret Belanger, MEd, MST, is a science teacher at Century High School in Hillsboro, OR. Her Masters degrees are in Education andScience Teaching from Portland State University, while her undergraduate degree in Zoology is from Oregon State University.Her MST thesis research was centered on critical thinking disposition.

William Becker, PhD, is the Director of the Center for Science Education at Portland State University. His degrees in ChemistryPhD, M.S., and B.S. are from Boston University and DePaul University. He is the co-Principal Investigator of a Math and SciencePartnership grant and a NSF Noyce grant.

The Critical Thinking Analytic Rubric (CTAR): Investigating intra-rater and inter-rater reliability of a scoring mechanism for critical thinking performance assessments

Documents