be termed “authentic Geisinger, Shaw, and McCormick (this volume) note, the concept of validity as modeled by classical test theory, generalizability theory, and multi-faceted Rasch

Rubrics Cube 1

As the call for accountability in higher education sets the tone for campus-wide learning outcomes

assessment, questions about how to conduct meaningful, reliable, and valid assessments to help students

learn have gained increased prominence. The chapter by Yen and Hynes presents the unique

contribution of a heuristic rubrics cube for authentic assessment validation. Just as Bloom’s taxonomy

maps the dimensions of cognitive abilities, the Yen and Hynes cube cohesively assembles three

dimensions (cognitive-behavioral-affective taxonomies, stakes of an assessment, and reliability and

validity) in a cube for mapping and organizing efforts in authentic assessment. The chapter includes a

broad discussion of rubric development. As an explication of the heuristic rubrics cube, they examine six

key studies to show how different types of reliability and validity were estimated for low, medium, and

high stakes assessment decisions.

Authentic Assessment Validation: A Heuristic Rubrics Cube

Jion Liou Yen, Lewis University

Kevin Hynes, Midwestern University

Authentic assessment entails judging student learning by measuring performance

according to real-life-skills criteria. This chapter focuses on the validation of authentic

assessments because empirically-based, authentic assessment-validation studies are sparsely

reported in the higher education literature. At the same time, many of the concepts addressed in

this chapter are more broadly applicable to assessment tasks that may not be termed “authentic

assessment” and are valuable from this broader assessment perspective as well. Accordingly, the

chapter introduces a heuristic, rubrics cube which can serve as a tool for educators to

conceptualize or map their authentic and other assessment activities and decisions on the

following three dimensions: type and level of taxonomy, level of assessment decision, and types

of validation methods.

What is a Rubric?

The call for accountability in higher education has set the tone for campus-wide

assessment. As the concept of assessment gains prominence on campuses, so do questions about

how to conduct meaningful, reliable, and valid assessments. Authentic assessment has been

credited by many as a meaningful approach for student learning assessment (Aitken and Pungur,

2010; Banta et al., 2009; Eder, 2001, Goodman et al., 2008; Mueller, 2010; Spicuzza and

Cunningham, 2003). When applied to authentic assessment, a rubric guides evaluation of

student work against specific criteria from which a score is generated to quantify student

performance. According to Walvoord (2004), “ A rubric articulates in writing the various

criteria and standards that a faculty member uses to evaluate student work” (p.19).

There are two types of rubrics—holistic and analytic. Holistic rubrics assess the overall

quality of a performance or product and can vary in degree of complexity from simple to

complex. For example, a simple holistic rubric for judging the quality of student writing is

described by Moskal (2000) as involving four categories ranging from “inadequate” to “needs

Rubrics Cube 2

improvement” to “adequate” to “meet expectations for a first draft of a professional report”.

Each category contains a few additional phrases to describe the category more fully. On the

other hand, an example of a complex rubric is presented by Suskie (2009) who likewise

describes a four-category holistic rubric to judge student’s ballet performance but utilizes up to

15 explanatory phrases. So, even though a rubric may employ multiple categories for judging

student learning, it remains a holistic rubric if it provides only one overall assessment of the

quality of student learning.

The primary difference between a holistic and an analytic rubric is that the latter breaks

out performance or product into several individual components and judges each part separately

on a scale that includes descriptors. Thus, an analytic rubric resembles a matrix comprised of

two axes—dimensions (usually referred to as criteria) and level of performance (as specified by

rating scales and descriptors). The descriptors are more important than the values assigned to

them because scaling may vary across constituent groups. Keeping descriptors constant would

allow cross-group comparison (Hatfield, personal communication, 2010). Although it takes time

to develop clearly-defined and unambiguous descriptors, they are essential in communicating

performance expectations in rubrics and for facilitating the scoring process (Suskie, 2009;

Walvoord, 2004).

To further elucidate the differences between holistic and analytic rubrics, consider how

figure skating performance might be judged utilizing a holistic rubric versus how it is judged

utilizing an analytic rubric. If one were to employ a holistic rubric to judge the 2010 Olympics

men’s figure skating based solely on technical ability, one might award a gold medal to Russian

skater Evgeni Plushenko because he performed a “quad” whereas American skater Evan Lysacek

did not. On the other hand, using analytic rubrics to judge each part of the performance

separately in a matrix comprised of a variety of specific jumps with bonus points awarded later

in the program, then one might award the gold medal to Evan rather than Evgeni. So, it is

possible that outcomes may be judged differently depending upon the type of rubric employed.

Rubrics are often designed by a group of teachers, faculty members, and/or assessment

representatives to measure underlying unobservable concepts via observable traits. Thus, a

rubric is a scaled rating designed to quantify levels of learner performance. Rubrics provide

scoring standards to focus and guide authentic assessment activities. Although rubrics are

described as objective and consistent scoring guides, rubrics are also criticized for the lack of

evidence of reliability and validity. One way to rectify this situation is to conceptualize

gathering evidence of rubric reliability and validity as part of an assessment loop.

Rubric Assessment Loop

Figure 1 depicts three steps involved in the iterative, continuous quality improvement

process of assessment as applied to rubrics. An assessment practitioner may formulate a rubric

to assess an authentic learning task, then conduct studies to validate learning outcomes, and then

make an assessment decision that either closes the assessment loop or leads to a subsequent

round of rubric revision, rubric validation, and assessment decision. The focus of the assessment

decision may range from the classroom level to the program level to the university level.

Rubrics Cube 3

Figure 1. Rubric Assessment Loop.

While individuals involved in assessment have likely seen “feedback loop” figures reminiscent

of Figure 1, these figures need to be translated into a practitioner-friendly form that will take

educators to the next level conceptually. Accordingly, this chapter introduces a heuristic, rubrics

cube to facilitate this task.

Rubrics Cube

Figure 2 illustrates the heuristic, rubrics cube where height is represented by three levels

Figure 2. Rubrics cube.

Rubrics Cube 4

of assessment stakes (assessment decisions), width is represented by two methodological

approaches for developing an evidentiary basis for the validity of the assessment decisions, and

depth is represented by three learning taxonomies. The cell entries on the face of the cube

represented in Figure 2 are the authors’ estimates of the likely correspondence between the

reliability and validity estimation methods minimally acceptable for each corresponding

assessment stakes level. Ideally, the cell entries would capture a more fluid interplay between

the types of methods utilized to estimate reliability and validity and the evidentiary argument

supporting the assessment decision.

As Geisinger, Shaw, and McCormick (this volume) note, the concept of validity as

modeled by classical test theory, generalizability theory, and multi-faceted Rasch Measurement

(MFRM) is being re-conceptualized (Kane, 1992; Kane, 1994; Kane, 2006; Smith and

Kulikowich, 2004; Stemler, 2004) into a more unified approach to gathering and assembling an

evidentiary basis or argument that supports the validity of the assessment decision. The current

authors add that as part of this validity re-conceptualization, it is important to recognize the

resource limitations sometimes facing educators and to strive to create “educator-friendly

validation environments” that will aid educators in their task of validating assessment decisions.

Moving on to a discussion of the assessment-stakes dimension of Figure 2, the authors

note that compliance with the demands posed by such external agents as accreditors and

licensure/certification boards has created an assessment continuum. One end of this continuum

is characterized by low-stakes assessment decisions such as grade-related assignments.

Historically, many of the traditional learning assessment decisions are represented here. The

other end of the assessment-stakes continuum is characterized by high-stakes assessment

decisions. Licensure/certification driven assessment decisions are represented here, as would the

use of portfolios in licensure/certification decisions. Exactly where accreditation falls on the

assessment-stakes continuum may vary by institution. Because academic institutions typically

need to be accredited in order to demonstrate the quality and value of their education, the authors

place accreditation on the high stakes end of the assessment-stakes dimension of Figure 2 for the

reasons described next.

While regional accreditors making assessment decisions may not currently demand the

reliability and validity evidence depicted in the high-stakes methods cells of Figure 2, the federal

emphasis on outcome measures, as advocated in the Spellings Commission’s report on the future

of U.S. higher education (U.S. Department of Education, 2006), suggests accrediting agencies

and academic institutions may be pressured increasingly to provide evidence of reliability and

validity to substantiate assessment decisions. Indeed, discipline-specific accrediting

organizations such as the Accreditation Council for Business Schools and Programs, the

Commission on Collegiate Nursing Education, the National Council for Accreditation of Teacher

Education, and many others have promoted learning-outcomes based assessment for some time.

Because many institutions wishing to gather the reliability and validity evidence suggested by

Figure 2 may lack the resources needed to conduct high-stakes assessment activities, the authors

see the need to promote initiatives at many levels that lead to educator-friendly assessment

environments. For example, it is reasonable for academic institutions participating in

commercially-based learning outcomes testing programs to expect that the test developer provide

transparent evidence substantiating the reliability and validity of the commercial examination.

Freed from the task of estimating the reliability and validity of the commercial examination,

Rubrics Cube 5

local educators can concentrate their valuable assessment resources and efforts on

triangulating/correlating the commercial examination scores with scores on other local measures

of learning progress. An atmosphere of open, collegial collaboration will likely be necessary to

create such educator-friendly assessment environments.

Returning to Figure 2, some of the frustration with assessment on college campuses may

stem from the fact that methods acceptable for use in low-stakes assessment contexts are

different from those needed in high-stakes contexts. Whereas low-stakes assessment can often

satisfy constituents by demonstrating good-faith efforts to establish reliability and validity, high-

stakes assessment requires that evidence of reliability and validity be shown (Wilkerson and

Lang, 2003). Unfortunately, the situation can arise where programs engaged in low-stakes

assessment activities resist the more rigorous methods they encounter as they become a part of

high-stakes assessment. For example, the assessment of critical thinking may be considered a

low-stakes assessment situation for faculty members and students when conducted as part of a

course on writing in which the critical thinking score represents a small portion of the overall

grade. However, in cases where a university has made students’ attainment of critical thinking

skills part of its mission statement, then assessing students’ critical thinking skills becomes a

high-stakes assessment for the university as it gathers evidence to be used for accreditation and

university decision-making. Similarly, while students can be satisfied in creating low-stakes

portfolios to showcase their work, they may resist high-stakes portfolio assessment efforts

designed to provide an alternative to standardized testing.

Succinctly stated, high-stakes assessment involves real-life contexts where the learner’s

behavior/performance has critical consequences (e.g., license to practice, academic accreditation,

etc.). In contrast, in low-stakes assessment contexts, the learner’s performance has minimal

consequences (e.g., pass-fail a quiz, a formative evaluation, etc.). Thus, from a stakeholder

perspective, the primary assessment function in a low-stakes context is to assess the learner’s

progress. On the other hand, from a stakeholder perspective, the primary assessment function in

a high-stakes context is to assess that the mission-critical learning outcome warrants, for

example, licensure to practice (from the stakeholder perspective of the learner) and accreditation

(from the stakeholder perspective of an institution). Lastly, a moderate-stakes context falls

between these two end points. Thus, from a stakeholder perspective, the primary assessment

function in a moderate-stakes context is to assess that moderating outcomes (e.g, work-study

experiences, internship experiences, volunteering, graduation, etc.) are experienced and

successfully accomplished.

For the low-stakes assessment level, at a minimum, a content validity argument needs to

be established. In the authors’ judgment, while content, construct and concurrent validity would

typically need to be demonstrated for a validity argument at the medium-stakes assessment level;

predictive validity need not be demonstrated. In order to demonstrate a validity argument at the

high-stakes assessment level, all types of validation methods such as content validity, construct

validity (i.e., convergent and discriminant), and criterion-related validity (i.e., concurrent validity

and predictive validity) need to be demonstrated (see Thorndike and Hagen, 1977 for definitional

explanations).

With regard to inter-rater reliability estimation, for the low-stakes assessment level,

percentage of rater agreement (consensus) needs to be demonstrated. For the medium- and high-

stakes assessment levels, both consensus and consistency need to be demonstrated. Needless to

Rubrics Cube 6

say, there are a variety of non-statistical and statistical approaches to estimate these types of

reliability and validity. For those having the software and resources to do so, the MFRM

approach can be an efficient means for establishing an evidentiary base for estimating the

reliability and validity of rubrics. The important consideration in validation is that it is the

interpretation of rubric scores upon which validation arguments are made.

Before examining the interplay of learning taxonomies with assessment decisions, a few

additional comments on methods may be helpful. First, the reliability methods portrayed in the

rubrics cube ignore such reliability estimates as internal consistency of items, test-retest, parallel

forms, and split-half. Instead, the chapter focuses on inter-rater reliability estimates because it is

critical to establish inter-rater reliability estimates whenever human judgments comprise the

basis of a rubrics score. In creating the methods dimension of the rubrics cube, the authors

adapted the Stemler (2004) distinction between consensus, consistency, and measurement

approaches to estimating inter-rater reliability. Figure 2 indicates that at a low-assessment-stakes

level, consensus is an appropriate method for estimating inter-rater reliability. Estimating the

degree of consensus between judges can be as simple as calculating the percent of agreement

between pairs of judges or as intricate as collaborative quality filtering which assigns greater

weights to more accurate judges (Traupman and Wilensky, 2004). Figure 2 also shows that as

the assessment stakes level moves to the “moderate” and “high” levels, then both consistency

and consensus are relevant methods for estimating inter-rater reliability. Among the more

popular approaches, consistency can be estimated via an intraclass correlation derived utilizing

an analysis of variance approach. In the analysis of variance, judges/raters form the between

factor and the within-ratee source of variation is a function of both between judge/rater variation

and residual variation (Winer, 1971, p. 283 - 89). Consistency can also be estimated utilizing an

item response theory (IRT) approach (see Osterlind & Wang, this volume) as well as Linacre

(2003). Because reliability is a necessary, but not sufficient condition for validity, it is listed first

on the methods dimension.

Having discussed the contingency of the various types of validation methods on the

assessment-stakes-level dimension of the rubrics cube, a few general comments on validation

methods as they relate to rubrics are warranted here. Detailed definitions and discussions of the

types of validity are available in The Standards for Educational and Psychological Testing

(American Education Research Association, American Psychological Association, National

Council Measurement in Education, 1999). In order for others to benefit from and replicate the

validation of rubrics described in the assessment literature, it is essential that the rubric scaling

utilized be accurately described. It is also essential that researchers describe the way in which

content validity was maximized (rational, empirical, Delphi technique, job analysis, etc.).

Likewise, researchers’ discussions of efforts to establish construct validity should describe any

evidence of positive correlations with theoretically-related (and not simply convenient)

constructs (i.e., convergent validity), negative or non-significant correlations with unrelated

constructs (i.e., discriminant validity), or results of multivariate approaches (factor analysis,

canonical correlations, discriminant analysis, multivariate analysis of variance, etc.) as support

for construct validity. Central to the concept of predictive validity is that rubric scores gathered

prior to a theoretically relevant, desired, end-state criterion correlate significantly and therefore

predict the desired end state criterion. At the high-stakes assessment level, assembling

information regarding all of these types of validity will enable stakeholders to make an

evidentiary validity argument for the assessment decision.

Rubrics Cube 7

With regard to predictive validity, the authors note that it is possible to statistically

correct for restriction in the range of ability (Hynes & Givner, 1981; Wiberg & Sundstrom, 2009;

also see the Geisinger et al., this volume). For example, if the adequacy of subsequent work

performance of candidates passing a licensure examination (i.e., restricted group) were rated by

supervisors and these ratings were then correlated with achieved licensure examination scores,

this correlation could be corrected for restriction in range, yielding a better estimate of the true

correlation between performance ratings and licensure examination scores for the entire,

unrestricted group (pass candidates and fail candidates).

Assessment Decisions (Assessment Stakes Level) by Taxonomy

The levels of each of the learning taxonomies as they relate to rubric criteria and assessment

decisions are explicated through fictional examples in the tables that follow. Figure 3 represents

Cognitive Taxonomy: Evaluation/Critical Thinking (CT)

Rubric Criteria*

Assessment Decisions

Remedial CT

Training

Targeted CT

Training

Advanced CT

Training

Planning - - +

Comparing - - +

Observing - + +

Contrasting - + +

Classifying - + +

*Criteria were adapted from Smith & Kulikowich (2004)

Figure 3. Rubrics cube applied to medium-stakes decisions, cognitive taxonomy.

an assessment-decision-focused slice of the three-dimensional heuristic, rubrics cube for

Bloom’s (1956) cognitive learning taxonomy involving creative/critical thinking (CT). Five

rubric criteria are crossed with three assessment decisions in a fictional, moderate-stakes

authentic assessment adapted from Smith and Kulikowich (2004).

The pluses and minuses in Figure 3 represent ‘Yes’ - ‘No’ analytic assessment decisions

regarding students’ mastery of critical thinking criteria. For example, the assessment decisions

for students who received minuses in all five CT criteria would be to take remedial CT training.

On the other hand, students who received a majority of pluses would need to take targeted CT

training until they master all criteria. Lastly, students who earned all pluses are ready to move

on to the advanced CT training.

Figure 4 represents an assessment-decision-focused slice of the three-dimensional

heuristic, rubrics cube for the behavioral/psychomotor taxonomy level (operation) with two

Rubrics Cube 8

Behavioral/Psychomotor Taxonomy:

Operating Equipment/Work Performance (WP)

Rubric Criteria*

Assessment Decisions

Repeat

Simulation

Supervised Practicum Solo Flight

Technical 1 2 3 4 1 2 3 4 1 2 3 4

Team Work 1 2 3 4 1 2 3 4 1 2 3 4

Decision rule** Total < 5 Total >= 5 & < 8 Total = 8

*Criteria were adapted from Mulqueen et al. (2000)

**Decision rule represents a cutoff score for each assessment decision.

Figure 4. Rubrics cube applied to high-stakes decisions, behavioral/psychomotor taxonomy.

rubric criteria crossed with three assessment decisions in a fictional, authentic assessment of pilot

performance adapted from a study by Mulqueen et al. (2000). The 4-point scale in Figure 4

represents analytic assessment decisions regarding students’ ability to apply knowledge in three

criteria. For example, an assessment decision could be made whereby students need to repeat

simulation training if they scored a total of less than ‘5’. Similar assessment-decision logic could

apply to the supervised practicum and solo flight where advancement is based on meeting or

surpassing specified cutoff scores.

Figure 5 represents an assessment-decision-focused slice of the three-dimensional

heuristic, rubrics cube for the valuing level of Krathwohl’s (1964) affective taxonomy.

Affective Taxonomy: Value/Attitudes Towards Community (VATC)

Rubric Criteria Assessment Decisions

Self-Assessment Question

Reflection

Describe how your community service has affected your

values and attitudes toward the community and decide

whether additional community service would be beneficial.

Figure 5. Rubrics cube applied to a low-stakes decision, affective taxonomy.

Figure 5 involves authentic assessment because students are involved in a real-life context and

are asked to self assess their community service experiences using holistically-judged reflection.

For some students, their self-assessment decision would be to engage in additional community

service to increase their valuing of community to the level they deem appropriate.

Assessment Stakes Level (Assessment Decisions) by Methods

Thus far, in discussing Figure 2, the learning taxonomy dimension has been discussed as

it relates to the assessment-stakes level dimension and the assessment decisions, but the methods

dimension has not yet been addressed. It is critical to note that valid assessment decisions can be

Rubrics Cube 9

made only if the reliability and validity of the rubrics (i.e., methods dimension) involved have

been established. Without an evidentiary validity argument, the decisions may be called into

question, or in a worst-case, high-stakes assessment scenario, challenged “in a court of law”

(Wilkerson and Lang, 2003, p.3).

Central to the heuristic value of the rubrics cube is the notion that the methods employed

to estimate inter-rater reliability and validity are contingent on the assessment stakes level. If the

assessment-stakes level is low, then the authors suggest only consensus reliability (percentage of

agreement between raters) and content validity need to be established. On the other hand, as

represented by the authors in Figure 2, if the assessment-stakes level is high, then consensus and

consistency inter-rater reliability as well as content, construct, and concurrent validity need to be

demonstrated. The realities of the medium-stakes assessment level will determine how much

evidence may be gathered in support of the assessment decision. Because the establishment of

adequate reliability estimates is a necessary condition for the establishment of validity (and,

indeed, the size of a validity coefficient is limited by the size of the reliability estimate), the

estimations of validity and reliability are both included in the methods dimension of the rubrics

cube. Currently, in the higher education literature, it can be difficult to locate exemplars where

evidentiary validity arguments have been made for rubrics. But this does not justify describing

the task of validating rubrics as an insurmountable one, because some exemplars are beginning

to surface in the higher education literature. With this in mind, Figure 6 presents six key studies

highlighting the types of inter-rater reliability and validity demonstrated for low, medium, and

high-stakes assessment decisions.

Reliability

- Consensus

- Consistency

Content

Validity

Construct

Validity:

- Convergent

- Discriminant

Criterion-

related

Validity

- Concurrent

- Predictive

Low Stakes Assessment

Study 1— Research

Quality (Bresciani, et

al., 2009)

Consistency Yes No No

Medium Stakes Assessment

Study 2— Delphi

method (Allen &

Knight, 2009)

Consensus /

Agreement

Yes No No

Study 3— Comparing

generalizabilty &

multifaceted Rasch

models (Smith &

Kulikowich, 2004)

Consistency Not

described

Yes No

Rubrics Cube 10

High Stakes Assessment

Study 4—

Multifaceted Rasch

model study of rater

reliability (Mulqueen,

et al., 2000)

Consistency Not

described

Not described No

Study 5—Performance

assessment (Pecheone

& Chung, 2006)

Consistency Yes Yes Yes

(Concurrent)

Study6—Minimum

Competency Exam

(Goodman, et al, 2008)

Not describe Yes Yes Yes

(Concurrent /

Predictive)

Figure 6. Key studies highlighting stakes assessment dimension/decisions by methods.

The studies summarized in the above table are next discussed in greater detail.

Low-Stakes Assessment. Study 1—Faced with the need to judge the quality of research

presentations by 204 undergraduate, masters, and doctoral students across multiple disciplines ,

Bresciani et al. (2009) address what may be termed a low-stakes assessment task because

consequences to the student presenters were never specified and appeared to be nil. Based on an

extensive review of internal rubrics as well as rubrics for the review of publication submissions,

a multi-disciplinary team of 20 faculty members devised a rubric comprised of four content area

constructs (organization, originality, significance, and discussion/summary) and a fifth construct

for presentation delivery that were applied by judges with no formal training in the use of the

rubrics. Thus, two taxonomies (cognitive and behavioral) were involved and therefore dealt

with all cells of the lowest layer of the rubrics cube portrayed in Figure 2 with the exception of

those associated with the affective taxonomy. In developing their five-construct rubric, the

researchers employed a 5-point Likert scale with unambiguous, construct-specific descriptors

labeling each point of the scale deemed equally applicable across multiple disciplines. Inter-rater

reliability/internal consistency was estimated using intraclass correlations which were computed

for each of 40 multiple-presenter sessions based on an inter-rater data structure where 3 to 7

judges rated 6 to 8 student presentations per session. Bresciani et al. (2009) reported moderately

high intraclass correlations. It could be argued that Breciani et al (2009) made good-faith efforts

to establish the content validity of their rubrics. Although the researchers did not report any

construct validity or criterion-related validity results, the validity argument they presented was

adequate for their low-stakes assessment situation.

Medium-Stakes Assessment. Study 2— In a medium-stakes assessment context, Allen

and Knight (2009) presented a step-by-step, iterative process for designing and validating a

writing assessment rubric. The study involved assessment activities concentrated on Bloom’s

cognitive taxonomy as displayed in Figure 2. It was considered a medium-stakes assessment

because the decisions had some effects on learners’ knowledge and skills but were not

determinants of critical decisions. In this study, faculty and professionals worked in

collaboration to develop agreed-upon criteria that were in accordance with intended learning

outcomes and professional competence. Using content-related evidence resulting from baseline

Rubrics Cube 11

data, the rubric was further refined and expanded. The Delphi method, a qualitative research

methodology which extracts knowledge from a panel of experts (Cyphert & Gant, 1970), was

adopted by these authors as a means for developing group consensus to assign scoring weights

for each category in the rubric. Accordingly, the method was implemented with separate groups

of professionals and faculty in order to reach consensus on the weights of each rubric category so

as to improve the construct validity of the rubric.

The study then utilized two-way Analysis of Variance (factorial ANOVA) techniques to

identify the sources of variability in rubric scores. Student writing samples were first grouped

into two piles based on quality of writing. Faculty and professionals used the scoring rubric to

grade writing samples that were randomly selected from each pile. Significant differences

between the average scores for the two piles of writing samples indicated that the rubric

differentiated the quality of student writing, while non-significant differences between average

scores given by faculty and professionals indicated that there was rater agreement in scoring.

The same data was reanalyzed using an ANOVA to measure rater agreement within faculty

members. Discrepancies in scores were discussed throughout the scoring process. The authors

concluded that smaller estimated variance in rubric scores existed in higher quality writing

samples. They also raised concerns over various interpretations of rubric categories that had

affected rating consistency. It was suggested that descriptors for each level of performance

needed to be added to each category in the rubric to help clarify the scoring category and guide

the scoring process. The evidentiary validity argument presented by Allen and Knight (2009)

could be improved by the inclusion of concurrent validity estimates.

Study 3—Two approaches, Generalizability theory (G-theory; see the Webb, Shavelson

& Steedle chapter, this volume) and Multi-Faceted- Rasch Measurement (MFRM), were used to

explore psychometric properties of student responses collected from a simulated assessment

activity involving the cognitive taxonomy (problem solving skills) and medium-stakes

assessment decision (students would be selected as the coach of a fictional kickball team). Five

questions that measured the complex problem-solving skills of observing, classifying,

comparing, contrasting, and planning were given to 44 students and scored by two judges at two

points in time with a three-month interval (Smith and Kulikowich, 2004). The authors first

adopted G-theory to explicitly quantify multiple sources of errors and the effect of errors on the

ranking order of subjects. In their first-stage Generalizability analysis, estimates of the

variability of four facets (subject, item, judge, and occasion) were first obtained from the fully

crossed, random-effect model. The estimates— Generalizability (G) coefficients were used as

sample estimates. The second stage of analysis illustrated how different components of

measurement errors could be reduced in repeated studies for the purpose of obtaining optimal

generalizability for making decisions.

These authors continued their study by employing MFRM analysis with the same data to

further estimate variance accounted for by differences in subjects, judges, items, and occasions.

Since MFRM is an extension of the basic Rasch model for which unidimensionality and local

independence are two assumptions, Smith and Kulikowich (2004) noted that if these

assumptions are met, the data should “provide a precise and generalizable measure of

performance” (p. 627). Using associated fit statistics—infit and outfit—Smith and Kulikowich

(2004) investigated the extent of how data fit the model (model-data fit) and identified the degree

of consistency for each element within each facet. FACETS (Linacre,1988), a computer

Rubrics Cube 12

program, was used to calculate reliability of separation and chi-square statistics in providing the

information on individual elements (subjects, items, raters, occasions) within each facet (subject,

item, rater, occasion). The FACETS results indicated that planning was the most difficult item,

followed by comparing, observing, contrasting, and classifying. They further reported that

complex problem-solving skills were different from student to student. In particular, differences

in perceptions on the items were found for five students. The significant reliability of separation

(0.00) and non-significant associated chi-square (0.2) indicated that judges were very consistent

in “their ratings and overall severity level, and their influence on the estimation of person

measures is minimal” (p. 635). Consistency between occasions also provided acceptable

calibration fit statistic values of 0.5 to 1.5 for the mean square, reliability of separation (0.00),

and chi-square statistics (.05). Because judges consistently rated items in accordance with item

difficulty, this supported the underlying construct of the complex problem solving skills. Also,

the Rasch model essentially establishes construct validity because any items not fitting the model

constitute instances of multidimensionality and are subject to modification or deletion.

Therefore, Smith and Kulikowich (2004) demonstrated construct validity and, indeed, indicated

that the findings may provide additional information on the hierarchy of the complex problem

solving skills. However, the evidentiary validity argument presented by Smith and Kulikowich

(2004) could be improved by describing content validity estimates and by presenting some

concurrent validity estimates.

High-Stakes Assessment. Study 4—In contrast to study 1 which provided raters with no

formal training, study 4 by Mulqueen, et al. (2000) not only provided rater training but also

measured the effectiveness of the training in addition to other factors (i.e., ratee ability, task

performance difficulty, and rater severity/leniency/bias) utilizing a MFRM approach. Mulqueen,

et al. (2000) addressed what may be termed a high-stakes assessment task involving pilot

training (job simulation) with trainee consequences being flight certification or additional

training. Based on a series of three videotaped aircrew scenarios, airline pilot instructors were

trained to rate the teamwork and technical ability of each member of a two person crew utilizing

a four-point Likert scale with each point described by a one-word, assessment-decision-focused

descriptor (i.e., repeat-debrief-standard-excellent). Although the events being rated were only

described in general terms (teamwork and technical ability), the researchers indicated

performance was being rated and presumably involved the cognitive and behavioral taxonomies

and therefore dealt with all cells of the highest layer of the rubrics cube portrayed in Figure 2

with the exception of those associated with the affective taxonomy. The researchers did not

describe how they developed their rubric and provided no rational content or construct validity

for the training program. However, Mulqueen, et al. (2000) reported that the training program

displayed “separation reliability” because the three crews videotaped were judged to be low,

moderate, and high in their ability. The multi-faceted Rasch analysis identified raters who were

too lenient or too harsh. Mulqueen, et al. (2000) noted that while the multi-faceted Rasch model

has many advantages, the startup involves cumbersome data and programming by an individual

trained in the multi-faceted Rasch modeling. Because Mulqueen, et. al. (2000) did not present an

evidentiary argument that included content, construct, and criterion-related validity estimation,

additional work is needed in order to meet the stringent validity standards associated with a high-

stakes assessment decision.

Study 5—Pecheone and Chung (2006) reported a study that proposed meeting state

credentialing mandates with utilization of authentic assessment. The study addressed a high-

Rubrics Cube 13

stakes assessment context across all three taxonomies (cognitive, behavioral, and affective) and

therefore dealt with the top layer of the rubrics cube portrayed in Figure 2. In a pilot effort to

promote an alternative assessment to state credentialing, the Performance Assessment for

California Teachers (PACT) was created by a coalition of California colleges and universities.

Because this is a high-stakes assessment, PACT was required by the state to report reliability and

validity estimates for the measures. Pecheone and Chung (2006) defined validity as referring

“…to the appropriateness, meaningfulness, and usefulness of evidence that is used to support the

decisions involved in granting an initial license to prospective teachers” (p. 28). Thus, a series of

studies were conducted in order to collect evidence of content, construct, and concurrent validity

along with rater consistency on PACT scores from pilot samples since 2002. Using expert

judgment from teacher educators, the content representation and job-relatedness of the Teaching

Event (TE) elements (e.g. rubrics) was examined and validated. Factor analysis was performed

to determine whether or not clusters of interrelated elements loaded into hypothesized TE

categories. The results from separate factor analyses conducted in two pilot years supported the

underlying TE construct categories. To further substantiate the construct validity of PACT,

Pecheone and Chung (2007) conducted correlation analyses between mean task scores and

documented the results in a PACT Technical Report (2007). Pecheone and Chung (2006)

concluded significant correlations between mean scores across PACT tasks demonstrated that

“…scorers can differentiate their judgments of teacher competence across scoring tasks and that

there is reasonable cohesiveness across the dimensions of teaching” (p.32).

Pecheone and Chung (2006) conducted an additional two sets of studies to examine the

external validity of PACT scores- namely, their ability to validly differentiate candidates who

met minimum teaching performance standards from those who did not. In the first study, the

authors found strong agreement between TE analytic-rubric scores and holistic ratings of

candidate performance used to grant preliminary teaching credentials. The second study

reported 90% agreement between TE scores and candidate competency as evaluated by their

faculty and supervisors. Consensus estimates, obtained by calculating percent agreement

between raters, resulted in 90% to 91% level of agreement within one point (on a 1-4 point scale)

across a two-year preliminary study. In the second pilot year, the authors also investigated inter-

rater reliability for each task and the full year by using the Spearman Brown Prophecy reliability

statistics and the standard error of scoring (SES) to quantify the amount of variation associated

with raters. The inter-rater reliability for the 2003-04 year was 0.88 and across TE tasks was in

the range of 0.65 to 0.75. In addition, PACT adopted an evidence-based three-stage standard

setting model to determine cut-off scores for granting teaching credentials. Using a consensus

based process; the passing standards have been continuously reviewed and revised and were

adopted in 2007. Although Pecheone and Chung (2006; 2007) proposed a predictive validation

study to see if TE scores predicted candidate performance in a real-job context, the analysis had

yet to be implemented since PACT was not approved as a high-stakes assessment in the state of

California. To summarize, Pecheone and Chung (2006; 2007) presented a convincing

evidentiary validity argument that would be strengthened by the inclusion of the proposed

predictive validity study.

Study 6—In a teacher-education study involving 150 teacher candidates, Goodman et al.

(2008) also addressed a high-stakes assessment context across all three taxonomies displayed in

Figure 2 as was done in Study 5. Next, we consider the methods these researchers employed in

Rubrics Cube 14

developing two sets of performance-based rubrics. One 20-item rubric designed to assess

professional attributes had a maximum score of 100. Although student performance was judged

by faculty members and master school-based teacher educators (SBTE), no inter-rater reliability

estimates were reported. Nor was an internal consistency reliability estimate reported in the

study. The authors reported significant concurrent validity estimated by a Pearson correlation (r

= .39) between the professional attributes rubric and portfolio scores. The authors also reported

significant predictive validity estimated by a Pearson correlation (r = .34 and r = .25) between

the professional attributes rubric and two teacher certification examinations.

The second rubric, designed to assess student teaching performance via a portfolio, had a

maximum score of 300 based on student presentation rated by a cluster coordinator. Although

the Goodman et al. (2008) referred to the domains being assessed, they did not describe the

methods utilized in creating the rubric criteria for these domains, nor did they provide descriptors

for the scoring. As noted above, the authors reported significant concurrent validity estimated by

a Pearson correlation between the portfolio scores and the profession attributes rubric scores (r =

.39). The Goodman et al. (2008) also reported significant predictive validity estimated by a

Pearson correlation between the portfolio scores and one teacher certification examination (r =

.27). Thus, Goodman et al. (2008) presented a sound evidentiary validity argument that will be

strengthened by increased reliability estimates which should improve the validity estimation

evidence.

Discussion and Summary

The issue of rubric score generalizability is one that warrants discussion. Unlike

psychological test development efforts intended to be universally applicable (such as tests

designed to measure needs or motivations), rubrics upon initial consideration appear to be

domain specific and often are drawn from small, convenience samples. If indeed, the rubrics are

domain specific, then perhaps the heuristic rubrics cube introduced in this chapter can also serve

a categorizing/coding role as was done in the discussion of the studies summarized in Figure 6.

In other words, the rubrics cube itself may serve as a rubric for judging the evidence being

generated by rubric validation studies. Through systematic recording of rubric findings

according to the rubrics cube three dimensions, generalizable findings may eventually be

deduced from the patterns of evidence that emerge from taxonomic-domain-specific findings.

Studies that directly assess the variability associated with a task facet (or a rubric facet) may be

particularly helpful in increasing the generalizability of findings.

The issue of the small, convenience samples serving as the basis of rubric validation

efforts can be addressed from a variety of approaches. For one thing, meta-studies of rubric

validation will eventually be feasible as the field progresses. Also, the findings of state-wide

initiatives (such as study 5 – Pecheone and Chung, 2005 and study 6 - Goodman, et. al., 2008)

begin to increase the scope of the inference space that is generalizable. A suggested role for

research sponsored by the Department of Education would be to gather requests for proposals to

systematically study a high-stakes assessment “setting” (such as university systems, states,

accreditation agency jurisdictions, etc.) facet for critical areas such as authentic teacher

certification efforts. It would be reasonable for consortia of colleges of education to conduct

rubric validation studies in the high-stakes assessment area of authentic teacher certification

measures. Accreditation agencies could foster these efforts by serving as a dissemination vehicle

Rubrics Cube 15

for promising rubrics. Such efforts would help foster an educator-friendly validation

environment and would supplement efforts by institutional research and assessment offices.

The issue of fairness in the high-stakes assessment context was raised in study 5 and

study 6; they are to be commended for testing effects of race/ethnicity on performance scores.

The authors emphasize that in accordance with the Standards for Educational and Psychological

Testing (AERA, APA, and NCME, 1999), valid assessment rubrics should apply equally across

gender and ethnicity subgroups. Rubrics displaying gender/ethnicity subgroup differences need

to be revised per Figure 1 until they apply equally across these subgroups.

In summary, learning goals are not formulated and rubrics are not utilized in theoretical

vacuums. Indeed, one way of making sense of rubrics is to acknowledge that they are

methodological tools framed by one of the assessment-stakes levels depicted in the heuristic

rubrics cube (Figure 2). Using the rubrics cube as a conceptual organizer, this chapter

systematically cited examples from the literature and compared/categorized how researchers

utilized and validated rubrics. Lastly, from a utilitarian-assessment perspective, a rubric will

have served its purpose well if accreditors (in the case of high-stakes educational assessments) or

other stakeholders accept the evidentiary argument of valid learning gathered by the rubric as the

basis for a positive assessment decision. Otherwise, the rubric will not have served its purpose

and would need to be revised as part of a continuous quality improvement process.

Rubrics Cube 16

REFERENCES

Aitken, N & Pungur, L. (2010). Authentic assessment. Retrieved January 11, 2010 from

http://education.alberta.ca/apps/aisi/literature/pdfs/Authentic_Assessment_UofAb_UofL.PDF,

Allen, S., & Knight. J. (2009). A method for collaboratively developing and validating a rubric.

International Journal for the Scholarship of Teaching and Learning, 3(2).

American Educational Research Association, American Psychological Association, and National

Council of Measurement in Education (1999). Standards for educational and psychological

testing. Washington, D.C.: AERA

Banta, T.W., Griffin, M, Flateby, T.L., and Kahn, S. (2009). Three Promising Alternatives for

Assessment College Students’ Knowledge and Skills. Retrieved January 3, 2010 from

http://learningoutcomesassessment.org/documents/AlternativesforAssessment.pdf,

Bloom, B. S. (Ed.). (1956) Taxonomy of educational objectives: Handbook 1, Cognitive domain.

New York: Longman.

Bresciani, M. J., Oakleaf. M., Kolkhorst. F., Nebeker, C., Barlow. J., Duncan, K., & Hickmott, J.

(2009). Examining design and inter-rater reliability of a rubric measuring research quality across

multiple disciplines. Practical Assessment, Research & Evaluation, 14(12).

Cyphert, F.R., & Gant, W.L. (1970). The Delphi technique: A tool for collecting opinions in

teacher education. Journal of Teacher Education, 31, 417-425.

Eder, D. J. (2001). Accredited programs and authentic assessment. In Palomba, C. A., & Banta,

T. W. (Eds.), Assessing student competence in accredited disciplines: pioneering approaches to

assessment in higher education (pp. 199-216). Sterling, Virginia: Stylus Publishing, LLC.

Geisinger, K.F., Shaw, L.F., & McCormick, C. (this volume). The validation of tests in higher

education.

Goodman, G., Arbona, C., and de Rameriz, R.D. (2008). High-stakes, minimum-competency

exams: how competent are they for evaluating teacher competence? Journal of Teacher

Education. 59(1), 24-39.

S. Hatfield, (personal communication, March 10, 2010).

Hynes K. and Givner, N. (1981). Restriction of range effects on the New MCAT’s predictive

validity. Journal of Medical Education. 56: 352-3.

Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112,527-

535.

http://education.alberta.ca/apps/aisi/literature/pdfs/Authentic_Assessment_UofAb_UofL.PDF

http://learningoutcomesassessment.org/documents/AlternativesforAssessment.pdf

Rubrics Cube 17

Kane, M. (1994). Validating the performance standards associated with passing scores. Review

of Educational Research, 64 (3), 425-461.

Kane, M. (2006). Validation. In R. L. Brennan (Ed). Educational measurement (4th

edition, pp.

17-64). Washington, DC: American Council on Education/Praeger.

Krathwohl, D.R., Bloom, B.S., and Masia, B.B. (1964). Taxonomy of educational objectives:

Handbook II: Affective domain. New York: David McKay Co.

Linacre, J. M. (1988). FACETS. Chicago: MESA Press.

Linacre, J.M. (2003) A User’s Guide to Winsteps Rasch-Model Computer Programs, Chicago.

Mueller, J (2010). Authentic assessment toolbox. Retrieved December 22, 2009 from

http://jonathan.mueller.faculty.noctrl.edu/toolbox/rubrics.htm.

Mulqueen, C., Baker, D., & Dismukes, R. K. (2000). Using multifacet Rasch analysis to examine

the effectiveness of rater training. Retrieved February 8, 2010 from

http://www.airteams.org/publications/rater_training/multifacet_rasch.pdf.

Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: validity and reliability.

Practical Assessment, Research & Evaluation, 7(10). Retrieved December 17, 2009 from

http://PAREonline.net/getvn.asp?v=7&n=10.

Osterlind, S. J. & Wang, Z. (this volume). Item response theory in measurement, assessment, and

evaluation for higher education.

Pecheone, R. L. and Chung, R. R. (2006). Evidence in teacher education: the performance

assessment for California teachers (PACT). Journal of Teacher Education, 57(1), 22-36.

Pecheone, R. L. and Chung, R. R. (2007). Technical report of the performance assessment for

California teachers (PACT): summary of validity and reliability studies for the 2003-04 pilot

year. Retrieved March 24, 2010 from

http://www.pacttpa.org/_files/Publications_and_Presentations/PACT_Technical_Report_March0

7.pdf

Smith, E. V., Jr., & Kulikowich, J. M. (2004). An application of generalizability theory and

many-facet Rasch measurement using a complex problem-solving skills assessment.

Educational and Psychological Measurement, 64(4), 617-639.

Spicuzza, F. J. and Cunningham, M. L. (2003). Validating recognition and production measures

for the bachelor of science in social work. In Banta, T. W. (Ed.), Portfolio assessment: uses,

cases, scoring, and impact. San Francisco: Jossey-Bass.

http://jonathan.mueller.faculty.noctrl.edu/toolbox/rubrics.htm

http://www.airteams.org/publications/rater_training/multifacet_rasch.pdf

http://pareonline.net/getvn.asp?v=7&n=10

http://www.pacttpa.org/_files/Publications_and_Presentations/PACT_Technical_Report_March07.pdf

http://www.pacttpa.org/_files/Publications_and_Presentations/PACT_Technical_Report_March07.pdf

Rubrics Cube 18

Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to

estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). Retrieved

February 16, 2010 from http:// PAREonline.net/getvn.asp?v=9&n=4.

Suskie, L. (2009). Assessing student learning. San Francisco: Jossey-Bass.

Thorndike, R. L. and Hagen, E.P. (1977). Measurement and evaluation in psychology and

education. Wiley, New York.

Traupman, J. and Wilensky, R. (2004). Collaborative quality filtering: establishing consensus or

recovering ground truth? Retrieve March 22, 2010 from

http://maya.cs.depaul.edu/webkdd04/final/traupman.pdf

U.S. Department of Education, A test of leadership: charting the future of U.S. higher education.

Washington, D.C., 2006.

Walvoord, B.E. (2004). Assessment clear and simple: a practical guide for institutions,

departments, and general education. San Francisco: Jossey-Bass.

Webb, N. M., Shavelson, R. Steedle, J. (this volume). Generalizability theory in assessment

contexts.

Wiberg, M. and Sundstrom, A. (2009). A comparison of two approaches to correction of

restriction of range in correlation analysis. Retrieved March 22, 2010 from

http://pareonline.net/pdf/v14n5.pdf

Wilkerson, J. R., and Lang, W. S. (2003) Portfolios, the pied piper of teacher certification

assessments: legal and psychometric issues. Retrieved January 3, 2010 from

http://epaa.asu.edu/ojs/article/viewFile/273/399

Winer, B. J. (1971). Statistical principals and experimental design (2nd

ed). McGraw Hill, New

York, 1971.

http://maya.cs.depaul.edu/webkdd04/final/traupman.pdf

http://pareonline.net/pdf/v14n5.pdf

http://epaa.asu.edu/ojs/article/viewFile/273/399

Rubrics Cube 19

Jion Liou Yen is Associate Vice President of Institutional Research and Planning at Lewis

University in Romeoville, Illinois. She provides leadership in university-wide assessment of

student learning and institutional effectiveness as a member of the Assessment Office. Her

research interests focus on college student access and persistence, outcomes- and evidence-based

assessment of student learning and program evaluation.

Kevin Hynes directs the Office of Institutional Research and Educational Assessment for

Midwestern University with campuses in Downers Grove, Illinois and Glendale, Arizona. His

research interests include educational outcomes assessment and health personnel distribution.

be termed “authentic Geisinger, Shaw, and McCormick (this volume) note, the concept of validity as modeled by classical test theory, generalizability theory, and multi-faceted Rasch

Documents