Rubrics Cube 1 As the call for accountability in higher education sets the tone for campus-wide learning outcomes assessment, questions about how to conduct meaningful, reliable, and valid assessments to help students learn have gained increased prominence. The chapter by Yen and Hynes presents the unique contribution of a heuristic rubrics cube for authentic assessment validation. Just as Bloom’s taxonomy maps the dimensions of cognitive abilities, the Yen and Hynes cube cohesively assembles three dimensions (cognitive-behavioral-affective taxonomies, stakes of an assessment, and reliability and validity) in a cube for mapping and organizing efforts in authentic assessment. The chapter includes a broad discussion of rubric development. As an explication of the heuristic rubrics cube, they examine six key studies to show how different types of reliability and validity were estimated for low, medium, and high stakes assessment decisions. Authentic Assessment Validation: A Heuristic Rubrics Cube Jion Liou Yen, Lewis University Kevin Hynes, Midwestern University Authentic assessment entails judging student learning by measuring performance according to real-life-skills criteria. This chapter focuses on the validation of authentic assessments because empirically-based, authentic assessment-validation studies are sparsely reported in the higher education literature. At the same time, many of the concepts addressed in this chapter are more broadly applicable to assessment tasks that may not be termed “authentic assessment” and are valuable from this broader assessment perspective as well. Accordingly, the chapter introduces a heuristic, rubrics cube which can serve as a tool for educators to conceptualize or map their authentic and other assessment activities and decisions on the following three dimensions: type and level of taxonomy, level of assessment decision, and types of validation methods. What is a Rubric? The call for accountability in higher education has set the tone for campus-wide assessment. As the concept of assessment gains prominence on campuses, so do questions about how to conduct meaningful, reliable, and valid assessments. Authentic assessment has been credited by many as a meaningful approach for student learning assessment (Aitken and Pungur, 2010; Banta et al., 2009; Eder, 2001, Goodman et al., 2008; Mueller, 2010; Spicuzza and Cunningham, 2003). When applied to authentic assessment, a rubric guides evaluation of student work against specific criteria from which a score is generated to quantify student performance. According to Walvoord (2004), “ A rubric articulates in writing the various criteria and standards that a faculty member uses to evaluate student work” (p.19). There are two types of rubrics—holistic and analytic. Holistic rubrics assess the overall quality of a performance or product and can vary in degree of complexity from simple to complex. For example, a simple holistic rubric for judging the quality of student writing is described by Moskal (2000) as involving four categories ranging from “inadequate” to “needs
19
Embed
be termed “authentic Geisinger, Shaw, and McCormick (this volume) note, the concept of validity as modeled by classical test theory, generalizability theory, and multi-faceted Rasch
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Rubrics Cube 1
As the call for accountability in higher education sets the tone for campus-wide learning outcomes
assessment, questions about how to conduct meaningful, reliable, and valid assessments to help students
learn have gained increased prominence. The chapter by Yen and Hynes presents the unique
contribution of a heuristic rubrics cube for authentic assessment validation. Just as Bloom’s taxonomy
maps the dimensions of cognitive abilities, the Yen and Hynes cube cohesively assembles three
dimensions (cognitive-behavioral-affective taxonomies, stakes of an assessment, and reliability and
validity) in a cube for mapping and organizing efforts in authentic assessment. The chapter includes a
broad discussion of rubric development. As an explication of the heuristic rubrics cube, they examine six
key studies to show how different types of reliability and validity were estimated for low, medium, and
high stakes assessment decisions.
Authentic Assessment Validation: A Heuristic Rubrics Cube
Jion Liou Yen, Lewis University
Kevin Hynes, Midwestern University
Authentic assessment entails judging student learning by measuring performance
according to real-life-skills criteria. This chapter focuses on the validation of authentic
assessments because empirically-based, authentic assessment-validation studies are sparsely
reported in the higher education literature. At the same time, many of the concepts addressed in
this chapter are more broadly applicable to assessment tasks that may not be termed “authentic
assessment” and are valuable from this broader assessment perspective as well. Accordingly, the
chapter introduces a heuristic, rubrics cube which can serve as a tool for educators to
conceptualize or map their authentic and other assessment activities and decisions on the
following three dimensions: type and level of taxonomy, level of assessment decision, and types
of validation methods.
What is a Rubric?
The call for accountability in higher education has set the tone for campus-wide
assessment. As the concept of assessment gains prominence on campuses, so do questions about
how to conduct meaningful, reliable, and valid assessments. Authentic assessment has been
credited by many as a meaningful approach for student learning assessment (Aitken and Pungur,
2010; Banta et al., 2009; Eder, 2001, Goodman et al., 2008; Mueller, 2010; Spicuzza and
Cunningham, 2003). When applied to authentic assessment, a rubric guides evaluation of
student work against specific criteria from which a score is generated to quantify student
performance. According to Walvoord (2004), “ A rubric articulates in writing the various
criteria and standards that a faculty member uses to evaluate student work” (p.19).
There are two types of rubrics—holistic and analytic. Holistic rubrics assess the overall
quality of a performance or product and can vary in degree of complexity from simple to
complex. For example, a simple holistic rubric for judging the quality of student writing is
described by Moskal (2000) as involving four categories ranging from “inadequate” to “needs
Rubrics Cube 2
improvement” to “adequate” to “meet expectations for a first draft of a professional report”.
Each category contains a few additional phrases to describe the category more fully. On the
other hand, an example of a complex rubric is presented by Suskie (2009) who likewise
describes a four-category holistic rubric to judge student’s ballet performance but utilizes up to
15 explanatory phrases. So, even though a rubric may employ multiple categories for judging
student learning, it remains a holistic rubric if it provides only one overall assessment of the
quality of student learning.
The primary difference between a holistic and an analytic rubric is that the latter breaks
out performance or product into several individual components and judges each part separately
on a scale that includes descriptors. Thus, an analytic rubric resembles a matrix comprised of
two axes—dimensions (usually referred to as criteria) and level of performance (as specified by
rating scales and descriptors). The descriptors are more important than the values assigned to
them because scaling may vary across constituent groups. Keeping descriptors constant would
allow cross-group comparison (Hatfield, personal communication, 2010). Although it takes time
to develop clearly-defined and unambiguous descriptors, they are essential in communicating
performance expectations in rubrics and for facilitating the scoring process (Suskie, 2009;
Walvoord, 2004).
To further elucidate the differences between holistic and analytic rubrics, consider how
figure skating performance might be judged utilizing a holistic rubric versus how it is judged
utilizing an analytic rubric. If one were to employ a holistic rubric to judge the 2010 Olympics
men’s figure skating based solely on technical ability, one might award a gold medal to Russian
skater Evgeni Plushenko because he performed a “quad” whereas American skater Evan Lysacek
did not. On the other hand, using analytic rubrics to judge each part of the performance
separately in a matrix comprised of a variety of specific jumps with bonus points awarded later
in the program, then one might award the gold medal to Evan rather than Evgeni. So, it is
possible that outcomes may be judged differently depending upon the type of rubric employed.
Rubrics are often designed by a group of teachers, faculty members, and/or assessment
representatives to measure underlying unobservable concepts via observable traits. Thus, a
rubric is a scaled rating designed to quantify levels of learner performance. Rubrics provide
scoring standards to focus and guide authentic assessment activities. Although rubrics are
described as objective and consistent scoring guides, rubrics are also criticized for the lack of
evidence of reliability and validity. One way to rectify this situation is to conceptualize
gathering evidence of rubric reliability and validity as part of an assessment loop.
Rubric Assessment Loop
Figure 1 depicts three steps involved in the iterative, continuous quality improvement
process of assessment as applied to rubrics. An assessment practitioner may formulate a rubric
to assess an authentic learning task, then conduct studies to validate learning outcomes, and then
make an assessment decision that either closes the assessment loop or leads to a subsequent
round of rubric revision, rubric validation, and assessment decision. The focus of the assessment
decision may range from the classroom level to the program level to the university level.
Rubrics Cube 3
Figure 1. Rubric Assessment Loop.
While individuals involved in assessment have likely seen “feedback loop” figures reminiscent
of Figure 1, these figures need to be translated into a practitioner-friendly form that will take
educators to the next level conceptually. Accordingly, this chapter introduces a heuristic, rubrics
cube to facilitate this task.
Rubrics Cube
Figure 2 illustrates the heuristic, rubrics cube where height is represented by three levels
Figure 2. Rubrics cube.
Rubrics Cube 4
of assessment stakes (assessment decisions), width is represented by two methodological
approaches for developing an evidentiary basis for the validity of the assessment decisions, and
depth is represented by three learning taxonomies. The cell entries on the face of the cube
represented in Figure 2 are the authors’ estimates of the likely correspondence between the
reliability and validity estimation methods minimally acceptable for each corresponding
assessment stakes level. Ideally, the cell entries would capture a more fluid interplay between
the types of methods utilized to estimate reliability and validity and the evidentiary argument
supporting the assessment decision.
As Geisinger, Shaw, and McCormick (this volume) note, the concept of validity as
modeled by classical test theory, generalizability theory, and multi-faceted Rasch Measurement
(MFRM) is being re-conceptualized (Kane, 1992; Kane, 1994; Kane, 2006; Smith and
Kulikowich, 2004; Stemler, 2004) into a more unified approach to gathering and assembling an
evidentiary basis or argument that supports the validity of the assessment decision. The current
authors add that as part of this validity re-conceptualization, it is important to recognize the
resource limitations sometimes facing educators and to strive to create “educator-friendly
validation environments” that will aid educators in their task of validating assessment decisions.
Moving on to a discussion of the assessment-stakes dimension of Figure 2, the authors
note that compliance with the demands posed by such external agents as accreditors and
licensure/certification boards has created an assessment continuum. One end of this continuum
is characterized by low-stakes assessment decisions such as grade-related assignments.
Historically, many of the traditional learning assessment decisions are represented here. The
other end of the assessment-stakes continuum is characterized by high-stakes assessment
decisions. Licensure/certification driven assessment decisions are represented here, as would the
use of portfolios in licensure/certification decisions. Exactly where accreditation falls on the
assessment-stakes continuum may vary by institution. Because academic institutions typically
need to be accredited in order to demonstrate the quality and value of their education, the authors
place accreditation on the high stakes end of the assessment-stakes dimension of Figure 2 for the
reasons described next.
While regional accreditors making assessment decisions may not currently demand the
reliability and validity evidence depicted in the high-stakes methods cells of Figure 2, the federal
emphasis on outcome measures, as advocated in the Spellings Commission’s report on the future
of U.S. higher education (U.S. Department of Education, 2006), suggests accrediting agencies
and academic institutions may be pressured increasingly to provide evidence of reliability and
validity to substantiate assessment decisions. Indeed, discipline-specific accrediting
organizations such as the Accreditation Council for Business Schools and Programs, the
Commission on Collegiate Nursing Education, the National Council for Accreditation of Teacher
Education, and many others have promoted learning-outcomes based assessment for some time.
Because many institutions wishing to gather the reliability and validity evidence suggested by
Figure 2 may lack the resources needed to conduct high-stakes assessment activities, the authors
see the need to promote initiatives at many levels that lead to educator-friendly assessment
environments. For example, it is reasonable for academic institutions participating in
commercially-based learning outcomes testing programs to expect that the test developer provide
transparent evidence substantiating the reliability and validity of the commercial examination.
Freed from the task of estimating the reliability and validity of the commercial examination,
Rubrics Cube 5
local educators can concentrate their valuable assessment resources and efforts on
triangulating/correlating the commercial examination scores with scores on other local measures
of learning progress. An atmosphere of open, collegial collaboration will likely be necessary to
create such educator-friendly assessment environments.
Returning to Figure 2, some of the frustration with assessment on college campuses may
stem from the fact that methods acceptable for use in low-stakes assessment contexts are
different from those needed in high-stakes contexts. Whereas low-stakes assessment can often
satisfy constituents by demonstrating good-faith efforts to establish reliability and validity, high-
stakes assessment requires that evidence of reliability and validity be shown (Wilkerson and
Lang, 2003). Unfortunately, the situation can arise where programs engaged in low-stakes
assessment activities resist the more rigorous methods they encounter as they become a part of
high-stakes assessment. For example, the assessment of critical thinking may be considered a
low-stakes assessment situation for faculty members and students when conducted as part of a
course on writing in which the critical thinking score represents a small portion of the overall
grade. However, in cases where a university has made students’ attainment of critical thinking
skills part of its mission statement, then assessing students’ critical thinking skills becomes a
high-stakes assessment for the university as it gathers evidence to be used for accreditation and
university decision-making. Similarly, while students can be satisfied in creating low-stakes
portfolios to showcase their work, they may resist high-stakes portfolio assessment efforts
designed to provide an alternative to standardized testing.
Succinctly stated, high-stakes assessment involves real-life contexts where the learner’s
behavior/performance has critical consequences (e.g., license to practice, academic accreditation,
etc.). In contrast, in low-stakes assessment contexts, the learner’s performance has minimal
consequences (e.g., pass-fail a quiz, a formative evaluation, etc.). Thus, from a stakeholder
perspective, the primary assessment function in a low-stakes context is to assess the learner’s
progress. On the other hand, from a stakeholder perspective, the primary assessment function in
a high-stakes context is to assess that the mission-critical learning outcome warrants, for
example, licensure to practice (from the stakeholder perspective of the learner) and accreditation
(from the stakeholder perspective of an institution). Lastly, a moderate-stakes context falls
between these two end points. Thus, from a stakeholder perspective, the primary assessment
function in a moderate-stakes context is to assess that moderating outcomes (e.g, work-study
experiences, internship experiences, volunteering, graduation, etc.) are experienced and
successfully accomplished.
For the low-stakes assessment level, at a minimum, a content validity argument needs to
be established. In the authors’ judgment, while content, construct and concurrent validity would
typically need to be demonstrated for a validity argument at the medium-stakes assessment level;
predictive validity need not be demonstrated. In order to demonstrate a validity argument at the
high-stakes assessment level, all types of validation methods such as content validity, construct
validity (i.e., convergent and discriminant), and criterion-related validity (i.e., concurrent validity
and predictive validity) need to be demonstrated (see Thorndike and Hagen, 1977 for definitional
explanations).
With regard to inter-rater reliability estimation, for the low-stakes assessment level,
percentage of rater agreement (consensus) needs to be demonstrated. For the medium- and high-
stakes assessment levels, both consensus and consistency need to be demonstrated. Needless to
Rubrics Cube 6
say, there are a variety of non-statistical and statistical approaches to estimate these types of
reliability and validity. For those having the software and resources to do so, the MFRM
approach can be an efficient means for establishing an evidentiary base for estimating the
reliability and validity of rubrics. The important consideration in validation is that it is the
interpretation of rubric scores upon which validation arguments are made.
Before examining the interplay of learning taxonomies with assessment decisions, a few
additional comments on methods may be helpful. First, the reliability methods portrayed in the
rubrics cube ignore such reliability estimates as internal consistency of items, test-retest, parallel
forms, and split-half. Instead, the chapter focuses on inter-rater reliability estimates because it is
critical to establish inter-rater reliability estimates whenever human judgments comprise the
basis of a rubrics score. In creating the methods dimension of the rubrics cube, the authors
adapted the Stemler (2004) distinction between consensus, consistency, and measurement
approaches to estimating inter-rater reliability. Figure 2 indicates that at a low-assessment-stakes
level, consensus is an appropriate method for estimating inter-rater reliability. Estimating the
degree of consensus between judges can be as simple as calculating the percent of agreement
between pairs of judges or as intricate as collaborative quality filtering which assigns greater
weights to more accurate judges (Traupman and Wilensky, 2004). Figure 2 also shows that as
the assessment stakes level moves to the “moderate” and “high” levels, then both consistency
and consensus are relevant methods for estimating inter-rater reliability. Among the more
popular approaches, consistency can be estimated via an intraclass correlation derived utilizing
an analysis of variance approach. In the analysis of variance, judges/raters form the between
factor and the within-ratee source of variation is a function of both between judge/rater variation
and residual variation (Winer, 1971, p. 283 - 89). Consistency can also be estimated utilizing an
item response theory (IRT) approach (see Osterlind & Wang, this volume) as well as Linacre
(2003). Because reliability is a necessary, but not sufficient condition for validity, it is listed first
on the methods dimension.
Having discussed the contingency of the various types of validation methods on the
assessment-stakes-level dimension of the rubrics cube, a few general comments on validation
methods as they relate to rubrics are warranted here. Detailed definitions and discussions of the
types of validity are available in The Standards for Educational and Psychological Testing
(American Education Research Association, American Psychological Association, National
Council Measurement in Education, 1999). In order for others to benefit from and replicate the
validation of rubrics described in the assessment literature, it is essential that the rubric scaling
utilized be accurately described. It is also essential that researchers describe the way in which