1 Validating Formative and Interim Assessments Under ESSA Michael B. Bunch Measurement Incorporated Introduction This paper focuses on an orderly integration of formative, interim, and summative assessments to improve student achievement, as described in the Every Student Succeeds Act (ESSA) of 2015. Within that framework, it addresses the validation of formative assessments in terms of growth over time, rather than long-term prediction of performance. The primary vehicle for this work is Measurement Incorporated’s Project Essay Grade (PEG), an automated scoring engine we have been using for several years for formative, interim, and summative assessments. ESSA ESSA provides hundreds of millions of dollars for states to develop (either individually or collaboratively) balanced assessment systems comprising summative, interim, and formative assessments and to assist local education agencies in implementing them. Specifically: “SEC. 1204. Innovative assessment and accountability demonstration authority. “(a) Innovative assessment system defined.—The term ‘innovative assessment system’ means a system of assessments that may include— “(1) competency-based assessments, instructionally embedded assessments, interim assessments, cumulative year-end assessments, or performance-based assessments that combine into an annual summative determination for a student, which may be administered through computer adaptive assessments; and “(2) assessments that validate when students are ready to demonstrate mastery or proficiency and allow for differentiated student support based on individual learning needs. Title II (Section 2103) authorizes funds to provide training to classroom teachers, principals, or other school leaders for the purpose of implementing these assessments. This training can include inservice on new assessment technologies and means of integrating results of those assessments into classroom instruction.
12
Embed
Validating Formative and Interim Assessments Under ESSAccsso.confex.com/ccsso/2017/webprogram...series of formative assessments and improve their chances of doing well on subsequent
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Validating Formative and Interim Assessments Under ESSA
Michael B. Bunch
Measurement Incorporated
Introduction
This paper focuses on an orderly integration of formative, interim, and summative
assessments to improve student achievement, as described in the Every Student Succeeds
Act (ESSA) of 2015. Within that framework, it addresses the validation of formative
assessments in terms of growth over time, rather than long-term prediction of
performance. The primary vehicle for this work is Measurement Incorporated’s Project
Essay Grade (PEG), an automated scoring engine we have been using for several years for
formative, interim, and summative assessments.
ESSA
ESSA provides hundreds of millions of dollars for states to develop (either individually or
collaboratively) balanced assessment systems comprising summative, interim, and
formative assessments and to assist local education agencies in implementing them.
Specifically:
“SEC. 1204. Innovative assessment and accountability demonstration authority.
“(a) Innovative assessment system defined.—The term ‘innovative assessment system’
Additional workshops, tailored to individual schools’ needs, were also provided. For
example, one session was specifically designed for a school’s art, music, library, and
technology teachers. They learned how to collaborate with general classroom teachers
within NC Write to support cross-curricular writing. Following that training, all specialist
teachers were added to existing regular education courses, allowing cross-curricular
collaboration and support.
In addition to on-site sessions, all DPS administrators and teachers had access to free NC
Write User Experience Webinars. These webinars were conducted by members of the NC
Write team who were also former educators. The webinars were conducted live so
participants could ask questions and were also recorded so they could be viewed on
demand at any time. During the 2015-16 school year, MI provided 20 webinars for DPS.
Validation
As noted previously, predictive validity frameworks will fail if the formative assessments
properly inform intervening instruction to the point that all or most students reach mastery,
regardless of their starting points. Growth or performance relative to expectation would be
more appropriate criteria. Nichols, Meyers, & Burling (2009) provided a framework for
evaluating formative assessments, in response to criticisms of “so-called” formative
assessments made by William & Black (1996, p. 543): “…in order to serve a formative
function, an assessment must yield evidence that, with appropriate construct-referenced
interpretations, indicates the existence of a gap between actual and desired levels of
performance, and suggests actions that are in fact successful in closing the gap.”
7
In short, formative assessments are valid to the extent that they permit or guide instruction
that leads to improved student performance, typically measured by scores on subsequent
formative, interim, or summative assessments. Thus, our focus should be on score increases
from one formative assessment to another (formative to formative) from a series of
formative assessments to an interim assessment (formative to interim), or from a series of
formative and interim assessments to the summative assessment (formative/interim to
summative). It might also be appropriate to give some attention to the relationship
between the formative assessments and the curriculum or instruction (alignment). In fact,
we should start there.
Alignment. Nichols et al. (2009) note that formative assessment validation does not begin
with performance improvement; rather, it begins with evidence of a meaningful
relationship between the assessment and the relevant criterion - alignment. The literature
on alignment has focused almost exclusively on summative assessments (cf., Porter, Smith,
Blank & Zeidner, 2007; Webb, 2007). Porter introduced the Survey of Enacted Curriculum
(SEC; seconline.wceruw.org). Webb has given us the Webb Alignment Tool (WAT,
wat.wceruw.org). Both employ Webb’s depth of knowledge (DOK,
dese.mo.gov/divimprove/sia/msip/DOK_Chart.pdf) scale. With these tools, educators have
been able to plot curriculum, instruction, and assessment on a two-dimensional grid to create a
variety of useful visual displays.
Alignment of a single summative assessment to an enacted curriculum is a fairly time-
consuming process that would be prohibitive for a series of several formative assessments.
Alignment of and for formative assessment tends to be more ad hoc. Greenstein (2010) has
provided a template for classroom teachers to create formative assessments and align them
with instruction based on pretest data (see Figure 1 above). Her approach is to create the
formative assessments on the fly, make quick adjustments based on student performance,
and integrate assessment and instruction on an almost daily basis, allowing each to inform
the other. This is essentially the manner in which PEG aligns writing formative assessments
and instructional modules and the manner in which classroom teachers use them. Success
of the alignment process can then be measured in terms of progress on subsequent
formative, interim, and summative assessments.
The Greenstein (2010) book is written from the perspective of a classroom teacher and, like
those of Rick Stiggins (e.g., Stiggins, 2014; Stiggins & Chappius, 2012; Stiggins & Conklin,
1992) is based more in practice than in theory. Nevertheless, her recommendations are very
much in line with those of Nichols et al. (2009).
Formative to formative. Wilson (2012) found that the PEG-based Connecticut Benchmark
Assessment System for Writing (CBAS-Writing) was effective not only in identifying
struggling writers but in helping them move from struggling to proficient. His sample
8
included over 9,000 students in grades 3-12 and a collection of over 40,000 PEG-scored
essays. Using cluster analysis, Wilson was able to differentiate struggling writers from
proficient writers with great reliability. Through repeated use of PEG, 2/3 of the struggling
writers were able to move to a higher cluster. Typically, five to six revisions of an essay were
sufficient to move a struggling writer to the next higher cluster. For the remaining struggling
writers, additional teacher intervention was necessary.
This last point actually highlights one of the features of PEG-based writing systems. They
allow for four kinds of interactions:
Student-system – PEG provides feedback in terms of immediate scores on six traits
as well as comments on spelling and grammar. In addition, these systems provide
links to trait-based skill builders.
Teacher-system – PEG provides opportunities for teachers to monitor students’
progress and time on task.
Teacher-student – PEG also allows teachers to post feedback and suggestions to
students.
Student-student – on PEG writing sites, students can read one another’s essays and
provide feedback and suggestions.
Wilson & Andrada (2016) used hierarchical linear modeling (HLM; Raudenbush & Bryk,
2002) to analyze effects of PEG writing feedback on subsequent PEG scores. The resultant
model showed that scores improve with practice and feedback, up to about five attempts.
In other words, students’ scores improved steadily on the first through fifth revision of a
PEG-scored essay (about half a score point per revision), but not appreciably with
subsequent attempts. However, the scores on the fifth revision were significantly better
than those on the first draft.
Formative to interim. Returning to the notion of alignment, it must be assumed that the formative, interim, and summative assessments that are aligned to the curriculum and instruction are also aligned to one another. This is not as easy as it may sound. In writing assessment, for example, some assessments take a holistic approach to performance, while others are trait-based. PEG happens to be trait based, yielding scores on six traits (Ideas and Development, Organization, Style, Sentence Structure, Word Choice, and Conventions) plus
a total score that is an unweighted sum of the six trait scores. To date, we are aware of no
reported studies of interim assessment performance related to PEG or any other formative
writing assessment program.
Formative/interim to summative. PEG writing systems are also known to have effects on
outcomes on summative assessments in two states. In Utah, Wilson (2016) found that
participation in a year-long application of PEG-based writing feedback had positive effects
on Utah’s SAGE Writing test as well as on the overall SAGE ELA test. Using the same HLM
9
techniques as cited above, Wilson found positive effects for number of essays written,
number of drafts per essay, and number of lessons engaged in based on PEG feedback. The
following excerpt is taken from that report:
Findings from analyses of each research question were consistent: Utah Compose usage is
positively associated with increased performance on the SAGE ELA and Writing assessments.
This was true both for students and for schools, and true even after controlling for prior literacy
achievement. In sum, the more students and schools used Utah Compose the better their
individual and school performance on SAGE. Findings underscore the educational benefits of
Utah Compose.
In Delaware, Wilson (under review) used path analysis (Wright, 1934) to tease out an
indirect effect of PEG writing on the Smarter Balanced total ELA scale score via student self-
reports on writing efficacy. Wilson examined self-efficacy (belief in one’s own competence
in a particular endeavor) as an intervening variable because of its theorized effect on
persistence in that endeavor (cf. Bruning & Kaufman, 2016). Students (n=56) in grade 6
responded to prompts at three points during the school year (October, January, and
March). The October and March prompts required the students to read stimulus materials
and/or watch video. The October prompt was informational, while the March prompt was
persuasive. The January prompt had no associated stimuli.
Students wrote a total of 1,027 essays scored by PEG and spent a total of 708 minutes on
PEG-based lessons. A control group (n=58) wrote essays using GoogleDocs without feedback
or links to remedial lessons. PEG provided feedback on six traits as well as in-text comments
on spelling and grammar errors. At the end of the year, all students responded to a Smarter
Balanced writing prompt (stimulus based, informational). Path analysis revealed that
involvement in PEG writing over the course of the year had no more direct impact on scores
on the summative assessment than did involvement in GoogleDocs writing. However, PEG
writing did have a direct effect on writing self-efficacy, and writing self-efficacy had a direct
effect on scores on the summative assessment. GoogleDocs (the control) had no such
effect.
This finding is interesting in that the interim (PEG/GoogleDocs) writing assignments differed
in important ways from the summative writing assessment; namely, the interim
assignments focused on an assortment of stimulus-based and non-stimulus-based prompts
in two genres, while the summative assessment employed a stimulus-based prompt in a
single genre and combined the writing score with a reading score. A separate writing scale
score was not available. Students who had participated in the PEG program over the course
of the academic year saw their writing performance improve and therefore persisted in
their own performance improvement. This finding is similar to that of Perie, Marion, & Gong
(2009, p. 12) that “repeated testing, in and of itself, contributed to retention.”
10
Conclusions and Recommendations
The rules are changing – again. While the basic definitions of validity have shifted with each
new edition of Educational Measurement (Cureton, 1951, Cronbach, 1971, Messick, 1989,
Kane, 2006), the integration of formative, interim, and summative assessments has
fundamentally changed the nature of what we seek to validate. Indeed, Mike Kane (2006)
speaks of validity as argument, after the manner of Stephen Toulmin (1958). Kane’s
framework is an excellent one for validation of formative and interim assessment.
These changes in how we view validity or even how we view assessment were not brought
about instantly by ESSA, or by any previous renewal/revision of the Elementary and
Secondary Education Act (ESEA) of 1965 (Public Law 95-10). They have been brought about
gradually by a recognized need to focus on assessment at the classroom and individual
student level. Here we find that our longstanding definitions of validity don’t work very
well. Theory works only slightly better. Pragmatism is the order of the day.
The framers of ESSA recognized the fact that assessment embedded in instruction works. It
even seems a bit presumptuous to refer to this kind of testing “innovative” (Section 1204 a),
given that we have known about it for more than 50 years. Greenstein (2010) and others
writing from the perspective of the classroom have helped us realize the potential of
embedded, formative assessments, and Page (1966) and others who followed have given us
tools to make formative assessment a real possibility for teachers.
Automated scoring of essays has been proven to be reliable (Morgan et al., undated). Now,
automated scoring, particularly when coupled with instantaneous feedback and links to
instructional modules, has been shown to be valid in that it helps students write better,
both in the short term and in the longer term. This particular definition of validity may seem
to be a throwback to the earlier definitions in which validity referred to an instrument
rather than to the interpretation of scores derived from that instrument for a particular
purpose, but it is not. It simply recognizes a different context for assessing validity.
Results with the PEG program so far have been very encouraging. While this paper has
focused on PEG’s ability to score essays and provide useful feedback, it should be noted
that PEG is also being used to score short-answer constructed-response items. So far, most
of these items have been embedded in summative assessments. Formative assessments
with similar item types are not far behind. We expect to see similar research on the utility of
PEG-based systems with these assessments very soon.
Neither the current Standards for Educational and Psychological Testing (APA/AERA/NCME,
2014) nor the Operational Best Practices for Statewide Large-Scale Assessment Programs
(CCSSO/ATP, 2013) has a great deal to say about formative assessments. Moving forward,
we should make sure that a workable definition of validation of such assessments makes its
way into the next edition of these guidebooks. As classroom teachers, principals, and other
11
school leaders improve their assessment literacy (a stated goal of Section 2103 of ESSA), the
original purpose of formative (embedded) assessments will be fulfilled.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: AERA.
Bruning, R. H., & Kauffman, D. F. (2016). Self-efficacy beliefs and motivation in writing development. In C. A. McArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (2nd Ed.) (pp. 160-173). New York, NY: Guilford
Bunch, M. B., Vaughn, D., & Miel, S. (2016). Automated scoring in assessment systems. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Technology Tools for Real-World Skill Development (pp. 611-626). Hershey, PA: IGI Global
Council of Chief State School Officers & Association of Test Publishers (2013). Operational Best Practices for Statewide Large-Scale Assessment Programs. Washington, DC: CCSSO.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.). Washington, DC: American Council on Education.
Cronbach, L. J. (1982). Designing Evaluations of Educational and Social Programs. San Francisco: Jossey Bass.
Cureton, E. E. (1951). Validity. In E. F. Lindquist (1951). Educational Measurement. Washington, DC: American Council on Education.
Every Student Succeeds Act of 2015. Public Law 114-95 § 114 Stat. 1177 (2015-2016).
Greenstein, L. (2010). What Teachers Really Need to Know About Formative Assessment.
Washington, DC: Association for Supervision and Curriculum Development.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.). Educational Measurement (4th Ed.). Westport, CT: American Council on Education/Praeger.
Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational Measurement (3rd Edition). Washington, DC: American Council on Education/Macmillan
Morgan, J., Shermis, M. D., Van Deventer, L. & Vander Ark, T. (undated). Automated Student Assessment Prize: Phase 1 & Phase 2: A Case Study to Promote Focused Innovation in Student Writing Assessment. Retrieved 9/1/14 from http://gettingsmart.com/wp-content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf
Nichols, P. D., Meyers, J. L., & Burling, K. S. (2009). A framework for evaluating and planning assessments intended to improve student achievement. Educational Measurement: Issues and Practice, 28 (3), 14-23.
Page, E. B. (1966). The imminence of…grading essays by computer. Phi Delta Kappan, 47 (2). 238-243).
Perie, M. , Marion, S., & Gong, B. (2009). Moving toward a comprehensive assessment system: A framework for considering interim assessments. Educational Measurement: Issues and Practice, 28 (3), 5-13.
Porter, A.C., Smithson, J., Blank, R., & Zeidner, T. (2007). Alignment as a teacher variable. Applied Measurement in Education, 20(1), 27-51.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods (2nd Ed.). Thousand Oaks, CA: Sage.
Stiggins, R. J. (2014). Defensible Teacher Evaluation: Student Growth Through Classroom Assessment. Thousand Oaks, CA: Corwin Press.
Stiggins, R. J. & Chappuis, J. (2012). Introduction to Student-Involved Assessment for Learning (6th Ed.). Boston, MA: Pearson Education.
Stiggins, R. J. & Conklin, N. F. (1992). In Teachers’ Hands: Investigating the Practice of Classroom Assessment. Albany, NY: State University of New York Press.
Toulmin, S. E. (1958). The Uses of Argument. New York: Cambridge University Press.
Webb, N.L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20(1), 7-25.
William, D. & Black, P. (1996). Meaning and consequence: A basis for distinguishing formative and summative functions of assessment. British Educational Research Journal, 22 (5), 537-548.
Wilson, J. (2012). Using CBAS-WRITE to Identify Struggling Writers and Improve Writing Skills. Paper presented at the Third Annual Connecticut Assessment Forum, Rocky Hill, CT.
Wilson, J. (2016). Executive Summary of Findings from Analyses of Utah Compose and SAGE Data for Academic Year (AY) 2014-15 and 2015-16.
Wilson, J. (under review). Effects of automated writing evaluation software on writing
quality, writing self-efficacy, and state test performance: A study of PEG Writing. Computers & Education.
Wilson, J. & Andrada, G. N. (2016). Using automated feedback to improve writing quality:
Opportunities and challenges. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Technology Tools for Real-World Skill Development (pp 678-703). Hershey, PA: IGI Global.
Wright, S. (1934). The method of path coefficients. Annals of Mathematical Statistics. 5 (3): 161–215.