Top Banner
PLEASE SCROLL DOWN FOR ARTICLE This article was downloaded by: [Universiti Sains Malaysia] On: 29 September 2008 Access details: Access Details: [subscription number 794555651] Publisher Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Applied Measurement in Education Publication details, including instructions for authors and subscription information: http://www.informaworld.com/smpp/title~content=t775653631 Analytic Versus Holistic Scoring of Science Performance Tasks Stephen P. Klein; Brian M. Stecher; Richard J. Shavelson; Daniel McCaffrey; Tor Ormseth; Robert M. Bell; Kathy Comfort; Abdul R. Othman Online Publication Date: 01 April 1998 To cite this Article Klein, Stephen P., Stecher, Brian M., Shavelson, Richard J., McCaffrey, Daniel, Ormseth, Tor, Bell, Robert M., Comfort, Kathy and Othman, Abdul R.(1998)'Analytic Versus Holistic Scoring of Science Performance Tasks',Applied Measurement in Education,11:2,121 — 137 To link to this Article: DOI: 10.1207/s15324818ame1102_1 URL: http://dx.doi.org/10.1207/s15324818ame1102_1 Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.
18

Analytic Versus Holistic Scoring of Science Performance Tasks

May 12, 2023

Download

Documents

Junjie Zhang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Analytic Versus Holistic Scoring of Science Performance Tasks

PLEASE SCROLL DOWN FOR ARTICLE

This article was downloaded by: [Universiti Sains Malaysia]On: 29 September 2008Access details: Access Details: [subscription number 794555651]Publisher RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Applied Measurement in EducationPublication details, including instructions for authors and subscription information:http://www.informaworld.com/smpp/title~content=t775653631

Analytic Versus Holistic Scoring of Science Performance TasksStephen P. Klein; Brian M. Stecher; Richard J. Shavelson; Daniel McCaffrey; Tor Ormseth; Robert M. Bell;Kathy Comfort; Abdul R. Othman

Online Publication Date: 01 April 1998

To cite this Article Klein, Stephen P., Stecher, Brian M., Shavelson, Richard J., McCaffrey, Daniel, Ormseth, Tor, Bell, Robert M.,Comfort, Kathy and Othman, Abdul R.(1998)'Analytic Versus Holistic Scoring of Science Performance Tasks',Applied Measurement inEducation,11:2,121 — 137

To link to this Article: DOI: 10.1207/s15324818ame1102_1

URL: http://dx.doi.org/10.1207/s15324818ame1102_1

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf

This article may be used for research, teaching and private study purposes. Any substantial orsystematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply ordistribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae and drug dosesshould be independently verified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directlyor indirectly in connection with or arising out of the use of this material.

Page 2: Analytic Versus Holistic Scoring of Science Performance Tasks

APPLIED MEASUREMENT IN EDUCATION, 11(2), 121-137 Copyright O 1998, Lawrence Erlbaurn Associates, Inc.

Analytic Versus Holistic Scoring of Science Performance Tasks

Stephen P. Klein and Brian M. Stecher RAND

Santa Monica, California

Richard J. Shavelson School of Education Stanford University

Daniel McCaffrey RAND

Santa Monica, California

Tor Ormseth El Rancho Unijied School District

Pico Rivera, California

Robert M. Bell RAND

Santa Monica, California

Kathy Comfort WestEd

Sun Francisco, California

Abdul R. Othman Mathematical Sciences Program Centre for Distance Education

Universiti Sains Malaysia

Requests for reprints should be sent to Stephen P. Klein, RAND, 1700 Main Street, Santa Monica, CA 90401.

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 3: Analytic Versus Holistic Scoring of Science Performance Tasks

122 KLEIN ET AL.

We conducted 2 studies to investigate interreader consistency, score reliability, and reader time requirements of 3 hands-on science performance tasks. One study involved scoring the responses of students in Grades 5, 8, and 10 on 3 dimensions ("cumculum standards") of performance. The other study computed scores for each of the 3 parts of the Grade 5 and 8 tasks. Both studies used analytic and holistic scoring rubrics to grade responses but differed in the characteristics of these rubrics. Analytic scoring took much longer but led to higher interreader consistency. Nevertheless, when averaged over all the questions in a task, a student's holistic score was just as reliable as that student's analytic score. There was a very high correlation between analytic and holistic scores after they were disattenuated for inconsistencies among readers. Using 2 readers per answer does not appear to be a cost-effective means for increasing the reliability of task scores.

Many large-scale testing programs have begun using performance-based ("authen- tic") assessments of student achievement rather than relying solely on multiple- choice questions. Some states, such as California and New York, even use measures that require students to manipulate various kinds of materials (and are therefore called hands-on tasks). Whether the use of such measures will persist and be expanded to other testing programs will depend in part on the reliability of the scores obtained and the time and cost required for test development, administration, and scoring. In this context, score reliability is generally defined in terms of the degree to which students who receive high scores on one task tend to receive high scores on other tasks that are designed to measure essentially the same skills and abilities (Dunbar, Koretz, & Hoover, 1991). In short, reliability is a function of the correlation among tasks from the same domain. All other things being equal (such as task length), the higher the correlation among tasks, the shorter the test that is needed to obtain an acceptably reliable total score for a student and, therefore, the less testing time and expense required.

One factor that affects the reliability of performance test scores is that readers (graders, scorers, raters, observers, etc.) may differ in their assessment of the quality of a student's response. One reader may be more lenient than another (which is indicated by their having different means). Readers also may differ in their judgments about whether one answer is better than another (which would be indicated by a low correlation between readers in the scores they assign to a common set of answers). Such disagreements tend to depress the correlation among tasks and thereby lower score reliability.

Another factor that may affect the reliability of performance test scores is the method that is used to grade responses. In a holistic system, a reader makes a single, overall judgment about an answer's quality. This approach is usually most appro- priate when the whole is greater than the sum of the parts, that is, when scores need to be sensitive to general features of answer quality, such as organization, style, and persuasiveness.

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 4: Analytic Versus Holistic Scoring of Science Performance Tasks

SCORING PERFORMANCE ASSESSMENTS 123

In an analytic system, the reader assigns scores to reflect how well the student responds to each of the several aspects of the question or task that should be addressed in a model answer. Analytic scoring presumably provides a more objective assessment of answer quality than holistic methods because, at least in theory, it is less susceptible to extraneous factors and biases, such as handwriting quality or a reader being overly affected by the last portion of the student's answer. Some high-stakes testing programs that include constructed-response measures, such as the Society of Actuaries exams, use analytic scoring because it facilitates calibrating readers who cannot readily meet with each other, and it provides a more concrete basis for defending challenges to the assigned scores. However, it usually takes much longer to grade a batch of answers with an analytic scale than with a holistic one. Thus, the choice of grading method may involve trade-offs among reader agreement, score reliability, costs, and other factors.

Most of the research on the utility of alternative scoring methods has been done with tests of writing ability. This literature suggests it is possible to achieve high levels of interreader consistency with different scoring methods. Bauer (1981) obtained interreader correlations of .95 with the analytic method and .93 with the holistic method in grading of National Assessment of Educational Progress (NAEP) essay answers. Others have reported similarly high levels of interreader consistency for both analytic and holistic scoring of student essays (Moss, Cole, & Khampalikit, 1982; Vacc, 1989).

Analytic and holistic methods do not always yield the same results in terms of students' relative standing. For example, Moss et al. (1982) found that the correla- tions between methods ranged from .12 to .47 across three gradelevels. Vacc (1989) obtained correlations between analytic and holistic scores ranging from .56 to .81 across four raters. These data suggest that the choice of scoring method may influence conclusioas regarding the relative quality of a student's responses on a performance task.

Analytic approaches appear to require more training and scoring time than holistic methods. Bauer (1981) reported that the scoring of student responses to NAEP writing pronipts required twice as long to train and four times as long to grade with the analytic method than with the holistic method. Differences of this magnitude have significant consequences for large-scale testing programs with limited resources. Wainer and Thissen (1993) displayed this trade-off graphically using a "ReliaBuck" graph that shows the relation between scoring costs and reliability for both niultiple-choice and constructed-response measures.

Absolute (criterion referenced) and relative (normative) scales have been used with both holistic and analytic approaches. There does not appear to be any generally accepted or preferred method when it comes to grading responses, at least with respect to tests that measure content-related knowledge, skills, and abilities. For example, some state boards of bar examiners grade applicant answers to bar

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 5: Analytic Versus Holistic Scoring of Science Performance Tasks

exam essay questions on an absolute holistic scale (e.g., scores are assigned on a scale from 0 to 100, where 65 is defined as a marginal fail, 70 as a marginal pass, 75 as a clear pass, etc.), some use a relative holistic scale (e.g., on a scale from 1 to 5, where 1 is far below average and 5 is far above average), and some use an absolute analytic scale (e.g., where scoring guides are tailored to the unique features and issues in each question, and the maximum number of raw score points an applicant can earn varies across questions). Some of the states that use the latter approach convert these raw scores to a relative analytic scale. They do this by transforming the raw scores on each essay question to a common scale (e.g., with a mean of 50 and standard deviation of 10).

Analytic or holistic scales that are expressed in absolute terms may still have a substantial relative component. For example, a reader's judgment about whether a student's portfolio has a certain absolute property (such as "used mathematical terms appropriately") may be influenced by contextual factors, such as the quality and characteristics of the other portfolios the reader graded (Hambleton et al., 1995).

To our knowledge, there are no published studies that compare analytic and holistic grading methods in evaluating student performance with hands-on science tasks. This is not surprising because such measures are just beginning to be used in large-scale testing programs. However, the levels of interreader consnstency that have been reported with these tasks appear to be as high as or higher than those obtained with writing prompts (Linn & Burton, 1994; Shavelson, Gao, & Baxter, 1993). For example, Baxter, Shavelson, Goldman, and Pine (1992) found mean interreader correlations with an analytic method of .96 for observers who graded students on how well they performed a science experiment and .85 among readers who scored student laboratory notebooks.

The two studies described here explore the trade-offs between two types of analytic and holistic methods for scoring responses to hands-on science tasks. Specifically, we use student responses from a statewide assessment program to examine the degree to which these methods differ in interreader consistency, score reliability, and reader time. We then discuss the implications of these findings for large-scale programs.

TASKS

Both studies used hands-on performance tasks that were developed by the Califor- nia State Department of Education for students in Grades 5,8, and 10. In the spring of 1992, these tasks were administered in several hundred schools throughout California as part of a field test of that state's assessment program in science. Although the content and cognitive demands of the tasks differed by grade level, they all had the same structure. Each task had three parts and each part took one classroom period. The three parts were linked by a story line that coordinated

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 6: Analytic Versus Holistic Scoring of Science Performance Tasks

SCORING PERFORMANCE ASSESSMENTS 1 25

concepts from life, earth, and physical sciences. Students were guided through the parts by a notebook in which they wrote their answers. In Parts I and 111, the students worked individually. In Part 11, they worked in pairs but wrote their own answers to the questions for this part in their own notebooks.

For example, the Grade 5 task focused on recycling. Part I asked students to sort the objects in a bag of "trash" into groups and then explain why they formed the groups that they did. In Part 11, they sorted the trash into three groups: liquids, solids that were living or once living, and solids that were nonliving. They then responded to three questions about their sorting (e.g., "From the living or once living group, which objects would you send to be recycled and tell how you decided"). The first item in Part 111 showed students how to construct an electromagnet and asked them to sketch what they had done. Items 2 and 3 in this part had students test their magnets; the students were then asked "Is the nail now a magnet?' and "How do you know?' Item 4 had students use their magnets to sort solid nonliving trash into metallic and nonmetallic groups and record their results in a table (for more complete descriptions of these measures, see Saner, McCaffrey, Stecher, Klein, & Bell, 1994).

The questions within each part were mapped onto three statewide standards: (a) conceptual understanding (student understands and communicates the main con- cepts of science and connections among them), (b) pelformance (student uses scientific processes, concepts, and tools to solve problems and increase conceptual understanding), and (c) application (student demonstrates, communicates, and applies conceptual understanding that reflects scientific attitudes, values, and social responsibility).

The Grade 5 task had 2 scorable items in Part 1,7 in Part 11, and 3 in Part 111. The corresponding counts were 5, 14, and 4, respectively, for the Grade 8 task and 3, 8, and 1, respectively, for the Grade 10 task. Most of the conceptual items and almost all the performance items were in Part II. Half of the application items were in Part 111. Part I usually had a mix of items across standards. The readers in both of the studies described here recorded their grades on separate score sheets. They could not see the scores assigned by other readers or their own scores under a different grading method.

STUDY 1 : ANALYSES BY STANDARD

Scoring Methods And Procedures

Descriptions of the scoring levels that corresponded to each question or combina- tion of questions were compiled in a detailed analytic scoring guide for each task that included benchmark answers for each score level. Each response was graded on a 3-, 4-, 5, or 6-point scale that was tailored to each question. The higher the

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 7: Analytic Versus Holistic Scoring of Science Performance Tasks

score, the better the quality of the response. For example, the scale for the combination of Questions 2 and 3 in Part 111 of the Grade 5 task described earlier was as follows:

0 = No attempt at answering, left blank 1 = Vague or no specific metal mentioned (e.g., "because it picks up things") 2 = Gives specific instance (e.g., "staples stick to it") 3 = Has concept of a magnet (e.g., "attracts metal paper clip and small nail")

A separate analytic score was computed for a standard by summing the analytic scores on the questions that were aligned with that standard. All the readers participated in a 6-hr training and calibration session before they began working on their own. A reader graded one batch of 25 notebooks before grading another batch. Within a batch, a reader graded all the responses to one part before grading all the responses to another part. There was a separate analytic score recording form for each part.

After the readers completed grading their notebooks using the analytic method, they received a 2-hr orientation to arelative holistic method. In this method, areader assigned a single score to each student for each of the three standards based on the reader's judgment regarding the overall quality of the notebook. The readers used a 6-point scale for this purpose, ranging from 1 (well below average performance) to 6 (well above average pe$ormance).

The analyses for Study 1 are based on 168 Grade 5 students, 98 Grade 8 students, and 102 Grade 10 students who had their answers scored under standardized conditions. These students were sampled randomly from the statewide field test population. There were four to five pairs of readers per grade level. Both members of apair graded the same set of answers under both methods. That is, each student's notebook was graded four times: twice under the analytic method and twice under the holistic method. All the readers were teachers at the grade level of the students whose answers they scored.

Results

A generalizability study (Cronbach, Gleser, Nanda, & Rajaratnan, 1972; Shavelson &Webb, 1991) was conducted at each grade level for each standard for both scoring methods. A Person x Rater x Item analysis was used with the analytic ratings, and a Person x Rater analysis was used with the holistic ratings (see the Appendix). These analyses found that all the readers had essentially the same mean within a scoring method. Consequently, disagreements between readers stemmed almost entirely from differences in their judgments regarding the relative quality of the students' answers.

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 8: Analytic Versus Holistic Scoring of Science Performance Tasks

SCORING PERFORMANCE ASSESSMENTS 127

TABLE 1 Interreader Correlations by Grade Level, Standard, and Scoring Method in Study 1

Conceptual Perfonnmce Application

Grade Analytic Holistic Analytic Holistic Analytic Holistic

Mean .75 .55 32 .38 .68 .52

Note. In Grade 5, there were 3 scorable conceptual items, 4 performance items, and 5 application items. In Grade 8, there were 8 scorable conceptual times, 11 performance items, and 4 application items. In Grade 10, there were 4 scorable conceptual items, 6 performance items, and 2 application items.

Interreader correlations with the analytic method were consistently higher than those with the holistic method (Table 1). Some of the possible explanations for this disparity are (a) it was simply too difficult to read a student's entire notebook and then arrive at a single score on each standard, (b) readers were not adequately trained in the use of the holistic rubric, (c) they were too fatigued after having done the analytic scoring, and (d) the definitions of the standards did not provide readers with enough guidance.

Although the analytic method had higher interreader correlations than the holistic method, it still failed to provide reasonably reliable student scores for each standard when there was only one reader per answer (Table 2). This occurred because the variability due to persons accounted for only a small part of the total variability (almost always less than 20%). In contrast, task variability (i.e., the Person x Item inter~ction) accounted for 36% to 54%. Raters as well as Raters x Items explained lesS than 1%. In short, the low reliability for a score on a standard was not due to one reader being consistently more lenient than another or one reader grading high on some questions but low on others. Instead, it stemmed mainly from inconsistencies in a student's level of performance across items.

The same was true for total scores. A G-study analysis with all the items in a task produced estimates of .52, .77, and .63 for the reliability of analytic total scores at Grades 5,8, and 10, respectively (the distribution of the variability in these scores across components followed the pattern exhibited in Table 2). Again, these esti- mates assume each notebook is read once. However, because of the large task variability, using two (or more) readers per answer would make only a small improvement in score reliability (while greatly increasing costs and perhaps delay- ing the reporting of results). For example, on the average across the three tasks studied, increasing the number of readers per answer from one to two would increase the reliability of an analytic score on a standard from .45 to .52 and increase the reliability of a total score from $4 to .7 1.

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 9: Analytic Versus Holistic Scoring of Science Performance Tasks

128 KLEIN ET AL.

The mean correlation among the three standards with the analytic method at Grades 5, 8, and 10 (.35, .49, and .57, respectively) was much lower than the corresponding mean with the holistic scale (.73, 34, and .76, respectively). This difference was probably due to halo effects with the holistic ratings. Consequently, we suspect the correlations among the three standards with the holistic method would be lower if readers graded all the students on one standard before grading them all on another standard (but this would have tripled reading time).

There was usually only a moderate observed correlation between the analytic and holistic scores on a standard (Table 3). However, these correlations were all close to 1.00 when they were corrected (disattenuated) for the less than perfect agreement among readers. The disattenuated correlation between an analytic score on one standard and a holistic score on another standard also tended to be very high (the mean values at Grades 5, 8, and 10 were 30, 31, and .86, respectively) but still less than the correlations between scoring methods on the same standard. Thus, it appears that the standards may assess somewhat (but not substantially) different aspects of student performance. However, these data must be interpreted cautiously because of the large amount of measurement error that was present.

TABLE 2 Percentage of Total Variability and Reliability Estimates From Generalizability Analyses With

the Analytic Scoring Method by Grade Level and Standard in Study 1

Source of Variance (% of total variability)

Person Person Rater Score Grade Standard Person Rater Item x Rater x Item x Item Residual Reliabilitya

5 Conceptual 14 0 11 3 46 0 26 .34 Performance 9 0 10 2 52 0 26 .38 Application 16 0 18 5 36 0 26 .49

8 Conceptual 14 0 16 1 41 1 26 .60 Performance 12 0 22 1 39 1 24 .64 Application 14 0 5 1 52 0 27 .41

10 Conceptual 13 1 28 5 38 0 16 .42 Performance 16 1 8 3 54 0 18 .51 Application 25 1 2 9 41 0 22 .38

Mean 15 0 13 3 44 0 23 .46

aReliability of all items in the scale with one reader. The reliability estimates were computed using the following formula with the estimated variance components:

Reliability = P (p) + (p x r ) + [ ( p x i ) + Residual] I (Number of items) '

where p is person, r is rater, and i is item.

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 10: Analytic Versus Holistic Scoring of Science Performance Tasks

SCORING PERFORMANCE ASSESSMENTS 129

TABLE 3 Observed and Disattenuated Correlations Between Analytic and Holistic Scores in Study 1

Observed Disattenuated

Standard Grade 5 Grade 8 Grade 10 Grade 5 Grade 8 Grade 10

Conceptual .59 .52 .69 .90 .93 .99 Performance .62 .41 .56 .96 .95 .95 Application .57 .57 .72 .97 > 1.00 > 1.00

aCoefficients are disattenuated for the lack of perfect agreement between readers. These correlations may be spurious when the correlation between readers is low.

STUDY 2: ANALYSES BY PART

Scoring Methods and Procedures

Study 2 used a somewhat different analytic scoring guide than Study 1. Specifically, in Study 2, a student's response to each item was rated on a scale ranging from 1 (nonresponsive) to 6 (outstanding), but again, there were benchmark answers in the scoring guide to indicate the types of responses that were associated with each score level. A score of 0 was assigned if the student left the question blank (but this happened only rarely). A student's overall analytic score on a part was the mean (as distinct from the total) of the separately scored responses in that part.

The holistic method for Study 2 involved assigning a single overall score to each part. A separate 5-point scale was developed for this purpose. The score points on this scale ranged from 1 (among the worst of the responses to a part) to 5 (among the best of the respbnses to a part). Readers were instructed to use the full 5-point scale. To ensure that they did this, we asked them to assign at least one answer to each score level on each part. However, beyond this restriction, they could assign as many answers as they wished to each score level. Thus, unlike Study 1, all the grading was done by parts rather than by standards.

A student's totalscore under each scoring method was the sum of the three part scores. Therefore, the maximum possible analytic score was 18, and the maximum possible holistic total score was 15. Summing items to produce part scores and summing parts to produce total scores effectively weights each item and part by its variance.

During the 2-day scoring session for this study, about half the readers used the analytic method to gade all of their assigned answers to Part I. Next, they used this scoring method to grade all of their assigned answers to Part 11, and then they used this scoring method to grade all the answers to Part 111. Finally, they repeated this whole process using the holistic method. The other readers did the same thing, but they started with the holistic method and finished with the analytic method (this

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 11: Analytic Versus Holistic Scoring of Science Performance Tasks

130 KLEIN ET AL.

counterbalancing was done to control for the possible sequence effects that may have crept into Study 1).

The teachers who served as graders for Study 2 were asked to bring to the scoring session 6 student notebooks from their own class. Teachers were told to select one student in the bottom third of their class, one student in the middle third, and one student in the top third. In addition, they were told to bring the notebooks for the students who were paired with these three students during Part I1 of the assessment. This netted a sample of 82 notebooks from Grade 5 students and 72 notebooks from Grade 8 students from the statewide field test that were legible enough to grade. There were no 10th graders in Study 2. Because of time and other constraints, 64% of the fifth graders and 94% of the eighth graders had their notebooks evaluated on all three parts by at least two readers under both scoring methods.

The 29 readers participating in this study were fifth- and eighth-grade teachers who had been trained to use the analytic rubric for their grade level as part of the statewide program. They all had previous experience grading answers with this rubric. They were trained in the use of the holistic method in conjunction with the scoring sessions for this study. The teachers at each grade level were split into two groups. The teachers in one group began by grading the notebooks that were submitted by the teachers in the other group. After a reader graded a batch of notebooks, they were shuffled and given to anotherreader to score. Several answers that were graded by one group of readers with the holistic method were also graded with this method by the other group (but teachers did not score their own students' work, and time constraints precluded switching batches between reader groups with the analytic method). At both grade levels, a teacher scored 10 to 18 notebooks with the holistic method and 9 to 12 with the analytic method.

Results

Readers agreed somewhat more with each other regarding the score that should be assigned to a student on a part when they used the analytic method than when they used the holistic method. Specifically, mean interreader correlation coefficients were higher with the analytic method than with the holistic method (Table 4). The generally higher interreader coefficients in Part I1 for both scoring methods prob- ably stemmed from this part having more than twice as many questions than either of the other two parts. In terms of total scores summed across the three parts, the analytic method led to higher interreader consistency than the holistic method at Grade 5 but not at Grade 8.

There was no meaningful difference between the analytic and holistic methods in their assessment of whether one student's answers were better or worse than some other student's answers. The correlation between the total scores obtained with each method (i.e., across the three parts) was .71 at Grade 5 and .80 at Grade 8 (see Table 5 and the Appendix). However, when these observed correlations were

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 12: Analytic Versus Holistic Scoring of Science Performance Tasks

TABLE 4 Interreader Correlations by Scoring Method in Study 2

Grade 5 Grade 8

Score Analytic Holistic Analytic Holistic

Part I .85 .54 .79 .68 Part I1 .92 .7 1 .85 .67 Part 111 .80 .58 .79 .73

Total 39 .70 .84 .83

Note. The Grade 5 task had 2 scorable items in Part I, 7 in Part 11, and 3 in Part 111. The Grade 8 task had 5 scorable items in Part I, 14 in Part 11, and 4 in Part 111. The Grade 10 task had 3 scorable items in Part I, 8 in Pact 11, and 1 in Part 111.

TABLE 5 Observed and Disattenuated Correlations Between Analytic

and Holistic Scores in Study 2

Observed Disattenuated

Score Grade 5 Grade 8 Grade 5 Grade 8

Part I .61 .68 .90 .92 Part I1 .70 .68 3 6 .90 Part I11 .60 .74 .87 .98

Total .71 .80 .90 .96

disattenuated for the less than perfect agreement among readers within a scoring method, the correlation between methods increased to .90 at Grade 5 and .96 at Grade 8. These data suggest that almost all of the observed differences between methods was due to the readers within a method differing with each other regarding the score that should be assigned to an answer. Thus, as in Study 1, the scoring method itself appeared to have almost no unique influence on the readers' assess- ment of the relative quality of a student's answers.

Finally, we used detailed records from Study 2 to examine how long it took to grade a student's responses with each scoring method. This analysis found that, at Grade 5, the analytic method took an average of 17.5 min per student to grade all three parts, whereas the holistic method took only 6.4 min (Table 6). The corre- sponding times at Grade 8 were 14.6 and 3.1 min, respectively. Thus, the analytic method took nearly three times as much time as the holistic approach to score a fifth grader's answers and nearly five times as much to score an eighth grader's answers. The mean of these times translate into roughly $6.70 per student for analytic scoring versus $2.00 for holistic scoring (these estimates exclude the time

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 13: Analytic Versus Holistic Scoring of Science Performance Tasks

132 KLEIN ET AL.

TABLE 6 Score Reliability, Scoring Times, and Costs With Analytic

and Holistic Methods in Study 2

Grade 5 Grade 8

Analytic Holistic Analytic Holistic

Score reliability .68 .66 .77 .81 Minutes per student 17.5 6.4 14.6 3.1 Cost per student $7.29 $2.67 $6.08 $1.29

Note. The reliability of total analytic scores in Grades 5, 8, and 10 in Study 1 were .52, .77, and .63, respectively. Reliability and cost estimates assume one reading per notebook. The estimates of score reliability in both studies may be inflated because of the nesting of items within tasks. This nesting may create a context effect such that the usual assumption of zero covariance of errors among items within a test may be untenable.

required for training and assume a conservative $25/hr cost per reader, which includes supervisory staff and reader time, meals, facilities, travel, etc.). In short, if readers are reimbursed for their time and expenses, then the cost of using the types of analytic methods California employed in 1992 will be substantially greater than a far less time-consuming holistic approach.

DISCUSSION AND CONCLUSIONS

The foregoing results suggest there are trade-offs between analytic and holistic scoring. In both studies, there was almost no difference in mean scores between readers within a scoring method, but the analytic approach produced higher correlations between readers than did the holistic method. However, Study 2 showed that the reliability of total scores across the three parts was about the same with the two methods, and the holistic approach required far less time to grade a set of answers than did the analytic method. In addition, the very high disattenuated correlations between the two scoring methods in both studies suggests there was no meaningful underlying difference between them in their assessment of the relative quality of student responses. Either graders respond to the same features of student responses regardless of scoring method or the features that drive each method are highly correlated with each other.

The disattenuated values are particularly relevant if the focus of attention is switched from individual students to schools. School means are generally much more reliable than individual student scores. Thus, the correlation between methods probably would be closer to the disattenuated values if the school rather than the individual student is used as the unit of analysis. In short, the choice of scoring

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 14: Analytic Versus Holistic Scoring of Science Performance Tasks

SCORING PERFORMANCE ASSESSMENTS 1 33

method would probably have little or no effect on a school's relative standing in a district or a statewide assessment program. In addition, both scoring methods resulted in comparable estimates of the internal consistency reliability of the overall task in a grade level. These estimates may have been inflated by the nesting of questions within tasks (Wainer & Kiely, 1987; Yen, 1993) and deflated by the students working in pairs on Part I1 but individually on Parts I and III. For example, Saner et al. (1994) found that scores on Parts I and I11 correlated higher with each other than they did with the Part I1 scores.

From the standpoint of cost effectiveness, using two readers per answer does not appear to be a viable strategy because it produces only a small increase in score reliability for twice the cost of having a single reader. Moreover, the slight advantage the analytic method enjoyed in interreader consistency did not pay off in overall score reliability. Thus, on balance, the only factor that would seem to favor one scoring method over the other is that readers could grade far more answers per hour with the holistic approach.

Nevertheless, there are potential disadvantages to the holistic approach that must be considered. For example, if a purely relative holistic scale is used and if the readers in one part of a state see a somewhat better batch of responses than the readers in another part, then this difference in mean response quality may not be reflected in their scores (i.e., unless the different grading teams are truly calibrated to a common standard through extensive training and supervision). Similarly, if a state repeated some tasks over time to assess change in student achievement, then it would need some way to ensure that the holistic judgments that were made one year employed the same grading standards as those used in subsequent years. Narrative descriptions of scoring levels probably are not sufficient for this purpose. We suspect that the type of analytic scale used in Study 1 would be far less sensitive to contextual biases than would any of the other grading systems we studied because it assigns scores on the basis of how closely the student's responses match specific benchmark exemplars in the scoring guide.

Another potentid concern with holistic judgments is that they may be harder to explain and defend. With a holistic scale, one has to justify why (when taken together) the student's responses to all of the separate components and aspects of an answer warrant given overall grade. In an analytic system, each component is graded separately. Thus, one has only to show that the student's response to a component did or did not conform to the scoring guide for that component. For example, the analytic score on Item 4 in Part I11 of the Grade 5 task is based on a simple count of the number of pieces of trash the student properly placed in the metallic and nonmetallic categories. In contrast, the holistic score for Part 111 is based on the reader's impression of the overall quality of the student's responses to this part as a whale.

All the standards in Study 1 had fairly low reliabilities. Thus, they should not be used for reporting scores for individual students. Moreover, the high disattenu-

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 15: Analytic Versus Holistic Scoring of Science Performance Tasks

134 KLEIN ET AL.

ated correlations among the standards means that, if their reliabilities had been sufficient for making decisions about individual students, then the standards would be so highly correlated with each other that they would not differ in how they classified students. Thus, they would not provide differential diagnostic data about a student's performance.

On the positive side, the teachers who served as readers often reported that the scoring activities helped them develop a better appreciation of different aspects of student learning and performance as well as gave them guidance in how to construct and score hands-on tasks for their own students. All of the scoring methods employed in this research seemed to be equally useful for this purpose. However, whether this is a actual benefit would have to be determined by more systematic studies.

Finally, it is important to remember that these findings are based on the specific tasks, student responses, scoring rubrics, and readers used in our research. Results with other tasks, rubrics, and readers could be different. Moreover, results may be sensitive to how holistic and analytic rubrics are constructed and how readers are trained to use them. Nevertheless, our findings were consistent across grade levels, with different sets of students and readers, and with studies in other fields. This leads us to conclude that, in many situations (and especially those in which the unit of analysis is the school or larger aggregation of students), differences in overall response quality probably will be evident regardless of whether an analytic or holistic method is used. Thus, the choice of scoring method will most likely depend on other factors, such as costs and ease in communicating results to others.

REFERENCES

Bauer, B. A. (1981). A study ofthe reliabilities and cost-eficiencies of three methods ofassessment for writing ability. (ERIC Document Reproduction Service No. ED 216 357)

Baxter, G. P., Shavelson, R. J., Goldman, S. R., &Pine, J. (1992). Evaluation of procedure-based scoring for hands-on science assessment. Journal of Educational Measurement, 29, 1-17.

Cronbach, L., Gleser, G., Nanda, H., & Rajaratnan, N. (1972). The dependability of behavioral measurements: Theory of generalizability of scores and projiles. New York: Wiley.

Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4, 289-303.

Hambleton, R., Jaeger, R., Koretz, D., Linn, R., Millman, J., & Phillips, S. (1995). Review of the measurement quality oj'the Kentucky Instructional Results Information System, 1991-1 994 (Report prepared for the Office of Educational Accountability, Kentucky General Assembly). Frankfort, KY: Office of Educational Accountability, Kentucky General Assembly.

Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement: Issues and Practice, 13, 5-15.

Moss, P. A., Cole, N. S., & Khampalikit, C. (1982). A comparison of procedures to assess written language skills at grades 4 ,7 and 10. Journal of Educational Measurement, 19, 37-47.

Saner, H., McCaffrey, D., Stecher, B., Klein, S., & Bell, R. (1994). The effects of working in pairs in science performance assessments. Educational Assessment, 2, 325-338.

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 16: Analytic Versus Holistic Scoring of Science Performance Tasks

SCORING PERFORMANCE ASSESSMENTS 1 35

Shavelson, R. J., Gao, X., & Baxter, G. P. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215-232.

Shavelson, R. J., Mayberry, P. W., Li, W., &Webb, N. M. (1990). Generalizability ofjob performance measurements: Marine Corps rifleman. Military Psychology, 2, 129-144.

Shavelson, R. J., &Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Vacc, N. N. (1989). Writing evaluation: Examining four teachers' holistic and analytic scores. The

Elementary School Journal, 90,87-95. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case of testlets.

Journal ofEducationa1 Measurement, 24, 185-201. Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores:

Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103-1 18. Weerakkody, G. J., & Givaruangsawat, S. (1995). Estimating the correlation coefficient in the presence

of correlated observations from a bivariate normal population. Communications in Statistics, Theory and Methods, 24, 1M5-1720.

Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-214.

APPENDIX Statistical Procedures

The actual design for Study 1 replicated the Person x Rater x Item G-study for four to five blocks of student-rater pairs. One way to analyze such data is to estimate the Person x Rater x Item design for each block and then pool across the blocks to estimate variance components. However, this was not necessary. Past research indicates that, with negligible rater effects (as was present in this study), the crossed design we used produces almost exactly the same variance component estimates as does a design that specifically accounts for bloclung (Shavelson, Mayberry, Li, & Webb, 1990). The interreader correlations for Study 1 were calculated as follows: For each student, (a) average the item scores assigned by a rater within a standard, (b) average over the item scores assigned by the other rater within that standard, and (c) correlate the scores assigned by one reader with the scores assigned by the other reader. The reliability estimates in Table 2 were computed using the following formula with the estimated variance components:

Reliability = P ( p ) + ( p x r) + [(p x i) + Residual] 1 (Number of items) '

where p is person, r is rater, and i is item. Tables A1 and A2 show the correlations among standards with the same and

different scoring methods, respectively. The disattenuated values in Table A2 are corrected for rater reliability only. All the correlations in these tables assume one rater per answer.

In Study 2, each notebook was graded by two or three readers under each scoring method. Thus, there was no way to calculate a unique unbiased Pearson correlation

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 17: Analytic Versus Holistic Scoring of Science Performance Tasks

136 KLEINETAL.

TABLE A1 Observed Correlations Between Standards Within Scoring Methods by Grade Level

Standards Being Correlated Grade 5 Grade 8 Grade 10

Within analytic method Conceptual and performance .68 .53 Conceptual and application .22 .42 Performance and application .16 .51

Within holistic method Conceptual and performance .78 .85 Conceptual and application .70 .83 Performance and application .70 .84

TABLE A2 Observed and Disattenuated Correlations Between Standards and Scoring Methods

by Grade Level

Standards Being Correlated

Observed ~isattenuateg

5 8 1 0 5 8 10

Analytic conceptual and holistic conceptual .59 .52 .69 .90 .93 .99 Analytic conceptual and holistic perfomance .52 .54 .48 > 1 .OO > 1 .OO 3 6 Analytic conceptual and holistic application .56 .48 .68 .86 .86 .99

Analytic performance and holistic conceptual .56 .30 .52 .52 .52 .71 Analytic performance and holistic performance .62 .41 .56 .95 .95 .95 Analytic performance and holistic application .53 .29 .50 .50 .50 .70

Analytic application and holistic conceptual .37 .56 .70 >1.00 >1.00 >1.00 Analytic application and holistic performance .45 .59 .47 > 1.00 > 1.00 37 Analytic application and holistic application .57 .57 .72 > 1 .OO > 1 .OO > 1 .OO

"With low reliabilities, disattenuated correlations may be spurious.

coefficient of interreader agreement within a method and still use all the data that were collected with that method. Consequently, we used an intraclass correlation coefficient for this purpose. This estimate of interreader agreement is defined as follows:

Variance among scores from a student Interclass Correlation = 1 -

Total variance among scores

Similarly, because a given paper received two or three holistic scores and two or three analytic scores from different combinations of readers, we could not use all of these data to calculate a unique unbiased Pearson correlation coefficient between scoring methods. Hence, we relied on a maximum likelihood approach for

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008

Page 18: Analytic Versus Holistic Scoring of Science Performance Tasks

SCORING PERFORMANCE ASSESSMENTS 1 37

calculating the values in Table 5. The joint distribution of all scores from a single paper depends on three unknown parameters: the correlation between analytic scores, the correlation between holistic scores, and the correlation between analytic and holistic scores. We reduced the problem to a single-parameter maximum likelihood function by using the mean interreader correlations within methods to estimate the correlation between scores from the same method. Weerakkody and Givaruangsawat (1995) found that this approximate maximum likelihood estimator behaved very much like the full maximum likelihood estimator, and it outperformed other techniques, such as the Pearson correlation coefficient or ignoring the clustering of scores from the same paper.

The disattenuated correlations between analytic and holistic scoring methods in Table 5 were obtained by dividing the estimated observed correlations between methods by the square root of the product of the mean interreader analytic and holistic correlation coefficients. All reliability estimates are for a single reader per answer.

We did not run a generalizability analysis with Study 2 to estimate the correlation between scoring methods because the data for this study did not satisfy the assumptions of independent errors and equal variances across parts. The part scores within a task were correlated with each other, and the parts had somewhat different variances. Exploratory analyses revealed that this situation would lead to uninter- pretable results with a G-study. Moreover, the simpler approaches we adopted for Tables 4 and 5 make much less stringent assumptions than are needed for a G-study.

Downloaded By: [Universiti Sains Malaysia] At: 02:10 29 September 2008