Top Banner
DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that Teachers Build: An Analysis of Classroom Tests in Science and Mathematics. PUB DATE Apr 92 NOTE 21p.; Paper presented at the Annual Meeting of the National Council on Measurement in Education (San Francisco, CA, April 21-23, 1992). PUB TYPE Reports Research/Technical (143) Speeches /Conference Papers (150) EDRS PRICE MFO1 /PCO1 Plus Postage. DESCRIPTORS Classroom Techniques; Competence; Grade 7; Grade 8; Interviews; Junior High Schools; Mathematics Teachers; *Mathematics Tests; Multiple Choice Tests; Questionnaires; Science Teachers; *Science Tests; *Secondary School Teachers; *Teacher Made Tests; *Test Construction; Test Content ABSTRACT Classroom tests developed by seventh- and eighth-grade science teachers (n=23) and mathematics teachers (n=18) were analyzed by panels of content and measurement experts. The 41 participating teachers, each of whom contributed 2 tests, completed a questionnaire, an interview, and 2 measures of competence in testing. Teachers used all major item formats in their classroom tests. Science teachers favored multiple-choice items and mathematics teachers favored computation items. Faults were found in 35 percent of completion items and 20 percent of multiple-choice items on teachers' tests. Average test quality on 6 dimensions was rated 5.0 to 5.7 on 7-puint semantic differential scales. Test quality was best predicted by scores on a multiple-choice measurement competency test. The sample of classroom tests is described, evaluated, and then related to teachers' training and experience, knowledge of testing, and content of test use to learn more about this pervasive, crucial, and understudied type of testing. Three tables and one figure illustrate study findings. (SLD) *********************************************************************** maprocucLions suppilea oy with are tne DeSt tnat can be mace from the original document. ***********************************************************************
17

Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

Sep 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

DOCUMENT RESUME

ED 350 348TM 019 125

AUTHOR McMorris, Robert F.; Boothroyd, Roger A.TITLE Tests that Teachers Build: An Analysis of Classroom

Tests in Science and Mathematics.PUB DATE Apr 92NOTE 21p.; Paper presented at the Annual Meeting of the

National Council on Measurement in Education (SanFrancisco, CA, April 21-23, 1992).

PUB TYPE Reports Research/Technical (143)Speeches /Conference Papers (150)

EDRS PRICE MFO1 /PCO1 Plus Postage.DESCRIPTORS Classroom Techniques; Competence; Grade 7; Grade 8;

Interviews; Junior High Schools; MathematicsTeachers; *Mathematics Tests; Multiple Choice Tests;Questionnaires; Science Teachers; *Science Tests;*Secondary School Teachers; *Teacher Made Tests;*Test Construction; Test Content

ABSTRACT

Classroom tests developed by seventh- andeighth-grade science teachers (n=23) and mathematics teachers (n=18)were analyzed by panels of content and measurement experts. The 41participating teachers, each of whom contributed 2 tests, completed aquestionnaire, an interview, and 2 measures of competence in testing.Teachers used all major item formats in their classroom tests.Science teachers favored multiple-choice items and mathematicsteachers favored computation items. Faults were found in 35 percentof completion items and 20 percent of multiple-choice items onteachers' tests. Average test quality on 6 dimensions was rated 5.0to 5.7 on 7-puint semantic differential scales. Test quality was bestpredicted by scores on a multiple-choice measurement competency test.The sample of classroom tests is described, evaluated, and thenrelated to teachers' training and experience, knowledge of testing,and content of test use to learn more about this pervasive, crucial,and understudied type of testing. Three tables and one figureillustrate study findings. (SLD)

***********************************************************************maprocucLions suppilea oy with are tne DeSt tnat can be mace

from the original document.***********************************************************************

Page 2: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

U.S. DEPARTMENT OF EDUCATIONOece of Educational Research and Impmyement

EDUCATIONALRESOURCESEP.ICI

INFORMATIONCENTE

his document has been reproduced asreceived 'nom the person Or ofganizationoriginating It

r Minor changes Irene been made 10 improvereproduction quality

Rants of new or opinions slated in this documint do not necessarily represent otticiaiOEM position or policy

"PERMISSION TO REPRODUCE THISMATERIAL HAS BEEN GRANTED BY

h.eeer 64e15

TO THE EDUCATIONALRESOURCES

INFORMATION CENTER (ERIC)."

Tests that Teachers Build: An Analysis of Classroom Testsin Science and Mathematics

Robert F. Mc Morris

Roger A. Boothroyd

Stzte University of New York at Albany

A paper presented at the annual meeting of the National Council on Measurement inEducation, San Francisco, CA, April, 1992.

BEST COPY AVAILABLE 2

Page 3: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

Tests that Teachers Build: An Analysis of Classroom Testsin Science and Mathematics

Robert F. Mc Morris, & Roger A. Boothroycli

State University of New York at Albany

The typical student has probably taken more teacher-made tests than he or she haseaten fast-food hamburgers, yet we may know even less about the tests than about thehamburgers. Development of such tests is hardly a franchise operation. Few teachers aregiven directions or prescriptions; no organized quality control is practiced. Nevertheless, thetests remain the primary basis for a multitude of educational decisions, including grading.What are some of the characteristics of actual classroom tests, and how good are these testsjudged to be? Do teaches have sufficient professional skill in test development to turncontent knowledge into more than hamburger? (Food for thought?)

With the cooperation of a sample of science and mathematics teachers, we examinedactual classroom tests developed by individual teachers. In addition, teachers completed twomeasures of competence in testing plus a questionnaire and an interview.

Research questions

What types of tests do teachers construct? (e.g., what types of items are used?)

For what purposes dc teachers test?

To what extent do teachers apply sound principles of classroom testing to their own testdevelopment and usage?

Do many items contain violations of item-writing principles?

What is the judged quality of these tests?

Is test quality related to teacher characteristics? More specifically, does test quality relateto

.. teachers' measurement competence, and their ability to detect faulted items?

.. experience, number of measurement courses, measurement knowledge, or adequacyof measurement training?

'We appreciate the contributions of the 41 teachers for providing us time and tests, six raters for analyzingthose tests, many graduates students for helping us increment the instruments, three reviewers for challengingcomments and "continue" ratings and Drs. Robert M. Pruzek and Vicky L. Kouba for their substantial andconstructive contributions to the dissertation on which this paper was based.

1

Page 4: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

METHOD

Parts of this section also appear in Boothroyd, Mc Morris, and Pruzek (1992) and arereproduced here for the convenience of the reader.

Sample

Seventh- and eighth-grade science and mathematics teachers were selected for thestudy. Judging from prior research, classroom testing occurs with the greatest frequency forthose grades and subjects, and such restrictions provided some degree of homogeneity.

Strong efforts were undertaken to obtain a sample that met prespecified criteria (e.g.,developed their own classroom tests) yet varied in terms of the independent variables of thisstudy (e.g., content area, experience, and type of school). Names of potential participantswere obtained from a variety of sources including graduate courses at local colleges anduniversities, school districts, directors of teacher centers, teachers, and friends. Teacherswere screened by telephone to ensure that they were either provisionally or permanentlystate-certified in either 7th- and 8th-grade science and/or mathematics, were teaching withintheir certification, had primary responsibility for constructing their own classroom tests anddid not depend on an item manual accompanying the textbook. Only one teacher wasexcluded because of not constructing his/her own classroom test items.

The 41 participating teachers represented 25 pub =ic and private schools districts frommany geographic regions in the state. No more than two teachers were selected from anyone district with one exception in which four teachers were included. The districts werequite varied and included public (88%) and private (12%) schools in urban, suburban, andrural settings.

Twenty-three teachers (56%) taught 7th- and 8th-grade science while 18 taughtmathematics at this level (44%). Approximately two-thirds (68%) were permanently statecertified in their discipline while 13 (32%) had provisional certification. Female teachersoutnumbered males by nearly a two-to-one margin (63% to 37%, respectively). The degreeof teaching experience was somewhat evenly distributed, averaging 12 years but quitevariable (SD = 7.2 years).

Instrumentation

Each teacher supplied the researchers with two classroom tests which he/she haddeveloped. For each test, three judges used a rating form in responding to questions of testcharacteristics and quality. In addition, each teacher devoted approximately three-and-a-halfhours to answering a multiple-choice test of measurement competence, identifying itemscontaining rule violations, responding to a questionnaire, and interacting in aninterview.

Test Rating Form. The rating scale was designed to describe and assess classroom tests onsix dimensions that many authors of measurement textbooks suggest are important to a test's

2

Page 5: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

overall test quality (e.g., Hopkins & Antes, 1985; Nitko, 1983). A preliminary version of therating form was pilot tested with seven participants in a doctoral-level measurement coursewho each rated two classroom tests. The resulting form, a semantic differential, contained39 adjective pairs.

Given that quality ratings were desired on each of the six dimensions and that someof the adjective pairs were more descriptive in nature as compared to evaluative, sevenjudges were asked to classify each adjective pair as either evaluative (i.e., a characteristicclearly good or bad) or descriptive. Nine items were classified as descriptive and thereforeanalyzed separately. The six test dimensions and the number of evaluative items perdimension: presentation/appearance (6), directions (4), length (2), content sampling (7),item construction (6), and overall quality (5).

The scale was used by two panels of three raters each, with one panel for sciencetests, the other for mathematics tests. Each panel consisted of a measurement specialist,a subject-matter specialist, and a person with both measurement and subject-matterexpertise.

Mean ratings over items and raters were computed for each dimension and each test.Internal consistency reliabilities ranged from .60 for length to .98 for overall quality.

Measurement Competency Test (MCI). A 65-item, four-option, multiple-choice test wasdeveloped to assess teachers' knowledge of various measurement concepts specific toclassroom testing. The test included items on test planning, types of items, item writing,reliability, and validity.

For the 41 teachers' responses to the final 65-item test the item difficulties weresomewhat evenly distributed. Twenty items (31%) were relatively easy (p > .7), 23 items(35%) had moderate difficulty (.4 to .7), and 22 items (34%) proved difficult (p < .4). Allbut two items had positive item discrimination values, with 51% (33 items) havingdiscrimination indices above .33. A more complete description of the items and thedevelopment procedures may be found in Boothroyd et al. (1992).

Item Judgment Task (IJT). Teachers reviewed 32 multiple-choice and completion itemsrelated to junior high school science and mathematics, identifying items considered "good"items and items perceived as "poor" items. Violations of recommended item writingprinciples (flaws) were introduced into some of the items. The 32 items were equallydivided between mathematics and science, and further faceted to include an equal numberof multiple-choice and completion items. Within each of the four resulting cells, 3/4 of theitems (12 of 16) contained a "flaw" in item construction.

Six types of flaws were included, three in multiple-choice items and three others incompletion items. Multiple-choice flaws included: (1) a cue repeated in both stem andanswer, (2) the longest, most detailed option as the keyed response, and (3) options lackinghomogeneity and plausibility. Flaws incorporated in completion items included: (1) blanksin either the beginning or middle of the statement, (2) nonspecific responses as possible

3

Page 6: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

correct answers, and (3) omission of a nonessential word, such as a verb.

Analysis on teachers' responses to these items revealed that the greatest proportionof items (14 items/44%) were easy (p > .7), five items (16%) had m derate difficulty (.4to .7), and 41% (13 items) were difficult (p < .4). Two items had negative discriminationvalues, 12 items (38%) :lad discrimination indices less than .1, and 12 items (38%) haddiscrimination levels greater than .33. A more extensive description of the LIT items,including development procedures and illustrative items, is presented in Boothroyd et al.(1992).

Interview Protocol. A 36-question interview protocol was developed as a means forproviding some structure to the interviews and thus helping to ensure that consistent datawere acquired for each teacher. The questions were designed to collect information on fivetopics: (1) the teacher's classroom testing practices and test development procedures [11items], (2) his/her measurement training [5 items], (3) school/district policies and/orregulations specific to testing [4 items], (4) criteria the teacher used when making judgmentsconcerning good/bad item decisions [3 items], and (5) the classroom tests submitted forreview [13 items]. Given that the study was exploratory in nature, some additional questionswere added for the purpose of exnioring additional issues that arose during some of theinitial teacher interviews.

RESULTS

Results are reported according to research questions.

What types of tests do teachers construct?

Information on teachers' tests was obtained by examining classroom tests they haddeveloped. Of the 82 tests submitted for review (two tests per teacher), 64 (78%) were unitor chapter tests, 17 were midterm/final examinations (21%), and one (1%) was a quiz. Thenumber of days of content the tests were designed to cover ranged from two days to 200days. The average number of items on a unit test was 40 (SD = 32.6) while this figure was91 items for midterms and finals (SD = 45.5). The teachers indicated that the unit/chaptertests tend not to be cumulative (i.e., do not contain material from previous tests) whilemidterms and finals typically cover all previously presented material. Both unit/chapter andmidterm /finals are typically administered to multiple classes as indicated by an average of67 students per unit or chapter test and 86 students per midterm or final.

In Table 1 the tests are described by item type. According to both the teachers'self-report estimates and the second author's independent analysis of their tests, computationitems were the most popular for mathematics teachers and multiple-choice items for scienceteachers. Further, many formats were used by each set of teachers; indeed, with a moreliberal definition of essays to include extended computational items, all these major typesof items were used by each set of teachers.

4

6

Page 7: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

For what purposes do teachers test?

An analysis of the teachers' responses to the interview question "Why do you test?"revealed foar primary reasons in addition to a number of secondary considerations. Mostfrequently cited by a majority of the teachers (69%) was the response: "to assess students'mastery and understanding of the content taught in class."

The remaining three primary reasons were cited much less frequently, albeit withsimilar frequency to each other. Instructional reasons were cited by 33 percent of theteachers who reported that students' performance on classroom tests provide them with anindication as to which lessonswere most effective and which lessons need to be retaught orremediation provided.

Grading was mentioned by 31% of the teachers. Many of these teachers did notplace grading in the larger context of assessing students strengths and weaknesses butrather indicated that they had to assign course grades, and classroom tests were a means tothat end.

Motivation was the fourth basic reason teachers offered for testing, and was cited by28% of the teachers. These teachers believed that students would not do the assignedreadings or seriously study the course material if tests were not given. Many of the teachersstated that their classroom tests are similar, in many respects, to other types of activitiesthey do during class but are treated in a more formal manner by both students and teachers.As such, students perceive the tests as more important than other classroom activities, takethem more seriously, and prepare for them to a greater extent.

To what extent do teachers apply sound principles of classroom testing to their own testdevelopment and usage?.

Over half of the teachers (54%) indicated they generally develop some form of testplan prior to constructing a test. Although these plans are typically not formally writtenblueprints, at a minimum they involve a review, and frequently a listing, of the topics to becovered on the test. Most of the teachers indicated their planning process involvesreviewing lesson plans, the textbook chapters, and other class material. Some teachers alsoreported reviewing old tests.

Once the topics are identified, most teachers begin to develop their own items orselect items from other sources to assess each of the topics. Slightly over one-third of theteachers (34%) indicated that they weight topics by varying the number of items per topic.The procedures teachers described for deciding on weights for topics involved either takinginto account the amount of time that was spent in class on specific topics or assessing theimportance of specific material. In either case, these teachers indicated that they includemore items on the test for topics which they deemed more important or for which theydevoted a greater amount of class time.

Many of the teachers reported during the interviews that they use different item

5

7

Page 8: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

formats for different types of content. These teachers indicated that the item format theyused was most generally related to the cognitive level of the item. In science, for example,a number of teachers reported using alternate response (i.e., true/false, yes/no) andmatching items for lower cognitive-level items, such as concept definitions or identification,while essay items were used to assess higher cognitive levels such as synthesis. Some of theteachers also distinguished between item formats requiring recognition (e.g., matching,alternate response, multiple choice) and those item formats necessitating recall (e.g.,completion, short answer). Few teachers, however, indicated how they "balance" theirclassroom tests with respect to Le issue of cognitive level.

Do many items contain violations of item-writing principles?

A sample of approximately 350 multiple-choice and comple :ion items submitted bythe teachers was examined, with flaws detected in 35% of the completion and 20% of themultiple-choice items. Most frequently observed problems in the completion items wereblanks in the beginning or middle of a statement (25% of all completion items) and therequest for a nonspecific response (14%). Nonhomogeneity of response options was presentin seven percent of all multiple-choice items reviewed, with the same percentage having thelongest option as the key. Cues were discovered in five percent of the items. Other flaws(in 6% of the items) included window dressing, no question presented in the stem, andspelling errors.

What was the judged quality of these tests?

Each . -.st was rated by a three-judge panel usingsemantic-differential items.Panelists assigned above-average ratings on all six dimensions, judging appearance thehighest (mean = 5.8 on a 7-point scale) and test length the lowest (mean = 5.0) (see Table2). Overall quality was rated 5.4 on the average. Raters perceived the greatest variationamong the tests in terms of appearance and in adequacy of directions (SDs = 1.3), and theleast variability in item construction and adequacy of content sampling (SDs = .8).

Is test quality related to teacher characteristics?

An analysis of the ratings first revealed differences in ratings between mathematicsand science tests on several dimensions, most importantly, overall quality. Given the twopanels, this difference was confounded by different raters across subject areas, so theseratings were regressed by subject area and the residual used as the dependent variable fora regression analysis. The best predictor of the resulting quality-of-test variable was thescore on the Measurement Competency Test (r = .37).

Even with the confounding of raters and subject matter, the overall classroom testquality is related to indices of measurement competency. One approach to theserelationships is to dichotomize the group on ratings, on the Measurement Competency Test(MCI) and on the Item Judgment Task (UT). For the 21 teachers in the bottom half onrated test quality, 11 were in the bottom half of the group on both the MCT and UT, only4 were in the top half on both predictors. For 20 teachers in the top half on rated test

6

8

Page 9: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

quality, 10 were in the top half on both predictors; 2 were in the bottom half on both.These reintionships are detailed in Figure 1.

These two extreme groups differ in teaching experience and measurementbackground, as noted in Table 3. The high group is the somewhat more experienced group.For each of the three measurement variables, the high-group mean exceeds the low-groupmean by more than half a standard deviation.

DISCUSSION

Guttman (1970) expressed in a classic cartoon the imbalance of research emphasison test design vs. test analysis. Similarly, study of classroom tests and their developers haslagged behind study of standardized, published measures. Classroom testing is the basis forsuch a variety of decisions involving instruction, grading, and other uses, yet as professionalswe know little about the qualities and characteristics of such tests. We have done little todescribe, let alone evaluate, these evaluative devices.

Every day, the number of tests taken in schools, and the number and type ofdecisions based on information from those tests, could perhaps best be described graphicallyby an astronomy professor from Cornell. And if we include the other types of assessmentinformation used by teachers and students (see, e.g., Airasian, 1991; Stiggins, Conklin, &Bridgeford, 1986), the amount of information, the number of decisions, and the impact ofthose decisions becomes virtually incomprehensible. Especially given that teachers' trainingin formal testing is so limited, and their training in informal assessment is even morelimited, we are concerned about 1) the quality of the measures, 2) the ability of the teachingprofessionals to provide professional interpretations of information and appropriate'decisions using that information, and 3) our own ability and resolve to formulate andrespond to educationally important questions.

Item types used by the science teachers in our study agree with item types found injunior high science tests by Fleming and Chambers (1983, p.33). Rank-order of occuranceis the same across studies: multiple choice was most popular, followed by matching, shortanswer/completion, true false, and essay. For teachers more generally, however, Flemingand Chambers found the short answer/completion format most popular and matching adistant second.

For our sample, 20% of the multiple-choice items contained faults. Similarly, in theOescher and Kirby (1993) study, "Of the 18 tests containing multiple choice items, 17 werejudged to have flaws in more than 20% of these items" (p. 13). Carter (1986) also foundfaults in teacher-made tests. Of the tests Carter reviewed, 78% strongly favored the key inC, 86% had at least one item with a longer correct answer, 47% contained at least one stemcue, and 58% contained at least one grammatical clue.

But what are the impacts of item faults on teacher-made tests? Certainly items maybe made easier by faults (Dunn & Goldstein, 1959; Mc Morris, Brown, Snyder, & Pruzek,

7

Page 10: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

1972; Haladyna & Downing, 1989a; 1989b). Tests containing item faults are inconsistentwith Nitko's (1983) principle that "test items should elicit only the behaviors which the testdeveloper desires to observe." (p. 141) We would expect faulted items to introduceextraneous variance; such variance would, in turn, reduce somewhat the validity ofdescriptions and decisions based on the test.

Other, more subtle impacts are also possible. Students judge tests and theirdevelopers. Do you expect them to respect a bogus test or an incompetent test developer?How many times did your attitude about a teacher or professor change as a result of takingyour first test in a course? To illustrate, how do you felt about a author who makegrammatical errors? And on how many other dimensions would you as a student have beenable to describe and discuss a teacher's test? Would you have considered easiness, contentbalance, and understanding or application vs. pedestrian knowledge? The teachercommunicates so much with a test. Student attitude toward the course, the instructor, andthe subject must be affected by that test and its interpretation.

Classroom evaluation affects student in many ways. For instance, it guidestheir judgment of what is important to learn, affects their motivation,andtiming of personal study (e.g., spaced practice), consolidates learning, andaffects the development of enduring learning strategies and skills. It appearsto be one of the most potent forces influencing education. (Crooks, 1988) (p.467)

The impacts of a test's characteristics and quality, then, are not just in producing appropriateor extraneous variance on the measure itself. The impacts also include student attitudes andperceptions which affect what they bring to the next encounter of an evaluation kind.

One disheartening, anecdotal index of teacher frustration and student achievementlevels came from the teacher interviews in this study. Some teachers admitted theyintentionally included clues in items so some weaker students could answer some itemscorrectly. Admittedly, if done with a sense of humor on an informal "test" that is essentiallyintended for review, there may easily be some positive benefit. If done when a lesscontaminated measure is desired, the extraneous variance may be expensive. At aminimum, intentional use of clues can be investigated in further studies.

Additional samples of teachers would provide appropriate replication. We wouldrecommend including outcome measures assessing characteristics/quality of teacher-madetests and independent measures for measurement competency, measurement training,experience, etc. Extensions to our instruments could better specify knowledge of teachers'ability and practice in grading, reporting/communicating, sizing up, instructional pacing, andperformance testing. Understanding how item characteristics and score distributions shouldfollow from type of objective could also be tested (see Terwilliger, 1989).

An outcome of our profession's lack of emphasis on classroom assessment may beto allow standardized testing to win by default. As noted by Stiggins et al. (1986),laypersons and policymakers maintain that schooling outcomes are measured best and fairestby standardized paper and pencil tests, which severely restricts the variety of outcomes usedfor accountability. Similarly, research on teaching has also depended excessively on

8

1 0

Page 11: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

standardized tests to represent school achievement. Such tests are not constructed to bemaximally sensitive to instruction (Hanson, Mc Morris, & Bailey, 1986; Mehrens & Phillips,1987). Issues concerning and techniques for assessing fit between test and curriculum arereviewed by Crocker, Miller, and Franks (1989).

Relationships of published achievement tests with instruction are being examinied inmore sophisticated ways, and additional research is needed. Such investigations will likelyhave applicability to local districts and enhance the assessment of student achievement.Teachers, however, develop virtually all the achievement measures on which instructionaldecisions are based. The current emphasis on studying teachers' testing and assessing isreassuring.

9

Page 12: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

REFERENCES

Airasian, P. W. (1991). Perspectiveson measurement instruction. Educational Measurement:Jssues and Practice, 1Q(1), 13-16,26.

Boothroyd, R. A. (1990). Variables related to the characteristics and quality of classroomtests: An explore tory study with seventh and eighth grade science and mathematicsteachers. (Doctoral Dissertation, The University at Albany, 1990) DissertationAbstactiinterruational, 51/07A, 2355.

Boothroyd, R. A., Mc Morris, R. F., & Pruzek, R. M. (1992, April). What do teachers knowabout rn. easursan rt lwwdikthealLsisat'out? Paper presented at the annualconference of the National Council on Measurement in Education. San Francisco,CA.

Carter, K. (1986). Test-wiseness for teachers and students. Educational Measurement: Issuesand Practice, (4), 20-23.

Crocker, L. M>, Miller, M. D., & Franks, E. A. (1989). Quantitative methods for assessingfit between test and curriculum. Applied Measurement in Education, 2, 179-194.

Crooks, T. J. (1988). The impact of classroom evaluation laactices on students. Review ofEducational Research, a, 438-481.

Dunn, T. F., & Goldstein, L. G. (1959). Test difficulty, validity, and reliability as a functionof selected multiple-choice item construction principles. Educational and. Psychological Measurement, 12, 171-179.

Fleming, M., & Chambers, B. (1983). Teacher-made tests: Window on the classroom. In W.E. Hathaway (Ed.), Its ini_gin the schools (pp. 29-38). New Directions for 'restingand Measurement, No. 19. San Francisco: Jossey-Bass.

Guttman, L. (1970). Interpretation of test design and analysis. In Proceedings of the 1969invitational conference on testing problems (pp. 53-65). Princeton, NJ: EducationalTesting Service.

Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item writingrules. Applied Measurement in Education, 2, 37-50.

Haladyna, T. M., & Downing, S. M. (1989b). Validity of a taxonomy of multiple-choice itemwriting rules. Applied Measurement in Education, 2, 51-78.

Hanson, R. A., McMorris, R. F., & Bailey, J. D. (1986). Differences in instructionalsensitivity between item formats and between achievement test items. Journal ofEducational Measuremmt, 21, 1-12.

Page 13: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

Hopkins, C. D., & Antes, R. L. (1985). Classroom measurement and evaluation. (2nd ed.).Itasca, IL: F. E. Peacock Publishers.

Mc Morris, R. F., Brown, J. A., Snyder, G. W., & Pruzek, R. M. (1972). Effects of violatingitem construction principles. Journal of Educational Measurement, 2, 278-295.

Mehrens, W. A., & Phillips, S. E. (1987). Sensitivity of item difficulties to curricular validity.Journal of Educational Measurement, 24, 357-370.

Nitko, A. J. (1983). Educational tests and measurement: An introduction. New York:Harcourt Brace Jovanovich.

Oescher, J., & Kirby, P. C. (1990, April). Assessing teacher-made tests in secondary mathand science classrooms. Paper presented at the annual conference of the NationalCouncil on Measurement in Education, Boston, MA. (ERIC Document ReproductionService No. ED 322 169).

Stiggins, R. J., Conklin, N. F., & Bridgeford, N. J. (1986). Classroom assessment: A key toeffective education. Educational Measurement: Issues and Practice, 5(2), 5-17.

Terwilliger, J. S. (1989). Classroom standard setting and grading practices. EducationalMeasurement: Issues and Practice, $(2), 15-19.

Page 14: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

Tab

le 1

Dis

trib

utio

n of

Cla

ssro

om T

est I

tem

s by

Sub

ject

and

Item

For

mat

Item

Typ

eP

erce

ntag

e fr

om T

each

er S

elf-

repo

rts

Per

cent

age

from

Exa

min

atio

n of

Cla

ssro

om T

ests

.

Mat

hem

atic

s(n

=18

)S

cien

ce(n

=23

)T

otal

(=41

)M

athe

mat

ics

(n=

36)

Sci

ence

(n =

4R

)T

otal

in=

82)

1. M

ultip

le C

hoic

e10

4530

1447

33

2. A

ltern

ate

Res

pons

e4

128

312

8

3. M

atch

ing

318

113

1811

4. S

hort

Ans

wer

98

813

1213

5. C

ompl

etio

n10

57

145

9

6. E

ssay

07

40

32

7. C

ompu

tatio

n60

227

411

19

8. O

ther

53

410

36

'Eac

h te

ache

r su

bmitt

ed tw

o cl

assr

oom

test

s fo

r re

view

.

1.4

1 5

Page 15: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

Tab

le 2

Rev

iew

ers'

Rat

ings

by

Sub

ject

and

Tes

t Dom

ain

Dim

ensi

onN

umbe

rof

item

s

Mat

hem

atic

s(n

=36

)S

cien

ceT

otal

Mea

n'S

DS

DM

ean'

SD

1. A

ppea

ranc

e6

5.62

1.22

5.87

1.36

5.76

1.30

2. D

irect

ions

45.

151.

205.

331.

435.

251.

33

3. L

engt

h2

4.53

.99

5.38

1.17

5.01

1.17

4. C

onte

nt S

ampl

ing

75.

22.8

05.

31.8

95.

27.8

5

5. It

em C

onst

ruct

ion

75.

61.4

45.

11.9

55.

33.8

0

6. O

vera

ll Q

ualit

y5

5.04

1.00

5.71

.89

5.41

.99

Com

posi

te31

5.32

.63

5.41

.79

5.37

.72

Rat

ings

wor

e ob

tain

ed u

sing

a 7

-poi

nt s

eman

tic d

iffer

entia

l for

mat

and

wer

e av

erag

ed fo

r th

ree-

judg

e pa

nels

.

1 6

1 7

4

Page 16: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

Tab

le 3

Tea

cher

Pro

files

Bas

ed o

n T

heir

Per

form

ance

on th

e M

CT

, IJT

, and

Rat

ings

of C

lass

room

Tes

t Qua

lity

Tea

cher

Pro

file

Var

iabl

e

Sco

res

on th

e M

CT

, IJT

, and

Rat

ings

of C

lass

room

Tes

tQ

ualit

y

Bel

ow th

e M

edia

ns(n

=11

)A

bove

the

Med

ians

(n=

10)

Mea

nS

tand

ard

Dev

iatio

nM

ean

Sta

ndar

dD

evia

tion

Yea

rs o

f Tea

chin

g E

xper

ienc

e10

.82

8.99

13.0

09.

14

Num

ber

of M

easu

rem

ent C

ours

es.2

7.4

71.

301.

34

Sel

f-re

port

Mea

sure

men

t Kno

wle

dge

1145

.82

2.00

.94

Ade

quac

y of

Mea

sure

men

t Tra

inin

gb2.

09.9

43.

101.

10

Not

e. H

igh

and

low

cat

egor

izat

ions

wer

e ba

sed

on In

depe

nden

t med

ian

split

s pe

rfor

med

on

the

MC

T, U

T, a

nd R

atin

gs o

f Cla

ssro

om T

est Q

ualit

y.

*Sca

le: 1

-P

oor;

2-F

air;

3..G

ood;

4 =

Ver

y G

ood;

5-E

xcel

lent

bSca

le: 1

-S

tron

gly

Dis

agre

e; 2

-Dis

agre

e; 3

= U

ncer

tain

; 4-A

gree

; 5=

Str

ongl

y A

gree

Page 17: Education Resources Information Center - DOCUMENT RESUME … · 2014. 4. 9. · DOCUMENT RESUME ED 350 348 TM 019 125 AUTHOR McMorris, Robert F.; Boothroyd, Roger A. TITLE Tests that

Fig

ure

1

Rat

ed T

est Q

ualit

y R

elat

ed to

IJT

and

MC

T

Ove

rall

Cla

ssro

om T

est Q

ualit

y R

atin

gs

Item

Jud

gmen

t Tas

k

Mea

sure

men

t Com

pete

ncy

Tes

t

Low

02=

21)

Low

02=

13)

Low

Hig

h(n

=11

)(n

=2)

Hig

h02

=8)

Low

(n=

4)H

igh

(n=

4)

Hig

h(0

=20

)

Low

03=

7)

Hig

hC

o_5)

Low

(n=

2)

Hig

h02

=13

)

Low

(0=

3)H

igh

02=

10)