OVERVIEW OF CRITERIA FOR EVALUATING EDUCATIONAL ASSESSMENT

OVERVIEW OF

CRITERIA FOR EVALUATING

EDUCATIONAL ASSESSMENT

Reliability (Chapter 3)

Validity (Chapter 4)

Absence-of-Bias (Chapter 5)

Reliability of Assessment

Chapter 3(p.61-82) W. James Popham

RELIABILITY of assessment:

Standardized tests are evaluated by reliability. However, an individual’s performances and

responses vary from one occasion to another, even under controlled conditions.

“An individual’s score and the average score of a group will always reflect at least a small amount of measurement error” (qtd. In American Education Research Association, 1999).

RELIABILITY = CONSISTENCY

1. STABILITY RELIABILITY: consistency of tests over time.

Test students on one occasion, wait a week or two, and test them again using the same assessment.

Compare the scores from the two testing times, to determine the test’s stability.(See Table 3.2, p. 65)(However, teachers don’t normally determine the reliability of their own classroom tests, unless they’re doing research or

evaluating an end of term test, etc..)

“In general, if you construct your own classroom tests WITH CARE, those tests will be sufficiently reliable for the decisions you will base on the tests’ results” (Popham 75).

“In short, you need to be at least knowledgeable about the fundamental meaning of reliability, but I do not suggest you make your own classroom tests pass any sort of reliability muster” (inspection) (Popham 75).

RELIABILITY = CONSISTENCY 2. ALTERNATE-FORM RELIABILITY: Using two

different test forms (Form A, Form B) that are allegedly equivalent.

To determine the alternate-form consistency, give both test forms to the same individuals.

Compare the students’ scores from the two test forms, and decide on a level of performance that you would consider to be passing.

Commercially published or state-developed tests should have evidence to support that their alternate test forms are indeed equivalent (67).

3. INTERNAL CONSISTENCY RELIABILITY: Are the items in a test doing their measurement job in a consistent manner?

The internal items on a test should be homogeneous, or designed to measure a single variable such as “reading achievement” or “mathematical problem solving” (69).

“The more items on an educational assessment, the more reliable it will tend to be” (69). (Example: a 100-item test on mathematics will give you a more reliable fix on a student’s ability in math, than a 20 item test.)

STANDARD ERROR OF MEASUREMENT (SEM)

The consistency of an individual’s scores if given the same assessment again and again and again.

The higher the reliability of the test, the smaller the SEM will be (72).

(See the formula for SEM on p. 73).

STANDARD ERROR OF MEASUREMENT (SEM)

The SEM is linked to the way students’ performances on state accountability tests are reported:

(Exemplary, Exceeds Standards, Meets Standards, Approaching Standards, Academic Warning)

(Advanced, Proficient, Basic, Below Basic) (74).

What do teachers really need to know about RELIABILITY?

Reliability is a central concept in the measurement of an assessment’s consistency (75).

•Teachers may be called on to explain to parents the meaning of a student’s important test scores, and the reliability of the test is at stake.•Teachers should be knowledgeable about the three types of reliability evidence. Test manuals should support the reliability evidence.

•Popham doesn’t think teachers need to devote time in calculating the reliability of their own tests; however, teachers DO need to have a general knowledge about

reliability and why it’s important (77).

Validity

Chapter 4

(p.83-110) W. James Popham

“VALIDITY is the most significant concept in assessment”

However, according to Popham, “There is no such thing as a ‘valid test’(85). Rather, validity centers on the accuracy of the inferences that teachers make about their students through evidence gathered formally or informally (87).

Popham suggests that teachers focus on score-based and test-based inferences and make sure they are accurate / valid (87).

VALIDITY EVIDENCE

1. CONTENT RELATED: The content standards, or skills and knowledge that

need to be mastered, are represented on the assessment.

Teachers should focus their instructional efforts on content standards or curriculum aims.

VALIDITY EVIDENCE 2. CRITERION RELATED: An assessment that is used in order to predict how well a

student will perform at some later point. An example is the relationship between students’ scores on (1) an

aptitude test such as the SAT or the ACT test that a student takes in high school, and (2)the scores the student gets from this test that predicts how well the student is apt to perform in college (97).

However, Popham says that “these tests are far from perfect.” In fact, he suggests that “25 % of student’s college grades can be explained by their scores on aptitude tests. Other factors such as motivation and study habits account for the other 75%”(98).

VALIDITY EVIDENCE

CONSTRUCT RELATED: The test items are accurately measuring what they

are supposed to be measuring. Constructed evidence, such as a student’s ability

to write an essay, is accurately assessed.

What do teachers really need to know about VALIDITY?

Remember that it is not the validity of the ‘test’, but score-based inferences. “And because score-based inferences reflect judgments made by people, some of those judgments will be in error” (Popham 106).

•Again, Popham doesn’t think teachers should worry about gathering validity evidence; however, they DO need to have a reasonable understanding of the 3 types of validity. They must be especially concerned about content related validity.

Absence-of-Bias

Chapter 5

(p.111-137) W. James Popham

“Assessment bias” is:

Any element in an assessment procedure that offends or unfairly penalizes students because of personal characteristics;

These characteristics include students’ gender, race, ethnicity, socioeconomic status, religion, etc.

(111).

Examples of “offensive” content on test items:

Only males are shown in high-paying and prestigious positions (attorneys, doctors); while women are portrayed in low-paying and unprestigious positions (housewives, clerks).

Word problems are based on competitive sports, using mostly boys’ names, suggesting that girls are less skilled.

Examples of “unfair penalization” on test items:

Problem solving items (deals with attending operas and symphonies); may be more advantageous for children from affluent families, than lower socioeconomic students who may lack these experiences (113).

CHILDREN WITH DISABILITIES AND FEDERAL LAW

1975, The Education for All Handicapped Children Act (Public Law 94-142)

“Individualized Education Program” (IEP) 1997, The Individuals with Disabilties Act

(IDEA) Special education significantly altered in 2002

with No Child Left Behind (NCLB). NCLB required adequate yearly progress

(AYP) not only for students overall, but for subgroups including children with disabilities (123).

ENGLISH LANGUAGE LEARNERS (ELL)

Students whose first language is not English and know very little, if any;

Students who are beginning to learn English; Students who are proficient in English but

need additional assistance. Popham states, “To provide equivalent tests in

English and all other first languages spoken by students is an essentially impossible task” (130).

ASSESSMENT MODIFICATIONS?

Today’s federal laws oblige teachers to test ELL students and students with disabilities on the same curricular goals as all other children.

Assessment accommodations or modifications are not completely satisfying nor accurate.

“Absence-of-bias is especially difficult to attain in cases of ELL students and students with disabilities”

(Popham 133).

OVERVIEW OF

CRITERIA FOR EVALUATING

EDUCATIONAL ASSESSMENT

Reliability (Chapter 3)

Validity (Chapter 4)

Absence-of-Bias (Chapter 5)

OVERVIEW OF CRITERIA FOR EVALUATING EDUCATIONAL ASSESSMENT

test students

stability reliability

item test

alternateform reliability

consistency of tests

alternate test forms

classroom tests

tests stability

Documents

Evaluating RTLS Technology: Success Criteria for...

Evaluating Educational Program

Evaluating With Criteria

A FRAMEWORK FOR EVALUATING EDUCATIONAL OUTCOMES IN...

Evaluating Open Educational Resource (OER) Objects

Evaluating Educational Technology

Measure Evaluation Criteria and Guidance for Evaluating ...

Criteria for evaluating research: the unique adequacy...

Evaluating the criteria for financial holding company ...

Criteria in Evaluating Instructional Material

Criteria for Evaluating the Implementation Plan Required ...

Evaluating Educational Outcomes - Acc

Stream Nutrient Criteria for Evaluating Water ...

Evaluating Ship Selection Criteria for Maritime...

Evaluating Consequences of Educational Privatization

Criteria | Insurance | General: Evaluating The Enterprise...