Interpreting Assessment Results using Benchmarks Program Information & Improvement Service Mohawk Regional Information Center Madison-Oneida BOCES.

Interpreting Assessment Results using Benchmarks

Program Information & Improvement Service Mohawk Regional Information Center

Madison-Oneida BOCES

Why Use Benchmarks?

Benchmarks are useful for comparing results of individual students or schools to a larger population who took the same assessment at the same time.

What are Benchmarks?

Each benchmark represents a sample group of students who performed similarly on a given assessment.

Benchmarks, cont.

Benchmarks of any given assessment are determined at selected points of the overall performance: specifically at Low, Average, and High performance levels.

Benchmarks, cont.

A “Low” performance is equated with Level 2;

“Average” performance with Level 3;

“High” performance with Level 4.

Benchmarks, cont.

An analysis is done to determine how students who performed at these key points achieved on the standards, i.e.:

Standard (SPI) 1, 2, & 3 for ELA;

Key Idea (KPI) 1, 2, 3, 4, 5, 6, & 7 for Math.

Benchmarks, cont.

The best way to determine these points is to select the lowest scale score associated with Low (Level 2), Average (Level 3) and High (Level 4) performance levels.

Benchmarks, cont.

Finding enough students who achieve these exact scale scores enables a “benchmark profile” to be constructed.

To do this, MORIC uses regional data from all of the students within the 52 school districts served within our region.

Benchmarks, cont.

Typically, 50-100 students achieve the exact scale scores representing each of the benchmarks.

Benchmarks, cont.

These students come from among the 6,000 to 6,500 students within the districts served by the Mohawk RIC who take any given assessment at the same time.

Benchmarks, cont.

The students from the benchmark groups are anonymously selected and their SPI or KPI scores are analyzed. That is, the scores for all students within a given benchmark group are averaged.

Benchmarks, cont.

Because any given assessment is based on items of unequal difficulty, it turns out that students who receive identical scale scores tend to answer questions nearly all the same way.

This is how “benchmark profiles” for each proficiency level are determined.

Benchmarks, cont.

Once it is known how the benchmark groups performed on each learning standard or key idea, there is a relevant context for comparing individual or school scores.

FAQs About Benchmarks

FAQ 1 – Where do benchmark groups come from?Within the 52-district region comprising

MORIC, there are about 6,000 to 6,400 students who take any given state assessment at grades 4 and 8. Benchmark groups come from this large group of students.

FAQ 2: What exactly is a benchmark?Each year the

benchmarking procedure identifies groups of children who score EXACTLY at a scale score cut-point .

FAQ 2, cont.

For example, the lowest scale score to earn a Level 4 is designated "Benchmark Level 4", and this represents a group of children who have achieved an advanced level of proficiency. More importantly, these children represent those who scored at the exact "cut off" for Level 4.

FAQ 2, cont.

For each assessment the New York State Education Department establishes the scale score cut off, (also called “cut scores” or “cut points”.)

FAQ 3: How many benchmark groups are there?Benchmark level 2 group—

(low, not proficient)Benchmark level 3 group—

(average proficiency)Benchmark level 4 group—

(advanced proficiency)

FAQ 4: Why use benchmarks?

Children within a given benchmark group tend to answer items on the assessment in the same way. Thus, comparing a test item result for a particular school against the benchmark group provides a relevant context for interpreting results.

FAQ 4, cont.

Since not all test items are of the same difficulty level, this can help to discriminate between test item results within a given sub-skill. It then aids in identifying where instruction could be improved.

FAQ 5: How big are benchmark groups?There are generally 50-100 students who

comprise a given benchmark group within the MORIC region. These children's item scores are anonymously pulled together to form the benchmark group.

FAQ 5, cont.

We don't know which schools these children come from, but it is not important where they come from. What is important is that these students have scored the same exact overall scale score.

FAQ 6: Why analyze students who got the same scale score?The tests measure overall achievement of

learning standards and key ideas. The cut points for scale scores demark distinct levels of performance. Therefore, students who scored at the same exact scale score achieved the same level of performance. They also tend to answer questions on the test in the same way.

FAQ 7: How reliable are regional benchmarks?MORIC staff participate with other state-

wide educational data analysts through the New York State School Analyst Group (DATAG) to compare how MORIC’s benchmarks compare to those from other regions.

FAQ 7, cont.

In five years of state data, there have been no statistically significant differences between MORIC’s benchmarks and those from other regional groups.

Because of the way the state assessments are designed, statistically significant differences are not anticipated, either.

FAQ 7, cont.

When SED releases the benchmark values for statewide results, (usually around one year after the assessments are administered), these are also found not to be significantly different from those of the MORIC region.

FAQ 7, cont.

Therefore, as long as the benchmark groups remain reasonably large (greater than 25 students), basing the benchmark groups upon regional results from all 52 MORIC districts is a defensible procedure.

Data Do’s and Don’ts

Do:

Consider these results in the context of other related data, (e.g., classroom work & other assessments, such as the Terra Novas, Milestones, or TONYSS.)

Do:

Use the findings as “conversation starters”, “not conversation enders.”

Good analysis of data provides questions to be discussed jointly by administrative and instructional teams.

Do:

Make lists of questions generated by the data for the data analyst, staff developers, & the students.

Do:

Remember the tests are a “snapshot” of achievement in a given time, but that they are not the total view.

Don’t:

Make major programmatic decisions on the basis of any one data analysis finding. It is statistically unsound to do so.

Don’t:

Read too much into the result of a single test question. Place more trust on the “broader measure” (i.e., the sub skill results and the SPI/KPI) than the “smaller, narrower measure.” It is more statistically sound to rely upon the “bigger, broader measure.”

Tips for Interpreting Assessment Results

Tip #1: Ask Questions

What should be done instructionally?What should/should not be done with the

curriculum?Are there non-instructional factors, (such

as the school culture, attendance, etc.), affecting student achievement?

Tip #2: Validate

Use multiple measures for making programmatic or instructional changes. The state assessment is one measure of student achievement in a given subject area. Utilize other sources of student performance.

Tip #3: Examine

The best way to improve overall performance is to examine all of the curriculum content related to a given sub skill or standard/key

Tip #4: Focus

Focus program improvements around the full breadth of content within that sub skill area, standard, or key idea.

Tip #5: But, don’t limit!

Do not over emphasize any one sub skill in a single year. State assessments contain questions assessing a students’ knowledge on a number of sub skills. A sub skill measured one year may not be

assessed the following year.

Contact:

Maria Fallacaro

Educational Data Analyst

[email protected]

(315) 361-5552

www.moric.org

Click on >”Data Analysis”

Interpreting Assessment Results using Benchmarks Program Information & Improvement Service Mohawk Regional Information Center Madison-Oneida BOCES.

Documents

benchmark groups

designated benchmark

low level

given benchmark group

average level

high level

proficiency level

given assessment