Examing Rounding Rules in Angoff Type Standard Setting Methods Adam E. Wyse Mark D. Reckase
Dec 22, 2015
Examing Rounding Rules in Angoff Type Standard Setting Methods
Adam E. WyseMark D. Reckase
Mark D. Reckase• Current Projects• Multidimensional Item Response Theory
– Development of methodology for fine grained analysis of item response data in high dimensional spaces. Application of methodology to gain understanding of constructs assessed by tests.
• Test Design and Construction – Design of content and statistical specifications for tests using the
philosophy of item response theory. Use of computerized test assembly procedures to match test specifications.
• Portfolio Assessment – Design of portfolio assessment systems, including formal objective
scoring of portfolios. • Procedures for Setting Standards
– Development and evaluation of procedures for setting standards on educational and psychological tests. Includes extensive work on setting standards on the National Assessment of Educational Progress.
• Computerized Adaptive Testing – Developing procedures for selecting and administering test items
to individuals using computer technology. In particular, designing systems to match item selection to the specific requirements for test use.
Angoff Method
• The probability of the minimally competent examinee (MCE) would respond correctly to the item
Modified Angoff Method (1)
• Round to a whole number of score point (Yes/No method)
Polytom
ous
Dichotom
ous
Modified Angoff Method (2)
• Rate the MCE score of each cluster of items.-Round to 1 decimal place -round to integer
Modified Angoff Method (3)
• How to aggregate those rater’s judgment– Mean or median (for excluding the effect of outliner)
mean median18.166
7 18.4
20.8833 21
Theoretical Framework
• Reckase 2006 Round to integer
Round to 0.05
Perfectly understand the relation between Item difficulty and Cut theta
Theoretical Framework
• Reckase 2006
Round to 1 decimal place
Round to 2 decimal places
Theoretical Framework
• Bias– Individual panelists cut-score– Group level cut-scores: mean or median.
• Other evidence for evaluating Standard Setting– Correlation: item ratings and P values provided by
panelists• Can’t detect the panelists’ servility• Errors can be incorporated into Reckase evaluation
approach.
Theoretical Framework
• Assumption– Only for single round (Without training effect)– Do not include error (In an ideal setting)
• Investigate the impact of the Angoff modifications and rounding rules in the ideal situation.
Data and Method• NEAP Data– 20 raters last round– The panelist’s θ cut-score in NEAP was his
intended cut-score.• 2PL• 3PL• GPCM:
E(X|θ)=1*P1(θ)+2*P2(θ)+3*P3(θ)+4*P4(θ)
Simulated conditions
• Round – Integer: 1.2345 1– Nearest 0.05: 1.2345 1.25– Nearest 2 decimal places: 1.2345 1.23
• Item pool– 180, 107, 109, 53 items
Simulated conditions
• Individual item vs. clusters of items• Cut-scores– Basic, Proficient, and advanced
• Aggregating value– Mean vs. Median
Evaluation Criteria• Bias:–
• Average absolute bias:–
• Bias for the group’s intended cut score– mean:
– median:
Result –individual panelist
•
> > > >
Rounding: integer > 0.05 > 2 decimal places
Result –individual panelist
•
Cut-score location: Advanced > Basic > Proficient
Result –individual panelist
•
Individual items > cluster level (fewer rounding error)
>
Result –individual panelist
•
Item pool: 53 items have greater bias than the other pools
Result –individual panelist
•
Item pool: 53 items < 180 items , for Proficient, integer.The importance of the location of Cut-score and the items distribution
Result –Group panelist
Some cases the Mean is better, other cases the Median is better
Result –Group panelist
Basic were “-” bias, Proficient and Advanced were “+” bias.At cluster item level, the proficient was “-” bias.
Result –Group panelist
The advanced produced the greatest bias than other two level.The bias did not cancel out for a group of panelists.
Result –Group panelist
Both the mean and median bias < 0.01 for round to 0.05 and 2 decimal places.Again, more test items did not necessarily.
Result –Group panelist
Cluster level is better than individual items.
Impact on Percent Above Cut-score (PAC)
Finding the PAC for the closest value on the NAEP in the pilot study.PAC for estimating θ - PAC for intended θ. Nearest 0.05 or nearest 0.01 did not change. No effect. Minimal impact
Impact on Percent Above Cut-score (PAC)
Basic: 5.610~13.010Proficient: -3.823~-4.387Advanced: -1.156~-1.262
Impact on Percent Above Cut-score (PAC)
Basic: 4.490~14.190Proficient: -4.387~-5.346Advanced: -1.156~-1.343
Impact on Percent Above Cut-score (PAC)
Bias: Advanced > Basic and ProficientPAC: Advanced < Basic and ProficientThere are more student near the basic and proficient cut score
Impact on Percent Above Cut-score (PAC)
Rounding to the integer dose not present a viable alternative in Angoff method.
Discussion
• Rounding to integer could affect the cut scores.– Using cluster item level can mitigate bias, but
biases still remained.• Using more test items will not necessarily
produce less bias.– The important is the location of the items in
relationship to the intended cut-score.
Discussion
• 10 items [-2 ~ +2]• Cut score θ = 0– 5 items rounded to score 1 – 5 items rounded to score 0
• Cut total score = 5 θ = 0
• Bias = 0
Discussion
• 20 items [-1 ~ +3]• Cut score θ = 0– 5 items rounded to score 1 – 15 items rounded to score 0
• Cut total score = 5 θ = -0.438
• Bias = -0.438
Discussion
• Using OIB from bookmark to roughly design half of the items were above cut-score.– Impossible to know the location of cut-score.– The intended cut-scores in different panelists are
different. Some panelists must have bias• In multiple cut-scores, at lease one of cut-
scores would produce bias.• Rounding to integer present many potential
problems.
Discussion
• Challenge: in real situations panelists are not completely consistent in their judgments.– Feedback is helpful for reducing rater
inconsistency in NAEP
• Further development– Examine the bias at the group level
Thank you for attention