Mfrm to Adjust for Rater Severity Leniency

8/3/2019 Mfrm to Adjust for Rater Severity Leniency

1/13

Sultan Qaboos University

Language Centre

MFRM TO ADJUST FOR RATER SEVERITY/LENIENCY

Presentation for the LC Conference

by

Farah Bahrouni/[email protected]

April 20, 2011

1Farah Bahrouni/LC Conf./April 20, 2011
mailto:[email protected]:[email protected]


2/13

Plan Briefing about MFRM

Run the analysis for 5 facets: candidate, rater, background ,

experience & category

Adjusting scores as per FACETS estimates

Conclusion



3/13

Student 1

TA:25 CC:25 LR:25 GR:25 Total: 100

Mean 19.62132 Mean 19.38971 Mean 18.20956 Mean 16.45588

Max 25 Max 24 Max 23 Max 22 94

Min 14 Min 13 Min 14 Min 10 51Range 11 Range 11 Range 9 Range 12 43

Count 68 Count 68 Count 68 Count 68

Student 2


Max 25 Max 25 Max 25 Max 24 99Min 14 Min 13 Min 12 Min 11 50

Range 11 Range 12 Range 13 Range 13 49


Student 3


Max 25 Max 23 Max 20 Max 24 92

Min 10 Min 10 Min 8 Min 11 39

Range 15 Range 13 Range 12 Range 13 53




4/13

Assessment of language proficiency:Speaking/Writing subjectivity

a number ofdistinct factors directly orindirectly impinge upon the

assessment/measurement outcomes.

These factors are referred to asfacets.



5/13

Afacethas been defined as

Any factor, variable, or component [e.g. examinees,

tasks, raters, interviewers, etc] of the

measurement situation that is assumed to affecttest scores in a systematic way.

(Backman, 2004; Linacre, 2002; Wolfe & Dobria, 2008, cited in Eckes,2009: 2)



6/13

The error-prone nature of mostmeasurement facets bring about serious

concerns about both the reliability and

validity of the obtained scores.



7/13

The usual approaches to deal with rater variability include:

rater training

using 2 or more raters in the scoring of performance

assessment

call for an adjucator (3

rd

/4th

.. rater, usu. > exp./senior/expert..)

developing rubrics that spell out the proficiency levels

identifying anchor papers to provide concrete examples of

each proficiency level

(for details see Johnson, et al. 2005, 2003, 2001, 2000)



8/13

Nevertheless, research has found that try as they may,

none of these methods is effective enough toguarantee reliable objective scores.

They are diverse enough to raise questions about the

quality of the resolved scores.

Underlying these resolution models is the common assumption that

the discrepant scores might lack the requisite levels of reliability and

validity, and that adjudication might improve this deficit to someextent (Johnson, et al. 2005 :123).



9/13

As for rater training, it has been found that even

with proper training, substantial differences

between raters persist.

(Linacre, 1990; Hamp-Lyons, 1991; Weigle, 1994, 1998, 2002; Lumley & McNamara ,

1995; McNamara, 1996; Lumley 2005)

Raters differences are reduced by training, but do

persist. (McNamara, 1996: 118 )

Reason:

Some see severity much as a personality trait thatis inherently brought to any rating situation.

(Myford, et all. 2003)



10/13

Multi-facet Rasch Model (MFRM) provides a rich

set of highly flexible tools to account, and

compensate, for measurement error, especially

rater-dependent measurement error.

It is an extension of the basic Rasch model thatincorporates more facets than the 2 usally included

in dichotomous item tests, i.e. candidates and

items.



11/13

Multifaceted Rasch measurement is a stochastic model

performed using FACETS, a computer program developed

by Linacre (1989).

Candidate ability is estimated from all ratings given by all

raters on all items(Lunz & Wright, 1997; McNamara, 1996: 132).

Item difficulty (TA,CC,LR & GA) is estimated from all

responses across all candidates to that item (ibid).

Rater severity is estimated from all ratings given across

all candidates and items (ibid).



12/13

Farah Bahrouni/LC Conf./April 20, 2011 12

In addition, MFRM has 2 more very informative

functions:

Bias analysisFit analysis

These 2 functions enable researchers to look at

how individual raters, ratees, or traits included in the analysis are performing: (fit

analysis: z score values between +2 & -2 are usually accepted in contexts similar to ours)

how the individual elements within the facets interact: individual-level effects of the

various elements: (bias analysis: z score values between +2 & -2 )

Thus, source(s) of variation in the scores are efficiently determined.(Myford, et al. 2003; Lunz & Wright, 1997)


13/13

Conclusion

Owing to the above features, MFRM has been found a

model with a great potential to improve our capacity to

produce objective measures of the ability of test takers

in performance assessment contexts. It is practical and

can be used in our context along with the pair rating.

(Linacre, et al. 1990; Engelhard, 1991, 1992, 1994, 1996; Engelhard & Myford, 2003; Hamp-Lyons, 1991; Lunz

1996, 1997a, 1997b; Lunz & Wright 1997, Weigle, 1994, 1998, 2002; Schaefer 2003, 2008; Kondo-Brown 2002;Lumley & McNamara 1995, Lumley 2005; McNamara 1991, 1996, 1997, 2000, 2002, 2008; McNamara & Roever,

2006; Myford et al, 2003, 2004; Shaw & Weir 2007; Wigglesworth, 1993, 1994).


Mfrm to Adjust for Rater Severity Leniency

Documents