3.3.interrater

7/28/2019 3.3.interrater

1/16

Funded through the ESRCs ResearcherDevelopment Initiative

Prof. Herb Marsh

Ms. Alison OMara

Dr. Lars-Erik Malmberg

Department of Education,University of Oxford

Session 3.3: Inter-rater reliability


2/16

Establishresearchquestion

Definerelevantstudies

Develop codematerials

Locate andcollate studies

Pilot coding;coding

Data entryand effect size

calculation

Main analysesSupplementary

analyses


3/16

Aim of co-judge procedure, to discern:Consistency within coder

Consistency between coders

Take care when making inferences based on littleinformation,

Phenomena impossible to code become missingvalues

Interrater reliability


4/16

Percent agreement: Common but not recommended

Cohens kappa coefficientKappa is the proportion of the optimum improvement

over chance attained by the coders, 1 = perfectagreement, 0 = agreement is no better than that expectedby chance, -1 = perfect disagreementKappas over .40 are considered to be a moderate level of

agreement (but no clear basis for this guideline)

Correlation between different raters

Intraclass correlation. Agreement among multipleraters corrected for number of raters usingSpearman-Brown formula (r)

Interrater reliability


5/16

Percent exact agreement =

Number of

observations agreed onTotal number of

observations

Interrater reliability of categorical IV (1)

Categorical IV with 3 discreetscale-steps

9 ratings the same

% exact agreement = 9/12 =.75

Study Rater 1 Rater 2

1 0 0

2 1 1

3 2 1

4 1 1

5 1 1

6 2 2

7 1 1

8 1 1

9 0 0

10 2 1

11 1 0

12 1 1


6/16

Interrater reliability of categorical IV (2)unweighted Kappa

544.451.1

451.750.

451.12/)]1)(3()8)(7()3)(2[(

75.12/)162(

,1

2

K

P

P

P

PPK

E

O

E

EO

Rater 1

0 1 2 SumRater 2 0 2 0 0 2

1 1 6 0 7

2 0 2 1 3

Sum 3 8 1 12

Kappa:

Positive values indicate

how much the raters

agree over and abovechance alone

Negative values indicate

disagreement

If agreement

matrix is irregularKappa will not be

calculated, or

misleading


7/16

Interrater reliability of categorical IV (3)unweighted Kappa in SPSS

Symmetric Measures

.544 .220 2.719 .007

12

KappaMeasure of Agreement

N of Valid Cases

Value

Asymp.

Std. Errora

Approx. Tb

Approx. Sig.

Not assuming the null hypothesis.a.

Using the asymptotic standard error assuming the null hypothesis.b.

CROSSTABS/TABLES=rater1 BY rater2

/FORMAT= AVALUE TABLES

/STATISTIC=KAPPA

/CELLS= COUNT

/COUNT ROUND CELL .


8/16

Interrater reliability of categorical IV (4)Kappas in irregualar matrices

If rater 2 is systmatically above rater 1 when coding anordinal scale, Kappa will be misleading possible to fill

up with zeros

Rater 11 2 3 Sum

Rater 2 2 4 1 0 5

3 3 6 1 10

4 0 3 7 10

Sum 7 10 8 25

Rater 11 2 3 4 Sum

Rater 2 1 0 0 0 0 0

2 4 1 0 0 5

3 3 6 1 0 10

4 0 3 7 0 10

Sum 7 10 8 0 25

K= .51 K= -.16


9/16

Interrater reliability of categorical IV (5)Kappas in irregular matrices

If there are no observations in some row or column,Kappa will not be calculated possible to fill up with

zeros

Rater 1

1 3 4 SumRater 2 1 4 0 0 4

2 2 1 0 3

3 1 3 2 6

4 0 1 4 5

Sum 7 5 6 18

Rater 1

1 2 3 4 SumRater 2 1 4 0 0 0 4

2 2 0 1 0 3

3 1 0 3 2 6

4 0 0 1 4 5

Sum 7 0 5 6 18

Knot possible to

estimateK= .47


10/16

Interrater reliability of categorical IV (6)weighted Kappa using SAS macro

PROCFREQDATA = int.interrater1 ;TABLES rater1 * rater2 / AGREE;

TESTKAPPA; RUN;

Papers and macros

available for estimatingKappa when unequal or

misaligned rows and

columns, or multiple raters:

eii

oiiW

pw

pwK

1


11/16

Interrater reliability of continuous IV (1)

Average correlation r= (.873 + .879 + .866) / 3 = .873

Coders code in same direction!

Correlations

1 .873** .879**

.000 .000

12 12 12

.873** 1 .866**

.000 .000

12 12 12

.879** .866** 1

.000 .00012 12 12

Pearson Correlat ion

Sig. (2-tail ed)

N


Sig. (2-tail ed)

N


Sig. (2-tail ed)

N

rater1

rater2

rater3

rater1 rater2 rater3

Correlation is significant at the 0.01 level (2-tailed).**.

Study Rater 1 Rater 2 Rater 3

1 5 6 52 2 1 2

3 3 4 4

4 4 4 4

5 5 5 5

6 3 3 4

7 4 4 48 4 3 3

9 3 3 2

10 2 2 1

11 1 2 1

12 3 3 3


12/16


Estimates of Covariance Parametersa

.222222

1.544613

Parameter

Residual

VarianceIntercept [subject = study]

Estimate

Dependent Variable : rating.a.

874.0767.1

544.1

222.0544.1

544.1

22

2

WB

BICC


13/16


Design 1 one-way random effects model when each

study is rater by a different pair of codersDesign 2 two-way random effects model when a

random pair of coders rate all studies

Design 3 two-way mixed effects model ONE pair ofcoders rate all studies


14/16

Comparison of methods (from Orwin, p. 153;in Cooper & Hedges, 1994)

Low Kappa but good AR when little variability

across items, and coders agree


15/16

Interrater reliability in meta-analysis andprimary study


16/16

Meta-analysis: coding of independent variables

How many co-judges?

How many objects to co-judge? (sub-sample ofstudies, versus sub-sample of codings)

Use of Golden standard (i.e., one master-coder)Coder drift (cf. observer drift): are coders

consistent over time? Your qualitative analysis is only as good as the

quality of your categorisation of qualitative data

Interrater reliability in meta-analysis vs. inother contexts

3.3.interrater

Documents