7/28/2019 3.3.interrater
1/16
Funded through the ESRCs ResearcherDevelopment Initiative
Prof. Herb Marsh
Ms. Alison OMara
Dr. Lars-Erik Malmberg
Department of Education,University of Oxford
Session 3.3: Inter-rater reliability
7/28/2019 3.3.interrater
2/16
Establishresearchquestion
Definerelevantstudies
Develop codematerials
Locate andcollate studies
Pilot coding;coding
Data entryand effect size
calculation
Main analysesSupplementary
analyses
7/28/2019 3.3.interrater
3/16
Aim of co-judge procedure, to discern:Consistency within coder
Consistency between coders
Take care when making inferences based on littleinformation,
Phenomena impossible to code become missingvalues
Interrater reliability
7/28/2019 3.3.interrater
4/16
Percent agreement: Common but not recommended
Cohens kappa coefficientKappa is the proportion of the optimum improvement
over chance attained by the coders, 1 = perfectagreement, 0 = agreement is no better than that expectedby chance, -1 = perfect disagreementKappas over .40 are considered to be a moderate level of
agreement (but no clear basis for this guideline)
Correlation between different raters
Intraclass correlation. Agreement among multipleraters corrected for number of raters usingSpearman-Brown formula (r)
Interrater reliability
7/28/2019 3.3.interrater
5/16
Percent exact agreement =
Number of
observations agreed onTotal number of
observations
Interrater reliability of categorical IV (1)
Categorical IV with 3 discreetscale-steps
9 ratings the same
% exact agreement = 9/12 =.75
Study Rater 1 Rater 2
1 0 0
2 1 1
3 2 1
4 1 1
5 1 1
6 2 2
7 1 1
8 1 1
9 0 0
10 2 1
11 1 0
12 1 1
7/28/2019 3.3.interrater
6/16
Interrater reliability of categorical IV (2)unweighted Kappa
544.451.1
451.750.
451.12/)]1)(3()8)(7()3)(2[(
75.12/)162(
,1
2
K
P
P
P
PPK
E
O
E
EO
Rater 1
0 1 2 SumRater 2 0 2 0 0 2
1 1 6 0 7
2 0 2 1 3
Sum 3 8 1 12
Kappa:
Positive values indicate
how much the raters
agree over and abovechance alone
Negative values indicate
disagreement
If agreement
matrix is irregularKappa will not be
calculated, or
misleading
7/28/2019 3.3.interrater
7/16
Interrater reliability of categorical IV (3)unweighted Kappa in SPSS
Symmetric Measures
.544 .220 2.719 .007
12
KappaMeasure of Agreement
N of Valid Cases
Value
Asymp.
Std. Errora
Approx. Tb
Approx. Sig.
Not assuming the null hypothesis.a.
Using the asymptotic standard error assuming the null hypothesis.b.
CROSSTABS/TABLES=rater1 BY rater2
/FORMAT= AVALUE TABLES
/STATISTIC=KAPPA
/CELLS= COUNT
/COUNT ROUND CELL .
7/28/2019 3.3.interrater
8/16
Interrater reliability of categorical IV (4)Kappas in irregualar matrices
If rater 2 is systmatically above rater 1 when coding anordinal scale, Kappa will be misleading possible to fill
up with zeros
Rater 11 2 3 Sum
Rater 2 2 4 1 0 5
3 3 6 1 10
4 0 3 7 10
Sum 7 10 8 25
Rater 11 2 3 4 Sum
Rater 2 1 0 0 0 0 0
2 4 1 0 0 5
3 3 6 1 0 10
4 0 3 7 0 10
Sum 7 10 8 0 25
K= .51 K= -.16
7/28/2019 3.3.interrater
9/16
Interrater reliability of categorical IV (5)Kappas in irregular matrices
If there are no observations in some row or column,Kappa will not be calculated possible to fill up with
zeros
Rater 1
1 3 4 SumRater 2 1 4 0 0 4
2 2 1 0 3
3 1 3 2 6
4 0 1 4 5
Sum 7 5 6 18
Rater 1
1 2 3 4 SumRater 2 1 4 0 0 0 4
2 2 0 1 0 3
3 1 0 3 2 6
4 0 0 1 4 5
Sum 7 0 5 6 18
Knot possible to
estimateK= .47
7/28/2019 3.3.interrater
10/16
Interrater reliability of categorical IV (6)weighted Kappa using SAS macro
PROCFREQDATA = int.interrater1 ;TABLES rater1 * rater2 / AGREE;
TESTKAPPA; RUN;
Papers and macros
available for estimatingKappa when unequal or
misaligned rows and
columns, or multiple raters:
eii
oiiW
pw
pwK
1
7/28/2019 3.3.interrater
11/16
Interrater reliability of continuous IV (1)
Average correlation r= (.873 + .879 + .866) / 3 = .873
Coders code in same direction!
Correlations
1 .873** .879**
.000 .000
12 12 12
.873** 1 .866**
.000 .000
12 12 12
.879** .866** 1
.000 .00012 12 12
Pearson Correlat ion
Sig. (2-tail ed)
N
Pearson Correlat ion
Sig. (2-tail ed)
N
Pearson Correlat ion
Sig. (2-tail ed)
N
rater1
rater2
rater3
rater1 rater2 rater3
Correlation is significant at the 0.01 level (2-tailed).**.
Study Rater 1 Rater 2 Rater 3
1 5 6 52 2 1 2
3 3 4 4
4 4 4 4
5 5 5 5
6 3 3 4
7 4 4 48 4 3 3
9 3 3 2
10 2 2 1
11 1 2 1
12 3 3 3
7/28/2019 3.3.interrater
12/16
Interrater reliability of continuous IV (2)
Estimates of Covariance Parametersa
.222222
1.544613
Parameter
Residual
VarianceIntercept [subject = study]
Estimate
Dependent Variable : rating.a.
874.0767.1
544.1
222.0544.1
544.1
22
2
WB
BICC
7/28/2019 3.3.interrater
13/16
Interrater reliability of continuous IV (3)
Design 1 one-way random effects model when each
study is rater by a different pair of codersDesign 2 two-way random effects model when a
random pair of coders rate all studies
Design 3 two-way mixed effects model ONE pair ofcoders rate all studies
7/28/2019 3.3.interrater
14/16
Comparison of methods (from Orwin, p. 153;in Cooper & Hedges, 1994)
Low Kappa but good AR when little variability
across items, and coders agree
7/28/2019 3.3.interrater
15/16
Interrater reliability in meta-analysis andprimary study
7/28/2019 3.3.interrater
16/16
Meta-analysis: coding of independent variables
How many co-judges?
How many objects to co-judge? (sub-sample ofstudies, versus sub-sample of codings)
Use of Golden standard (i.e., one master-coder)Coder drift (cf. observer drift): are coders
consistent over time? Your qualitative analysis is only as good as the
quality of your categorisation of qualitative data
Interrater reliability in meta-analysis vs. inother contexts