Top Banner
Inter- Rater Reliabili ty March 1, 2013 Emily Phillips Galloway William Johnston
23

Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

Dec 16, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

Inter-Rater Reliability

March 1, 2013

Emily Phillips GallowayWilliam Johnston

Page 2: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Accessing Workshop Materials

Go to:

isites.harvard.edu/research_technologies

– Click on Workshops tab (on the left) and then the Inter-Rater Reliability folder (near the bottom)

– Save all of the files to the desktop (right click and ‘Save Link As’)

Page 3: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

AgendaI. Introducing IRRII. What is Kappa?

I. ‘By-hand’ Example

III. Limitations and Complications of KappaIV. Working Through a Complex Example

I. Data setupII. Estimation III. Interpretation

V. Reporting Results

Page 4: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

What is Inter-Rater Reliability?

IRR can be defined as the degree of agreement among raters. Numerous statistics can be calculated to provide a score of how much consensus exists between raters.

Why does it matter in educational research?:

In IRR we trust The quality of a coding scheme and the ability to replicate results is

connected with the overall ‘believability’ of the results. To publish results, we must demonstrate that our coding scheme is reliable.

Page 5: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

What is Inter-Rater Reliability?

IRR can be defined as the degree of agreement among raters. Numerous statistics can be calculated to provide a score of how much consensus exists between raters.

Why does it matter in language & literacy research? Language data can be challenging to code and can fall prey to

subjectivity given that interlocutors will not always say or write all that may be inferred by scorers.

Page 6: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

What is Inter-Rater Reliability?

IRR can be defined as the degree of agreement among raters. Numerous statistics can be calculated to provide a score of how much consensus exists between raters.

Why does it matter in language & literacy research? Language data can be challenging to code and can fall prey to

subjectivity given that interlocutors will not always say or write all that may be inferred by scorers.

Our Task: To design coding schemes that are not subjective.

Page 7: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

IRR: A Beginning and an End

How does formative IIR/inter-rater agreement tell us during the beginning / design phase of a study?

If your coding scheme is being developed, calculating IRR can tell you if your codes are functioning in the same way across raters.

If you are using an existing coding scheme, calculating IRR can tell you if your raters may need additional training.

Page 8: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

IRR: A BeginningHow does formative IIR/inter-rater agreement tell us during the beginning / design phase of a study?

If your coding scheme is being developed, calculating IRR on 15%-20% of your data can tell you if your codes are functioning in the same way across raters.

If there are numerous disagreements, this signals a need to revise your coding

scheme (and recode the data), to locate clear examples to help raters to understand codes better, or (when all else fails) to

abandon codes that do not function well.

Disagreements are OK at this

point!

Page 9: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

IRR: A Beginning How does formative IIR/inter-rater agreement tell us during the beginning / design phase of a study?

If are using an existing coding scheme, calculating IRR on 15%-20% of your data can tell you if your codes are functioning in the same way across raters.

If there are numerous disagreements, this signals a need to retrain your raters (and recode the data)

or to evaluate if the coding scheme you are working with may need to be revised for your

data.

Disagreements are OK at this

point!

Page 10: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

IRR (Agreement) by Hand!

While kappa statistics give us insight into user disagreements on the aggregate, a ‘confusion matrix’ can help us to identify specific codes on which raters disagree.

(Bakeman & Gottman, 1991)

Page 11: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Let’s try itScenario: Marie and Janet are coding students’ definitions for the presence of nominalized words. For each definition, 0=no nominalizations and 1=nominalizations. They are at the beginning of scoring the data with a new coding scheme that Janet has developed.  

How does it seem to be working?

Page 12: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Are Janet & Marie still friends?

Page 13: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

IRR: The Middle? How does IRR tell us during the middle of a study?

In the middle of a study, especially if we are coding data over a long period, we may again conduct IRR analysis (a ‘reliability check’) to be sure we are still coding the data reliably.

involves selecting 20% of the data at random to assess for agreement.

Page 14: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

IRR: An End Why does summative IRR matter during the analysis phase of a study?

At the end of a study, we calculate IRR to demonstrate to the academic community that our coding scheme functioned reliably.

Generally, if we have been diligent in developing our coding scheme and training our raters there are few surprises.

Page 15: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

What is Kappa?

Cohen’s Kappa (Cohen, 1960) is a numeric summation of agreement that accounts for agreements simply by chance.

po = Proportion of agreement that is actually observedpc = Proportion of agreement by chance

See pages 63-64 in the Bakeman & Gottman (1991) for an excellent example of how po, pc , and Cohen’s Kappa are calculated.

Page 16: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Limitations of KappaWhat if there are more than two possible

ratings and the size of the discrepancy between raters matters?

Use weighted KappaWhat if there are more than two raters?

Use Fleiss’ KappaWhat if different participants have different numbers of raters?

Use Krippendorff’s alpha

Page 17: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

A Detailed ExampleThe sample: 37 studentsThe data structure: 8 key variables• 2 different work explanation tasks

• “bicycle” and “debate”• 2 coders

• coder1 and coder2• 2 rating subscales

• Super ordinate scale (0-5 points)• Syntax scale (0-6 points)

Which statistic should we be

using?

Weighted Kappa!(p. 66 in B & G)

Page 18: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Estimating IRR in StataStart with a simple cross tabulation:

coder1_bic |ycle_super | coder2_bicyclel_superordinate ordinate | 0 2 3 4 5 | Total-----------+-------------------------------------------------------+---------- 0 | 13 0 0 1 0 | 14 2 | 0 5 0 0 0 | 5 3 | 0 0 3 0 0 | 3 4 | 0 0 0 5 0 | 5 5 | 0 0 0 0 10 | 10 -----------+-------------------------------------------------------+---------- Total | 13 5 3 6 10 | 37

What do you notice?• Very strong agreement! Why might this be? • Neither reviewer ranked anybody a “1”. Why might this be?

• This has implications for how we do this in Stata…

Page 19: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Estimating IRR in Stata• Because of the nature of the ratings, we

have to make some changes to the data in order for things to run. • Our weight matrix implies there are 6 possible

ratings but only 5 are used by the raters we must use the “absolute” option in Stata

• BUT… in order for this to work we need to change the scale from 0,1,2…5 to 1,2,3…6.

replace c1_bicycle_so = c1_bicycle_so+1replace c2_bicycle_so = c2_bicycle_so+1 tab c1_bicycle_so c2_bicycle_so

Page 20: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Estimating IRR in StataThe command:

kap c1_bicycle_so c2_bicycle_so, wgt(s_o_wgt) absolute

The output:

ExpectedAgreement Agreement Kappa Std. Err. Z Prob>Z----------------------------------------------------------------- 97.84% 53.81% 0.9532 0.1284 7.43 0.0000

Page 21: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Apply Your Knowledge

For each of the three remaining subscales repeat the steps:

• Inspect a cross tabulation to get an idea of the data distributions

• Create a weighting matrix • “w” or “w2” are built-in weight matrices that you

can use (w is linear and w2 is quadratic)

• Estimate the Kappa and interpret the results!

Page 22: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Estimating IRR with web resources

If you do not have access to Stata, there are some great web resources you can use as well:

http://www.stattools.net/CohenKappa_Exp.php

http://www.agreestat.com/

Page 23: Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

www.gse.harvard.edu

Reporting Results“inter-rater reliability, calculated based on double coding of 20% of the tasks, was very high (Agreement = 98%; Cohen’s Kappa = .96).” (Kieffer & Lesaux, 2010)

“To calculate inter-rater reliability for the coding scheme developed in the first phase, a research coordinator and a graduate research assistant randomly selected 20 writing samples for each of four tasks from different years, cohorts, and writing ability levels: narratives by pen, the sentence integrity task by pen, essays by keyboard, and the sentence integrity task by keyboard. One rater served as the anchor for computing percent agreement between coders. The inter-rater reliability was generally very good for each coded category. Except for two categories, initial percent agreement ranged from 84.6% to 100%. For the only two categories of low interrater reliability, subordinate and adverbial clauses, additional training and reliability checks improved inter-rater reliability for these to acceptable levels of over 0.80.” (Berninger, Nagy, & Scott, 2011)