Examining Rubric Design and Inter-rater Reliability: a Fun Grading Project Presented at the Third Annual Association for the Assessment of Learning in.

Examining Rubric Design and Inter-rater Reliability: a Fun Grading Project

Presented at the Third Annual Association for the Assessment of Learning in Higher Education

(AALHE) Conference, Lexington, Kentucky, June 3, 2013

Dr. Yan Zhang CookseyUniversity of Maryland University College

Outline of Today’s Presentation

• Background and purposes of the full-day grading project

• Procedural methods of the project• Discuss the results and decisions

informed by the assessment findings• Lessons learned through the

process

Purposes of the Full-day Grading Project

• To simplify the current assessment process

• To validate the newly developed common rubric measuring four core student learning areas (written communication, critical thinking, technology fluency, and information literacy)

UMUC Graduate School Previous Assessment Model: 3-3-3 Model

Previous Assessment Model: 3-3-3 Model (Cont.)

Previous Assessment Model: 3-3-3 Model (Cont.)

Strengths: Weaknesses:

• Tested rubrics • Added faculty workload

• Reasonable collection points

• Lack of consistency in assignments

• Larger samples - more data for analysis

• Variability in applying scoring rubrics

C2 Model: Common activity & Combined rubric

Compare 3-3-3 Model to (new)C2 Model

Current 3-3-3 Model Combined Activity/Rubric (C2) Model

•Multiple Rubrics: one for each of 4 SLEs

•Single rubric for all 4 SLEs

•Multiple assignments across graduate school

•Single assignment across graduate school

•One to multiple courses/4 SLEs •Single course/4 SLEs

•Multiple raters for the same assignment/course

•Same raters/assignment/course

•Untrained raters •Trained raters

Purposes of the Full-day Grading Project

• To simplify the current assessment process

• To validate the newly developed common rubric measuring four core student learning areas (written communication, critical thinking, technology fluency, and information literacy)

Procedural Methods of the Grading Project

• Data Source • Rubric • Experimental design for data

collection • Inter-rater reliability

Procedural Methods of the Grading Project (Cont.)

• Data Source (student papers/redacted)Course name # of PapersBTMN9040 27BTMN9041 29BTMN9080 7DETC630 9MSAF670 20MSAS670 13TMAN680 16

Total 121


• Common Assignment• Rubric (rubric design and

refinement)• 18 Raters (faculty members)


• Experimental design for data collection randomized trial (Group A&B) raters’ norming and training grading instruction


• Inter-rater reliability (literature) Stemler (2004): in any situation that

involves judges (raters), the degree of inter-rater reliability is worthwhile to investigate, as the value of inter-rater reliability has significant implication for the validity of the subsequent study results.

Intraclass Correlation Coefficients (ICC) were used in this study.

Results and Findings

• Two-sample t-testGroup Statistics

Group # N Mean Std. Deviation

Std. Error Mean

Differ_Rater1and2

Group A-Experiment Group 483 .249 1.0860 .0494

Group B-Control Group 540 .024 1.2463 .0536

Results and Findings (Cont.)Independent Samples Test

Levene's Test for Equality of Variances

t-test for Equality of Means

F Sig. t df Sig. (2-tailed)

Mean Difference

Std. Error Difference

95% Confidence Interval of the

Difference

Lower Upper

Differ_Rater1and2

Equal variances assumed

11.311 .001 3.056 1021 .002 .2246 .0735 .0804 .3688

Equal variances not assumed

3.0801020.31

5.002 .2246 .0729 .0815 .3677

Results and Findings (Cont.)

• Inter-rater Reliability: Intraclass Correlations Coefficients (ICC)

Overall Intraclass Correlation Coefficient

Intraclass Correlation

Group A Group B

Single Measures .288 .132

Average Measures .447 .233

One-way random effects model where people effects are random.

Group A-Experiment Group; Group B-Control Group


• Intraclass Correlation Coefficient by Criterion

CriterionAverage Measures Intraclass Correlation

Group A

1 Conceptualization/Content/Ideas [THIN] .461

2 Analysis/Evaluation [THIN] .3723 Synthesis /Support [THIN] .4594 Conclusion/Implications [THIN] .1635 Selection/Retrieval [INFO] .4616 Organization [COMM] .5327 Writing Mechanics [COMM] .6488 APA Compliance [COMM] .4509 Technology Application [TECH] .303


• Inter-Item Correlation for Group A

Reliability Statisticsa

Cronbach's Alpha Cronbach's Alpha Based on

Standardized Items

N of Items

.895 .900 9

a. Group# = Group A-Experiment

Results and Findings (Cont.)Inter-Item Correlation Matrixa

Criterion 1

Criterion 2

Criterion 3

Criterion 4

Criterion 5

Criterion 6

Criterion 7

Criterion 8

Criterion 9

Criterion 1 [THIN] 1.000 .707 .575 .811 .296 .687 .518 .319 .397

Criterion 2 [THIN] .707 1.000 .868 .788 .198 .788 .478 .325 .403

Criterion 3 [THIN] .575 .868 1.000 .743 .344 .843 .494 .541 .424

Criterion 4 [THIN] .811 .788 .743 1.000 .314 .820 .500 .344 .379

Criterion 5 [INFO] .296 .198 .344 .314 1.000 .301 .444 .523 .241

Criterion 6 [COMM] .687 .788 .843 .820 .301 1.000 .540 .555 .428

Criterion 7 [COMM] .518 .478 .494 .500 .444 .540 1.000 .510 .081

Criterion 8 [COMM] .319 .325 .541 .344 .523 .555 .510 1.000 .445

Criterion 9 [TECH] .397 .403 .424 .379 .241 .428 .081 .445 1.000

Lessons Learned through the Process

• Get faculty excited about assessment!

• Strategies to improve inter-rater agreement More training Clear rubric criteria Map assignment instructions to rubric

criteria

• Make decisions based on the assessment results Further refined the rubric and common

assessment activity

Resources

• McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46 (Correction, 1(1), 390).

• Nunnally, J. (1978). Psychometric theory (2nd ed.). New York:

McGraw-Hill. • Stemler, S.E. (2004). A comparison of consensus, consistency,

and measurement approaches to estimating. Practical Assessment, Research & Evaluation, 9(4). Retrieved from http://pareonline.net/getvn.asp?v=9&n=4.

• Shrout, P.E. & Fleiss, J.L. (1979). Intraclass Correlations: Uses in Assessing Rater reliability. Psychological Bulletin, 2, 420-428. Retrieved from http://www.hongik.edu/~ym480/Shrout-Fleiss-ICC.pdf.

http://pareonline.net/getvn.asp?v=9&n=4

http://pareonline.net/getvn.asp?v=9&n=4

http://www.hongik.edu/~ym480/Shrout-Fleiss-ICC.pdf

http://www.hongik.edu/~ym480/Shrout-Fleiss-ICC.pdf

Stay Connected…

• Dr. Yan Zhang CookseyDirector for Outcomes AssessmentThe Graduate School, University of Maryland University CollegeEmail: [email protected]: http://assessment-matters.weebly.com

mailto:[email protected]

http://assessment-matters.weebly.com/



Examining Rubric Design and Inter-rater Reliability: a Fun Grading Project Presented at the Third Annual Association for the Assessment of Learning in.

Documents

rubric slide

model slide

process slide

fun grading project

information literacy

sles single rubric

assessment of learning

group bcontrol group