1 Psychometric Issues in the Measurement of Non-Cognitive Attributes Yoon Soo Park University of Illinois – College of Medicine at Chicago October 6, 2014 Correspondence concerning this manuscript should be addressed to Yoon Soo Park, Department of Medical Education, University of Illinois – College of Medicine at Chicago, 808 S. Wood Street, 986 CMET (MC 591) Chicago, IL 60612-7309. Email: [email protected].
40
Embed
Psychometric Issues in the Measurement of Non-Cognitive ... · non-cognitive attributes. In particular, when estimates from the psychometric rater models were analyzed with value-added
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Psychometric Issues in the Measurement of Non-Cognitive Attributes
Yoon Soo Park
University of Illinois – College of Medicine at Chicago
October 6, 2014
Correspondence concerning this manuscript should be addressed to Yoon Soo Park, Department of Medical Education, University of Illinois – College of Medicine at Chicago, 808 S. Wood Street, 986 CMET (MC 591) Chicago, IL 60612-7309. Email: [email protected].
2
Abstract
Recent research has demonstrated the impact that non-cognitive attributes have on long-term life outcomes, with studies supporting such evidence continuing to emerge across various disciplines. Non-cognitive attributes refer to character skills such as conscientiousness, motivation, and agreeableness that contrast from cognitive attributes which traditionally measure general knowledge or intelligence. Although investing in non-cognitive attributes has shown great promise, psychometric issues pertaining to its measurement characteristics deserve greater attention and discussion. Non-cognitive attributes face greater challenges in its measurement due to sampling of behaviors that require the use of sufficient cases, items, and raters, which complicates reliable and precise estimates. This paper uses psychometric rater models to refine measurements of non-cognitive attributes. Empirical analysis using teacher observation data from classroom settings demonstrate benefits of using this technique to refine measurements of non-cognitive attributes. In particular, when estimates from the psychometric rater models were analyzed with value-added scores, non-cognitive attributes had greater effect sizes, relative to traditional methods. This paper also proposes a new method for measuring non-cognitive attributes that account for modes of observations. Real-world data from police promotion exercises are used to demonstrate its use. Monte carlo simulations show stability in recovery of parameter estimates.
Response and Rapport 1.70 19.49 62.02 16.79 Culture of Learning 2.65 30.73 53.87 12.74 Managing Procedures 3.36 26.27 60.30 10.08 Managing Behavior 3.35 26.91 59.23 10.51
Instruction (Domain 3)
Communication 3.75 31.57 54.88 9.80 Questioning and Discussion 8.37 51.40 35.02 5.21 Engaging in Learning 6.95 44.28 42.18 6.60 Assessment in Instruction 9.83 48.77 37.18 4.21 Flexibility and Responsiveness 8.46 46.99 39.01 5.53
Note: Values represent row percentages. A total of 1,000 observations were scored by principals and IES raters using a 4-point rating scale (CPS Framework for Teaching, see http://cps.edu/sitecollectiondocuments/cpsframeworkteaching.pdf for a full description of the rubric. Document accessed on October 1, 2014).
Model parameter estimates. Figure 2 shows the LC-SDT model parameters, restricted
to IES raters (complete table of results for all 115 raters can be obtained from the author). In the
left figure, rater precision estimates and their respective 95% confidence intervals are plotted.
The X-axis represents the 19 IES raters and the Y-axis represents the rater precision estimates (dj
from Equation [1]). Results show a wide variability in rater precision even among IES raters. IES
raters are highly-trained observers who visit schools to work with principals to improve their
teacher evaluation skills. The wide variability in rater precision estimates indicates the need to
adjust for rater-specific differences, which are often ignored in practice. In the figure to the right,
plots of the rater criteria (ckj) are presented. Since the CPS Framework for Teaching is based on 4
ordinal performance categories, there are three criteria locations. A criteria estimate that is higher,
relative to other raters indicates greater severity; lower estimates indicate leniency.
18
Table 2. Measures of agreement between principals and IES raters
Communication 69.23% .50 (.03) .55 (.03) .63 (.04) Questioning and Discussion 70.53% .53 (.03) .57 (.03) .61 (.04) Engaging in Learning 64.11% .44 (.03) .51 (.03) .60 (.04) Assessment in Instruction 65.31% .46 (.03) .53 (.03) .62 (.04) Flexibility and Responsiveness 58.95% .36 (.03) .43 (.03) .51 (.04)
Note: IES raters are “Instructional Effectiveness Specialists” who provide guidance on rater training to principals. Values in parentheses are standard errors.
19
Note:
1. Figure on left shows estimates of rater precision (dj estimates) across the 19 IES raters. A greater rater precision estimate reflects greater ability for the rater to discriminate differences between behaviors.
2. Figure on the right shows relative criteria estimates for the 19 IES raters. Since there are 4 categories in the rubric, there are 3 criteria locations (cut points) in the distribution. Criteria estimates were standardized to the same scale to make comparisons between raters. Higher criteria estimate indicates severity, while a lower estimate indicates leniency.
Figure 2. Parameter estimates from LC-SDT: Rater precision and relative criteria
Comparing model-based scores with original rater scores. Based on model estimated
parameters, model-based scores were generated. Value-added scores (combined subjects,
mathematics, and reading) were regressed simultaneously to the estimated latent classes. In
comparison, traditional linear regression was used to examine the regression coefficient effects.
Table 3 presents the results comparing the two methods.
Results show that when a psychometric rater model is used, the coefficients of the value-
added scores have greater effect sizes. For example, in the combined value-added scores, the
regression coefficient for the latent class regression is .15, while it is .09 for linear regression.
While this difference is modest, with similar standard error estimates, the difference in effect
sizes indicate some value in using psychometric rater models to refine measurement precision of
non-cognitive attributes.
Table 3. Comparison of coefficient effects: Latent class regression and linear regression
Value-added score Latent class regression coefficients using
model-based Scores Linear regression coefficients
using original ratings Combined .154 (0.048)** .093 (.039)* Mathematics .259 (0.050)*** .166 (.036) Reading –.009 (0.043) –.005 (.035)
Note: Value-added scores standardized to –3 and 3 scale (see Value-Added Research Center, 2014). *p<.05; **p<.01; ***p<.001.Values in parentheses represent standard errors.
5. Columbus Police and Firefighter Promotion Data
5.1 Methods
Data. In this section, data were analyzed from a real-world administration of live and
video-recorded observation scores, where two exercises (items) are given to candidates and 6
different raters to score each exercise, comprising a total of 12 raters. For each exercise, 3 raters
score the candidate through live observation, with possible interactions between the examinee
and the raters; the remaining 3 raters score a video recording of the performance at a subsequent
21
time. In other words, raters 1, 2, and 3 score exercise 1 through live observation; raters 4, 5, and
6 score exercise 1 through videotaped recording. Similarly, raters 7, 8, and 9 score exercise 2
through live observation; raters 10, 11, and 12 score exercise 2 through videotaped recordings.
All raters were trained to score using a holistic 3-point rating scale, which measures the
following skills: oral communication, interpersonal relations, information analysis, and problem
sensing and resolution ability. The data contain 440 global ratings from each rater for each
exercise.
Analysis. Data were used to fit both HRM-SDT and HRM-MO. Model fit indices,
parameters, latent class size, and classification indices were compared. Estimation was
conducted using Latent Gold 4.5 (Vermunt and Magidson, 2005).
5.2 Results
Descriptive statistics. Table 4 shows the descriptive statistics of the ratings as well as the
rater agreement statistics for each mode of observation.
Table 4. Distribution of scores assigned and rater agreement
Note: “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation. Results of the model comparison for the real-world data indicate a better model fit for the HRM-
MO model (lower AIC and BIC). Table 6 shows the classification indices, Pc and λ.
Classification indices indicate the quality of classification based on posterior probabilities of the
model. The Pc measures classification accuracy, and the λ statistic accounts for classification that
23
can occur by chance (Clogg and Manning, 1996). For both exercises, classification was lower for
the video-based observation (η12 and η22) when compared to live observation (η11 and η21). In
addition, classification was lower for the combined latent categorical variables (Φ1 and Φ2).
Latent class sizes for the latent categorical variables are also presented in Table 6.
Table 6. Classification indices and latent class sizes by model
Model Latent
variable Classification Latent class sizes Pc λ Class 1 Class 2 Class 3
Note: Proportion correctly classified (Pc) and λ are both measures of classification based on posterior probability (Clogg, 1995). The λ statistic accounts for classification that can occur by chance. Values in parenthesis represent standard errors.
Level 1: Rater model parameters. Table 7 shows level 1 rater parameters by HRM-
SDT and HRM-MO models. Results indicate that rater discrimination (d parameter), which
indicates how well a rater is able to discriminate between different qualities of performance
(rater precision), was generally greater for live (onsite) scoring than video-based scoring for both
exercises, where the difference was slightly greater for exercise 1 than exercise 2. However, the
average rater discrimination between the two exercises was comparable. The distribution of rater
discrimination indices shows raters that are able to better detect differences between the
categories. Although estimates were different, the overall trends between the HRM-SDT and
HRM-MO were similar.
Figures 3 (right: discrimination, left: relative criteria) was created to visually illustrate the
parameters for the HRM-MO model.
24
Table 7. Rater parameters: Level 1 (Signal Detection Theory Rater Model) by model Exercise Mode Rater Parameter HRM-SDT HRM-MO
Note: Values in parenthesis represent standard errors. Parameter d represents rater discrimination and c represents rater criteria. “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation.
25
Note:
1. In the left figure, the X-axis indicates the rater IDs; the Y-axis indicates relative criteria estimates. Raters 1 to 3 and 6 to 9 scored onsite (live scoring) for exercises 1 and 2, respectively. Raters 4 to 6 and 10 to 12 scored using a video for exercises 1 and 2, respectively. Horizontal lines were added on criteria location where the likelihood ratios are maximized as reference points.
2. In the right figure, the X-axis indicates rater IDs; the Y-axis indicates rater discrimination estimates. Raters 1 to 3 and 6 to 9 scored onsite (live scoring) for exercises 1 and 2, respectively. Raters 4 to 6 and 10 to 12 scored using a video for exercises 1 and 2, respectively.
Figure 3. Plots of relative criteria by rater characteristics
26
In Figure 3 (left), the relative criteria for the 12 raters are presented. Relative criteria are
standardized estimates of rater effects that allows comparison between raters (direct comparisons
of c parameters between raters in Table 7 is not accurate, due to differences in d parameters,
which needs to be standardized). The X-axis indicates rater IDs and the Y-axis presents the
relative criteria locations that have been standardized by accounting for differences in rater
discrimination. Since there are three categories, there are two criteria locations per rater.
Horizontal lines were added to provide reference points in the criteria where the likelihood ratios
are maximized, meaning higher rater criteria location indicates leniency and lower location
indicates severity. In general, all raters were severe in their use of the lower scoring category, as
indicated by the relative criteria estimates below the horizontal line.
Figure 3 (right) shows the rater discrimination estimates by rater. The X-axis represents
the rater IDs, and the Y-axis represents the rater discrimination estimates. As indicated in Table
7, rater discrimination was generally higher for live observations. Moreover, rater 12 had the
lowest rater discrimination, indicating lower ability to discriminate differences between the
qualities of performance demonstrated by the examinees.
Level 2: Mode of observation parameters. Table 8 shows the level 2 parameters,
pertaining to the quality of observation mode. Similar to level 1, the LC-SDT model was used to
estimate differences in the quality of latent categorical scores between modes of observation. The
f parameter indicates mode effect, similar to the c parameter that indicated rater effects. The h
parameter, similar to the d parameter, indicates how well the mode of observation was used to
discriminate differences between latent qualities of examinee performance.
Results indicate that for exercise 1, the h parameter was greater for video-based
recordings than for live observations. For exercise 2, the live observation had slightly greater
27
discrimination than video-based recordings. These results may indicate that video-based
observations were better at discriminating different qualities of examinee performance than live
observations for exercise 1. Relative criteria based on the f parameter were similar between the
different modes of observations.
Table 8. Mode of observation parameters: Level 2 (Signal Detection Model) Mode of observation Parameter Exercise 1 Exercise 2
Note: Values in parenthesis represent standard errors. Parameter h represents discrimination and f represents criteria. Combining results from level 1 and level 2, the estimates seem to indicate that raters assigned to
score live observations were more precise (higher rater discrimination) than raters assigned to
score video-based recordings. However, between the two modes of observations, video-based
recordings allowed greater discrimination of differences in quality than live observations for
exercise 1.
Level 3: Item parameters. Table 9 presents the item parameters for the two exercises by
Note: Values in parenthesis represent standard errors. Parameter a represents item discrimination and b represents category step parameter based on the generalized partial credit model (Muraki, 1992). “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation.
28
Between the two HRMs, the HRM-MO had greater item discrimination (a) estimates. Moreover,
the step category (b) parameters were spaced further apart. However, the general trends in the
parameters were similar, with slightly greater estimates of item discrimination for exercise 2.
6. Monte Carlo Simulation Study
6.1 Methods
Monte carlo simulations were conducted to examine the sensitivity of the HRM-MO
model under varying sample sizes of 100, 400, and 1,000 for two exercises scored on two modes
of observation (i.e., live observation and videotaped observation) with three raters each,
following the same data structure in the Columbus examination. The sample sizes were designed
to account for realistic numbers of examinees who take the promotion exam in the real-world
data. Although possible, it would be extremely rare that over 1,000 examinees will be tested
simultaneously in a national setting for the particular exam analyzed in Study 1.
Three conditions were used to generate data. Population values (generating values)
associated with these conditions are presented in Table 10. In condition 1, all raters are assumed
to have the same rater parameters, and item parameters are also the same; only the level 2
parameters (mode of observation level) differ. In condition 2, item parameters in level 3 are
different, in addition to different level 2 parameters. In condition 3, item, mode of observation,
and raters have different parameter estimates. The motivation for these different conditions is to
examine the effect of parameter recovery at each level. Results from the real-world analysis in
Study 1 indicated that parameters from all three levels could vary. Given the three parameter
conditions presented in Table 10 and the three sample size sets, there were 9 total conditions
examined in the simulation study (9 total conditions = 3 parameter conditions in Table 7 x 3
sample size conditions).
29
Table 10. Conditions for simulation: Generating values
Level Exercise Type Parameter Condition 1 Condition 2 Condition 3
Level 3: CR item model Generalized partial credit model
Note: Proportion correctly classified (Pc) and λ are both measures of classification based on posterior probability (Clogg, 1995). The λ statistic accounts for classification that can occur by chance. Values in parenthesis represent standard errors.
7. Conclusion
This paper reviews psychometric rater models used in the measurement literature to
refine measures of non-cognitive attributes. While the use of non-cognitive attributes provide
new approaches to target interventions that can impact human capital, measurement issues have
yet to be resolved. This paper contributes to the literature in this regard, by proposing a solution
to generate model-based scores that can provide more refined estimates. To demonstrate this
application, psychometric rater models used in the educational measurement and mathematical
psychology literature are presented. In addition, a new model, extending the existing foundation
of LC-SDT is also proposed.
34
The analysis conducted in this paper show the utility of applying these techniques. First,
the CPS teacher evaluation data were fit using the LC-SDT model. Results showed that model-
based scores that account for rater effects generated larger effect sizes with value-added scores.
Although modest, the difference when compared to traditional techniques that use linear
regression can be quite large when taken into context with other value-added results shown in the
literature (Bill and Melinda Gates Foundation, 2012). Moreover, using latent class regression
that incorporates a psychometric rater model may yield more refined results when compared to
traditional value-added models; further investigation is needed.
This study also contributes by proposing a new method that accounts for mode of
observation. Many non-cognitive attributes can be directly observed or measured through post-
hoc mechanisms such as video playback. Findings from the real-world data analysis show utility
in this approach. The monte carlo simulation results also show promise in the continued
development of these techniques as more refined methods to capture learner’s non-cognitive
attributes.
Recently, there has been an increase in observation–based methods to assess candidates,
as scoring can be based on live or video–based observations – such testing practice is
administered frequently in medical education and in other professions. In the K–12 education
literature, measuring effective teaching has been conducted onsite by observers or offsite using
video recordings. Given the increase in observations to measure performance, a measurement
model that accounts for modes of observation is necessary.
The HRM–MO model proposed in this study provides a framework for extending the
HRM, which previously only accounted for raters at level 1 and the items at level 2. The HRM–
MO accounts for a separate level between the rater and item levels that models the effect of
35
observation mode. This can be a useful approach for researchers as multiple modes of
observation can be applied in high–stakes testing. Quality of observation mode can provide
information for planning the scoring design. In addition, this study contributes to the growing
literature on developments of HRM, which can lead to improved measurements of examinee
performance.
The HRM–MO used in this study can be a useful model for studying modes of
observations. It provided greater explanation on differences between modes of observation than
the simple rater agreement statistics or the traditional HRM–SDT. The model fit indices based on
HRM–MO also showed improved fit, which could be a promising indication for further
development of this model. Simulation results also showed interesting patterns regarding the
higher–level parameters in level 3. Conditions that reflect better estimation for item parameters
should be examined as part of future research.
As greater emphasis is placed on investing in non-cognitive attributes of learners at
various stages of training, additional care should be applied in its measurement. Although much
work in the professions education and educational measurement literature has contributed to this
effort, translation of these techniques to further reduce gaps between disciplines may be of need.
While measurement sciences focus on improving the precision around constructs, a method to
align these methodological trends with long-term outcomes would support better estimation and
deeper understanding of how non-cognitive attributes influence human development and
potential.
36
References
Abramson, David, Yoon Soo Park, Tasha Stehling-Ariza, and Irwin Redlener. 2010. “Children as
bellwethers of recovery: Dysfunctional systems and the effects of parents, households,
and neighborhoods on serious emotional disturbance in children after Hurricane Katrina”
Disaster Medicine and Public Health Preparedness 4:S17–S27.
Agresti, Alan. 2002. Categorical data analysis. Hoboken: Wiley.
Almlund, Mathilde, Angela Duckworth, James Heckman, and Tim Kautz. 2011. “Personality
psychology and economics.” In Handbook of the Economics of Education, edited by Eric
Hanushek, Stephen Machin, and Ludger Wöβmann. Amsterdam: Elsevier.
Beard, J. D., B. C. Jolly, D. I. Newble, W. E. Thomas, J. Donnelly, and L. J. Southgate. 2005.
“Assessing the technical skills of surgical trainees.” British Journal of Surgery 92 (6):
778–82.
Bill and Melinda Gates Foundation, Measures of Effective Teaching (MET). 2012. Gathering
feedback for teaching: Combining high–quality observations with student surveys and
achievement gains. Seattle: Bill and Melinda Gates Foundation.
Cardy, Robert L., and Gregory Dobbins. 1986. “Affect and appraisal accuracy: Liking as an
integral dimension in evaluating performance.” Journal of Applied Psychology 71:672–
678.
City of Columbus Civil Service Commission. 2012. 2012 police lieutenant and commander
promotional examination: Test guide. Columbus: City of Columbus Civil Service
Commission.
Clogg, Clifford, and Wendy D. Manning. 1996. “Assessing reliability of categorical
measurements using latent class models.” In Categorical variables in developmental
37
research, edited by Alexander von Eye and Clifford C. Clogg. New York: Academic
Press.
Cohen, Jacob. 1960. “Coefficient of agreement for nominal scales.” Educational and
Psychological Measurement 20:37–46.
Cohen, Jacob. 1968. “Weighted kappa: Nominal scale agreement with provision for scaled
disagreement or partial credit.” Psychological Bulletin 70: 213–220.
Danielson, Charlotte. 2007. Enhancing professional practice: A framework for teaching.
Alexandria, VA: Association for Supervision and Curriculum Development.