Psychometric Issues in the Measurement of Non-Cognitive ... · non-cognitive attributes. In particular, when estimates from the psychometric rater models were analyzed with value-added

1

Psychometric Issues in the Measurement of Non-Cognitive Attributes

Yoon Soo Park

University of Illinois – College of Medicine at Chicago

October 6, 2014

Correspondence concerning this manuscript should be addressed to Yoon Soo Park, Department of Medical Education, University of Illinois – College of Medicine at Chicago, 808 S. Wood Street, 986 CMET (MC 591) Chicago, IL 60612-7309. Email: [email protected].

2

Abstract

Recent research has demonstrated the impact that non-cognitive attributes have on long-term life outcomes, with studies supporting such evidence continuing to emerge across various disciplines. Non-cognitive attributes refer to character skills such as conscientiousness, motivation, and agreeableness that contrast from cognitive attributes which traditionally measure general knowledge or intelligence. Although investing in non-cognitive attributes has shown great promise, psychometric issues pertaining to its measurement characteristics deserve greater attention and discussion. Non-cognitive attributes face greater challenges in its measurement due to sampling of behaviors that require the use of sufficient cases, items, and raters, which complicates reliable and precise estimates. This paper uses psychometric rater models to refine measurements of non-cognitive attributes. Empirical analysis using teacher observation data from classroom settings demonstrate benefits of using this technique to refine measurements of non-cognitive attributes. In particular, when estimates from the psychometric rater models were analyzed with value-added scores, non-cognitive attributes had greater effect sizes, relative to traditional methods. This paper also proposes a new method for measuring non-cognitive attributes that account for modes of observations. Real-world data from police promotion exercises are used to demonstrate its use. Monte carlo simulations show stability in recovery of parameter estimates.

Key words: Non-cognitive attributes, hierarchical rater model, psychometrics, value-added models

3

1. Introduction

During the past decade, studies in disciplines ranging from economics, education,

medicine, and public health have underscored the value of non-cognitive attributes that have

long-term effects in predicting meaningful outcomes (e.g., Heckman and Kautz, 2013; Gutman

and Schoon, 2013; Park, Riddle, and Tekian, 2014; Abramson, Park, Stehling-Ariza, and

Redlener, 2010). Non-cognitive attributes refer to character skills, such as conscientiousness,

motivation, agreeableness, sociability, or perseverance; they contrast from cognitive attributes

that measure general knowledge or intelligence. Studies have shown that investing in non-

cognitive attributes during early childhood and elementary school programs significantly

contribute to a person’s future potential and life outcomes (Heckman, Pinto, and Savelyev, 2013).

Evidence supporting the effects of non-cognitive attributes has been noted in various contexts,

including the labor market, schools, community, and hospitals, where these qualities affect the

success of workplace performance and productivity (Almlund, Duckworth, Heckman, and Kautz,

2011). As such, proper measurement of non-cognitive attributes that minimize measurement

error and maximize precision has become an important subject of discussion and a focus of this

paper.

In the field of professions education, where highly-specialized workforce is trained (e.g.,

physicians, teachers, lawyers), the value of non-cognitive attributes has received careful attention.

For example, in the training of medical doctors, competence in communication and interpersonal

skills (CIS) is considered an essential element of good clinical practice – CIS have been linked

with patient and physician satisfaction, better medical decision, patient safety, and adherence to

treatment plans, which are known to be important patient outcomes, quality of healthcare, cost-

effective practices, and decreases in malpractice (Makoul and Curry, 2007; Vincent, Young, and

4

Philips, 1994). Therefore, valid and reliable measurement of non-cognitive attributes for health

professional trainees is at the forefront of medical licensure and maintenance of quality in the

United States and Canada. For teachers, research has shown that effective teaching requires

competence in non-cognitive attributes that measure proper management of classroom

environment and demonstration of professional responsibilities (Danielson, 2007); these

attributes surpass the impact that professional credentials and graduate degrees provide. When

combined with student achievement gains, non-cognitive attributes of teachers have shown to be

an effective mechanism for identifying qualified teachers (MET, 2012). For high-ranking police

offers and firefighters, the ability to be professional and communicate effectively have long been

noted as important competencies in their work and are included in their promotion assessments

(City of Columbus Civil Service Commission, 2012). As demonstrated in these examples,

emerging research has continuously emphasized the value of non-cognitive attributes as a

significant construct at the highest levels of specialized education and training across a variety of

fields.

While the impact of non-cognitive attributes on long-term outcomes has been recognized,

they have been largely neglected due to insufficient evidence supporting their reliability and

non-ignorable errors associated with their precision; these issues are known as reference bias or

measurement error (Heckman and Rubinstein, 2001; Groot, 2000). Instruments used to measure

non-cognitive attributes have been noted to be “subjective” (Jacob and Lefgren, 2008; Rockoff

and Speroni, 2010) and in many cases rely on self-reported surveys, which can affect their

response process and consequently their validity. When non-cognitive attributes are measured by

human raters, they face issues of rater severity or bias that can significantly affect the quality of

5

data (Park, Holtzman, and Chen, 2014). These issues contrast from instruments used to measure

cognitive attributes, such as achievement tests, which have fixed and unequivocal responses.

To rank order trainees by non-cognitive attributes for selection or for promotional

purposes, challenges posed by these measurement issues need to be examined. In this regard,

assessment researchers and testing organizations in the professions education have relied on

observing samples of behavior to measure non-cognitive attributes, while closely monitoring

their psychometric characteristics by studying their properties. This paper focuses on

measurement of non-cognitive attributes that use samples of trainee behavior evaluated by

human raters and psychometric techniques to overcome this problem.

Assessing how learners react to certain scenarios and observing their behaviors in real-

life circumstances have shown promise as a reliable and valid mechanism to measure non-

cognitive attributes (Jackson, 2013; Pratt and Cullen, 2000). Measurement precision, particularly

for non-cognitive attributes, is therefore a sampling issue – observations rely on sufficient

sampling of cases or items and the use of trained raters to score and evaluate trainees. For

example, in measuring teaching effectiveness, principals observe a teacher’s instructional ability

and gather evidence to generate a numeric score of their performance. In the case of physicians

training to practice independently, they are observed and evaluated by senior medical personnel

until they are certified. Although a scoring rubric is often provided with ample opportunities for

training, the use of raters makes an assumption that they are invariant to extraneous

circumstances. Non-cognitive attributes are generally measured using performance-based

assessments, requiring human judgment to evaluate their quality. This presents a challenge, as

the consistency and accuracy of rater judgments can become a critical issue, especially when

high-stakes decisions and consequences are confounded with rater bias. Therefore, improving the

6

measurement of non-cognitive attributes remains an important and meaningful discussion.

However, to date, the literature lacks methodological guidance and techniques that can refine

measures of non-cognitive attributes.

The purpose of this paper is to incorporate psychometric rater models to refine

measurements of non-cognitive attributes. Psychometric rater models that rely on latent variables

are used as the empirical strategy in this context. The application of psychometric rater models

will demonstrate the need to apply model-based approaches that account for item- and rater-

specific effects to estimate the non-cognitive attributes. The latent class signal detection theory

(LC-SDT) model used in educational measurement and mathematical psychology will form the

theoretical basis for this approach. Extensions of the LC-SDT to hierarchical structures are

proposed with empirical application and monte carlo simulations to demonstrate its utility and

stability in estimates. Teacher observations of effective teaching qualities taken from Chicago

Public Schools (CPS) Recognizing Educators Advancing Chicago Students (REACH) project

and the Columbus Police and Firefighter data are used to demonstrate the method.

This paper is organized as follows. First, an overview of the LC-SDT model and its

psychometric background are provided as rationale for incorporating this method to measure

non-cognitive attributes. Then, theoretical background for using hierarchical extensions of the

LC-SDT model that account for items, raters, and observational context are presented. The

ensuing sections use real-world data to present applications of the methods proposed. The CPS

teacher observation data is fit using the LC-SDT model to derive model-based non-cognitive

scores. Value-added scores are used to examine the utility of using psychometric rater models.

Finally, the Columbus Police and Firefighter data are used to fit hierarchical extensions of the

LC-SDT model with simulation studies that demonstrate the model’s identification, stability, and

7

estimation properties. Implications for future use and developments for measuring non-cognitive

attributes are provided, with potential policy implications.

2. Latent Class Signal Detection Theory (LC-SDT)

In the educational measurement and mathematical psychology literature, various models

have been developed to measure constructs that require human judgment. This paper uses this

technique for refining the measurement precision of non-cognitive attributes. The latent class

signal detection theory (LC-SDT) model (DeCarlo, 2002) is used as the empirical strategy for

analysis. A brief overview of the psychological background and parameterization of the model is

presented.

In the LC-SDT model, rating is conceptualized as a psychological process, where a

rater’s role is viewed as attempting to discriminate between latent classes of behaviors; the latent

classes are defined as ordinal performance categories from the scoring rubric. That is, for a non-

cognitive behavior with four performance categories, a rater’s task is to classify a specific

behavior from the trainee into one of the four latent scores. In fact, the role of a rater is to

discriminate between scores defined in the rubric, which is analogous to discriminating between

latent classes.

The latent class SDT model has two parameters that explain the response of a rater: (1)

discrimination (d) and (2) response criteria (ck). Rater discrimination (d) refers to the ability of a

rater to discriminate between latent classes of behaviors, and the response criteria (ck) represents

the internal criteria to which the rater uses to compare and judge the behaviors. Figure 1 presents

a graphical representation of the SDT, where four probability distributions of perceptions in

behaviors are illustrated. There are three response criteria locations in the figure. These locations

8

represent a rater’s criteria for judging a particular score. For example, if a behavior is thought to

be between c1 and c2, then the rater gives the behavior a “2.” However, if a rater perceives the

quality as over c2, but below c3, then the score now becomes “3.” As such, the response criteria

represent a decisional aspect of the rater. Furthermore, it can be inferred from this diagram that

by shifting c3 up, the rater becomes stricter, because this decreases the likelihood of getting a “4.”

Likewise, by shifting c1 down, the rater becomes more lenient, because this increases the chance

for a rater to assign a higher score. As noted, these shifts in raters’ criteria locations represent

rater effects, because they allow a rater to be lenient or strict. Furthermore, it can also accounts

for the shrinkage effect in that if the criteria location for c1 is shifted to the far left, then a rater’s

chance of assign a score of “1” becomes very low.

Figure 1. A Representation of SDT for Scoring Categories 1 to 4

The discrimination parameter (d) represents the distance between the probability

distributions and reflects a perceptual aspect of the rater. Rater discrimination presents how well

a rater discriminates between latent classes of behaviors. When the distance between

distributions is larger, the rater has greater discrimination between the latent classes, because this

means that the perceptions of each scoring category are more distinct. In other words, when d is

larger, there is less overlap between the distributions and less error in terms of a rater’s attempt

0 d 2d 3dc 1 c 2 c 3

"1" "2" "3" "4"

9

to classify a behavior. If the distance between distributions is small, the ability of a rater to

differentiate between two latent classes of behaviors becomes less clear.

More formally, for N items, J raters, and K discrete scores (such that 1≤ k≤K), the latent

class SDT model is expressed in Equation (1):

)()|Pr( cjjkcj dcFkY (1)

Here, Yj is rater j’s observed response, and F is the logistic cumulative distribution function. The

c represents the categorical latent classes, which are the discrete ordered scores of examinee

ability defined by the scoring rubric. One of the aims of latent class analysis is to make model-

based classifications into a latent class using the observed response patterns (Dayton, 1998;

Clogg, 1995). The posterior probability of the latent variable c can be used to measure the

quality of this classification. Two measures for classification accuracy are presented for these

purposes. These measures are used in this paper to reflect the accuracy of classification derived

from the latent class SDT model. First, the expected proportion of cases correctly classified (Pc)

is calculated as follows:

s

Jcsc NYYYnP /)],...,,|Pr(max[ 21 (2)

)Pr(max1

)Pr(max

c

ccP

(3)

Here, the s in Equation (2) indicates the unique response patterns, and sn corresponds to the

frequency of each pattern. Furthermore, ),...,,|Pr(max 21 Jc YYY is the maximum posterior

probability across the latent classes for a given response pattern, and N is the total number of

cases. In addition to the proportion correctly classified statistic ( cP ), the lambda statistic (λ) is

considered, which accounts for classification that can occur by chance. This statistic can be

10

important when there is a latent class with a large size. The calculation of the lambda statistic is

presented in Equation (3). Both proportion correctly classified (Pc) and the lambda statistic (λ)

are used in this study to study classification accuracy.

3. Hierarchical Rater Models, with Extensions for Modes of Observation

3.1 Modes of Observation

Scoring of performance assessment tasks can be based on different modes of observation,

which can include live (onsite) observations and post-exam videotaped observations. For

example, medical students are observed and scored based on their encounter with patients (van

der Vleuten and Swanson, 1990); the interactions with patients are often also videotaped for

post-exam evaluation. Surgeons performing technical operative skills are also assessed using live

observations and videotaped recordings (Vassiliou et al., 2007; Beard et al., 2005). In

professional and personnel testing, the use of different modes of observation has also become

more prevalent. Promotion of firefighters and police officers are based on a combination of

direct observations and post-exam review, where judges score candidates through live interaction

onsite and subsequently through video-based recordings (City of Columbus Civil Service, 2012).

More recently, in the K-12 assessment arena, evaluations of teaching effectiveness through

observations of teachers have raised interest in the use of videotaped classrooms to standardize

scoring (Bill and Melinda Gates Foundation, 2012). As these examples demonstrate,

assessments based on multiple modes of observations have gained increased use and interest in

the testing field.

Although live observations of examinee performance have been the traditional format of

testing performance assessment tasks, technological advances have provided the opportunity to

11

use videotaped observations that provide educators and testing agencies the promise of quality

control and also offer practical solutions to limitations in real-time assessments (Vivekananda-

Schmidt et al., 2007). Contrary to the rapid increase in multiple modes of observations to assess

examinees, there is limited research on the extent to which video-based observations can produce

psychometrically comparable scores when compared to direct and live observations. Furthermore,

it is unclear whether video-based observations may limit viewing and possibly distort

interpretation of affective qualities of the persons or situations. There is also research that

support relationship building between the examinee and the rater that may influence how

observers perceive encounters differently between live-interaction and videotaped observation

(Ryanet al., 1995). The effect of scoring based on different modes of observations has been an

understudied area in the educational measurement field. Yet, many non-cognitive attributes rely

on post-encounter video reviews for measurement.

3.2 Hierarchical Rater Models

Various measurement models have been proposed to examine performance assessments

that require the use of raters. A popular measurement model for performance assessments is

item response theory (IRT) model. However, using IRT to estimate examinee ability ignores

rater effects – as Tate (1999) noted, IRT model can confound rater effects with item effects. In

other words, IRT models cannot distinguish scores between severe or lenient raters, which can

affect item characteristics that are used to estimate examinee performance. When the scoring

process involves the use of both live observations and videotaped recordings, an additional layer

of consideration – mode of observation – may influence raters’ judgment process; that is,

confounding can also occur through the mode of observation, which can further complicate

measuring examinee ability.

12

These issues raise the need to “disentangle” effects associated with raters, mode of

observation, and items. A natural candidate for resolving the confounding effect between raters

and items has been the hierarchical rater model (HRM; Patz, Junker, Johnson, and Mariano, 2002;

Mariano, 2002). An HRM based signal detection rater model (HRM-SDT; DeCarlo, Kim, and

Johnson, 2011) has also been proposed to refine the original HRM. The idea behind these

hierarchical rater models is that scores assigned by raters become a direct indicator of

performance quality, which in turn, becomes an indicator of examinee ability.

The hierarchical structure of these models prevents confounding of items and rater effects.

In the HRM-SDT, a latent class signal detection theory (LC-SDT; DeCarlo, 2002) model is

specified as the rater model in level 1; the LC-SDT model in level 1 allows estimation of rater

characteristics such as severity and precision of raters. In level 2, the generalized partial credit

(GPC; Muraki, 1992) model is specified for the constructed response (CR) model to estimate

item characteristics such as item discrimination and item step parameters. In this study, an

additional level is specified to take into account the mode of observation, as an intermediary

level between the rater and item models.

3.3 Hierarchical rater model with LC-SDT

To examine differences in modes of observation, a hierarchical rater model (HRM) is

used. The HRM uses rater scores as indicators of performance quality, which thereby becomes

an indicator of examinee ability (DeCarlo, Kim, and Johnson, 2011). In the HRM-SDT, a signal

detection rater model is specified in level 1, which provides measures of rater precision and rater

effects. The LC-SDT model provides a measure of a rater’s precision in terms of how well they

discriminate between the latent classes (d). It also estimates their use of response criteria (ck),

13

which reflects rater effects such as how lenient or strict they score, as well as shrinkage and other

effects. In level 2, a polytomous IRT model is applied to estimate item parameters.

Equation (4) shows level 1 LC-SDT model for J raters and K discrete scores, such that

1≤k≤K:

)()|( ljljklljl dcFkYp (4)

Here, Yj is rater j’s observed response for item l, and F is the logistic cumulative distribution

function. The ηl represents the categorical latent classes, which are the discrete ordered scores of

examinee ability defined by the scoring rubric. In level 2, the generalized partial credit (GPC;

Muraki, 1992) model is specified as shown in equation (5):

]exp/[)]([exp)|(1

00

lg0

M ba

mlmll

gl

bap

(5)

Together, equations (4) and (5) can be combined to form equation (6), which becomes the HRM-

SDT.

dppYpYp )()|(),|()( (6)

3.4 Hierarchical Rater Model for Modes of Observation (HRM-MO)

This study proposes an extension to the HRM-SDT, which includes a model for

examining the quality of observation mode by applying the LC-SDT model. In level 1, rater

scores (Y) are indicators for latent categorical performance quality (η), which in turn, becomes

an indicator for the latent categorical quality of observation mode (Φ) in level 2; the quality of

observation mode becomes an indicator for examinee ability (θ) in level 3.

In the HRM-MO, the LC-SDT model is used for levels 1 (rater model) and 2 (mode of

observation model). The same equation (1) is used for level 1. In level 2, the following equation

14

(7) is used to model the mode of observation, where the parameter f indicates the criteria

associated with the mode of observation and the parameter h indicates how well the quality of

performances are discriminated within each latent class of observation modes (o).

)()|( ololoklol hfFp (7)

In level 3, the following equation, based on the GPC model is used to estimate item parameters,

where the parameter a indicates item discrimination, and the parameter b indicates the category

steps.

]exp/[)]([exp)|(1

00

lg0

M ba

m

lmllg

l

bap (8)

Together, equation (9) presents the HRM-MO, which will be examined in this study.

dpppYpYp )()|()|,(),,|()(,

(9)

Assumptions of conditional independence are made at each level, to simplify equations at each

level:

)|()|(),,|( YpYpYp (10)

)|(),(),|( ppp (11)

)|()|( pp (12)

The complete HRM-MO model can be obtained by substituting equations (10), (11), and (12)

into equation (9) and using probabilities from equations (4), (7), and (8).

15

4. Chicago Public Schools (CPS) Teacher Observation Data

4.1 Methods

Data. Empirical data for the Chicago Public Schools (CPS) teacher observation was

collected during the 2012-2013 academic year, as part of the teacher evaluation system recently

implemented in its legislative policies (Title 23: Education and Cultural Resources, Part 50

Evaluation of Certified Employees under Article 24A and 34 of the School Code). A total of

1,000 teacher evaluations were collected (total original sample was 1,060, of which 1,000 was

used for this study due to missing data and other sampling issues), where a principal observed a

teacher, jointly with an Instructional Effectiveness Specialist (IES). Evaluations measured two

non-cognitive attributes: Classroom Environment and Instruction. Classroom Environment

corresponds to Domain 2 of the CPS Framework for Teaching, which measures the teacher’s

ability to “create an environment of respect and rapport,” “establishing a culture for learning,”

“managing classroom procedures,” and “managing student behavior.” Instruction corresponds to

Domain 3, which measures “communicating with students,” “using questions and discussion

techniques”, “engaging students in learning”, “using assessment in instruction”, and

“demonstrating flexibility and responsiveness.” Details of the observation instrument used to

measure teacher’s non-cognitive attributes can be found in the Chicago Public School’s Teacher

Evaluation Plan and Handbook of Procedures (Chicago Public Schools, 2012). Teachers were

observed by principal-IES pairs. A total of 96 principals and 19 IES raters participated in the

study data.

Analysis. The CPS REACH data were fit using the LC-SDT model, as specified in

Equation (1). To avoid common boundary estimation problems among latent class models in

numerical computation, a partly Bayesian approach using posterior mode estimation was used

16

(McLachlan and Krishnan, 2008). In this approach, a prior specification of Bayes constants (α1)

is used to smooth boundaries with estimation issues (DeCarlo, Kim, and Johnson, 2012). Bayes

constants can be interpreted as adding α1 observations to the data. If α1 are set equal to zero, log

p(θ) = 0, which will obtain maximum likelihood estimates: 00 |

0

1| log][log

uu zxzx UKp

. Here,

K denotes the number of latent classes (behaviors). The influence of the prior is equivalent to

adding α1 / K cases to each latent class (Park and Lee, 2014).

To examine rater (observer) characteristics, parameter estimates (ckj and dj) from the

model were plotted. Post-hoc comparisons between raters, particularly between principals and

IES raters, were made using discrimination parameters that represent rater precision. Estimated

model parameters were used to derive model-based scores. Model-based scores that account for

rater characteristics were compared with original ratings provided by principals and IES raters

using measures of agreement (% exact agreement, kappa, quadratically-weighted kappa). Finally,

value-added scores (standardized to –3 and 3 scale) for mathematics, reading, and combined

subjects were used as covariates to regress their effect on the latent classes, logit

ZhbZp kk )|( , where Z represents the vector of covariates. Estimates of hk were compared

with traditional linear regression coefficients using original principal and IES ratings.

4.2 Results

Descriptive statistics. Over 85% of the rating distribution was concentrated in the

middle categories of “3” and “4”. Table 1 shows the distribution of ratings by component and

domain. Since each teacher observed was double scored by a principal-IES pair, their agreement

was examined. Non-cognitive attributes in Domain 2 (Classroom Environment) had about 75%

exact agreement, while attributes in Domain 3 (Instruction) had about 66% exact agreement.

17

Table 2 shows measures of agreement using kappa and weighted kappa, which accounts for

chance agreement and penalizes for larger discrepancies between the pairs. The table is also

stratified by tenured teacher and non-tenured teachers, as often examined by content experts.

Table 1. Descriptive statistics

Domain Component Rating Performance Category (%) “1” “2” “3” “4”

Classroom Environment (Domain 2)

Response and Rapport 1.70 19.49 62.02 16.79 Culture of Learning 2.65 30.73 53.87 12.74 Managing Procedures 3.36 26.27 60.30 10.08 Managing Behavior 3.35 26.91 59.23 10.51

Instruction (Domain 3)

Communication 3.75 31.57 54.88 9.80 Questioning and Discussion 8.37 51.40 35.02 5.21 Engaging in Learning 6.95 44.28 42.18 6.60 Assessment in Instruction 9.83 48.77 37.18 4.21 Flexibility and Responsiveness 8.46 46.99 39.01 5.53

Note: Values represent row percentages. A total of 1,000 observations were scored by principals and IES raters using a 4-point rating scale (CPS Framework for Teaching, see http://cps.edu/sitecollectiondocuments/cpsframeworkteaching.pdf for a full description of the rubric. Document accessed on October 1, 2014).

Model parameter estimates. Figure 2 shows the LC-SDT model parameters, restricted

to IES raters (complete table of results for all 115 raters can be obtained from the author). In the

left figure, rater precision estimates and their respective 95% confidence intervals are plotted.

The X-axis represents the 19 IES raters and the Y-axis represents the rater precision estimates (dj

from Equation [1]). Results show a wide variability in rater precision even among IES raters. IES

raters are highly-trained observers who visit schools to work with principals to improve their

teacher evaluation skills. The wide variability in rater precision estimates indicates the need to

adjust for rater-specific differences, which are often ignored in practice. In the figure to the right,

plots of the rater criteria (ckj) are presented. Since the CPS Framework for Teaching is based on 4

ordinal performance categories, there are three criteria locations. A criteria estimate that is higher,

relative to other raters indicates greater severity; lower estimates indicate leniency.

18

Table 2. Measures of agreement between principals and IES raters

Data Domain Component % Exact

Agreement Kappa

(Unweighted)Kappa

(Linear) Kappa

(Quadratic)

All teachers (n=1,000)

Classroom Environment

Response and Rapport 75.42% .55 (.02) .60 (.02) .67 (.03) Culture of Learning 75.69% .59 (.02) .63 (.02) .68 (.03) Managing Procedures 76.35% .58 (.02) .62 (.02) .69 (.03) Managing Behavior 73.11% .53 (.02) .59 (.02) .66 (.03)

Instruction

Communication 71.54% .52 (.02) .57 (.02) .64 (.03) Questioning and Discussion 68.47% .48 (.02) .53 (.02) .58 (.03) Engaging in Learning 66.11% .46 (.02) .51 (.02) .58 (.03) Assessment in Instruction 66.80% .47 (.02) .52 (.02) .59 (.03) Flexibility and Responsiveness 60.16% .37 (.02) .44 (.02) .51 (.03)

Non-tenured Teacher observations(n=503)



Instruction


Tenured teacher observations(n=497)



Instruction


Note: IES raters are “Instructional Effectiveness Specialists” who provide guidance on rater training to principals. Values in parentheses are standard errors.

19

Note:

1. Figure on left shows estimates of rater precision (dj estimates) across the 19 IES raters. A greater rater precision estimate reflects greater ability for the rater to discriminate differences between behaviors.

2. Figure on the right shows relative criteria estimates for the 19 IES raters. Since there are 4 categories in the rubric, there are 3 criteria locations (cut points) in the distribution. Criteria estimates were standardized to the same scale to make comparisons between raters. Higher criteria estimate indicates severity, while a lower estimate indicates leniency.

Figure 2. Parameter estimates from LC-SDT: Rater precision and relative criteria

01

23

45

67

8R

ater

Pre

cisi

on

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Rater

Rater Precision

-.4-.2

0.2

.4.6

.81

1.2

1.4

Rel

ativ

e C

riter

ia

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Rater

Relative Criteria

20

Comparing model-based scores with original rater scores. Based on model estimated

parameters, model-based scores were generated. Value-added scores (combined subjects,

mathematics, and reading) were regressed simultaneously to the estimated latent classes. In

comparison, traditional linear regression was used to examine the regression coefficient effects.

Table 3 presents the results comparing the two methods.

Results show that when a psychometric rater model is used, the coefficients of the value-

added scores have greater effect sizes. For example, in the combined value-added scores, the

regression coefficient for the latent class regression is .15, while it is .09 for linear regression.

While this difference is modest, with similar standard error estimates, the difference in effect

sizes indicate some value in using psychometric rater models to refine measurement precision of

non-cognitive attributes.

Table 3. Comparison of coefficient effects: Latent class regression and linear regression

Value-added score Latent class regression coefficients using

model-based Scores Linear regression coefficients

using original ratings Combined .154 (0.048)** .093 (.039)* Mathematics .259 (0.050)*** .166 (.036) Reading –.009 (0.043) –.005 (.035)

Note: Value-added scores standardized to –3 and 3 scale (see Value-Added Research Center, 2014). *p<.05; **p<.01; ***p<.001.Values in parentheses represent standard errors.

5. Columbus Police and Firefighter Promotion Data

5.1 Methods

Data. In this section, data were analyzed from a real-world administration of live and

video-recorded observation scores, where two exercises (items) are given to candidates and 6

different raters to score each exercise, comprising a total of 12 raters. For each exercise, 3 raters

score the candidate through live observation, with possible interactions between the examinee

and the raters; the remaining 3 raters score a video recording of the performance at a subsequent

21

time. In other words, raters 1, 2, and 3 score exercise 1 through live observation; raters 4, 5, and

6 score exercise 1 through videotaped recording. Similarly, raters 7, 8, and 9 score exercise 2

through live observation; raters 10, 11, and 12 score exercise 2 through videotaped recordings.

All raters were trained to score using a holistic 3-point rating scale, which measures the

following skills: oral communication, interpersonal relations, information analysis, and problem

sensing and resolution ability. The data contain 440 global ratings from each rater for each

exercise.

Analysis. Data were used to fit both HRM-SDT and HRM-MO. Model fit indices,

parameters, latent class size, and classification indices were compared. Estimation was

conducted using Latent Gold 4.5 (Vermunt and Magidson, 2005).

5.2 Results

Descriptive statistics. Table 4 shows the descriptive statistics of the ratings as well as the

rater agreement statistics for each mode of observation.

Table 4. Distribution of scores assigned and rater agreement

Exercise Mode Rater Score Assigned (%) Rater Agreement

“1” “2” “3” Kappa Weighted Kappa

1

(n=440)

Live (onsite) 1 16.82 35.00 48.18

.51 .66 2 14.09 44.09 41.82 3 11.82 36.82 51.36

Video 4 14.77 41.14 44.09

.38 .56 5 15.00 42.95 42.05 6 15.45 49.55 35.00

Combined (overall) 14.66 41.59 43.75 .36 .52

2

(n=440)

Live (onsite) 7 7.50 41.59 50.91

.50 .62 8 3.86 36.14 60.00 9 6.36 37.27 56.36

Video 10 16.36 43.18 40.45

.37 .57 11 9.09 47.95 42.95 12 16.59 45.00 38.41

Combined (overall) 9.96 41.86 48.18 .31 .49 Note: Each rater assigned scores for 440 observations on a 3-point scale. “Weighted kappa” used quadratic weights. Scores assigned (%) indicates row percentages.

22

Results indicate that less than 15% of the scores received “1” for exercise 1, while less than 10%

of scores received “1” for exercise 2. Similar proportions of scores were assigned for “2” and “3”

across both exercises.

Rater agreement was greater for live observations than for video-based observations.

Kappa and quadratically weighted kappa were used to measure rater agreement. Kappa takes into

account agreement that can occur by chance (Cohen, 1960; Cohen, 1968). The weighted kappa

penalizes larger discrepancies between raters more than smaller discrepancies (Shaeffer, Briel,

and Fowles, 2001). For exercise 1, the kappa and weighted kappa were .51 and .66 for live

observations and .38 and .56 for video-based observations, respectively. A similar trend was

found for exercise 2, where the kappa and weighted kappa were .50 and .62 for live observations

and .37 and .57 for video-based observations, respectively. The combined rater agreement

measures across modes of observations decreased.

Model fit, latent class sizes, and classification. Table 5 shows the model fit comparison

between the HRM-SDT and HRM-MO.

Table 5. Model comparison using information criteria

Fit Index HRM-SDT HRM-MO # of parameters 42 54

–2LL 8,101.12 7,866.50 AIC 8,185.12 7,974.50 BIC 8,356.77 8,195.18

Note: “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation. Results of the model comparison for the real-world data indicate a better model fit for the HRM-

MO model (lower AIC and BIC). Table 6 shows the classification indices, Pc and λ.

Classification indices indicate the quality of classification based on posterior probabilities of the

model. The Pc measures classification accuracy, and the λ statistic accounts for classification that

23

can occur by chance (Clogg and Manning, 1996). For both exercises, classification was lower for

the video-based observation (η12 and η22) when compared to live observation (η11 and η21). In

addition, classification was lower for the combined latent categorical variables (Φ1 and Φ2).

Latent class sizes for the latent categorical variables are also presented in Table 6.

Table 6. Classification indices and latent class sizes by model

Model Latent

variable Classification Latent class sizes Pc λ Class 1 Class 2 Class 3

HRM-SDT η1 .93 .87 .16 (.02) .47 (.03) .37 (.03) η2 .92 .82 .16 (.02) .32 (.03) .52 (.04)

HRM-MO

η11 .92 .87 .20 (.03) .38 (.03) .42 (.03) η12 .89 .79 .15 (.02) .49 (.05) .35 (.05) η21 .90 .82 .17 (.03) .33 (.03) .50 (.04) η22 .88 .75 .14 (.02) .40 (.04) .46 (.04) Φ1 .91 .82 .14 (.03) .54 (.05) .32 (.05) Φ2 .86 .74 .19 (.03) .45 (.05) .36 (.05)

Note: Proportion correctly classified (Pc) and λ are both measures of classification based on posterior probability (Clogg, 1995). The λ statistic accounts for classification that can occur by chance. Values in parenthesis represent standard errors.

Level 1: Rater model parameters. Table 7 shows level 1 rater parameters by HRM-

SDT and HRM-MO models. Results indicate that rater discrimination (d parameter), which

indicates how well a rater is able to discriminate between different qualities of performance

(rater precision), was generally greater for live (onsite) scoring than video-based scoring for both

exercises, where the difference was slightly greater for exercise 1 than exercise 2. However, the

average rater discrimination between the two exercises was comparable. The distribution of rater

discrimination indices shows raters that are able to better detect differences between the

categories. Although estimates were different, the overall trends between the HRM-SDT and

HRM-MO were similar.

Figures 3 (right: discrimination, left: relative criteria) was created to visually illustrate the

parameters for the HRM-MO model.

24

Table 7. Rater parameters: Level 1 (Signal Detection Theory Rater Model) by model Exercise Mode Rater Parameter HRM-SDT HRM-MO

1

Live (Onsite)

1 c11 1.20 (.31) 1.30 (.37) c12 4.43 (.38) 6.05 (.62) d1 3.55 (.30) 4.65 (.48)

2 c21 .66 (.27) .54 (.28) c22 4.62 (.38) 5.49 (.49) d2 3.26 (.26) 3.66 (.30)

3 c31 .41 (.28) .05 (.25) c32 4.04 (.38) 3.75 (.38) d3 3.42 (.31) 3.04 (.29)

Video

4 c41 .22 (.24) .60 (.28) c42 3.14 (.30) 3.85 (.38) d4 2.25 (.21) 2.83 (.26)

5 c51 .37 (.24) .78 (.28) c52 3.53 (.32) 4.34 (.39) d5 2.47 (.23) 3.12 (.31)

6 c61 .43 (.25) .97 (.32) c62 4.02 (.35) 5.18 (.52) d6 2.47 (.22) 3.29 (.30)

2

Live (Onsite)

7 c71 .40 (.26) .41 (.29) c72 4.38 (.53) 4.90 (.57) d7 3.03 (.29) 3.50 (.33)

8 c81 1.30 (.29) 1.36 (.30) c82 3.59 (.44) 3.66 (.48) d8 3.02 (.31) 3.20 (.39)

9 c91 .55 (.26) .55 (.30) c92 4.64 (.55) 6.31 (.96) d9 3.55 (.35) 5.10 (.93)

Video

10 c101 .70 (.26) 1.36 (.39) c102 4.09 (.41) 5.41 (.65) d10 2.41 (.23) 3.34 (.34)

11 c111 .15 (.25) .26 (.31) c112 4.59 (.46) 6.05 (.65) d11 2.79 (.26) 3.86 (.38)

12 c121 .32 (.22) .71 (.27) c122 3.24 (.31) 3.98 (.38) d12 1.82 (.18) 2.36 (.22)

Note: Values in parenthesis represent standard errors. Parameter d represents rater discrimination and c represents rater criteria. “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation.

25

Note:

1. In the left figure, the X-axis indicates the rater IDs; the Y-axis indicates relative criteria estimates. Raters 1 to 3 and 6 to 9 scored onsite (live scoring) for exercises 1 and 2, respectively. Raters 4 to 6 and 10 to 12 scored using a video for exercises 1 and 2, respectively. Horizontal lines were added on criteria location where the likelihood ratios are maximized as reference points.

2. In the right figure, the X-axis indicates rater IDs; the Y-axis indicates rater discrimination estimates. Raters 1 to 3 and 6 to 9 scored onsite (live scoring) for exercises 1 and 2, respectively. Raters 4 to 6 and 10 to 12 scored using a video for exercises 1 and 2, respectively.

Figure 3. Plots of relative criteria by rater characteristics

26

In Figure 3 (left), the relative criteria for the 12 raters are presented. Relative criteria are

standardized estimates of rater effects that allows comparison between raters (direct comparisons

of c parameters between raters in Table 7 is not accurate, due to differences in d parameters,

which needs to be standardized). The X-axis indicates rater IDs and the Y-axis presents the

relative criteria locations that have been standardized by accounting for differences in rater

discrimination. Since there are three categories, there are two criteria locations per rater.

Horizontal lines were added to provide reference points in the criteria where the likelihood ratios

are maximized, meaning higher rater criteria location indicates leniency and lower location

indicates severity. In general, all raters were severe in their use of the lower scoring category, as

indicated by the relative criteria estimates below the horizontal line.

Figure 3 (right) shows the rater discrimination estimates by rater. The X-axis represents

the rater IDs, and the Y-axis represents the rater discrimination estimates. As indicated in Table

7, rater discrimination was generally higher for live observations. Moreover, rater 12 had the

lowest rater discrimination, indicating lower ability to discriminate differences between the

qualities of performance demonstrated by the examinees.

Level 2: Mode of observation parameters. Table 8 shows the level 2 parameters,

pertaining to the quality of observation mode. Similar to level 1, the LC-SDT model was used to

estimate differences in the quality of latent categorical scores between modes of observation. The

f parameter indicates mode effect, similar to the c parameter that indicated rater effects. The h

parameter, similar to the d parameter, indicates how well the mode of observation was used to

discriminate differences between latent qualities of examinee performance.

Results indicate that for exercise 1, the h parameter was greater for video-based

recordings than for live observations. For exercise 2, the live observation had slightly greater

27

discrimination than video-based recordings. These results may indicate that video-based

observations were better at discriminating different qualities of examinee performance than live

observations for exercise 1. Relative criteria based on the f parameter were similar between the

different modes of observations.

Table 8. Mode of observation parameters: Level 2 (Signal Detection Model) Mode of observation Parameter Exercise 1 Exercise 2

Live observation f11 2.19 ( .77) 1.97 (1.04) f12 4.83 ( .89) 6.68 (1.79) h1 3.77 ( .76) 5.86 (1.67)

Video-based observation f21 2.68 (1.26) .86 ( .46) f22 8.20 (1.53) 5.30 (1.29) h2 5.75 (1.17) 4.24 (1.24)

Note: Values in parenthesis represent standard errors. Parameter h represents discrimination and f represents criteria. Combining results from level 1 and level 2, the estimates seem to indicate that raters assigned to

score live observations were more precise (higher rater discrimination) than raters assigned to

score video-based recordings. However, between the two modes of observations, video-based

recordings allowed greater discrimination of differences in quality than live observations for

exercise 1.

Level 3: Item parameters. Table 9 presents the item parameters for the two exercises by

HRM-SDT and HRM-MO models.

Table 9. Item parameters: Level 3 (Generalized partial credit model)

Parameter Exercise 1 Exercise 2

HRM-SDT HRM-MO HRM-SDT HRM-MO a 1.67 (.64) 2.53 (.84) 2.04 (1.03) 2.63 (.97) b1 –2.16 (.63) –3.29 (.92) –2.34 (1.04) –2.70 (.86) b2 .68 (.28) 1.38 (.57) –.32 ( .26) 1.04 (.58)

Note: Values in parenthesis represent standard errors. Parameter a represents item discrimination and b represents category step parameter based on the generalized partial credit model (Muraki, 1992). “HRM-SDT” is the hierarchical rater model with the latent class signal detection theory model as the rater model (DeCarlo, Kim, and Johnson, 2011). “HRM-MO” is an extension of the HRM-SDT with an additional level for mode of observation.

28

Between the two HRMs, the HRM-MO had greater item discrimination (a) estimates. Moreover,

the step category (b) parameters were spaced further apart. However, the general trends in the

parameters were similar, with slightly greater estimates of item discrimination for exercise 2.

6. Monte Carlo Simulation Study

6.1 Methods

Monte carlo simulations were conducted to examine the sensitivity of the HRM-MO

model under varying sample sizes of 100, 400, and 1,000 for two exercises scored on two modes

of observation (i.e., live observation and videotaped observation) with three raters each,

following the same data structure in the Columbus examination. The sample sizes were designed

to account for realistic numbers of examinees who take the promotion exam in the real-world

data. Although possible, it would be extremely rare that over 1,000 examinees will be tested

simultaneously in a national setting for the particular exam analyzed in Study 1.

Three conditions were used to generate data. Population values (generating values)

associated with these conditions are presented in Table 10. In condition 1, all raters are assumed

to have the same rater parameters, and item parameters are also the same; only the level 2

parameters (mode of observation level) differ. In condition 2, item parameters in level 3 are

different, in addition to different level 2 parameters. In condition 3, item, mode of observation,

and raters have different parameter estimates. The motivation for these different conditions is to

examine the effect of parameter recovery at each level. Results from the real-world analysis in

Study 1 indicated that parameters from all three levels could vary. Given the three parameter

conditions presented in Table 10 and the three sample size sets, there were 9 total conditions

examined in the simulation study (9 total conditions = 3 parameter conditions in Table 7 x 3

sample size conditions).

29

Table 10. Conditions for simulation: Generating values

Level Exercise Type Parameter Condition 1 Condition 2 Condition 3

Level 3: CR item model Generalized partial credit model

1 Exercise 1 b11 –1.5 –1.5 –1.5 b12 1.5 1.5 1.5 a1 2.0 2.0 2.0

2 Exercise 2 b21 –1.5 –2.0 –2.0 b22 1.5 2.0 2.0 a2 2.0 3.0 3.0

Level 2: Mode of observation model Latent class signal detection theory model

1

Live (onsite) f111 1.5 1.5 1.5 f112 4.5 4.5 4.5 h11 3.0 3.0 3.0

Video f121 2.5 2.5 2.5 f122 7.5 7.5 7.5 h12 5.0 5.0 5.0

2

Live (onsite) f211 2.5 2.5 2.5 f212 7.5 7.5 7.5 h21 5.0 5.0 5.0

Video f221 2.5 2.5 2.5 f222 7.5 7.5 7.5 h22 5.0 5.0 5.0

Level 1: Rater model Latent class signal detection theory model

1

Rater 1: Live observation

c11 2.0 2.0 1.5 c12 6.0 6.0 4.5 d1 4.0 4.0 3.0


c21 2.0 2.0 2.0 c22 6.0 6.0 6.0 d2 4.0 4.0 4.0


c31 2.0 2.0 2.5 c32 6.0 6.0 7.5 d3 4.0 4.0 5.0

Rater 4: Video

c41 2.0 2.0 1.5 c42 6.0 6.0 4.5 d4 4.0 4.0 3.0

Rater 5: Video

c51 2.0 2.0 2.0 c52 6.0 6.0 6.0 d5 4.0 4.0 4.0

Rater 6: Video

c61 2.0 2.0 2.5 c62 6.0 6.0 7.5 d6 4.0 4.0 5.0

2


c71 2.0 2.0 1.5 c72 6.0 6.0 4.5 d7 4.0 4.0 3.0


c81 2.0 2.0 2.0 c82 6.0 6.0 6.0 d8 4.0 4.0 4.0


c91 2.0 2.0 2.5 c92 6.0 6.0 7.5 d9 4.0 4.0 5.0

Rater 10: Video

c101 2.0 2.0 1.5 c102 6.0 6.0 4.5 d10 4.0 4.0 3.0

Rater 11: Video

c111 2.0 2.0 2.0 c112 6.0 6.0 6.0 d11 4.0 4.0 4.0

Rater 12: Video

c121 2.0 2.0 2.5 c122 6.0 6.0 7.5 d12 4.0 4.0 5.0

Note: Samples sizes of 100, 400, and 1,000 were used across the three conditions.

30

Data were generated using Stata 12 and fit using Latent Gold 4.5 using posterior mode

estimation and Bayes’ constants. 100 replications of data were generated. Summary of parameter

results were examined using bias and mean squared error (MSE).

6.2 Results

Table 11 presents the parameter recovery by condition and sample size. Results were

averaged across parameters at each level and presented in terms of bias, % bias, and MSE. For

condition 1, bias was greatest for level 2 parameters, which differed for modes of observation for

exercise 1. However, for sample size of 400, % bias was less than 5.3% for level 2 parameters.

In condition 2, data were generated to have different item parameters between the exercises; the

same mode of observation parameters was preserved from condition 1. Bias decreased to less

than 5% for a sample size of 400 for level 1 and 2 parameters. However, level 3 parameters still

had % bias over 20%, even with sample size of 1,000. Finally, in condition 3, which allowed all

parameters to vary, including level 1 rater parameters, the % bias results were similar to

condition 2. Even with a sample size of 1,000, the level 3 item parameters had % bias over 20%.

MSE estimates decreased with larger sample sizes.

Tables 12 and 13 present latent class sizes by sample size and condition and classification

indices, respectively. Results from these simulation studies indicate that the recovery of latent

class sizes and classification of the HRM-MO are within range for real-world data applications

for measuring non-cognitive attributes. Overall, results from the simulation study indicate that

the greatest bias occurred from item parameters in level 3. Rater parameters (level 1) and modes

of observation parameters (level 2) were not largely biased even with differences in population

values.

31

Table 11. Parameter recovery by condition and sample size

Condition Level Parametersn=100 n=400 n=1,000

Bias % Bias MSE Bias % Bias MSE Bias % Bias MSE

1

3: Item b –.005 11.4% .394 .009 3.8% .131 –.008 1.8% .060 a –.127 6.8% .312 .017 .9% .070 .012 .6% .029

2: Mode f .057 11.8% 1.426 –.162 5.1% .903 –.039 3.0% .473 h –.033 9.0% .919 .117 5.3% .550 .069 3.1% .317

1: Rater c –.122 3.2% .469 –.021 .9% .102 –.011 .7% .041 d .112 2.9% .464 .027 .9% .098 .009 .5% .037

2

3: Item b .066 27.0% .529 .020 17.2% .276 .028 18.7% .171 a –.354 22.9% .639 –.118 21.6% .411 –.134 21.8% .339

2: Mode f .017 10.8% 1.415 –.127 4.1% .836 –.120 4.2% .458 h –.071 8.5% .820 .113 4.1% .472 .094 3.5% .264

1: Rater c –.150 3.6% .515 –.033 1.1% .107 –.015 .7% .039 d .134 3.2% .528 .027 .8% .099 .011 .5% .036

3

3: Item b .002 22.5% .613 –.009 17.4% .242 –.019 17.5% .153 a –.360 23.7% .644 –.157 22.0% .402 –.139 21.5% .317

2: Mode f .016 10.1% 1.299 –.150 3.4% .734 –.097 2.2% .405 h –.031 7.9% .815 .155 3.6% .426 .118 2.7% .262

1: Rater c –.139 3.3% .549 –.034 1.0% .129 –.010 .6% .048 d .117 2.8% .538 .036 1.0% .121 .012 .5% .047

Note: Bias, % Bias, and mean squared error (MSE) represent mean estimates for the parameters at each level. 100 replications were

used in the simulation. Bias was defined as follows:

N

n

N

nnn xexe

Nxexe

NxBias

1 1)()(

1)]()([

1)( . MSE was defined as

follows:

N

nn xexe

NxMSE

1

2)]()([1

)( .

32

Table 12. Latent class sizes by condition and sample size

Condition Latent

Variablen=100 n=400 n=1,000

Class 1 Class 2 Class 3 Class 1 Class 2 Class 3 Class 1 Class 2 Class 3

1

η11 .29 (.05) .41 (.06) .29 (.05) .30 (.03) .42 (.03) .29 (.02) .29 (.02) .42 (.02) .29 (.02) η12 .28 (.05) .43 (.06) .28 (.05) .28 (.03) .44 (.03) .28 (.02) .28 (.02) .44 (.02) .28 (.02) η21 .28 (.05) .45 (.06) .28 (.05) .29 (.02) .43 (.03) .28 (.02) .28 (.02) .44 (.02) .28 (.01) η22 .28 (.05) .44 (.06) .28 (.05) .28 (.02) .44 (.03) .28 (.03) .28 (.01) .44 (.02) .28 (.01) Φ1 .28 (.07) .44 (.09) .28 (.07) .27 (.04) .46 (.05) .27 (.04) .27 (.03) .47 (.03) .27 (.03) Φ2 .28 (.06) .45 (.06) .27 (.05) .28 (.03) .44 (.04) .28 (.03) .27 (.02) .46 (.03) .27 (.02)

2

η11 .29 (.05) .41 (.06) .30 (.05) .29 (.03) .41 (.03) .29 (.03) .29 (.01) .42 (.02) .29 (.02) η12 .28 (.05) .43 (.05) .29 (.05) .28 (.02) .44 (.02) .28 (.02) .28 (.01) .44 (.02) .28 (.02) η21 .28 (.05) .43 (.05) .29 (.05) .28 (.02) .43 (.03) .29 (.03) .29 (.02) .43 (.02) .29 (.01) η22 .28 (.04) .43 (.06) .29 (.05) .28 (.03) .43 (.03) .29 (.02) .28 (.02) .43 (.02) .29 (.01) Φ1 .28 (.06) .42 (.07) .30 (.07) .27 (.04) .45 (.05) .28 (.04) .26 (.03) .47 (.03) .27 (.02) Φ2 .28 (.04) .42 (.06) .30 (.06) .28 (.03) .44 (.04) .28 (.03) .27 (.02) .45 (.03) .28 (.02)

3

η11 .29 (.05) .42 (.06) .29 (.05) .29 (.03) .42 (.03) .29 (.03) .29 (.02) .42 (.02) .29 (.01) η12 .28 (.05) .44 (.06) .28 (.05) .28 (.03) .44 (.03) .28 (.03) .28 (.01) .44 (.02) .28 (.02) η21 .29 (.05) .43 (.05) .28 (.05) .29 (.02) .43 (.03) .29 (.03) .29 (.02) .43 (.02) .29 (.02) η22 .29 (.05) .43 (.06) .28 (.05) .29 (.02) .43 (.03) .28 (.02) .29 (.01) .43 (.02) .28 (.01) Φ1 .28 (.06) .43 (.09) .29 (.06) .27 (.04) .45 (.05) .27 (.04) .27 (.03) .46 (.03) .27 (.03) Φ2 .29 (.05) .43 (.06) .28 (.06) .28 (.03) .44 (.04) .28 (.03) .28 (.02) .45 (.02) .27 (.02)

Note: Values in parenthesis represent standard errors

33

Table 13. Classification indices by conditions and sample size

Condition Sample

size Classification

Rater Mode of observation Exercise 1 Exercise 2

Direct Video Direct Video Direct Video

1

100 Pc .956 .954 .961 .963 .878 .918 λ .923 .918 .928 .932 .774 .848

400 Pc .950 .952 .954 .954 .877 .916 λ .914 .913 .920 .918 .772 .848

1000 Pc .948 .949 .952 .952 .877 .911 λ .909 .910 .915 .915 .770 .837

2

100 Pc .954 .957 .963 .961 .876 .919 λ .921 .924 .935 .931 .778 .857

400 Pc .949 .952 .956 .956 .884 .917 λ .913 .914 .922 .922 .787 .850

1000 Pc .948 .949 .953 .953 .881 .916 λ .911 .909 .918 .918 .776 .847

3

100 Pc .956 .958 .966 .963 .880 .921 λ .923 .923 .939 .935 .782 .859

400 Pc .952 .954 .959 .960 .887 .920 λ .917 .917 .929 .930 .792 .856

1000 Pc .951 .952 .958 .959 .887 .916 λ .915 .914 .927 .927 .789 .849

Note: Proportion correctly classified (Pc) and λ are both measures of classification based on posterior probability (Clogg, 1995). The λ statistic accounts for classification that can occur by chance. Values in parenthesis represent standard errors.

7. Conclusion

This paper reviews psychometric rater models used in the measurement literature to

refine measures of non-cognitive attributes. While the use of non-cognitive attributes provide

new approaches to target interventions that can impact human capital, measurement issues have

yet to be resolved. This paper contributes to the literature in this regard, by proposing a solution

to generate model-based scores that can provide more refined estimates. To demonstrate this

application, psychometric rater models used in the educational measurement and mathematical

psychology literature are presented. In addition, a new model, extending the existing foundation

of LC-SDT is also proposed.

34

The analysis conducted in this paper show the utility of applying these techniques. First,

the CPS teacher evaluation data were fit using the LC-SDT model. Results showed that model-

based scores that account for rater effects generated larger effect sizes with value-added scores.

Although modest, the difference when compared to traditional techniques that use linear

regression can be quite large when taken into context with other value-added results shown in the

literature (Bill and Melinda Gates Foundation, 2012). Moreover, using latent class regression

that incorporates a psychometric rater model may yield more refined results when compared to

traditional value-added models; further investigation is needed.

This study also contributes by proposing a new method that accounts for mode of

observation. Many non-cognitive attributes can be directly observed or measured through post-

hoc mechanisms such as video playback. Findings from the real-world data analysis show utility

in this approach. The monte carlo simulation results also show promise in the continued

development of these techniques as more refined methods to capture learner’s non-cognitive

attributes.

Recently, there has been an increase in observation–based methods to assess candidates,

as scoring can be based on live or video–based observations – such testing practice is

administered frequently in medical education and in other professions. In the K–12 education

literature, measuring effective teaching has been conducted onsite by observers or offsite using

video recordings. Given the increase in observations to measure performance, a measurement

model that accounts for modes of observation is necessary.

The HRM–MO model proposed in this study provides a framework for extending the

HRM, which previously only accounted for raters at level 1 and the items at level 2. The HRM–

MO accounts for a separate level between the rater and item levels that models the effect of

35

observation mode. This can be a useful approach for researchers as multiple modes of

observation can be applied in high–stakes testing. Quality of observation mode can provide

information for planning the scoring design. In addition, this study contributes to the growing

literature on developments of HRM, which can lead to improved measurements of examinee

performance.

The HRM–MO used in this study can be a useful model for studying modes of

observations. It provided greater explanation on differences between modes of observation than

the simple rater agreement statistics or the traditional HRM–SDT. The model fit indices based on

HRM–MO also showed improved fit, which could be a promising indication for further

development of this model. Simulation results also showed interesting patterns regarding the

higher–level parameters in level 3. Conditions that reflect better estimation for item parameters

should be examined as part of future research.

As greater emphasis is placed on investing in non-cognitive attributes of learners at

various stages of training, additional care should be applied in its measurement. Although much

work in the professions education and educational measurement literature has contributed to this

effort, translation of these techniques to further reduce gaps between disciplines may be of need.

While measurement sciences focus on improving the precision around constructs, a method to

align these methodological trends with long-term outcomes would support better estimation and

deeper understanding of how non-cognitive attributes influence human development and

potential.

36

References

Abramson, David, Yoon Soo Park, Tasha Stehling-Ariza, and Irwin Redlener. 2010. “Children as

bellwethers of recovery: Dysfunctional systems and the effects of parents, households,

and neighborhoods on serious emotional disturbance in children after Hurricane Katrina”

Disaster Medicine and Public Health Preparedness 4:S17–S27.

Agresti, Alan. 2002. Categorical data analysis. Hoboken: Wiley.

Almlund, Mathilde, Angela Duckworth, James Heckman, and Tim Kautz. 2011. “Personality

psychology and economics.” In Handbook of the Economics of Education, edited by Eric

Hanushek, Stephen Machin, and Ludger Wöβmann. Amsterdam: Elsevier.

Beard, J. D., B. C. Jolly, D. I. Newble, W. E. Thomas, J. Donnelly, and L. J. Southgate. 2005.

“Assessing the technical skills of surgical trainees.” British Journal of Surgery 92 (6):

778–82.

Bill and Melinda Gates Foundation, Measures of Effective Teaching (MET). 2012. Gathering

feedback for teaching: Combining high–quality observations with student surveys and

achievement gains. Seattle: Bill and Melinda Gates Foundation.

Cardy, Robert L., and Gregory Dobbins. 1986. “Affect and appraisal accuracy: Liking as an

integral dimension in evaluating performance.” Journal of Applied Psychology 71:672–

678.

City of Columbus Civil Service Commission. 2012. 2012 police lieutenant and commander

promotional examination: Test guide. Columbus: City of Columbus Civil Service

Commission.

Clogg, Clifford, and Wendy D. Manning. 1996. “Assessing reliability of categorical

measurements using latent class models.” In Categorical variables in developmental

37

research, edited by Alexander von Eye and Clifford C. Clogg. New York: Academic

Press.

Cohen, Jacob. 1960. “Coefficient of agreement for nominal scales.” Educational and

Psychological Measurement 20:37–46.

Cohen, Jacob. 1968. “Weighted kappa: Nominal scale agreement with provision for scaled

disagreement or partial credit.” Psychological Bulletin 70: 213–220.

Danielson, Charlotte. 2007. Enhancing professional practice: A framework for teaching.

Alexandria, VA: Association for Supervision and Curriculum Development.

Dayton, Chauncey Mitchell. 1998. Latent class scaling models. Thousand Oaks: Sage.

DeCarlo, Lawrence T. 2002. “A latent class extension of signal detection theory, with

applications.” Multivariate Behavioral Research 36:423–451.

DeCarlo, Lawrence T., and Youngkoung Kim, and Matthew S. Johnson. 2011. “A hierarchical

rater model for constructed responses, with a signal detection rater model.” Journal of

Educational Measurement 48: 333–356.

Ezra, Daniel, Raj Aggarwal, Michel Michaelides, Narciss Okhravi, Seema Verma, Larry

Benjamin, Philip Blom, Ara Darzi, and Paul Sullivan. 2009. “Skills acquisition and

assessment after a microsurgical skills course for ophthalmology residents.”

Ophthalmology 116:257–262.

Groot, Wim. 2000. “Adaptation and scale of reference bias in self-assessments of quality of life.”

Journal of Health Economics 19 (3):403–420.

Gutman, Leslie, and Ingid Schoon. 2013. The impact of non-cognitive skills on outcomes for

young people. London: Institute of Education.

38

Heckman, James, and Tim Kautz. 2013. “Fostering and measuring skills: Interventions that

improve character and cognition.” Working Paper no. 2013-019, HCEO, Chicago, IL.

Heckman, James, Rodrigo Pinto, and Peter Savelyev. 2013. Understanding the mechanisms

through which an influential early childhood program boost adult outcomes. American

Economic Review 103 (6):1–35.

Heckman, James, and Yona Rubinstein. 2001. “The importance of noncognitive skills: Lessons

from the GED Testing Program.” American Economic Review. 91 (2):145–9.

Hely, M. A., T. Chey, A. Wilson, P. M. Williamson, D. J. O’Sullivan, D. Rail, J. G. Morris.1993.

“Reliability of the Columbia scale for assessing signs of parkinson’s disease.” Movement

Disorders 8:466–472.

Jackson, C. Kirabo. 2013. “Non-cognitive ability, test scores, and teacher quality: Evidence from

9th grade teachers in North Carolina.” Working Paper no. 18624, NBER, Cambridge, MA.

Jacob, Brian A., and Lars Lefgren. 2008. “Can principals identify effective teachers? Evidence

on subjective performance evidence in education.” Journal of Labor Economics 26 (11)

101–36.

Makoul, Gregory, and Raymond Curry. 2007. “The value of assessing and addressing

communication skills.” Journal of American Medical Association 298:1057–9.

Mariano, Louis. T. 2002. “Information accumulation, model selection and rater behavior in

constructed response student assessments.” Manuscript, Carnegie Mellon University.

McLachlan, Geoffrey, and Thriyambakam Krishnan. 2008. The EM algorithm and extensions.

San Francisco: Wiley.

Muraki, Eiji. 1992. “A generalized partial credit model: Application of an EM algorithm.”

Applied Psychological Measurement 16:159–176.

39

Olshfski, Dorothy, and Robert Cunningham. 1985. “Improving management effectiveness by

training – The use of video techniques to assist self–appraisal.” Technovation 3:235–242.

Park, Yoon Soo, Steven Holtzman, and Jing Chen. 2014. “Evaluating efforts to minimize rater

bias in scoring classroom observations.” In Designing Teacher Evaluation Systems: New

Guidance from the Measures of Effective Teaching Project, edited by Thomas Kane,

Kerri Kerr, and Robert Pianta. San Francisco: Wiley.

Park, Yoon Soo, and Young-Sun Lee. 2014. “An extension of the DINA model using covariates:

Examining factors affecting response probability and latent classification.” Applied

Psychological Measurement 38 (5): 376–90.

Park, Yoon Soo, Janet Riddle, and Ara Tekian, 2014. “Validity Evidence of resident competency

ratings and the identification of problem residents.” Medical Education 48 (6):614–22.

Patz, R. J. 1996. “Markov chain Monte Carlo methods for item response theory models with

applications for the National Assessment of Educational Progress.” Manuscript, Carnegie

Mellon University.

Patz, Richard, Brian Junker, Matthew Johnson, and Louis Mariano. 2002. “The hierarchical rater

model for rated test items and its application to large–scale educational assessment data.”

Journal of Educational and Behavioral Statistics 27:341–384.

Pratt, Travis C., and Francis T. Cullen. 2000. “The empirical status of Gottfredson and Hirschi’s

general theory of crime: A meta-analysis.” Criminology 38 (3): 931–64.

Rockoff, Jonah E., and Cecilia Speroni. 2010. “Subjective and objective evaluations of teacher

effectiveness.” American Economic Review 100:261–6.

Shaeffer, Gary, Jackqueline Briel, and Mary Fowles. 2001. “Psychometric evaluation of the new

GRE writing assessment.” ETS Research Report No. RR–01–18, Princeton, NJ.

40

Tate, Richard. 1999. “A cautionary note on IRT–based linking of tests with polytomous items.”

Journal of Educational Measurement 36:336–46.

Taylor, Shelley, and Susan Fiske. 1978. “Salience, attention, and attributions: Top of the head

phenomena.” In Advances in experimental social psychology, edited by L. Berkowitz.

New York: Academic Press.

van der Vleuten, C. P., and David Swanson. 1990. “Assessment of clinical skills with

standardized patients: State of the art.” Teaching and Learning in Medicine 2:58–76.

Vassiliou, Melina, Liane Feldman, Shannon Fraser, Patrick Charlebois, Prosanto Chaudhury,

Donna Stanbridge, and Gerald Fried. 2007. “Evaluating intraoperative laparoscopic skill:

Direct observation versus blinded videotaped performances.” Surgical Innovation 14

(3):211–6.

Vermunt, Jeroen and Jay Magidson. 2005. Technical guide for Latent Gold 4.0: Basic and

advanced. Belmont: Statistical Innovations, Inc.

Vincent, Charles, Magi Young, and Angela Phillips. 1994. “Why do people sue doctors? A study

of patients and relatives taking legal action.” Lancet 343:1609–13.

Vivekananda–Schmidt, Pirashanthie, Martyn Lewis, David Coady, Catherine Morley, Lesley

Kay, David Walker, and Andrew Hassell. 2007. “Exploring the use of videotaped

objective structured clinical examination in the assessment of joint examination skills of

medical students.” Arthritis & Rheumatism 57:869–76.

Psychometric Issues in the Measurement of Non-Cognitive ... · non-cognitive attributes. In particular, when estimates from the psychometric rater models were analyzed with value-added

Documents