Top Banner
Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin Mindy E. Bergman* Department of Psychology, Texas A&M University Fritz Drasgow Department of Psychology, University of Illinois at Urbana-Champaign Michelle A. Donovan Intel Corporation Jaime B. Henning Department of Psychology, Texas A&M University Suzanne E. Juraska Personnel Decisions Research Institute Although situational judgment tests (SJTs) have been in use for decades, consensus has not been reached on the best way to score these assessments or others (e.g., biodata) whose items do not have a single demonstrably correct answer. The purpose of this paper is to review and to demonstrate the scoring strategies that have been described in the literature. Implementation and relative merits of these strategies are described. Then, several of these methods are applied to create 11 different keys for a video-based SJT in order to demonstrate how to evaluate the quality of keys. Implications of scoring SJTs for theory and practice are discussed. S ituational judgment tests (SJTs) have been in use since the 1920s, but have recently enjoyed a resurgence in attention in the research literature (e.g., Chan & Schmitt, 1997, 2005; Clevenger, Pereira, Wiechmann, Schmitt, & Harvey, 2001; Dalessio, 1994; McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001; Olson-Buchanan, Drasgow, Moberg, Mead, Keenan, & Donovan, 1998; Smith & McDaniel, 1998; Weekley & Jones, 1997, 1999). McDaniel et al. (2001) recently showed the validity of SJTs in predicting job performance. SJTs have also been found to provide incremental validity beyond more typically used assessments, such as cognitive ability tests, and appear to have less adverse impact (Chan & Schmitt, 2002; Clevenger et al., 2001; Motowidlo, Dunnette, & Carter, 1990; Olson- Buchanan et al., 1998; Weekley & Jones, 1997, 1999). However, important questions still persist. One critical issue is the selection of scoring methods (Ashworth & Joyce, 1994; Desmarais, Masi, Olson, Barbara, & Dyer, 1994; McHenry & Schmitt, 1994). Unlike cognitive ability tests, SJT items often do not have objectively correct answers; many of the response options are plausible. It is a question of which answer is ‘‘best’’ rather than which is ‘‘right.’’ However, there are many ways to determine the best answer and consensus has not yet been reached as to which method is superior. This paper delineates scoring strategies, discusses their merits, and demonstrates them for one SJT. We hope that this paper stimulates and serves as an example for further scoring research. What are SJTs? SJTs are a ‘‘measurement method that can be used to assess a variety of constructs,’’ (McDaniel et al., 2001, p. 732; McDaniel & Nguyen, 2001), although some constructs might be particularly amenable to SJT measurement (Chan & Schmitt, 2005). Most SJTs measure a constellation of job-related skills and abilities, often based on job analyses (Weekley & Jones, 1997). SJT formats vary, with some using paper-and-pencil tests with written descriptions of situations (Chan & Schmitt, 2002) and others using computerized multimedia scenarios (McHenry & Schmitt, *Address for correspondence: Mindy Bergman, Department of Psychology (MC 4235), College Station, TX 77843-4235. E-mail: [email protected] INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT VOLUME 14 NUMBER 3 SEPTEMBER 2006 223 r 2006 The Authors Journal compilation r 2006 Blackwell Publishing Ltd, 9600 Garsington Road, Oxford, OX4 2DQ, UK and 350 Main St, Malden, MA 02148, USA
13

Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

May 01, 2023

Download

Documents

Pulkit Jain
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

Scoring Situational Judgment Tests: Once You Getthe Data, Your Troubles Begin

Mindy E. Bergman*Department of Psychology,

Texas A&M University

Fritz DrasgowDepartment of Psychology, University of

Illinois at Urbana-Champaign

Michelle A. DonovanIntel Corporation

Jaime B. HenningDepartment of Psychology,

Texas A&M University

Suzanne E. JuraskaPersonnel Decisions Research Institute

Although situational judgment tests (SJTs) have been in use for decades, consensus has notbeen reached on the best way to score these assessments or others (e.g., biodata) whose itemsdo not have a single demonstrably correct answer. The purpose of this paper is to review and todemonstrate the scoring strategies that have been described in the literature. Implementationand relative merits of these strategies are described. Then, several of these methods are appliedto create 11 different keys for a video-based SJT in order to demonstrate how to evaluate thequality of keys. Implications of scoring SJTs for theory and practice are discussed.

S ituational judgment tests (SJTs) have been in use since

the 1920s, but have recently enjoyed a resurgence in

attention in the research literature (e.g., Chan & Schmitt,

1997, 2005; Clevenger, Pereira, Wiechmann, Schmitt, &

Harvey, 2001; Dalessio, 1994; McDaniel, Morgeson,

Finnegan, Campion, & Braverman, 2001; Olson-Buchanan,

Drasgow, Moberg, Mead, Keenan, & Donovan, 1998;

Smith & McDaniel, 1998; Weekley & Jones, 1997, 1999).

McDaniel et al. (2001) recently showed the validity of SJTs

in predicting job performance. SJTs have also been found to

provide incremental validity beyond more typically used

assessments, such as cognitive ability tests, and appear to

have less adverse impact (Chan & Schmitt, 2002; Clevenger

et al., 2001; Motowidlo, Dunnette, & Carter, 1990; Olson-

Buchanan et al., 1998; Weekley & Jones, 1997, 1999).

However, important questions still persist. One critical

issue is the selection of scoring methods (Ashworth &

Joyce, 1994; Desmarais, Masi, Olson, Barbara, & Dyer,

1994; McHenry & Schmitt, 1994). Unlike cognitive ability

tests, SJT items often do not have objectively correct

answers; many of the response options are plausible. It is a

question of which answer is ‘‘best’’ rather than which is

‘‘right.’’ However, there are many ways to determine the

best answer and consensus has not yet been reached as to

which method is superior. This paper delineates scoring

strategies, discusses their merits, and demonstrates them

for one SJT. We hope that this paper stimulates and serves

as an example for further scoring research.

What are SJTs?

SJTs are a ‘‘measurement method that can be used to assess

a variety of constructs,’’ (McDaniel et al., 2001, p. 732;

McDaniel & Nguyen, 2001), although some constructs

might be particularly amenable to SJT measurement (Chan

& Schmitt, 2005). Most SJTs measure a constellation of

job-related skills and abilities, often based on job analyses

(Weekley & Jones, 1997). SJT formats vary, with some

using paper-and-pencil tests with written descriptions of

situations (Chan & Schmitt, 2002) and others using

computerized multimedia scenarios (McHenry & Schmitt,

*Address for correspondence: Mindy Bergman, Department of

Psychology (MC 4235), College Station, TX 77843-4235. E-mail:[email protected]

INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT VOLUME 14 NUMBER 3 SEPTEMBER 2006

223

r 2006 The AuthorsJournal compilation r 2006 Blackwell Publishing Ltd, 9600 Garsington Road,Oxford, OX4 2DQ, UK and 350 Main St, Malden, MA 02148, USA

Page 2: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

1994; Olson-Buchanan et al., 1998; Weekley & Jones,

1997). SJT response options also vary. Some SJTs propose

solutions to problems, to which respondents rate their

agreement (Chan & Schmitt, 2002). Others offer multiple

solutions from which respondents choose the best and/or

worst option (Motowidlo et al., 1990; Olson-Buchanan

et al., 1998).

This study examines scoring issues within the context of

one SJT, the Leadership Skills Assessment (LSA). The

development of this video-based, computer-delivered SJT is

described in the ‘‘Methods.’’ Briefly, the LSA contains 21

items depicting situations in a leader-led group. Respon-

dents indicate which of four multiple-choice options they

would choose if they were the leader. Response options

vary in degree of participation in decision-making. For the

LSA, ‘‘item’’ refers to a video scenario (i.e., item stem) and

its four response options. Similar to Motowidlo et al.’s

(1990) procedure, correct options are scored as 11,

incorrect options as �1, and all other options as 0. A

‘‘key’’ is the set of 11, �1, and 0 values assigned to the

options of every item on a test; multiple keys can be created

for a single test through different scoring approaches. A

sample item from the LSA, with a summary of the video

item stem, appears in Table 1, along with the scoring of that

item for each key described in the following section.

Scoring Strategies

Some SJT scoring issues parallel those in the biodata

literature, where consensus on optimal scoring methods

also has not yet been reached. Similar to SJTs, biodata

measures – which ask individuals to report past behaviors

and experiences – often do not have demonstrably or

objectively correct answers (Mael, 1991; Mumford &

Owens, 1987; Mumford & Stokes, 1992; Nickels, 1994).

However, differences between SJTs and biodata limit the

extent to which scoring research can generalize. The

prevailing issue in scoring SJTs – which typically contain

a handful of items – is how to choose the best response from

among the multiple-choice options for each item. In

contrast, because biodata measures are often lengthy

(sometimes containing several hundred items; Schoenfeldt

& Mendoza, 1994), much of the biodata scoring research

examines which items should be included in composites

and which should be eliminated from the assessment. Thus,

Table 1. Example item from the leadership skills assessment and its scoring across 11 keys

Summary of item stem (video scenario) and response options

At the request of the team leader, Brad has reviewed several quarterly reports. Through discussion, Brad and the teamleader come to the conclusion that Steve, another team member, has been making errors in the last several reports and iscurrently working on the next one. As the leader in this situation, what would you do?

A. At the next team meeting discuss the report and inform the team that there are a few errors in it. Ask the teammembers how they want to revise the report

B. Tell Steve about the errors, then work with him to fix the errorsC. Tell Steve about the errors and have him go over the report and correct the errorsD. Tell Steve that you asked Brad to look over this report and that Brad found some errors in it. Ask Steve to work with

Brad to correct the report

Key

Score for option

A B C D

Empirical 0 1 0 0Initiating structure �1 1 0 0Participation 0 �1 0 1Empowerment 1 �1 0 0Hybrid initiating structure �1 1 0 0Hybrid participation 0 0 0 1Hybrid empowerment 1 0 0 0Vroom time-based 0 0 1 1Vroom developmental 0 1 0 0Subject matter experts 0 1 0 �1Novice vs. experts 0 0 1 0

Note: Entries in the table reflect that options are correct (1), incorrect (�1), or unscored (0) for that particular key.

224 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA

International Journal of Selection and Assessmentr 2006 The Authors

Journal compilation r Blackwell Publishing Ltd. 2006

Page 3: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

although there are similarities among the scoring issues in

SJTs and biodata, there are also important differences

between these methods and their scoring needs.

Three categories of scoring methods have been described

in the SJT literature: (1) empirical, (2) theoretical, and (3)

expert-based. Four scoring methods appear in the biodata

literature: (1) empirical, (2) rational, (3) factorial, and (4)

subgrouping. To these lists we add another method, which

we call hybrid keying. Empirical keying is common across

SJTs and biodata. Elements of the biodata literature’s

rational keying appear in the SJT literature’s theoretical

keying and expert-based keying. Thus, there are six distinct

strategies for scoring.

Empirical Scoring

In empirical approaches, items or options are scored

according to their relationships with a criterion measure

(Hogan, 1994). Although many specific methods exist

(Devlin, Abrahams, & Edwards, 1992; Legree, Psotka,

Tremble, & Bourne, 2005; Mumford & Owens, 1987), all

share several processes including: choosing a criterion,

developing decision rules, weighting items, and cross-

validating results. Empirical keys often have high validity

coefficients compared with other methods (Hogan, 1994;

Mumford & Owens, 1987), but also have a number of

challenges, including dependency on criterion quality

(Campbell, 1990a, b), questionable stability and general-

izability (Mumford & Owens, 1987), and capitalization on

chance (Cureton, 1950).

In effect, empirical scoring relies on item responses and

criterion scores of a sample. For assessments with non-

continuous multiple-choice options, empirical keys are

constructed by first creating dummy variables for each

option of each item, with 1 indicating that the option was

chosen. Positive (negative) dummy variable-criterion

correlations occur when options are chosen frequently by

individuals with high (low) criterion scores; zero correla-

tions indicate an option is unrelated to the criterion. The

result is a trichotomous scoring key. It should be noted that

some empirical procedures weight response options in

other ways (Devlin et al., 1992; England, 1961).

Although mathematically straightforward, decision

rules must be created in order to implement empirical

keying based on option–criterion correlations. First, a

minimum base rate of option endorsement is required

because spurious option–criterion correlations can occur if

only a handful of individuals endorse an option. Second, a

minimum correlation (in absolute value) must be set to

score an option as correct (positive correlation) or incorrect

(negative correlation). Options are scored when both

scoring standards are met. For some items, all options are

scored as 0. This may be because of low base rates for some

options, near-zero correlations between options and the

criterion, or both. Such items are particularly vexing

because it is not clear whether the underlying construct is

unrelated to the criterion, the item was poorly written, or

there were too few test-takers who had the ability to discern

the correct answer (i.e., the correct option did not meet the

minimum endorsement criterion).

For the LSA, our criteria were: (a) an option endorse-

ment rate of at least 25 respondents (20.3% of our eligible

sample) and (b) an option–criterion correlation (in absolute

value) of .10 with the criterion. The entire sample was used

to derive the empirical key, inflating its validity coefficient

due to capitalization on chance. We used N-fold cross-

validation to address this issue (Brieman, Friedman,

Olshen, & Stone, 1984; Mead, 2000). N-fold cross-

validation holds out the responses of person j and computes

an empirical key based on the remaining N�1 persons,

which is used to score to person j. Person j is then returned

to the sample, person j11 is removed, and an empirical key

is created on the N�1 remaining individuals. This

procedure is repeated N times so that every respondent

has a score free from capitalization on chance. Correlating

these holdout scores with the criterion measure provides a

validity estimate that does not capitalize on chance. N-fold

cross-validation is superior to traditional half-sample

approaches because it keys in subsamples of N� 1 (rather

than N/2) and uses all N individuals for cross-validation

(rather than the other N/2). Note that although N samples

of N�1 people are used to estimate validity, ‘‘empirical

key’’ refers to the key obtained from all N individuals.

Theoretical Scoring

Similar to biodata’s rational method (Hough & Paullin,

1994), SJT keys can reflect theory. Items and options can be

constructed to reflect theory, or theory can be used to

identify the best and worst options in a completed test.

Options reflecting (contradicting) the theory are scored as

correct (incorrect); items or options that are irrelevant or

unrelated to the theory are scored as zero. Theoretical

methods address major criticisms of empirical keying such

as it being atheoretical. Theoretical keys may be more

likely to generalize. However, theoretical keys might be

more susceptible to faking due to their greater transparency

(Hough & Paullin, 1994). Further, the theory might be

flawed or fundamentally incorrect.

The LSA’s options reflect three graduated levels of

delegation of decision-making to the team. Keys for each of

these leadership styles were developed, as were keys based

on Vroom’s (Vroom & Jago, 1978; Vroom & Yetton, 1973)

contingency model.

Theoretical Keys Based on Levels of Delegation

Three scoring keys were created by treating one particular

delegation style as the most effective across situations. For

empowerment, the group makes decisions; for participative

SCORING SJTS 225

r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006

Page 4: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

decision making, the leader delegates the decision, or a part

thereof, to the group (e.g., the leader allows the group to

decide, or the leader facilitates group discussion but still

holds veto power); and for initiating structure, the leader

retains the decision-making authority. In this last style, the

leader might gather information from subordinates, but the

subordinates have no voice in the decision that is made

(Liden & Arad, 1996). The empowerment key coded

empowering responses as correct and initiating structure

options as incorrect. This was reversed for the initiating

structure key.1 For both of these keys, all other options

were scored as zero. For the participative key, participative

responses were scored 11 and initiating structure options

as �1; other options were scored as zero, including

empowerment options, because empowerment allows for

participation, but not the best kind if participative

management is the best leadership style.

Vroom’s Decision Model Keys

The model proposed by Vroom and colleagues (Vroom,

2000; Vroom & Jago, 1978; Vroom & Yetton, 1973)

incorporates the critical notion that leader behavior should

be situation-dependent. The model describes several

factors (e.g., decision significance, team competence) that

should influence leaders’ choices among five-decision

strategies ranging from AI (autocratic decision by the

leader) to GII (group-based decision-making). Vroom

(2000) delineated two forms of the model. In the first,

time-related factors are emphasized; in the second, group

member development is emphasized. Although there is

considerable overlap, the two models sometimes make

different recommendations. Thus, both a time-based key

and a development-based key were created.

Each LSA scenario was rated on the factors in Vroom’s

(2000) model. Raters were 15 students (five males, 10

females; nine Caucasians, six Asian/Pacific Islanders;

Mage 5 26.4 years) at a large Midwestern university with

substantial backgrounds in psychology (Mclasses 5 20.67).

Raters averaged 5.6 years in the workforce; 40% had

managerial experience (M 5 2.5 years). Mean ratings on

the factors were used to navigate the Vroom (2000)

decision trees to identify the best decision strategy(-ies)

for each situation. Independently, the third author classi-

fied each of the LSA’s response options into Vroom’s five-

decision strategy categories. Options that matched

Vroom’s recommended strategy were scored as correct;

other options were scored as zero.

Hybridized Scoring

Combining different scoring methods could potentially

increase predictive power (Mumford, 1999; Olson-

Buchanan et al., 1998). Hybrid scoring combines two

independently generated keys. Two keys could, for

example, be added at the option level, allowing a positive

score on one key to cancel out a negative score on the other.

Another hybridization approach is substitution for zeroes.

One key is designated as the primary key and the other as

secondary. The hybrid key is initially assigned the keying of

the primary key; the secondary key is used to replace onlyzero scores in the primary key. Keys can also be

differentially weighted, such that one key is used with the

full scores and the other is fractionally weighted. Hybrid

keying can be implemented in a straightforward way in any

standard statistical software package.

Hybridizing an empirical key and a theoretical key

resolves some concerns about each because it both

recognizes theory and relies less on pure empiricism. It is

partially based on data, so there is an opportunity to

remedy flaws in theory. Empirical–theoretical hybrids can

also shed light on whether an option is ‘‘unscored’’ or

‘‘scored as zero.’’ However, problems attendant in the

original keys’ approaches transfer to hybrid keys. Further,

the choice of which keys to hybridize is not simple, as there

are theoretical (e.g., is combining these keys theoretically

justified?) and practical (e.g., is it possible to implement

this key without cross-validation?) issues to consider.

Here, three additive hybrid keys (initiating structure,

participation, and empowerment hybridized with empiri-

cal) were created; N-fold cross-validation was used to

estimate validity.

Expert-Based Scoring

Expert scoring creates keys based on the responses of

individuals with substantial knowledge about the topic.

Decision rules must be implemented in order to identify

consensus around the appropriate answer(s). Two expert-

based keys were created for the LSA.

Subject Matter Experts

The most common way to develop expert-based keys is to

ask subject matter experts (SMEs) to make judgments

about the items. In this sense, SME keying is similar to

rational keying in biodata, in that experts make judgments

about options’ relevance to the criterion. SMEs examine

each item and its options to identify the best and worst

choices, which are scored as correct or incorrect, respec-

tively. All other options receive a score of zero.

For the LSA, 15 SMEs were recruited from a graduate-

level I/O psychology course at a large midwestern

university. Respondents were well versed in leadership

theory through their coursework and reported a mean level

of 20.67 courses in psychology throughout their academic

careers. SMEs individually watched the LSA and identified

which options they believed were best or worst. We

required that at least one-third of the SMEs select an

option as best or as worst and that there also be a

226 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA

International Journal of Selection and Assessmentr 2006 The Authors

Journal compilation r Blackwell Publishing Ltd. 2006

Page 5: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

five-response (i.e., one-third of the sample) differential

between the number of endorsements as best vs. worst

before an option could be scored. This allowed for multiple

correct or incorrect answers per item; however, some items

had all options scored as zero.

Comparison of Novices vs. Experts

Another expert scoring approach contrasts experts’ and

novices’ responses. Instead of asking for judgments about

the best and worst options, experts and novices simply

complete the assessment, choosing the best option. Options

chosen frequently by experts are scored as correct

regardless of novices’ endorsement rate; options that

non-experts choose frequently – but experts do not – are

scored as incorrect. Following this procedure, 128

introductory psychology students and 20 masters stu-

dents in labor and industrial relations at a large Mid-

western university completed the LSA to fulfill a course

requirement. By item, if an option was picked by at least

one-third of the Masters group, it was scored as correct.

Where the introductory psychology students’ most fre-

quently endorsed option differed from the Masters

students’ choice and was endorsed by at least one-third,

the option was coded as incorrect. All other options were

scored as zero.

Factorial Scoring

The factorial approach forms construct-laden scales based

on factor analysis and item correlations (Hough & Paullin,

1994). Factorial approaches are used when specific,

construct-based scales are not specified a priori and items

are not assumed to measure particular constructs. This

procedure is useful when theory does not define the

relevant constructs, yet the item pool could produce

meaningful dimensions. It can also be used to winnow

item pools, as items that do not belong to a readily

identifiable factor can be dropped. We did not implement

this method.

Subgrouping

Infrequently used, subgrouping attempts to identify natu-

rally occurring groups by sorting individuals based on

similar patterns of responding to biodata items (Devlin et

al., 1992; Hein & Wesley, 1994). Subgrouping might

capture individuals’ self-models, which guide their beha-

vior in a range of situations (Mumford & Whetzel, 1997).

We did not use this method for the LSA.

Evaluating Scoring Keys’ Relative Utility

In biodata research, although results have been mixed,

empirical keys seem to have the greatest predictive validity,

whereas rational keys allow for greater understanding of

theory (Karas & West, 1999; Mitchell & Klimoski, 1982;

Schoenfeldt, 1999; Such & Hemingway, 2003; Such &

Schmidt, 2004; see Stokes & Searcy, 1999, for an

exception). Empirical keys display greater shrinkage in

cross-validation than other methods, such as rational

scoring, but the cross-validity coefficients of empirical

keys are still often higher than rational keys’ (Mitchell &

Klimoski, 1982; Schoenfeldt, 1999). Few studies compar-

ing scoring methods for one SJT have been conducted.

Paullin and Hanson (2001) compared one rational and

several empirical keys for a leadership SJT in two U.S.

military samples. There were few differences in the

predictive validity of the various keys for supervisory

ratings or promotion rate; cross-validities for the keys did

not differ. Other studies have found similar results (Krukos,

Meade, Cantwell, Pond, & Wilson, 2004; MacLane,

Barton, Holloway-Lundy, & Nickels, 2001; Such &

Schmidt, 2004).

In sum, few studies have examined the relative

effectiveness of various scoring methods. Further, the

criteria for ‘‘effectiveness’’ of a key have not been clearly

established; some studies focus on understanding the

constructs underlying a test, others emphasize prediction,

and yet others stress incremental validity. We believe that

like any test development process, an SJT key’s usefulness is

determined with several kinds of information. First,

different keys yield different scores and, therefore, different

correlations with the criterion. Obviously, high criterion-

related validity is one desideratum for a key. Keys will also

be differentially related to other predictors such as

cognitive ability or personality; therefore, keys’ incremen-

tal validities can vary. Thus, high incremental validity is a

second standard for a key. It is also important to consider

adverse impact. Keys from one test might vary in amount

of subgroup differences; minimizing these differences is

the third standard. The fourth standard for a key is

construct validity. The key should produce scores that

correlate with other measures that the SJT should correlate

with (i.e., convergent validity), but do not correlate with

measures that the SJT should not correlate with (i.e.,

discriminant validity; Campbell & Fiske, 1959). Ulti-

mately, the key should produce scores that indeed measure

the construct that the test developer intended. Note that in

the past, developers of SJTs may have paid less attention to

the construct validity of their measures than the developers

of other psychological instruments. Nonetheless, we

believe it is important for SJT researchers to begin to

articulate and investigate theoretical frameworks for their

measures.

In this study, we first evaluated the validities of the LSA’s

keys. Next, we examined the keys’ incremental validities in

predicting leadership and overall job performance above

and beyond the effects of personality and cognitive ability,

which are related to leadership performance (Borman,

White, Pulakos, & Oppler, 1991; Campbell, Dunnette,

SCORING SJTS 227

r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006

Page 6: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

Lawler & Weick, 1970; Judge, Bono, Ilies, & Gerhart,

2002). Given the ubiquity of these types of assessments in

selection, as well as their ready availability, it seems

important to demonstrate SJT’s added value beyond these

more common assessments. Third, we examined subgroup

differences across sex (sample sizes for ethnic minority

groups were too small for meaningful analysis). Finally, we

investigated one aspect of construct validity by examining

the keys’ correlations with measures of cognitive ability

and personality. LSA keys, as measures of leadership skill,

should not be redundant with cognitive ability or

personality; correlations with these traits should be

modest.

Method

Participants and Procedure

Participants were 181 non-academic supervisors from a

large Midwestern university who managed other indivi-

duals, not technical functions (e.g., computer system

administrators); their departments included building ser-

vices, housing, grants and contracts, and student affairs.

All completed the LSA and other measures, except the

criterion measures that were completed by the participants’

supervisors. Because not all criterion ratings forms were

returned, the eligible sample size for this study was 123. Of

the eligible sample, 67.5% were female. The majority were

Caucasian (91.9%), with African Americans (2.4%),

Asians (1.6%), and Hispanics (1.6%) completing the

sample (2.4% chose ‘‘other’’ or did not respond). The

median age category was 45–49 years; most participants

(80.5%) were older than 40. The sample was highly

educated, with 53.7% holding graduate or professional

degrees; 21.1% held a college degree, with or without some

graduate training; 15.4% had completed some college,

4.1% had a high school diploma with additional technical

training, and 5.7% had a high school diploma or GED

only.

The modal tenure category of the sample was greater

than 16 years (41.5%), followed by 7–10 (26.8%) and 11–

15 years (23.6%). Few reported three or less (1.6%) or 4–6

(6.5%) years of tenure. Most reported supervising small

numbers of employees, with 1–3 (37.4%), 4–6 (24.4%), or

7–10 (8.9%) subordinates. However, larger numbers of

supervisees were not uncommon, with 11–15 (7.3%), 16–

20 (7.3%), or 21 or more (14.6%) subordinates. Partici-

pants missing data did not differ from the effective sample

on any demographic measures.

Measures

Leadership Skills Assessment. The LSA is a 21-item

computerized multimedia assessment of leadership skills.

Items are set in either an office (N 5 10) or a manufacturing

(N 5 11) environment. Each environment has its own four-

person team, including a leader. Leaders appear in all

scenes; other team members did not. Scenes were filmed

with the leader’s back to the respondent to emphasize the

leader’s point of view. Respondents indicated which of the

four response options best describes what they would do if

they were the leader in that situation.

Test development followed a method similar to Olson-

Buchanan et al. (1998) and occurred with employees at a

building products distribution company. These employees

were in leader-led teams that were encouraged to partici-

pate in decisions and were empowered to make some

decisions. Using critical incidents techniques (Flanagan,

1954), we conducted individual and group interviews with

employees and leaders (N 5 43). Common and critical

leadership situations were identified from these interviews

and then summarized in descriptive paragraphs.

To generate realistic options, 53 supervisors were

presented with the descriptions and asked how they would

respond. Situations producing little response variability

were discarded. For situations that yielded variability,

responses were clustered into styles based on Liden and

Arad’s (1996) modified version of Vroom’s (2000; Vroom

& Jago, 1978; Vroom & Yetton, 1973) model, which

includes three graduated levels of delegation: the decision is

(a) made entirely by the group (empowerment); (b)

delegated from the leader to the group (participative); or

(c) controlled entirely by the leader (initiating structure).

Short descriptions of each response cluster were written

and became candidate multiple-choice options for the

scenarios.

The situations and their candidate multiple-choice

options were presented to employees (N 5 36) who were

asked to identify the option that they thought: (1) was the

best way of responding; and (2) others would most likely

select as the best way of responding (i.e., the socially

desirable response). Situations with little variability for

either question across candidate multiple-choice options

were discarded. For each scene, one option was selected for

each level of delegation; the fourth option was a distractor

(e.g., the leader ignores the problem).

Using details from the interview sessions that began this

test development process, scripts were written to expand

on the paragraph summaries of the item stems. After

revision for clarity and conversational tone, scripts were

filmed at a large midwestern university using local actors.

The Wonderlic Personnel Test (WPT). Cognitive

ability was assessed by the computerized version of the

WPT (for validity evidence, see Dodrill, 1983; Dodrill &

Warner, 1988; Wonderlic Personnel Test Inc., 1992). Test–

retest reliabilities for the WPT range from .82 to .94

(Dodrill, 1983; Wonderlic Personnel Test Inc., 1992).

Sixteen Personality Factor Questionnaire (16PF). Per-

onality was assessed with the computerized 16PF, Fifth

Edition (Cattell, Cattell, & Cattell, 1993), which contains

185 items that can be mapped onto 16 primary factors and/

or five global factors. We used the global factors:

228 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA

International Journal of Selection and Assessmentr 2006 The Authors

Journal compilation r Blackwell Publishing Ltd. 2006

Page 7: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

introversion, independence (likened to the Big 5 agreeable-

ness dimension), anxiety (neuroticism), self-control

(conscientiousness), and tough-mindedness (openness).

Reliabilities are: extraversion (0.90), Anxiety (0.90), Self-

control (0.88), Independence (0.81), and Toughminded-

ness (0.84) (S. Bedwell, personal communication, January

17, 2005).

Criterion Measures. A six-item scale, Empowering

Leadership, was based on Arnold, Arad, Rhoades, and

Drasgow’s (2000) research on the six dimensions of

empowerment. It assessed the main criterion variable:

leadership performance. Supervisors of the participants

were contacted and asked to indicate how often the

participants engaged in empowering leadership behaviors

(1 5 never; 7 5 always). Coefficient a was .95. This

measure was used to create the empirical key.

Supervisors were asked to rate the overall performance

of participants (1 5 Very poor; 7 5 Very good) using three

items, yielding the Overall Job Performance scale. Coeffi-

cient a was .96.

Results

Descriptive statistics and correlations for all keys and other

measures appear in Table 2.

Validity

The keys provided a wide range of validity coefficients

(� .03 to .32; see Table 2) and generally had stronger

relationships with Empowering Leadership than the Over-

all Job Performance scale. The empirical, SME, and hybrid

participation keys all showed significant if moderate

correlations with leadership ratings. No other keys were

significantly related to the criteria.

Incremental Validity Analysis

The contribution of each of the keys to the prediction of

leadership performance was tested through a series of

planned hierarchical regression analyses. Each key had its

own set of regressions. For every key, the hierarchical

regressions proceeded by first regressing leadership ratings

onto the WPT. Next, the five personality scales were

entered as a block. Then, each key was entered at the last

step in each hierarchical regression. Steps 1 and 2 and their

results are the same for each set of regressions, as the keys

were entered only in the last step of each set.

The results (Table 3) show that the WPT significantly

predicted leadership ratings, accounting for approximately

10% of the variance in leadership ratings. Personality

factors were not significantly related to the leadership

criterion. Incremental validity of each of the 11 different

keys entered in the third steps in the 11 sets of hierarchical

regressions was assessed by comparing the change in R2

between Step 2, which included both the WPT and the

16PF, and Step 3, the final equation that also included an

SJT key. The results indicate that four keys (empirical,

2.3%; hybrid initiating structure, 3.0%; hybrid participa-

tion, 1.7%; and SME, 4.9%) accounted for significant

additional variance in leadership ratings.

Subgroup Differences by Sex

Table 4 contains the results of sex differences analyses. Job

performance ratings were nearly identical across sex. Only

two predictors (extraversion, toughmindedness) exhibited

significant mean differences across sex and produced

medium effect sizes. Although some keys had larger

cross-sex effect sizes than others, examination of the effect

sizes indicates that the differences across sex were generally

small (.3 or less) for all keys. Building on the results

described in the previous section, additional regression

analyses were conducted that included the main effect for

sex and the sex-by-key interactions. The inclusion of these

variables assesses the across-sex equality of slopes and

intercepts, respectively (Cleary, 1968). None of these

parameters were significant or produced even a moderate

effect size. Thus, none of the keys seem likely to produce

practically significant levels of adverse impact vis-a-vis sex.

Discriminant Validity

Table 2 contains correlations of the various keys with the

WPTand the personality measures. No correlation exceeds

.32 in absolute value, suggesting that the LSA is not overly

redundant with cognitive ability or personality.

Best Keys for the LSA

Based on test validation standards, we can make recom-

mendations regarding the best keys for the LSA in this

sample. The empirical, SME, and hybrid participation keys

all predicted the leadership criterion. These three keys, as

well as the hybrid initiating structure key, provided

significant incremental validity over cognitive ability and

personality measures. None of the keys showed subgroup

differences by sex and all of the keys showed discriminant

validity. Therefore, for users of the LSA, we would

recommend the empirical, SME, and hybrid participation

keys. However, it is important to note that these results

must be replicated in other samples before it can be

determined whether these particular keys are the best

across samples, settings, and uses.

Discussion

Our analyses demonstrated various approaches to creating

and evaluating SJT keys. The wide variability2 in validity

coefficients across the 11 keys emphasizes the importance

of the choice of scoring method. Although some keys had

SCORING SJTS 229

r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006

Page 8: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

Table

2.

Mea

ns,

stan

dar

ddev

iation

s,an

dco

rrel

atio

ns

ofva

riab

les

Mea

n(S

D)

0item

sa1

23

45

67

89

10

11

12

13

14

15

16

17

18

1.

Won

der

lic2

7.3

1(5

.77

)–

2.

Ext

rave

rsio

n5

.13

(1.8

9)

–�

02

3.

Anxi

ety

4.5

5(1

.85

)–

�0

4�

30

4.

Tough

-min

ded

nes

s5

.39

(2.0

5)

–�

16�

33

12

5.

Indep

enden

ce5

.20

(1.6

9)

–�

04

34�

14�

26

6.

Sel

f-co

ntr

ol5

.81

(1.4

7)

–�

21�

28�

04

46�

22

7.

Em

piric

al3

.30

(2.1

8)

12

32�

05�

06�

10

09�

18

8.

Initia

ting

stru

cture

�1

.09

(4.6

7)

0�

15�

12

06

13�

12

03�

33

9.

Par

tici

pat

ion

�.3

8(2

.97

)0

17

25�

16�

24

13�

11

32�

68

10

.Em

pow

erm

ent

1.0

9(4

.67

)0

15

12�

06�

13

12�

03

33�

10

06

81

1.

Hyb

rid

initia

ting

stru

cture

.64

(3.9

8)

10

0�

13

06

09�

11�

06

11

89�

55�

89

12

.H

ybrid

par

tici

pat

ion

1.2

1(3

.26

)0

23

21�

14�

28

15�

23

64�

64

89

64�

35

13

.H

ybrid

empow

erm

ent

3.7

2(4

.98

)0

23

09�

06�

17

13�

11

59�

93

69

93�

69

78

14

.Vro

omtim

e-bas

ed8

.89

(1.6

6)

42

1�

15

08

10�

01

02

09

37�

29�

37

45�

23�

31

15

.Vro

omdev

elop

men

tal

7.4

5(1

.48

)6

22�

03

03�

03

05�

02

39�

40

21

40�

21

33

50

27

16

.Subje

ctm

atte

rex

per

ts8

.67

(2.3

0)

33

21

0�

11�

14�

04�

12

51�

32

15

32�

10

31

43�

20

14

17

.N

ovic

evs

.ex

per

ts9

.20

(2.5

4)

00

9�

03�

09�

06

01�

02

04

21�

33�

21

23�

22�

19

22

39

12

18

.Le

ader

ship

rating

34

.85

(5.0

1)

–3

20

7�

01�

08

00�

15

25�

03

04

03

17

22

17

07

13

32

12

19

.O

vera

llper

form

ance

rating

18

.59

(2.9

4)

–2

6�

11

12

01�

05�

01

15�

03�

01

03

11

10

11

11

11

22

04

72

Not

es:D

ecim

alpoi

nts

omitte

din

corr

elat

ions.

Entr

ies

with

anab

solu

teva

lue

of.1

9or

grea

ter

are

sign

ific

antat

po

.05

.a‘‘0

item

s’’r

efer

sto

the

num

ber

ofitem

sin

the

lead

ersh

ipsk

ills

asse

ssm

ent(L

SA

)ke

yth

atar

eunsc

ored

.D

ashed

line

indic

ates

that

this

variab

leis

not

rele

vantto

the

par

ticu

lar

scal

e.

230 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA

International Journal of Selection and Assessmentr 2006 The Authors

Journal compilation r Blackwell Publishing Ltd. 2006

Page 9: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

statistically significant validity coefficients, many more did

not. Obviously, with only one SJT and one sample, we

cannot reach abiding conclusions about the goodness of

scoring approaches for all SJTs, or the boundary conditions

under which some approaches might be better than others.

What we can conclude is that the validity of an SJT depends

in part on its scoring, and that poor choices could lead to

the conclusion that the SJT’s content is not valid when it

may only be the scoring key that is not valid.

Our recommendation to carefully follow standard

validation procedures is not surprising. However, doing

so may be especially important for SJTs. Although a key

may be criterion-related, it might add little value once

cognitive ability and personality measures (which are

widely available and relatively inexpensive) are accounted

for. Because some SJTs – especially multimedia, computer-

ized SJTs – are costly to construct and, importantly, to

administer, it is not enough to know whether an SJT

predicts a criterion; it must also provide incremental value.

Further, because of the difficulty in determining ‘‘correct’’

answers, organizations facing legal challenges to their SJT

use will need to be able to explain not only why the SJT’s

content is job relevant but also why a particular scoring

strategy was used. Careful validation should both minimize

legal challenges and help organizations survive those that

do arise.

The challenging aspects of scoring are likely to increase

exponentially as the breadth of the SJT increases. Although

empirical keying could proceed in the same general fashion

regardless of the breadth of the SJT (assuming that the

criterion was not deficient), other scoring strategies would

likely become more complicated. For every content subset

in an SJT, there will be different sets of theories to apply for

theoretical keying and different SMEs to query for expert

keying. The various keys for the subtests could be

combined in more or less optimal ways, such that the best

key for one subtest depends in part on the key for another

subtest. Hybrid scoring systems could then be applied to

the various keys, carrying over these same concerns. To

complicate matters, non-linear scoring (e.g., Breiman

et al.’s (1984) classification and regression tree analysis)

might lead to the highest validity. In short, as the breadth

increases for an SJT, scoring can become more complex.

These issues speak to the importance of test development.

Awell-constructed SJT is one that would reflect clear content

domains, rather than contain a hodgepodge of items. As

difficult as scoring SJTs becomes as the breadth of the test

increases, it would be even more complicated if specific

content areas cannot be identified. Without clear content

domains, there is little guidance as to where test constructors

should look for theories to determine the scoring key or how

SMEs should think about the meanings of the items. Thus,

although broader SJTs are likely to have more scoring

difficulties than narrower ones, some of these problems can

be ameliorated if the SJT is carefully constructed to reflect

rational – if not theoretical – content domains.

Table 3. Hierarchical regressions

Step Variables entered b t R2 Fa DR2 F for DR2b

1. Wonderlic .32 13.09* .101 13.61* – –2. Extraversion .08 .76 .112 2.44* .011 .29

Anxiety .01 .14Toughmindedness .01 .13Independence � .03 � .26Self-control � .07 � .68

3a. Empirical .17 1.77 .136 2.58* .023 3.06*

3b. Initiating structure .03 .33 .113 2.09* .001 .133c. Participation � .03 � .35 .113 2.09* .001 .133d. Empowerment � .03 � .33 .113 2.09* .001 .133e. Hybrid initiating structure .18 2.01* .142 2.72* .030 4.02*

3f. Hybrid participation .14 1.51 .129 2.44* .017 2.24*

3g. Hybrid empowerment .10 1.14 .122 2.28* .010 1.313h. Time-based Vroom .02 .25 .113 2.08 .001 .133i. Development-based Vroom .07 .75 .116 2.17 .004 .523j. SMEs .24 2.60* .161 3.16* .049 6.72*

3k. Novice vs. experts .09 1.05 .121 2.25* .009 1.18

Notes: Only the incremental additions to the hierarchical regressions are shown. Steps 1 and 2 were the same across allsets of regressions.aDegrees of freedom for F tests were: step 1 (1, 121), step 2 (6, 116), step 3 (7, 115).bDegrees of freedom for F tests in change in R2 were: step 2 (5, 116), step 3 (1, 115).*po.05.

SCORING SJTS 231

r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006

Page 10: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

Further, test administration can affect responses. For

example, different instruction sets lead to different

responses that, at the key level, are differentially related

to criteria (McDaniel & Nguyen, 2001; Ployhart &

Ehrhart, 2003). Even instructions that differ in seemingly

minor ways, such as ‘‘identify the best option’’ and

‘‘identify what you would do’’ (what Ployhart & Ehrhart,

2003 referred to as ‘‘should do’’ and ‘‘would do’’

instructions), appear to lead to different responses. This

suggests that keys and the constructs they represent vary

not only due to chosen scoring strategies but also because

of the ways that the respondents approach the assessment.

Different keys can lead to an SJT assessing different

constructs even when it was designed to measure performance

in a specific domain, such as our LSA. For the LSA, we keyed

three different theoretical constructs: initiating structure,

participation, and empowerment. On one hand, these keys all

reflect the domain of leadership skills. On the other hand,

these keys refer to different components of the leadership

skills domain and it could be argued that they represent

different constructs. For empirical keys, as well as con-

tingency theory keys such as the Vroom (2000) keys, different

domains could be the best answer for different questions.

Because of the potentially multi-dimensional nature of

SJTs as well as the many constructs that this method can be

applied to, it may be difficult to conduct meta-analyses on

some of the questions raised here (Hesketh, 1999). To

minimize this potential problem, researchers should

include in their reports, in addition to standard validity

coefficients and effect sizes, descriptions of: (a) the

domain(s) that the items of the SJT measure; (b) the scoring

methods used; and, (c) the instruction set for the SJT.

Without this information, it will be impossible for future

meta-analytic efforts to reach any meaningful conclusions.

Limitations

As with any study, this one has limitations. First, the small

sample size makes strong conclusions difficult. Larger

samples would allow for greater confidence in the results. A

larger sample might permit additional analyses, such as

subgroup differences across ethnicity groups. Further, the

low power afforded by the small sample size makes it

difficult to interpret differences in the validities of the keys.

However, because our predictors and criteria were

collected from different sources (managers and their

supervisors, respectively), some problems common to

concurrent validation – such as common method bias –

are not at issue here.

Further, we must acknowledge that a different empirical

key could emerge with a different or larger sample. The

minimum endorsement criterion is, in part, dependent on

the sample size (i.e., one should not require a minimum

endorsement rate that is unachievable in a particular

sample). Additionally, a sample from a different population

might lead to different results. Large samples from diverse

populations should improve the stability and general-

izability of keys.

Table 4. Subgroup differences across sex

Mean (SD), male Mean (SD), female t d

Wonderlic Personnel Test 27.73 (5.91) 27.11 (5.73) .55 .11Extraversion 4.50 (1.77) 5.43 (1.88) �2.63* � .51Anxiety 4.49 (2.02) 4.59 (1.77) � .27 � .05Tough-mindedness 6.30 (2.05) 4.96 (1.91) 3.59* .69Independence 5.20 (1.48) 5.20 (1.79) � .02 � .003Self-control 6.11 (1.47) 5.67 (1.46) 1.55 .30Empirical 3.70 (2.16) 3.11 (2.18) 1.42 .27Initiating structure �1.33 (4.68) � .98 (4.70) � .39 � .07Participation � .40 (3.02) � .37 (2.96) � .05 � .01Empowerment 1.33 (4.68) .98 (4.70) .39 .07Hybrid initiating structure .78 (4.53) .58 (3.71) .26 .05Hybrid participation 1.10 (3.25) 1.27 (3.28) � .26 � .05Hybrid empowerment 4.03 (4.56) 3.58 (5.20) .46 .09Time-based Vroom 9.18 (1.68) 8.76 (1.64) 1.31 .25Development-based Vroom 7.53 (1.40) 7.41 (1.52) .40 .08SMEs 8.30 (1.52) 8.86 (2.58) �1.26 � .24Novice vs. expert 9.08 (2.69) 9.27 (2.48) � .39 � .07Leadership ratings 34.73 (5.01) 34.90 (5.04) .58 .11Overall performance ratings 18.60 (2.45) 18.59 (3.17) .10 .02

Note: N 5 40 male, 83 female, degrees of freedom 5 121.*po.05.

232 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA

International Journal of Selection and Assessmentr 2006 The Authors

Journal compilation r Blackwell Publishing Ltd. 2006

Page 11: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

Most notably, we cannot draw a firm conclusion – based

on this single sample in this single organization at this

single time using a single SJT – about which scoring method

is best. There are likely to be boundary conditions on the

best scoring method, based on test content, test instruc-

tions, response options, and the like, that will aid in the

determination of the best scoring method for this SJT and

other assessments used in the future. However, this paper

provides a useful guide in both (a) the methods that are

currently available to create SJT keys and (b) the ways to

evaluate the relative effectiveness of keys.

Practical Issues in SJT Scoring

One important issue in all scoring is that there are other

keys that could be constructed. Although there is not an

infinite number of keys for an assessment, the possible

permutations of the pattern of scoring as correct, incorrect,

and zero across the number of options and items is of a very

large magnitude for a test of any reasonable length; a non-

trivial number of these mathematically possible keys are

likely to make some rational or theoretical sense. Further,

different criteria, approaches, and minimum scoring

requirements could lead to a multitude of other empirical

keys than the ones constructed. In short, there are many

ways to create a key within each general scoring strategy.

How one chooses keying systems should depend on the

test’s intended use, theory (not just for theoretical keys, but

also to determine which scoring strategies are most useful),

and practical considerations.

Potential Effects of Studying Keys onBroader Theory

One potential application of studying keys could be

providing support for theories about the content domain

of assessments. Support for a theory would be found when

empirical and theoretical keys overlap greatly in their best

option identification. For example, the LSA could be used

to provide support for a theory of leadership, such as

Vroom’s (2000; Vroom & Jago, 1978; Vroom & Yetton,

1973). The extent to which the empirical key – which, by

design, is related to leadership performance – and the

theoretical key identify the same best and worst options

would indicate support for the theory.

The utility of this approach hinges on three issues. First,

the criterion measure must be reasonably construct valid so

that an effective and appropriate empirical key is created.

Second, the theoretical key must be developed carefully so

that it accurately reflects the theory. Finally, ‘‘unscored’’

items on the empirical key must be minimized through the

use of a large sample. This is necessary so that there are

enough opportunities to evaluate the congruence of the

empirical and theoretical keys. Additionally, the unscored

options on the empirical key must be examined so that it is

clear whether they are unscored due to low correlations or

low endorsement rates. Options that have low correlations

and meet the minimum endorsement rate on the empirical

key are more informative about the option and its possible

relation to theory than options that are not scored because

they have not met the minimum endorsement criterion. It

may be useful to think of the first case as a ‘‘score of zero’’

and the second as ‘‘unscored,’’ because in the second case it

is unclear what the score will be if the minimum

endorsement requirement is met. As noted in the introduc-

tion, there are many reasons why an option is scored as

zero; some of these reasons are mitigated when the

minimum endorsement criterion is met.

Conclusion

We described and illustrated a process for determining the

best key(s) from among many possible keys. Keys should be

assessed for validity, incremental validity, adverse impact,

and construct validity as described in this paper. From these

analyses, the best key(s) can be identified. Although this

validation strategy seems basic, studies in the SJT literature

have rarely addressed the potential differential validity of

the multiple keys available for a given test. As demon-

strated here, it is essential that researchers critically

evaluate their SJT keying choices.

As we noted at the start of this paper, the major purpose

of this paper is to stimulate research on the topic of keying

in SJTs. We have reviewed the six general approaches to

scoring that have been examined or discussed in the SJT or

biodata scoring literatures to date, and we have demon-

strated four of them. Other scoring strategies might be

developed in the future, which will expand the possible

repertoire of scoring methodologies. Our goal is to

encourage SJT developers and researchers to investigate

and implement multiple scoring methods in their research

and to publish the various results of these keys. Ideally, in

10 years time, we would be able to revisit this topic to

conduct a meta-analysis on scoring strategies in order to

assess which approach is best.

Notes

1. Although the empowerment and initiating structure

keys are perfectly negatively correlated, both are

described because they were used in hybrid scoring;

the hybrid keys were not perfectly negatively correlated.

For ease of comparison, both the empowerment and the

initiating structure keys are presented here and included

in the analyses.

2. We must acknowledge that due to our small sample size,

sampling variability and error could also contribute the

variability of validity coefficients across the keys.

SCORING SJTS 233

r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006

Page 12: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

References

Arnold, J.A., Arad, S., Rhoades, J.A. and Drasgow, F. (2000) The

empowering leadership questionnaire: The construction of a

new scale for measuring leader behaviors. Journal of Organiza-

tional Behavior, 21, 249–269.Ashworth, S.D. and Joyce, T.M. (1994) Developing score protocols

for a computerized multimedia in-basket exercise. Paper

presented at the Ninth Annual Conference of the Society for

Industrial and Organizational Psychology, Nashville, TN, April.Borman, W.C., White, L.A., Pulakos, E.D. and Oppler, S.H. (1991)

Models of supervisory job performance ratings. Journal of

Applied Psychology, 76, 863–872.Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984)

Classification and regression trees. Belmont, CA: Wadsworth.Campbell, D.T. and Fiske, D.W. (1959) Convergent and discrimi-

nant validation by the multitrait–multimethod matrix. Psycho-

logical Bulletin, 56, 81–105.Campbell, J.P. (1990a) Modeling the performance prediction

problem in industrial and organizational psychology. In M.D.

Dunnette and L.M. Hough (Eds), Handbook of industrial and

organizational psychology, Vol. 1 (pp. 687–732). Palo Alto:

Consulting Psychologists Press.

Campbell, J.P. (1990b) The role of theory in industrial and

organizational psychology. In M.D. Dunnette and L.M. Hough

(Eds), Handbook of industrial and organizational psychology,

Vol. 1 (pp. 39–73). Palo Alto: Consulting Psychologists Press.

Campbell, J.P, Dunnette, M.D., Lawler, E.E. III. and Weick, K.E.

(1970) Managerial behavior, performance, and effectiveness.

New York: McGraw-Hill.Cattell, R.B., Cattell, A.K. and Cattell, H.E. (1993) Sixteen

personality factor questionnaire, 5th Edn). Champaign, IL:

Institute for Personality and Ability Testing Inc.Chan, D. and Schmitt, N. (1997) Video-based versus paper-and-

pencil method of assessment in situational judgment tests:

Subgroup differences in test performance and face validity

perceptions. Journal of Applied Psychology, 82, 143–159.Chan, D. and Schmitt, N. (2002) Situational judgment and job

performance. Human Performance, 15, 233–254.Chan, D. and Schmitt, N. (2005) Situational judgment tests. In A.

Evers, N. Anderson and O. Voskuijil (Eds), Handbook of

personnel selection (pp. 219–246). Oxford: Blackwell.Cleary, T.A. (1968) Test bias: Prediction of grades of Negro and

White students in integrated colleges. Journal of Educational

Measurement, 5, 115–124.Clevenger, J., Pereira, G.M., Wiechmann, D., Schmitt, N. and

Harvey, V.S. (2001) Incremental validity of situational judgment

tests. Journal of Applied Psychology, 86, 410–417.Cureton, E.E. (1950) Validity, reliability, and baloney. Educational

and Psychological Measurement, 10, 94–96.Dalessio, A.T. (1994) Predicting insurance agent turnover using a

video-based situational judgment test. Journal of Business and

Psychology, 9, 23–32.Desmarais, L.B., Masi, D.L., Olson, M.J., Barbara, K.M. and Dyer,

P.J. (1994) Scoring a multimedia situational judgment test: IBM’s

experience. Paper presented at the Ninth Annual Conference of

the Society for Industrial and Organizational Psychology,

Nashville, TN, April.Devlin, S.E., Abrahams, N.M. and Edwards, J.E. (1992) Empirical

keying of biographical data: Cross-validity as a function of scaling

procedure and sample size. Military Psychology, 4, 119–136.Dodrill, C.B. (1983) Long term reliability of the Wonderlic

Personnel Test. Journal of Consulting and Clinical Psychology,

51, 316–317.

Dodrill, C.B. and Warner, M.H. (1988) Further studies of the

Wonderlic Personnel Test as a brief measure of intelligence.

Journal of Consulting and Clinical Psychology, 56, 145–147.England, G.W. (1961) Development and use of weighted applica-

tion blanks. Dubuque: Brown.Flanagan, J.C. (1954) The critical incident technique. Psychological

Bulletin, 51, 327–358.Hein, M. and Wesley, S. (1994) Scaling biodata through subgroup-

ing. In G.S. Stokes, M.D. Mumford and W.A. Owens (Eds),

Biodata handbook: Theory, research, and use of biographicalinformation in selection and performance prediction (pp.

171–196). Palo Alto: Consulting Psychologists Press.

Hesketh, B. (1999) Introduction to the International Journal of

Selection and Assessment special issue on biodata. InternationalJournal of Selection and Assessment, 7, 55–56.

Hogan, J.B. (1994) Empirical keying of background data measures.

In G.S. Stokes, M.D. Mumford and W.A. Owens (Eds), Biodatahandbook: Theory, research, and use of biographical informa-

tion in selection and performance prediction (pp. 69–107). Palo

Alto: Consulting Psychologists Press.Hough, L. and Paullin, C. (1994) Construct-oriented scale

construction: The rational approach. In G.S. Stokes, M.D.

Mumford and W.A. Owens (Eds), Biodata handbook: Theory,research, and use of biographical information in selection and

performance prediction (pp. 109–145). Palo Alto: Consulting

Psychologists Press.Judge, T.A., Bono, J.E., Ilies, R. and Gerhart, M.W. (2002)

Personality and leadership: A qualitative and quantitative

review. Journal of Applied Psychology, 87, 765–780.Karas, M. and West, J. (1999) Construct-oriented biodata develop-

ment for selection to a differentiated performance domain.

International Journal of Selection and Assessment, 7, 86–96.Krukos, K., Meade, A.W., Cantwell, A., Pond, S.B. and Wilson,

M.A. (2004). Empirical keying of situational judgment tests:

Rationale and some examples. Paper presented at the 19th

annual meeting of the Society for Industrial/Organizational

Psychology, Chicago, IL.Legree, P.J., Psotka, J., Tremble, T. and Bourne, D.R. (2005) Using

consensus based measurement to assess emotional intelligence.

In R. Schulze and R.D. Roberts (Eds), Emotional intelligence:An international handbook (pp. 155–180). Cambridge, MA:

Hogrefe and Huber.

Liden, R.C. and Arad, S. (1996) A power perspective of empower-

ment and work groups: Implications for human resources

management research. In G.R. Ferris (Ed.), Research inpersonnel and human resources management (pp. 205–252).

Greenwich, CT: JAI Press.

MacLane, C.N., Barton, M.G., Holloway-Lundy, A.E. and Nickels,

B.J. (2001). Keeping score: Expert weights on situational

judgment responses. Paper presented at the 16th Annual

Conference of the Society for Industrial and Organizational

Psychology, San Diego, CA.Mael, F.A. (1991) A conceptual rationale for the domain and

attributes of biodata items. Personnel Psychology, 44, 763–792.McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A.

and Braverman, E.P. (2001) Use of situational judgment tests to

predict job performance: A clarification of the literature. Journalof Applied Psychology, 86, 60–79.

McDaniel, M.A. and Nguyen, N.T. (2001) Situational judgment

tests: A review of practice and constructs assessed. InternationalJournal of Selection and Assessment, 9, 103–113.

McHenry, J.J. and Schmitt, N. (1994) Multimedia testing. In M.J.

Rumsey, C.D. Walker and J. Harris (Eds), Personnel selectionand classification research (pp. 193–232). Mahwah, NJ:

Lawrence Erlbaum Publishers.

234 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA

International Journal of Selection and Assessmentr 2006 The Authors

Journal compilation r Blackwell Publishing Ltd. 2006

Page 13: Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin

Mead, A.D. (2000) Properties of a resampling validation tech-nique for empirically scoring psychological assessments. Unpub-

lished doctoral dissertation, University of Illinois at Urbana-Champaign.

Mitchell, T.W. and Klimoski, R.J. (1982) Is it rational to beempirical? A test of methods for scoring biographical data.Journal of Applied Psychology, 67, 411–418.

Motowidlo, S.J., Dunnette, M.D. and Carter, G.W. (1990) Analternative selection procedure: The low-fidelity simulation.Journal of Applied Psychology, 75, 640–647.

Mumford, M.D. (1999) Construct validity and background data:Issues, abuses, and future directions. Human Resource Manage-ment Review, 9, 117–145.

Mumford, M.D. and Owens, W.A. (1987) Methodology review:Principles, procedures, and findings in the application of

background data measures. Applied Psychological Measure-ment, 11, 1–31.

Mumford, M.D. and Stokes, G.S. (1992) Developmental determi-

nants of individual action: Theory and practice in applyingbackground measures. In M.D. Dunnette and L.M. Hough

(Eds), Handbook of industrial and organizational psychology,2nd Edn (pp. 61–138). Palo Alto: Consulting Psychologists

Press.Mumford, M.D. and Whetzel, D.L. (1997) Background data. In D.

Whetzel and G. Wheaton (Eds), Applied measurement methodsin industrial psychology (pp. 207–239). Palo Alto: Davies-Black

Publishing.

Nickels, B.J. (1994) The nature of biodata. In G.S. Stokes, M.D.

Mumford and W.A. Owens (Eds), Biodata handbook: Theory,research, and use of biographical information in selection andperformance prediction (pp. 1–16). Palo Alto: Consulting

Psychologists Press.Olson-Buchanan, J.B., Drasgow, F., Moberg, P.J., Mead, A.D.,

Keenan, P.A. and Donovan, M.A. (1998) An interactive video

assessment of conflict resolution skills. Personnel Psychology,51, 1–24.

Paullin, C. and Hanson, M.A. (2001) Comparing the validity ofrationally-derived and empirically-derived scoring keys for asituational judgment inventory. Paper presented at the 16thAnnual Conference of the Society for Industrial and Organiza-tional Psychology, San Diego, CA.

Ployhart, R.E. and Ehrhart, M.G. (2003) Be careful what you askfor: Effects of response instructions on the construct validity and

reliability of situational judgment tests. International Journal ofSelection and Assessment, 11, 1–16.

Schoenfeldt, L.F. (1999) From dust bowl empiricism to rationalconstructs in biographical data. Human Resource ManagementReview, 9, 147–167.

Schoenfeldt, L.F. and Mendoza, J.L. (1994) Developing and usingfactorially derived biographical scales. In G.S. Stokes, M.D.

Mumford and W.A. Owens (Eds), Biodata handbook: Theory,research, and use of biographical information in selection andperformance prediction (pp. 147–169). Palo Alto: ConsultingPsychologists Press.

Smith, K.C. and McDaniel, M.A. (1998). Criterion andconstruct validity evidence for a situational judgment measure.Poster presented at the 13th Annual Meeting of the Societyfor Industrial and Organizational Psychology, Dallas, TX,April.

Stokes, G.S. and Searcy, C.A. (1999) Specification of scales inbiodata form development: Rational vs. empirical and global vs.specific. International Journal of Selection and Assessment, 7,72–85.

Such, M.J. and Hemingway, M.A. (2003) Examining the usefulnessof empirical keying in the cross-cultural implementation of abiodata inventory. Paper presented in F. Drasgow (Chair),Resampling and Other Advances in Empirical Keying. Sympo-sium conducted at the 18th annual conference of the Society forIndustrial and Organizational Psychology.

Such, M.J. and Schmidt, D.B. (2004) Examining the effectiveness ofempirical keying: A cross-cultural perspective. Paper presentedat the 19th Annual Conference of the Society for Industrial andOrganizational Psychology, Chicago, IL.

Vroom, V.H. (2000) Leadership and the decision-making process.Organizational Dynamics, 28, 82–94.

Vroom, V.H. and Jago, A.G. (1978) On the validity of theVroom–Yetton model. Journal of Applied Psychology, 63,151–162.

Vroom, V.H. and Yetton, P.W. (1973) Leadership and decisionmaking. Pittsburgh: University of Pittsburgh Press.

Weekley, J.A. and Jones, C. (1997) Video-based situational testing.Personnel Psychology, 50, 25–49.

Weekley, J.A. and Jones, C. (1999) Further studies of situationaltests. Personnel Psychology, 52, 679–700.

Wonderlic Personnel Test Inc. (1992) User’s manual for theWonderlic Personnel Test and the Scholastic Level Exam.Libertyville, IL: Wonderlic Personnel Test Inc.

SCORING SJTS 235

r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006