Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin Mindy E. Bergman* Department of Psychology, Texas A&M University Fritz Drasgow Department of Psychology, University of Illinois at Urbana-Champaign Michelle A. Donovan Intel Corporation Jaime B. Henning Department of Psychology, Texas A&M University Suzanne E. Juraska Personnel Decisions Research Institute Although situational judgment tests (SJTs) have been in use for decades, consensus has not been reached on the best way to score these assessments or others (e.g., biodata) whose items do not have a single demonstrably correct answer. The purpose of this paper is to review and to demonstrate the scoring strategies that have been described in the literature. Implementation and relative merits of these strategies are described. Then, several of these methods are applied to create 11 different keys for a video-based SJT in order to demonstrate how to evaluate the quality of keys. Implications of scoring SJTs for theory and practice are discussed. S ituational judgment tests (SJTs) have been in use since the 1920s, but have recently enjoyed a resurgence in attention in the research literature (e.g., Chan & Schmitt, 1997, 2005; Clevenger, Pereira, Wiechmann, Schmitt, & Harvey, 2001; Dalessio, 1994; McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001; Olson-Buchanan, Drasgow, Moberg, Mead, Keenan, & Donovan, 1998; Smith & McDaniel, 1998; Weekley & Jones, 1997, 1999). McDaniel et al. (2001) recently showed the validity of SJTs in predicting job performance. SJTs have also been found to provide incremental validity beyond more typically used assessments, such as cognitive ability tests, and appear to have less adverse impact (Chan & Schmitt, 2002; Clevenger et al., 2001; Motowidlo, Dunnette, & Carter, 1990; Olson- Buchanan et al., 1998; Weekley & Jones, 1997, 1999). However, important questions still persist. One critical issue is the selection of scoring methods (Ashworth & Joyce, 1994; Desmarais, Masi, Olson, Barbara, & Dyer, 1994; McHenry & Schmitt, 1994). Unlike cognitive ability tests, SJT items often do not have objectively correct answers; many of the response options are plausible. It is a question of which answer is ‘‘best’’ rather than which is ‘‘right.’’ However, there are many ways to determine the best answer and consensus has not yet been reached as to which method is superior. This paper delineates scoring strategies, discusses their merits, and demonstrates them for one SJT. We hope that this paper stimulates and serves as an example for further scoring research. What are SJTs? SJTs are a ‘‘measurement method that can be used to assess a variety of constructs,’’ (McDaniel et al., 2001, p. 732; McDaniel & Nguyen, 2001), although some constructs might be particularly amenable to SJT measurement (Chan & Schmitt, 2005). Most SJTs measure a constellation of job-related skills and abilities, often based on job analyses (Weekley & Jones, 1997). SJT formats vary, with some using paper-and-pencil tests with written descriptions of situations (Chan & Schmitt, 2002) and others using computerized multimedia scenarios (McHenry & Schmitt, *Address for correspondence: Mindy Bergman, Department of Psychology (MC 4235), College Station, TX 77843-4235. E-mail: [email protected]INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT VOLUME 14 NUMBER 3 SEPTEMBER 2006 223 r 2006 The Authors Journal compilation r 2006 Blackwell Publishing Ltd, 9600 Garsington Road, Oxford, OX4 2DQ, UK and 350 Main St, Malden, MA 02148, USA
13
Embed
Scoring Situational Judgment Tests: Once You Get the Data, Your Troubles Begin
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Scoring Situational Judgment Tests: Once You Getthe Data, Your Troubles Begin
Mindy E. Bergman*Department of Psychology,
Texas A&M University
Fritz DrasgowDepartment of Psychology, University of
Illinois at Urbana-Champaign
Michelle A. DonovanIntel Corporation
Jaime B. HenningDepartment of Psychology,
Texas A&M University
Suzanne E. JuraskaPersonnel Decisions Research Institute
Although situational judgment tests (SJTs) have been in use for decades, consensus has notbeen reached on the best way to score these assessments or others (e.g., biodata) whose itemsdo not have a single demonstrably correct answer. The purpose of this paper is to review and todemonstrate the scoring strategies that have been described in the literature. Implementationand relative merits of these strategies are described. Then, several of these methods are appliedto create 11 different keys for a video-based SJT in order to demonstrate how to evaluate thequality of keys. Implications of scoring SJTs for theory and practice are discussed.
S ituational judgment tests (SJTs) have been in use since
the 1920s, but have recently enjoyed a resurgence in
attention in the research literature (e.g., Chan & Schmitt,
However, differences between SJTs and biodata limit the
extent to which scoring research can generalize. The
prevailing issue in scoring SJTs – which typically contain
a handful of items – is how to choose the best response from
among the multiple-choice options for each item. In
contrast, because biodata measures are often lengthy
(sometimes containing several hundred items; Schoenfeldt
& Mendoza, 1994), much of the biodata scoring research
examines which items should be included in composites
and which should be eliminated from the assessment. Thus,
Table 1. Example item from the leadership skills assessment and its scoring across 11 keys
Summary of item stem (video scenario) and response options
At the request of the team leader, Brad has reviewed several quarterly reports. Through discussion, Brad and the teamleader come to the conclusion that Steve, another team member, has been making errors in the last several reports and iscurrently working on the next one. As the leader in this situation, what would you do?
A. At the next team meeting discuss the report and inform the team that there are a few errors in it. Ask the teammembers how they want to revise the report
B. Tell Steve about the errors, then work with him to fix the errorsC. Tell Steve about the errors and have him go over the report and correct the errorsD. Tell Steve that you asked Brad to look over this report and that Brad found some errors in it. Ask Steve to work with
3k. Novice vs. experts .09 1.05 .121 2.25* .009 1.18
Notes: Only the incremental additions to the hierarchical regressions are shown. Steps 1 and 2 were the same across allsets of regressions.aDegrees of freedom for F tests were: step 1 (1, 121), step 2 (6, 116), step 3 (7, 115).bDegrees of freedom for F tests in change in R2 were: step 2 (5, 116), step 3 (1, 115).*po.05.
SCORING SJTS 231
r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006
Further, test administration can affect responses. For
example, different instruction sets lead to different
responses that, at the key level, are differentially related
to criteria (McDaniel & Nguyen, 2001; Ployhart &
Ehrhart, 2003). Even instructions that differ in seemingly
minor ways, such as ‘‘identify the best option’’ and
‘‘identify what you would do’’ (what Ployhart & Ehrhart,
2003 referred to as ‘‘should do’’ and ‘‘would do’’
instructions), appear to lead to different responses. This
suggests that keys and the constructs they represent vary
not only due to chosen scoring strategies but also because
of the ways that the respondents approach the assessment.
Different keys can lead to an SJT assessing different
constructs even when it was designed to measure performance
in a specific domain, such as our LSA. For the LSA, we keyed
three different theoretical constructs: initiating structure,
participation, and empowerment. On one hand, these keys all
reflect the domain of leadership skills. On the other hand,
these keys refer to different components of the leadership
skills domain and it could be argued that they represent
different constructs. For empirical keys, as well as con-
tingency theory keys such as the Vroom (2000) keys, different
domains could be the best answer for different questions.
Because of the potentially multi-dimensional nature of
SJTs as well as the many constructs that this method can be
applied to, it may be difficult to conduct meta-analyses on
some of the questions raised here (Hesketh, 1999). To
minimize this potential problem, researchers should
include in their reports, in addition to standard validity
coefficients and effect sizes, descriptions of: (a) the
domain(s) that the items of the SJT measure; (b) the scoring
methods used; and, (c) the instruction set for the SJT.
Without this information, it will be impossible for future
meta-analytic efforts to reach any meaningful conclusions.
Limitations
As with any study, this one has limitations. First, the small
sample size makes strong conclusions difficult. Larger
samples would allow for greater confidence in the results. A
larger sample might permit additional analyses, such as
subgroup differences across ethnicity groups. Further, the
low power afforded by the small sample size makes it
difficult to interpret differences in the validities of the keys.
However, because our predictors and criteria were
collected from different sources (managers and their
supervisors, respectively), some problems common to
concurrent validation – such as common method bias –
are not at issue here.
Further, we must acknowledge that a different empirical
key could emerge with a different or larger sample. The
minimum endorsement criterion is, in part, dependent on
the sample size (i.e., one should not require a minimum
endorsement rate that is unachievable in a particular
sample). Additionally, a sample from a different population
might lead to different results. Large samples from diverse
populations should improve the stability and general-
Psychologists Press.Judge, T.A., Bono, J.E., Ilies, R. and Gerhart, M.W. (2002)
Personality and leadership: A qualitative and quantitative
review. Journal of Applied Psychology, 87, 765–780.Karas, M. and West, J. (1999) Construct-oriented biodata develop-
ment for selection to a differentiated performance domain.
International Journal of Selection and Assessment, 7, 86–96.Krukos, K., Meade, A.W., Cantwell, A., Pond, S.B. and Wilson,
M.A. (2004). Empirical keying of situational judgment tests:
Rationale and some examples. Paper presented at the 19th
annual meeting of the Society for Industrial/Organizational
Psychology, Chicago, IL.Legree, P.J., Psotka, J., Tremble, T. and Bourne, D.R. (2005) Using
consensus based measurement to assess emotional intelligence.
In R. Schulze and R.D. Roberts (Eds), Emotional intelligence:An international handbook (pp. 155–180). Cambridge, MA:
Hogrefe and Huber.
Liden, R.C. and Arad, S. (1996) A power perspective of empower-
ment and work groups: Implications for human resources
management research. In G.R. Ferris (Ed.), Research inpersonnel and human resources management (pp. 205–252).
Greenwich, CT: JAI Press.
MacLane, C.N., Barton, M.G., Holloway-Lundy, A.E. and Nickels,
B.J. (2001). Keeping score: Expert weights on situational
judgment responses. Paper presented at the 16th Annual
Conference of the Society for Industrial and Organizational
Psychology, San Diego, CA.Mael, F.A. (1991) A conceptual rationale for the domain and
attributes of biodata items. Personnel Psychology, 44, 763–792.McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A.
and Braverman, E.P. (2001) Use of situational judgment tests to
predict job performance: A clarification of the literature. Journalof Applied Psychology, 86, 60–79.
McDaniel, M.A. and Nguyen, N.T. (2001) Situational judgment
tests: A review of practice and constructs assessed. InternationalJournal of Selection and Assessment, 9, 103–113.
McHenry, J.J. and Schmitt, N. (1994) Multimedia testing. In M.J.
Rumsey, C.D. Walker and J. Harris (Eds), Personnel selectionand classification research (pp. 193–232). Mahwah, NJ:
Lawrence Erlbaum Publishers.
234 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA
International Journal of Selection and Assessmentr 2006 The Authors
Journal compilation r Blackwell Publishing Ltd. 2006
Mead, A.D. (2000) Properties of a resampling validation tech-nique for empirically scoring psychological assessments. Unpub-
lished doctoral dissertation, University of Illinois at Urbana-Champaign.
Mitchell, T.W. and Klimoski, R.J. (1982) Is it rational to beempirical? A test of methods for scoring biographical data.Journal of Applied Psychology, 67, 411–418.
Motowidlo, S.J., Dunnette, M.D. and Carter, G.W. (1990) Analternative selection procedure: The low-fidelity simulation.Journal of Applied Psychology, 75, 640–647.
Mumford, M.D. (1999) Construct validity and background data:Issues, abuses, and future directions. Human Resource Manage-ment Review, 9, 117–145.
Mumford, M.D. and Owens, W.A. (1987) Methodology review:Principles, procedures, and findings in the application of
background data measures. Applied Psychological Measure-ment, 11, 1–31.
Mumford, M.D. and Stokes, G.S. (1992) Developmental determi-
nants of individual action: Theory and practice in applyingbackground measures. In M.D. Dunnette and L.M. Hough
(Eds), Handbook of industrial and organizational psychology,2nd Edn (pp. 61–138). Palo Alto: Consulting Psychologists
Press.Mumford, M.D. and Whetzel, D.L. (1997) Background data. In D.
Whetzel and G. Wheaton (Eds), Applied measurement methodsin industrial psychology (pp. 207–239). Palo Alto: Davies-Black
Publishing.
Nickels, B.J. (1994) The nature of biodata. In G.S. Stokes, M.D.
Mumford and W.A. Owens (Eds), Biodata handbook: Theory,research, and use of biographical information in selection andperformance prediction (pp. 1–16). Palo Alto: Consulting
Keenan, P.A. and Donovan, M.A. (1998) An interactive video
assessment of conflict resolution skills. Personnel Psychology,51, 1–24.
Paullin, C. and Hanson, M.A. (2001) Comparing the validity ofrationally-derived and empirically-derived scoring keys for asituational judgment inventory. Paper presented at the 16thAnnual Conference of the Society for Industrial and Organiza-tional Psychology, San Diego, CA.
Ployhart, R.E. and Ehrhart, M.G. (2003) Be careful what you askfor: Effects of response instructions on the construct validity and
reliability of situational judgment tests. International Journal ofSelection and Assessment, 11, 1–16.
Schoenfeldt, L.F. (1999) From dust bowl empiricism to rationalconstructs in biographical data. Human Resource ManagementReview, 9, 147–167.
Schoenfeldt, L.F. and Mendoza, J.L. (1994) Developing and usingfactorially derived biographical scales. In G.S. Stokes, M.D.
Mumford and W.A. Owens (Eds), Biodata handbook: Theory,research, and use of biographical information in selection andperformance prediction (pp. 147–169). Palo Alto: ConsultingPsychologists Press.
Smith, K.C. and McDaniel, M.A. (1998). Criterion andconstruct validity evidence for a situational judgment measure.Poster presented at the 13th Annual Meeting of the Societyfor Industrial and Organizational Psychology, Dallas, TX,April.
Stokes, G.S. and Searcy, C.A. (1999) Specification of scales inbiodata form development: Rational vs. empirical and global vs.specific. International Journal of Selection and Assessment, 7,72–85.
Such, M.J. and Hemingway, M.A. (2003) Examining the usefulnessof empirical keying in the cross-cultural implementation of abiodata inventory. Paper presented in F. Drasgow (Chair),Resampling and Other Advances in Empirical Keying. Sympo-sium conducted at the 18th annual conference of the Society forIndustrial and Organizational Psychology.
Such, M.J. and Schmidt, D.B. (2004) Examining the effectiveness ofempirical keying: A cross-cultural perspective. Paper presentedat the 19th Annual Conference of the Society for Industrial andOrganizational Psychology, Chicago, IL.
Vroom, V.H. (2000) Leadership and the decision-making process.Organizational Dynamics, 28, 82–94.
Vroom, V.H. and Jago, A.G. (1978) On the validity of theVroom–Yetton model. Journal of Applied Psychology, 63,151–162.
Vroom, V.H. and Yetton, P.W. (1973) Leadership and decisionmaking. Pittsburgh: University of Pittsburgh Press.
Weekley, J.A. and Jones, C. (1997) Video-based situational testing.Personnel Psychology, 50, 25–49.
Weekley, J.A. and Jones, C. (1999) Further studies of situationaltests. Personnel Psychology, 52, 679–700.
Wonderlic Personnel Test Inc. (1992) User’s manual for theWonderlic Personnel Test and the Scholastic Level Exam.Libertyville, IL: Wonderlic Personnel Test Inc.
SCORING SJTS 235
r 2006 The AuthorsJournal compilation r Blackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006