Top Banner
Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement 2.0 Conference, Salt Lake City, February 2009
26

1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

Mar 30, 2015

Download

Documents

Rita Milson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

1

Choosing Reliable Items:An objective multi-model approach to applied psychometrics

 

Warren LambertPeabody College & Vanderbilt Kennedy Center

Measurement 2.0 Conference, Salt Lake City, February 2009

Page 2: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

2

.

Len Bickman’s Peabody Treatment Progress Battery

To give away a suite of tools to evaluate client progress in counseling, we had to develop 17 “new” tests. This required a formal systematic approach.

http://peabody.vanderbilt.edu/Microsites/Center/Center_for_Evaluation_and_Program_Improvement_(CEPI)/The_Peabody_Treatment_Progress_Battery_(PTPB).xml

Page 3: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

3

Statistical Approach: Imperfect Complementary Models

NETFLIX Winners, “We found it was important to utilize a variety of models that complement the shortcomings of each other. . . .

Lessons Learned. . . the best predictive performance came from combining complementary models.”

CHASING $1,000,000: HOW WE WON THE NETFLIX PROGRESS PRIZE

Robert Bell, Yehuda Koren, and Chris VolinskyAT&T Labs – Research

VOLUME 18, NO 2, DEC. 2007

Page 4: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

4

How to Identify Reliable ItemsClassical test theory Enough for one-shot ad hoc indices

Floors or ceilings restrict variance

Look at a PCA

To increase Cronbach’s alpha, avoid low item-total correlations

Guesstimate test length with Spearman-Brown formula

Factor analysis (confirmatory if at all possible)

See how well a 1-factor confirmatory model fits

Factorial “validity,” does the factor structure fit theory?

Rasch (IRT) modeling

Pick items that fit a carefully considered measurement model

Consider item difficulties more deeply

Pick items suited to the intended task

Informal

Formal

Page 5: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

5

Classical Test Theory (CTT)Tools of CTT

Basic description of items & their correlations

Cronbach’s alpha, internal-consistency reliability

Corrected item-total correlations

Principal components (PCA)

Spearman-Brown test length estimation

CTT is good to do routinely with index scores

OK for informal test development e.g., one-shot ad hoc index

Insufficient for tests that will be published for wide use

Page 6: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

6

Note Floors or CeilingsThe “Too Short” IQ Test (TS-IQ)

Low mean, SD, variance all indicate floors or ceilings, but outrageous kurtosis is easy to see.

Page 7: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

7

Acorn 10 item scale and 3 item index

Page 8: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

8

Retain Flagged Estimates of Item Quality“Too Short IQ Test” (TS-IQ)

Variable Mean Kurtosis

Item01 0.06 11.16

Item02 0.22 -0.11

Item03 0.35 -1.61

Item04 0.39 -1.82

Item05 0.45 -1.96

Item06 0.49 -2.01

Item07 0.54 -1.99

Item08 0.58 -1.91

Item09 0.77 -0.29

Item10 0.86 2.60

Page 9: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

9

Raw Scree Plots for ComparisonCompare Result to Random Shadow

• Simple principal components (Pearson, 1901)

• 10,000 PCAs on random numbers

• Same size data set• Half page R code• Visually distinguish

chance effects• Falls short of a

confirmatory factor analysis

Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2), 93-102.

Page 10: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

10

“Too Short IQ” Spreadsheet•Items with something in common contribute to a reliable total score

•Cronbach’s alpha internal consistency reliability

•Reliability increases with high item-total correlations

•Reliability increases with test length

Item Mean Kurtosisr(Item-Total)

Item01 0.06 11.16 0.30

Item02 0.22 -0.11 0.53

Item03 0.35 -1.61 0.53

Item04 0.39 -1.82 0.43

Item05 0.45 -1.96 0.53

Item06 0.49 -2.01 0.60

Item07 0.54 -1.99 0.59

Item08 0.58 -1.91 0.36

Item09 0.77 -0.29 0.49

Item10 0.86 2.60 0.39

Page 11: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

11

How to Identify Reliable Items 2Classical test theory Enough for one-shot ad hoc indices

Floors or ceilings restrict variance

Look at a PCA

To increase Cronbach’s alpha, avoid low item-total correlations

Guesstimate test length with Spearman-Brown formula

Factor analysis (confirmatory if at all possible)

See how well a 1-factor confirmatory model fits

Factorial “validity,” does the factor structure fit theory?

Rasch (IRT) modeling

Pick items that fit a carefully considered measurement model

Consider item difficulties more deeply

Pick items suited to the intended task

Informal

Formal

Page 12: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

12

Confirmatory Factor Analysis (CFA)

See how well (ha! how badly!) the data fit a theory-driven model (factorial “validity”)

Theory: TS-IQ measures g, a single dimension of intelligence.

Evaluate the fit of a single factor measurement model

CFA, popular in psychology, seldom done in non-psychiatric medicine (exception: Quality of life indices have extensive psychometric analysis using all current methods)

Page 13: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

13

“Too Short IQ” SAS CFA of single-factor measurement model

RMSEA < .05, CFI > 0.95 or 0.96 (high standards of model fit)

So far, most VU tests early in development fail to meet the high standards for

measurement model fit.

SAS PROC CALIS, old fashioned but (more or less) useable

Page 14: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

14

Rasch or IRT ModelIRT, Item Response Theory

Rasch: One parameter logistic IRT model

Good for practical test development (converges)

Multi-parameter Item Response Theory (IRT)

2-3 parameter models (discrimination, guessing)

For measurement research

Software, e.g. R, MPLUS, Parscale, Bilog-MG, user-written procs

P = Prob of getting item i “right”

Theta = persons ability

b = item’s difficulty on same scale

Page 15: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

15

Rasch Model

• “Measure score” for person and item in same units

• If your measure = item’s measure, p(right) = 50%

• If you’re better than the item, p (right) > 50%

• 1 Parm logistic model (1PLM)

As (Person – Item) increases, prob (correct) increases in logistic model.

Page 16: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

16

Rasch (1960/1980) model

Simple 1PLM, can use conventional total score or table lookup

Parallel logistic curves for items

Good for practical test construction (WINSTEPS)

Software in development > 20 years

IRT 2PLM, 3PLM may be better for certain kinds of measurement research

Rasch, G. (1960/1980). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests (Expanded ed.). Chicago: University of Chicago Press.

Statistician, Danish student of RA Fisher

Rasch Model:TS-IQ Items Cover a Range of Difficulties

Page 17: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

17

“Too Short IQ” Items Information Spread Across Whole Range

Easy items, like #10, are most informative about low scoring individuals

Hard items, like #1, are most informative about high scoring individuals.

This test’s items spread to describe whole range of IQs

Page 18: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

18

IRT: Compare Items with PeopleClinically Targeted Test (VUMC Greco)

• Items gray, people black• School sample• High is bad (sicker)• Clinical screens focus on

sick people• Classify: treat yes-no• Job is to be maximally

informative at the cutpoint• This test invests its items in

severe range

Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2), 93-102.

Page 19: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

19

•Left, distribution of children (each # = 3)

•Right, distribution of items

•Centerline, measure score, theta for people

and “difficulty” for items

•Self-harm item, a severe outlier

•9 Items concentrated in low-average range

•Are they concentrated near the clinical-normal

threshold?

Acorn 10 item scale

Page 20: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

20

Putting It All TogetherToo Short-IQ’s Items and Total

Page 21: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

21

Putting it all together (Walker’s CSI)Multiple criteria converge => firm conclusion without definitive cutoffs or perfect models

Items scored 0-4

Items 1-35

Walker, Lynn S., Beck, Joy E., Garber, Judy, & Lambert, Warren. The Children’s Somatization Inventory: Psychometric Properties of the Revised Form (CSI-24) and Evidence for a Continuum of Symptom Reporting in Youth. In press, J. Pediatric Psychology.

Page 22: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

22

Bold items, some concern

Self-harm, having a low mean, shows some roughness (no fatal flaws).

Infit/outfit flags are borderline. Good is now 0.7-1.3, used to be 0.5-1.5

A&D items are near the floor, but still seem to work.

Acorn 10 item scale and 3 item index

Page 23: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

23

•10 item scale has excellent overall stats

•Even fits a one-factor model with fit indices good enough for Psych Assessment purists.

•3 item scale has some problems as a reliable psychological test

•May be too short to act as a scale with a reliable sum score

•A set of 3 warning flags?

Acorn 10 item scale and 3 item index

Page 24: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

24

Page 25: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

25

Page 26: 1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

26