Top Banner
The Art and Science of Test Development—Part C Test and item development: Use of Rasch scaling technology The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock Kevin S. McGrew, PhD. Educational Psychologist Research Director Woodcock-Muñoz Foundation
35

Applied Psych Test Design: Part C - Use of Rasch scaling technology

May 12, 2015

Download

Business

Kevin McGrew

The Art and Science of Applied Test Development. This is the third in a series of PPT modules explicating the development of psychological tests in the domain of cognitive ability using contemporary methods (e.g., theory-driven test specification; IRT-Rasch scaling; etc.). The presentations are intended to be conceptual and not statistical in nature. Feedback is appreciated.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Applied Psych Test Design: Part C - Use of Rasch scaling technology

The Art and Science of Test Development—Part C

Test and item development: Use of Rasch scaling technology

The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock

Kevin S. McGrew, PhD.

Educational Psychologist

Research DirectorWoodcock-Muñoz Foundation

Page 2: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Part A: Planning, development frameworks & domain/test specification blueprints

Part B: Test and Item Development

Part C: Use of Rasch Technology

Part D: Develop norm (standardization) plan

Part E: Calculate norms and derived scores

Part F: Psychometric/technical and statistical analysis: Internal

Part G: Psychometric/technical and statistical analysis: External

The Art and Science of Test Development

The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence.

The current module is designated by red bold font lettering

Page 3: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Important note: For the on-line public versions of this PPT module certain items,

information, etc. is obscured for test security or proprietary reasons…sorry

Page 4: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Use Rasch (IRT) scaling to evaluate the complete pool of items and to develop the Norming and Publication tests

Page 5: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Structural (Internal) Stage of Test Development

Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities)

Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence?

Method and concepts Internal domain studies Item/subscale intercorrelations Item response theory (IRT)

Characteristics of strong test validity program

• Moderate item internal consistency• Items/measures are representative of the empirical

domain• Items fit the theoretical structure

Page 6: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities

Gv

Item Scale Development via Rasch technology

Measurement or empirical domain

Rasch scale and evaluate the complete pool of items to develop Norming and Publication tests

Low ability/easy items

High ability/difficult items

Page 7: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Recall that Block Rotation items have 2 possible correct answers. Therefore there is a scoring question:

• Should items be scaled as 0/1 (need both correct to receive 1)?

• Should items be scales as 0/1/2 ?

Item data can be Rasch-scaled with both scoring systems and then select one that provides best reliability, etc.

We decided to go with 0/1/2 scoring sytem

Page 8: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Important understanding regarding 0/1 and multiple point (0/1/2) scoring systems when using Rasch/IRT

0 1

1 “step”

1 20

1 “step” 1 “step”

Therefore – think of 2-step items as

two 0/1 items

Dichotomous (0/1) item scoring

Multiple point (0/1/2) item scoring

Page 9: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Think of the items as now having been placed in their proper position on an equal interval ruler or yardstick….each item is a “tick” mark along the latent trait scale

Rasch IRT “norms” (calibrates) the scale !

Page 10: Applied Psych Test Design: Part C - Use of Rasch scaling technology

A major advantage/feature of a large Rasch IRT-scaled item pool……..

Once you have a large Rasch IRT-scaled item pool, you can develop different and customized scales that place people on the same underlying scale

• CAT (computer adaptive testing)

• Different and unique forms of the test

Page 11: Applied Psych Test Design: Part C - Use of Rasch scaling technology

A major advantage/feature of a large IRT-scaled item pool……..

Norming test Publicationtest

Possible special Research Edition tests

All three tests have items on the same scale (W-scale)

Although different number of items in each test, the obtained person ability W-score ‘s are equivalent, but differ in degree of precision (reliability)

Average difference in “gaps” between items on respective scales is called “item density”

W-scale is equal interval metric

Easy

Hard

Page 12: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Items are assignedW-difficulties

People are assigned W-ability scores

2 Major Rasch results

Rasch puts person ability and item difficulty on the same scale (W scale)

Page 13: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Select and order items

for Publication test based

on inspection of Rasch results

Block RotationNorming test

(n=44 items; n = 4,722 norm subjects)

Block RotationPublication test

(n = 37 items; n = 4,722 norm subjects)

Page 14: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Measure order and fit statistics

table

Used to select items with

specified item density

Page 15: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation: Final Rasch with norming

testn = 37 norming

itemsn = 4722 norm

subjects

Distribution of Block Rotation

W-ability scores in norm

sample

Complete range

(including extremes) of

Block Rotation W-

scores is 432-546

Majority of Block

Rotation norm sample obtained W-scores from

480-520

Page 16: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Recall Block Rotation scoring system is 0/1/2—Items have “steps”

1 20

1 “step” 1 “step”

Multiple point (0/1/2) item scoring

Page 17: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Item map with “steps”

displayed for items

Blue area represents majority of

norm sample subjects Block

Rotation W-scores

1 “step” 1 “step”

Item 1 (0/1/2) step structure

Page 18: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Item map with “steps”

displayed for items

Blue area represents majority of

norm sample subjects Block

Rotation W-scores

Very good test scale coverage for majority of population

Excellent “bottom” or “floor” for test scale

Adequate “top” or “ceiling” for test scale

Page 19: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Item map with “steps”

displayed for items

Red area represents the

complete range (including

extremes) of sample Block Rotation W-

scores

Good test scale coverage for complete range of population

Page 20: Applied Psych Test Design: Part C - Use of Rasch scaling technology

BLKROT: Floor (rs=1) & ceiling (rs=max) plot

0 10 20 30 40 50 60 70 80 90100

110

120130

140150

160170

180190

200210

220230

240250

260270

280290

300

camos

430

470

510

550

Re

f W +

/- 3

SD

's

Block Rotation Rasch floor/ceiling results confirmed by formal+-3SD floor/ceiling analysis (24-300 months of age)

Page 21: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation Rasch floor/ceiling results confirmed by formal+-3SD floor/ceiling analysis (300 - 1200 months of age)

BLKROT: Floor (rs=1) & ceiling (rs=max) plot

300330

360390

420450

480510

540570

600630

660690

720750

780810

840870

900930

960990

10201050

1080111

011

4011

701200

camos

430

470

510

550

Re

f W +

/- 3

SD

's

Page 22: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Block RotationNorming test

(n=44 items; n = 4,722 norm subjects)

Block RotationPublication test

(n = 37 items; n = 4,722 norm subjects)

Program generates final RS to W-ability scoring table

Page 23: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation: Final Rasch with

norming test

n = 37 norming items

n = 4722 norm subjects

Raw score to W-score

“scoring table”

Note: Total raw score points is 74 for 37 items. These are 2-step items.

37 items x 2 steps = 74 total possible points

Page 24: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Block Rotation Norming Test

n=44 items

44 items x 2 steps = raw scores from

0 to 88 on the Rasch-based scoring table (the equal interval

Visualization-Vz measurement “ruler”

or “yardstick”)

88

87

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

0

545.7

539.0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

437.8

431.6

Raw Score W-score

Block Rotation Norming test(n=44 items)

Page 25: Applied Psych Test Design: Part C - Use of Rasch scaling technology

88

87

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

0

545.7

539.0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

437.8

431.6

Raw Score W-score

545.7

539.0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

437.8

431.6

Block Rotation Norming test (n=44 items)Block Rotation Publication test n = 37 items)

74

73

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

0

Raw Score W-score

Block Rotation Norming and

Publication tests, although having

different number of items (and total

Raw Scores), are on the same underlying

measurement scale (ruler)

Page 26: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Program generates final RS to W-ability

scoring table

Result: All norm subjects with Block Rotation scores (n = 4,722) now have scores on equal interval W-score Block Rotation

Norming test(n=44 items; n = 4,722

norm subjects)

Block RotationPublication

test (n = 37 items)

Page 27: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Block RotationNorming test

(n=44 items; n = 4,722 norm subjects)

Block RotationPublication

test (n = 37 items)

Program generates

final RS to W-ability scoring

table

Result: All norm subjects with Block

Rotation scores (n = 4,722) now have

scores on equal interval W-score

These Block Rotation W-scores are then

used for developing test “norms” and

completing technical manual analysis and

validity research

Page 28: Applied Psych Test Design: Part C - Use of Rasch scaling technology

546

432

Graphic display of distribution of Block

Rotation person abilities

These Block Rotation W-scores are then used for developing test

“norms” and validity research

Block Rotation Summary: Final

Rasch for Publication test – graphic item map

n = 37 norming items (0-74 RS

points)n = 4,722 norm

subjects

Pub. TestW-score

scale

Page 29: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Recall early warning to expect the unexpected and the non-linear “art and science” of test

development

Last minute question raised (prior to formal production) of Block Rotation test:

Should the blocks be shaded/colored instead of being black and white?

Would adding shading/color change the nature of the task?

What to do?

Answer: Do a study—gather some empirical data to help make decision. The question should be

answered empirically – you should not assume that colorizing items will make no difference

Page 30: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Special Block Rotation no-color vs color group administration study completed

Page 31: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Special Block Rotation no-color vs color group administration study completed

Sample size plan - approx 300+ subjects

3 groups spanning the complete range of Block Rotation ability

• 2nd – 4th graders – approx. 100+• 7th – 11th graders – approx 100+• College students – approx 100+

•Final total sample was 380 subjects

Group administration version of test

Two forms of test constructed from complete set of ordered (scaled) items

• White version – even items• Colored version – odd items

Analyses – Rasch analysis and comparison of respective item difficulties and mean score comparison between versions

Conclusion – adding color did NOT change the psychometric characteristics of the items/test – therefore print the final test with colored items

Page 32: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Two sample items

Final Block Rotation Publication Test Constructedn = 37 (0/1/2) items—Raw Scores from 0-74

Page 33: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Rasch (IRT) is a magnificent tool for evaluating and constructing tests with flexibilty during the entire process. Embrace IRT methods in applied test development (vs CTT methods)

Important to remember you are calibrating the scale and not norming the test during this phase). Samples with rectangular distributions of ability are critical.

Carefully inspect the Rasch results (esp., measure order table) and determine if you have enough easy and difficulty items or need more items at certain places along the scale. Then use “linking/anchor” technology to add in new items.

Item fit is a relative matter involving “reasonably acceptable approximate fit”. Don’t blindly follow black and white item fit rules from text-books and articles. The “real world” of test development is not an ivory tower exercise. Follow 3-basic Rasch assumptions (unidimensionality; equal discrimination; local independence) “within reason” (Woodcock).

Many tests claim to use the Rasch model (Rasch “name dropping”), but only use for item analyses and do not harness the advantages of the underlying Rasch ability scale (e.g., W-scale) for improved test construction and score interpretation procedures (e.g., RPI’s).

Page 34: Applied Psych Test Design: Part C - Use of Rasch scaling technology

Maintaining a master item pool

Norming-calibration tests

Linking/equating (alternate forms) tests

Adding new items to master item pool (use of anchor items from master item pool)

Checking for possible item bias (DIF – differential item function)

Creating and using shortened special purpose versions of tests (norming tests; research edition tests; tests for special populations)

Flagging potentially poor examiners via empirical “person fit” statistics report

Computer adaptive testing (CAT)

Page 35: Applied Psych Test Design: Part C - Use of Rasch scaling technology

End of Part C

Additional steps in test development process will be presented in subsequent modules as they are developed