Applied Psych Test Design: Part C - Use of Rasch scaling technology

Post on 12-May-2015

2340 Views

Category:

Business

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Art and Science of Applied Test Development. This is the third in a series of PPT modules explicating the development of psychological tests in the domain of cognitive ability using contemporary methods (e.g., theory-driven test specification; IRT-Rasch scaling; etc.). The presentations are intended to be conceptual and not statistical in nature. Feedback is appreciated.

Transcript

The Art and Science of Test Development—Part C

Test and item development: Use of Rasch scaling technology

The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock

Kevin S. McGrew, PhD.

Educational Psychologist

Research DirectorWoodcock-Muñoz Foundation

Part A: Planning, development frameworks & domain/test specification blueprints

Part B: Test and Item Development

Part C: Use of Rasch Technology

Part D: Develop norm (standardization) plan

Part E: Calculate norms and derived scores

Part F: Psychometric/technical and statistical analysis: Internal

Part G: Psychometric/technical and statistical analysis: External

The Art and Science of Test Development

The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence.

The current module is designated by red bold font lettering

Important note: For the on-line public versions of this PPT module certain items,

information, etc. is obscured for test security or proprietary reasons…sorry

Use Rasch (IRT) scaling to evaluate the complete pool of items and to develop the Norming and Publication tests

Structural (Internal) Stage of Test Development

Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities)

Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence?

Method and concepts Internal domain studies Item/subscale intercorrelations Item response theory (IRT)

Characteristics of strong test validity program

• Moderate item internal consistency• Items/measures are representative of the empirical

domain• Items fit the theoretical structure

Theoretical Domain = Cattell-Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities

Gv

Item Scale Development via Rasch technology

Measurement or empirical domain

Rasch scale and evaluate the complete pool of items to develop Norming and Publication tests

Low ability/easy items

High ability/difficult items

Recall that Block Rotation items have 2 possible correct answers. Therefore there is a scoring question:

• Should items be scaled as 0/1 (need both correct to receive 1)?

• Should items be scales as 0/1/2 ?

Item data can be Rasch-scaled with both scoring systems and then select one that provides best reliability, etc.

We decided to go with 0/1/2 scoring sytem

Important understanding regarding 0/1 and multiple point (0/1/2) scoring systems when using Rasch/IRT

0 1

1 “step”

1 20

1 “step” 1 “step”

Therefore – think of 2-step items as

two 0/1 items

Dichotomous (0/1) item scoring

Multiple point (0/1/2) item scoring

Think of the items as now having been placed in their proper position on an equal interval ruler or yardstick….each item is a “tick” mark along the latent trait scale

Rasch IRT “norms” (calibrates) the scale !

A major advantage/feature of a large Rasch IRT-scaled item pool……..

Once you have a large Rasch IRT-scaled item pool, you can develop different and customized scales that place people on the same underlying scale

• CAT (computer adaptive testing)

• Different and unique forms of the test

A major advantage/feature of a large IRT-scaled item pool……..

Norming test Publicationtest

Possible special Research Edition tests

All three tests have items on the same scale (W-scale)

Although different number of items in each test, the obtained person ability W-score ‘s are equivalent, but differ in degree of precision (reliability)

Average difference in “gaps” between items on respective scales is called “item density”

W-scale is equal interval metric

Easy

Hard

Items are assignedW-difficulties

People are assigned W-ability scores

2 Major Rasch results

Rasch puts person ability and item difficulty on the same scale (W scale)

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Select and order items

for Publication test based

on inspection of Rasch results

Block RotationNorming test

(n=44 items; n = 4,722 norm subjects)

Block RotationPublication test

(n = 37 items; n = 4,722 norm subjects)

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Measure order and fit statistics

table

Used to select items with

specified item density

Block Rotation: Final Rasch with norming

testn = 37 norming

itemsn = 4722 norm

subjects

Distribution of Block Rotation

W-ability scores in norm

sample

Complete range

(including extremes) of

Block Rotation W-

scores is 432-546

Majority of Block

Rotation norm sample obtained W-scores from

480-520

Recall Block Rotation scoring system is 0/1/2—Items have “steps”

1 20

1 “step” 1 “step”

Multiple point (0/1/2) item scoring

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Item map with “steps”

displayed for items

Blue area represents majority of

norm sample subjects Block

Rotation W-scores

1 “step” 1 “step”

Item 1 (0/1/2) step structure

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Item map with “steps”

displayed for items

Blue area represents majority of

norm sample subjects Block

Rotation W-scores

Very good test scale coverage for majority of population

Excellent “bottom” or “floor” for test scale

Adequate “top” or “ceiling” for test scale

Block Rotation: Final Rasch with

norming testn = 37 norming

itemsn = 4722 norm

subjects

Item map with “steps”

displayed for items

Red area represents the

complete range (including

extremes) of sample Block Rotation W-

scores

Good test scale coverage for complete range of population

BLKROT: Floor (rs=1) & ceiling (rs=max) plot

0 10 20 30 40 50 60 70 80 90100

110

120130

140150

160170

180190

200210

220230

240250

260270

280290

300

camos

430

470

510

550

Re

f W +

/- 3

SD

's

Block Rotation Rasch floor/ceiling results confirmed by formal+-3SD floor/ceiling analysis (24-300 months of age)

Block Rotation Rasch floor/ceiling results confirmed by formal+-3SD floor/ceiling analysis (300 - 1200 months of age)

BLKROT: Floor (rs=1) & ceiling (rs=max) plot

300330

360390

420450

480510

540570

600630

660690

720750

780810

840870

900930

960990

10201050

1080111

011

4011

701200

camos

430

470

510

550

Re

f W +

/- 3

SD

's

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Block RotationNorming test

(n=44 items; n = 4,722 norm subjects)

Block RotationPublication test

(n = 37 items; n = 4,722 norm subjects)

Program generates final RS to W-ability scoring table

Block Rotation: Final Rasch with

norming test

n = 37 norming items

n = 4722 norm subjects

Raw score to W-score

“scoring table”

Note: Total raw score points is 74 for 37 items. These are 2-step items.

37 items x 2 steps = 74 total possible points

Block Rotation Norming Test

n=44 items

44 items x 2 steps = raw scores from

0 to 88 on the Rasch-based scoring table (the equal interval

Visualization-Vz measurement “ruler”

or “yardstick”)

88

87

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

0

545.7

539.0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

437.8

431.6

Raw Score W-score

Block Rotation Norming test(n=44 items)

88

87

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

0

545.7

539.0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

437.8

431.6

Raw Score W-score

545.7

539.0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

437.8

431.6

Block Rotation Norming test (n=44 items)Block Rotation Publication test n = 37 items)

74

73

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

0

Raw Score W-score

Block Rotation Norming and

Publication tests, although having

different number of items (and total

Raw Scores), are on the same underlying

measurement scale (ruler)

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Program generates final RS to W-ability

scoring table

Result: All norm subjects with Block Rotation scores (n = 4,722) now have scores on equal interval W-score Block Rotation

Norming test(n=44 items; n = 4,722

norm subjects)

Block RotationPublication

test (n = 37 items)

Person W-ability scores

2 Major Rasch results

Item W-difficulties

Block RotationNorming test

(n=44 items; n = 4,722 norm subjects)

Block RotationPublication

test (n = 37 items)

Program generates

final RS to W-ability scoring

table

Result: All norm subjects with Block

Rotation scores (n = 4,722) now have

scores on equal interval W-score

These Block Rotation W-scores are then

used for developing test “norms” and

completing technical manual analysis and

validity research

546

432

Graphic display of distribution of Block

Rotation person abilities

These Block Rotation W-scores are then used for developing test

“norms” and validity research

Block Rotation Summary: Final

Rasch for Publication test – graphic item map

n = 37 norming items (0-74 RS

points)n = 4,722 norm

subjects

Pub. TestW-score

scale

Recall early warning to expect the unexpected and the non-linear “art and science” of test

development

Last minute question raised (prior to formal production) of Block Rotation test:

Should the blocks be shaded/colored instead of being black and white?

Would adding shading/color change the nature of the task?

What to do?

Answer: Do a study—gather some empirical data to help make decision. The question should be

answered empirically – you should not assume that colorizing items will make no difference

Special Block Rotation no-color vs color group administration study completed

Special Block Rotation no-color vs color group administration study completed

Sample size plan - approx 300+ subjects

3 groups spanning the complete range of Block Rotation ability

• 2nd – 4th graders – approx. 100+• 7th – 11th graders – approx 100+• College students – approx 100+

•Final total sample was 380 subjects

Group administration version of test

Two forms of test constructed from complete set of ordered (scaled) items

• White version – even items• Colored version – odd items

Analyses – Rasch analysis and comparison of respective item difficulties and mean score comparison between versions

Conclusion – adding color did NOT change the psychometric characteristics of the items/test – therefore print the final test with colored items

Two sample items

Final Block Rotation Publication Test Constructedn = 37 (0/1/2) items—Raw Scores from 0-74

Rasch (IRT) is a magnificent tool for evaluating and constructing tests with flexibilty during the entire process. Embrace IRT methods in applied test development (vs CTT methods)

Important to remember you are calibrating the scale and not norming the test during this phase). Samples with rectangular distributions of ability are critical.

Carefully inspect the Rasch results (esp., measure order table) and determine if you have enough easy and difficulty items or need more items at certain places along the scale. Then use “linking/anchor” technology to add in new items.

Item fit is a relative matter involving “reasonably acceptable approximate fit”. Don’t blindly follow black and white item fit rules from text-books and articles. The “real world” of test development is not an ivory tower exercise. Follow 3-basic Rasch assumptions (unidimensionality; equal discrimination; local independence) “within reason” (Woodcock).

Many tests claim to use the Rasch model (Rasch “name dropping”), but only use for item analyses and do not harness the advantages of the underlying Rasch ability scale (e.g., W-scale) for improved test construction and score interpretation procedures (e.g., RPI’s).

Maintaining a master item pool

Norming-calibration tests

Linking/equating (alternate forms) tests

Adding new items to master item pool (use of anchor items from master item pool)

Checking for possible item bias (DIF – differential item function)

Creating and using shortened special purpose versions of tests (norming tests; research edition tests; tests for special populations)

Flagging potentially poor examiners via empirical “person fit” statistics report

Computer adaptive testing (CAT)

End of Part C

Additional steps in test development process will be presented in subsequent modules as they are developed

top related