Reliability in Scales

1

Reliability in Scales

• Reliability is a question of consistency• do we get the same numbers on repeated

measurements?• Low reliability: reaction time• High reliability: measuring weight• Psychological tests fall inbetween

2

Reliability & Measurement error• Measures are not perfectly reliable because they

have error

• The “built in accuracy” of the scale

• Pokemon wristwatch vs. USN atomic clock

• We can express this as:

• X = T + e

• X = your measurement

• T = the “True score”

• e = the error involved in measuring it (+ or -)

3

Example: the effect of e

• Imagine we have someone with a “true” int score of 100.

• If your int scale has a large e, then you measurements will vary a lot (say from 60 all the way to 130)

• If your scale has a small e, your scale will vary a little (say from 90 to 110)

4

Measurements as distributions

• Think of e as variance in a distribution, with X as your mean

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Small e - scores clustered close to true score

Large e - scores all over the place! (hard to say what the true score is)

5

More on the error

• Measures with a large e are dodgy (hides the true score)

• We can reduce the size of e, but not eliminate it completely

• Measuring reliability is measuring the impact of e

6

Different forms of reliability

• Reliability (“effect of e”) can be very hard to conceptualise

• To help, we break it up into 2 subclasses

• Temporal stability• If I measure you today and tomorrow, do I get the same

result?

• Internal consistency• Are all the questions in the test measuring the same thing?

7

Temporal stability

• The big idea: If I test you now, and then I test you tomorrow, I should get the same result

• Why have it?• Can’t measure changes otherwise!

• Tells us that we can trust results (small time error)

• Tells us that there is no learning effect

8

Measuring temporal stability

• How can we measure if a test is temporally stable?• The problem: we have 2 sets of scores. We

need to see if they are the same

• Solution: Use a correlation. If the two sets are strongly related, then they are basically the same

9

Example: Correlations & Stability

• Imagine a test with ten questions, and a person does it twice (on Monday and Wednesday):

• M: 5 6 5 3 4 8 6 4 8 7

• W: 4 7 4 3 4 8 6 3 9 5

• Are these scores the same? (r = 0.897)

10

Example: correlations & stability

• Now imagine a crappy scale:

• M: 5 6 5 3 4 8 6 4 8 7

• W: 5 3 2 2 8 5 8 1 4 4

• Are these scores basically the same (r = 0.211)

11

Different approaches to stability

• There are a two main ways of testing temporal stability

• Test-retest method: give the same test to the same people

• Alternate forms: give a highly similar test to the same people

12

Test-retest method

• Method:

• 1. Select a group of people

• 2. Give them your test

• 3. Get them to come back later

• 4. Give them the test again

• 5. Correlate the results to see

13

Things to note

• It must be the same people• We want to know that if client X returns, we

can measure that person again

• The amount of time between tests depends on your requirements

• The correlation value must be very high - above 0.85

14

Why it works• We get 2 results from each person to compare

• this means we can draw rely on the test to work for the same people

• We use a lot of people in our assessment• this means that we can rely on the test, regardless of

who our client is

• The correlation tells us the degree to which the 2 tests agree (R2 is the % they agree)

15

The learning effect• What if you have a test where learning/practice

can affect your score?• Eg. class test

• The Test-retest method will always yield poor correlations• people will always score higher marks the second

time around

• This will make it look as if temporal stability is poor!

16

Alternate forms reliability

• Answer: do test-retest, but don’t use the same test twice• Use a highly similar test

• In order for this to work, both forms must be equally difficult• The more similar, the better

17

Making alternate forms of a test

• Simple to ensure both forms are equally difficult• Make twice as many questions as you will want

in the test

• Randomly divide them up into two halves

• Each half is a test!

• The random division ensures both forms are equally difficult

18

The procedure: alternate forms

• Once you have your 2 forms:• Collect a sample of people

• Give them the first form of the test

• wait a while

• Give them the second form of the test

• Correlate the results

• If the correlation is high (> .85), you have stability

19

Which to use: alternate forms or test-retest?

• If you are measuring something which can be learned/perfected by practice - alternate forms

• If not, you could choose• Test-retest if preferable, removes confound about

difficulty

• In many cases, you don’t really know if learning is an issue

• Alternate forms is “safer”, but poorer statistically

20

What if you don’t have temporal stability?

• Temporal stability is not required for all tests

• Most important for tests which work longitudinally• Very important if you want to track changes

over time

• Excludes all “once off” tests (eg. aptitude tests)

21

Internal consistency

• A different type of reliability

• The big idea: Are all the questions in my test tapping into the same thing?• (or, are some questions irrelevant)

• All tests require this property

22

Why it’s important

• Imagine we have an arithmetic ability test, with 4 questions:

• 1. What is 5 x 3

• 2. What is 12 + 2 - 5

• 3. What is the capital of the Ukraine

• 4. What is 5 x 2 + 3

23

Why it’s important

• Item 3 does not contribute to measuring arithmetic• Someone who is a maths wiz (should get 4/4) might

only get 3/4

• A complete maths idiot (should get 0/4) could get 1/4

• It does not belong in this test!• If we include it in our total, it will confuse us

• Items such as this become “third variables”

24

How do we know if an item belongs?

• We need to figure out if a particular item is testing the same thing as the others

• We can correlate the item’s scores with the scores of some other item we do know belongs• High correlation (above 0.85) - it tests the same

thing

• Low correlation (below 0.85) - it measures something else

25

Our example again

• Some people who know maths, will also know geography• But not everyone!

• Correlate Q1 to Q3 - it will be weak

• Those who know arithmetic will know how to do the other items• Correlate Q1 to Q2 or Q4, all will give a high

correlation

26

Doing it for real

• Problem: how do we know which items are suspect?• Any item could be at fault

• Not always ovious

• Solution - check them all• Split half method

• Cronbach’s Alpha

27

Split half approach

• Basic idea: check one half of the test against the other half• If first half correlates well to the other half,

then they are tapping into the same thing

• Problem to overcome: each half of the test must be the same difficulty

28

Split half - procedure

• Give a bunch of people your test

• Decide on how to split the test in half

• Correlate the halves

• If the correlation is high (above 0.85), the test is reliable

29

Where to split?

• Problem: how do we split the test?• First 10 Q vs last 10?

• Odd numbered Q vs Even numbered Q?

• Any method is acceptable, as long as the halves are of equivalent difficulty• How do you show that?

• Not by correlation - paradox!• (low r could be difficulty or reliability!)

30

Cronbach’s coefficient

• A major problem with split-half approach• How do you know that inside a half there aren’t

a few bad items?

• Catches most, but not all

• Solution: Select another half to split at• But: if you have the same number of bad items

in each half, they balance out - hidden!

31

The splitting headache

• Imagine you have a few bad items, evenly spread in the test:

(Black bars are bad items)

If you use a first 3/ last 3 split, end up with one bad item in each half, so they are balanced out (hidden)

If you use a even/odd split, they are balanced out as well (hidden)

How do you split?

32

A solution to splitting

• Remember: we don’t know which the bad ones are• Can’t make bizzare splits to work around them

• Solution: brute force!• Work out the correlations between every

possible split, and average them out!

33

Cronbach’s

• Not to be confused with (prob of Type I error) from significance tests!

• Works out the correlation between each half and each other half, and averages them out

• Impossible for bad items to “hide” by balancing out

34

Interpreting Cronbach’s

• Gives numbers between 0 and 1

• Needs to be very high (above 0.9)

• It is a measure of homogeneity of the test• If your test is designed to measure more than

one thing, the score will be low

35

Other forms of reliability

• Kuder-Richardson formula 20 (KR20)• Like Cronbach’s alpha, but specialized for

correct/incorrect type answers

• Inter-scorer reliability• for judgement tests

• to what degree do several judges agree on the answer

• expressed as a correlation

Reliability in Scales

Documents

similar test

test again5

true scorelarge e scores

meansmall e scores

distributionsthink of

size of e

sets of scores

true scorewe