1 Reliability in Scales • Reliability is a question of consistency • do we get the same numbers on repeated measurements? • Low reliability: reaction time • High reliability: measuring weight • Psychological tests fall inbetween
Jan 12, 2016
1
Reliability in Scales
• Reliability is a question of consistency• do we get the same numbers on repeated
measurements?• Low reliability: reaction time• High reliability: measuring weight• Psychological tests fall inbetween
2
Reliability & Measurement error• Measures are not perfectly reliable because they
have error
• The “built in accuracy” of the scale
• Pokemon wristwatch vs. USN atomic clock
• We can express this as:
• X = T + e
• X = your measurement
• T = the “True score”
• e = the error involved in measuring it (+ or -)
3
Example: the effect of e
• Imagine we have someone with a “true” int score of 100.
• If your int scale has a large e, then you measurements will vary a lot (say from 60 all the way to 130)
• If your scale has a small e, your scale will vary a little (say from 90 to 110)
4
Measurements as distributions
• Think of e as variance in a distribution, with X as your mean
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Small e - scores clustered close to true score
Large e - scores all over the place! (hard to say what the true score is)
5
More on the error
• Measures with a large e are dodgy (hides the true score)
• We can reduce the size of e, but not eliminate it completely
• Measuring reliability is measuring the impact of e
6
Different forms of reliability
• Reliability (“effect of e”) can be very hard to conceptualise
• To help, we break it up into 2 subclasses
• Temporal stability• If I measure you today and tomorrow, do I get the same
result?
• Internal consistency• Are all the questions in the test measuring the same thing?
7
Temporal stability
• The big idea: If I test you now, and then I test you tomorrow, I should get the same result
• Why have it?• Can’t measure changes otherwise!
• Tells us that we can trust results (small time error)
• Tells us that there is no learning effect
8
Measuring temporal stability
• How can we measure if a test is temporally stable?• The problem: we have 2 sets of scores. We
need to see if they are the same
• Solution: Use a correlation. If the two sets are strongly related, then they are basically the same
9
Example: Correlations & Stability
• Imagine a test with ten questions, and a person does it twice (on Monday and Wednesday):
• M: 5 6 5 3 4 8 6 4 8 7
• W: 4 7 4 3 4 8 6 3 9 5
• Are these scores the same? (r = 0.897)
10
Example: correlations & stability
• Now imagine a crappy scale:
• M: 5 6 5 3 4 8 6 4 8 7
• W: 5 3 2 2 8 5 8 1 4 4
• Are these scores basically the same (r = 0.211)
11
Different approaches to stability
• There are a two main ways of testing temporal stability
• Test-retest method: give the same test to the same people
• Alternate forms: give a highly similar test to the same people
12
Test-retest method
• Method:
• 1. Select a group of people
• 2. Give them your test
• 3. Get them to come back later
• 4. Give them the test again
• 5. Correlate the results to see
13
Things to note
• It must be the same people• We want to know that if client X returns, we
can measure that person again
• The amount of time between tests depends on your requirements
• The correlation value must be very high - above 0.85
14
Why it works• We get 2 results from each person to compare
• this means we can draw rely on the test to work for the same people
• We use a lot of people in our assessment• this means that we can rely on the test, regardless of
who our client is
• The correlation tells us the degree to which the 2 tests agree (R2 is the % they agree)
15
The learning effect• What if you have a test where learning/practice
can affect your score?• Eg. class test
• The Test-retest method will always yield poor correlations• people will always score higher marks the second
time around
• This will make it look as if temporal stability is poor!
16
Alternate forms reliability
• Answer: do test-retest, but don’t use the same test twice• Use a highly similar test
• In order for this to work, both forms must be equally difficult• The more similar, the better
17
Making alternate forms of a test
• Simple to ensure both forms are equally difficult• Make twice as many questions as you will want
in the test
• Randomly divide them up into two halves
• Each half is a test!
• The random division ensures both forms are equally difficult
18
The procedure: alternate forms
• Once you have your 2 forms:• Collect a sample of people
• Give them the first form of the test
• wait a while
• Give them the second form of the test
• Correlate the results
• If the correlation is high (> .85), you have stability
19
Which to use: alternate forms or test-retest?
• If you are measuring something which can be learned/perfected by practice - alternate forms
• If not, you could choose• Test-retest if preferable, removes confound about
difficulty
• In many cases, you don’t really know if learning is an issue
• Alternate forms is “safer”, but poorer statistically
20
What if you don’t have temporal stability?
• Temporal stability is not required for all tests
• Most important for tests which work longitudinally• Very important if you want to track changes
over time
• Excludes all “once off” tests (eg. aptitude tests)
21
Internal consistency
• A different type of reliability
• The big idea: Are all the questions in my test tapping into the same thing?• (or, are some questions irrelevant)
• All tests require this property
22
Why it’s important
• Imagine we have an arithmetic ability test, with 4 questions:
• 1. What is 5 x 3
• 2. What is 12 + 2 - 5
• 3. What is the capital of the Ukraine
• 4. What is 5 x 2 + 3
23
Why it’s important
• Item 3 does not contribute to measuring arithmetic• Someone who is a maths wiz (should get 4/4) might
only get 3/4
• A complete maths idiot (should get 0/4) could get 1/4
• It does not belong in this test!• If we include it in our total, it will confuse us
• Items such as this become “third variables”
24
How do we know if an item belongs?
• We need to figure out if a particular item is testing the same thing as the others
• We can correlate the item’s scores with the scores of some other item we do know belongs• High correlation (above 0.85) - it tests the same
thing
• Low correlation (below 0.85) - it measures something else
25
Our example again
• Some people who know maths, will also know geography• But not everyone!
• Correlate Q1 to Q3 - it will be weak
• Those who know arithmetic will know how to do the other items• Correlate Q1 to Q2 or Q4, all will give a high
correlation
26
Doing it for real
• Problem: how do we know which items are suspect?• Any item could be at fault
• Not always ovious
• Solution - check them all• Split half method
• Cronbach’s Alpha
27
Split half approach
• Basic idea: check one half of the test against the other half• If first half correlates well to the other half,
then they are tapping into the same thing
• Problem to overcome: each half of the test must be the same difficulty
28
Split half - procedure
• Give a bunch of people your test
• Decide on how to split the test in half
• Correlate the halves
• If the correlation is high (above 0.85), the test is reliable
29
Where to split?
• Problem: how do we split the test?• First 10 Q vs last 10?
• Odd numbered Q vs Even numbered Q?
• Any method is acceptable, as long as the halves are of equivalent difficulty• How do you show that?
• Not by correlation - paradox!• (low r could be difficulty or reliability!)
30
Cronbach’s coefficient
• A major problem with split-half approach• How do you know that inside a half there aren’t
a few bad items?
• Catches most, but not all
• Solution: Select another half to split at• But: if you have the same number of bad items
in each half, they balance out - hidden!
31
The splitting headache
• Imagine you have a few bad items, evenly spread in the test:
(Black bars are bad items)
If you use a first 3/ last 3 split, end up with one bad item in each half, so they are balanced out (hidden)
If you use a even/odd split, they are balanced out as well (hidden)
How do you split?
32
A solution to splitting
• Remember: we don’t know which the bad ones are• Can’t make bizzare splits to work around them
• Solution: brute force!• Work out the correlations between every
possible split, and average them out!
33
Cronbach’s
• Not to be confused with (prob of Type I error) from significance tests!
• Works out the correlation between each half and each other half, and averages them out
• Impossible for bad items to “hide” by balancing out
34
Interpreting Cronbach’s
• Gives numbers between 0 and 1
• Needs to be very high (above 0.9)
• It is a measure of homogeneity of the test• If your test is designed to measure more than
one thing, the score will be low
35
Other forms of reliability
• Kuder-Richardson formula 20 (KR20)• Like Cronbach’s alpha, but specialized for
correct/incorrect type answers
• Inter-scorer reliability• for judgement tests
• to what degree do several judges agree on the answer
• expressed as a correlation