Effect Sizes and Power Review
Apr 01, 2015
Effect Sizes and Power Review
Statistical Power
Statistical power refers to the probability of finding a particular sized effect
Specifically, it is 1- type II error rate– Probability of rejecting the null hypothesis if it is false
It is a function of type I error rate, sample size, and effect size
Its utility lies in helping us determine the sample size needed to find an effect size of a certain magnitude
Two kinds of power analysis
A priori– Used when planning your study– What sample size is needed to obtain a certain level of power?
Post hoc– Used when evaluating study– What chance did you have of significant results?– Not really useful
If you do the power analysis and conduct your analysis accordingly then you did what you could. To say after, “I would have found a difference but didn’t have enough power” isn’t going to impress anyone.
A priori power
Can use the relationship of n, d, and (the noncentrality parameter, i.e. what the sampling distribution is centered on if H0 is false) plus our specified to calculate how many subjects we need to run
Decide on your level Decide an acceptable level of power/type II error rate Figure out the effect size you are looking for Calculate n
A priori Effect Size?
Figure out an effect size before I run my experiment?
Several ways to do this:– Base it on substantive knowledge
What you know about the situation and scale of measurement
– Base it on previous research– Use conventions
An acceptable level of power?
Why not set power at .99? Practicalities
– Howell shows how for a 1 sample t test, and an effect size d of 0.33:
Power = .80, then n = 72 Power = .95, then n = 119 Power = .99, then n = 162
Cost of increasing power (usually done through increasing n) can be high
Howell’s general rule
Look for big effectsor
Use big samples
You may now start to understand how little power many of the studies in psych have considering they are often looking for small effects
Many seem to think that if they use the central limit theorem rule of thumb (n=30), which doesn’t even hold that often, that power is solved too
This is clearly not the case
Post hoc power: the power of the actual study
If you fail to reject the null hypothesis might want to know what chance you had of finding a significant result – defending the failure
As many point out this is a little dubious One thing we can understand regarding the power of a
particular study at hand is that it can be affected by a number of issues such as
– Reliability of measurement An increase in reliability can actually result in power increasing or
decreasing as we will see later, though here I stress the decrease due to unreliable measures
– Outliers– Skewness– Unequal N for group comparisons– The analysis chosen
Something to consider
Doing a sample size calculation is nice in that it gives a sense of what to shoot for, but rarely if ever do the data or circumstances bare out such that it provides a perfect estimate for our needs
– Mike’s sample size calculation for all studies: The sample size needed is the largest N you can obtain based on
practical considerations (e.g. time, money) Also, even the useful form of power analysis (for sample size
calculation) involves statistical significance as its focus While it gives you something to shoot for, our real interest
regards the effect size itself and how comfortable we are with its estimation
Emphasizing effect size over statistical significance in a sense de-emphasizes the power problem
Always a relationship
Commonly define the null hypothesis as ‘no difference’ or ‘no relationship’
There is always a non-zero relationship (to some decimal place) seen in sample data
As such obtaining statistical significance can be seen as just a matter of sample size
Furthermore, the importance and magnitude of an effect are not reflected (because of the role of sample size in probability value attained)
What should we be doing?
Want to make sure we have looked hard enough for the difference – power analysis
Figure out how big the thing we are looking for is – effect size
Effect Size
There are different ways to speak about the relationship between variables, but in general effect size refers to the practical, rather than statistical, significance
– This is what we are really interested in No one cares about the statistical particulars if the effect is real
and will change the way we think about things and how we act However, the effect size, like our other measures, varies from
sample to sample– I.e. if we did a study 5 times, we would get 5 different effect
sizes So while we are primarily interested in effect size, we will need
to be cautious in our interpretation there too, and use other available evidence also to come to our final conclusions
Calculating effect size
Different statistical tests have different effect sizes developed for them
However, the general principle is the same
Effect size refers to the magnitude of the impact of the independent variable (factor) on the outcome variable
Thinking about effect size again
d family: Focused on standardized mean differences– Allows comparison across samples and variables with
differing variance Equivalent to z scores
– Note sometimes no need to standardize (units of the scale have inherent meaning)
r family: Variance-accounted-for– Amount of variance explained versus the total
d family and r family
Example: Cohen’s d – Differences Between Means
Used with independent samples t test
Cohen initially suggested could use either sample standard deviation, since they should both be equal in the population according to our assumptions. In practice people now use the pooled variance.
Variations of this are for control group settings, dependent samples, more than two groups… but the notion of standardized mean difference is the same
1 2
p
X Xd
s
Cohen’s d – Differences Between Means
Relationship to t
Relationship to rpb
1 2
1 1d t
n n
1 22
1 2
2 1 1
1pbpb
n nd r
n nr
2 (1/ )
dr
d pq
P and q are the proportions of the total each group makes up.If equal groups p=.5, q=.5.
Characterizing effect size
Cohen emphasized that the interpretation of effects requires the researcher to consider things narrowly in terms of the specific area of inquiry
Evaluation of effect sizes inherently requires a personal value judgment regarding the practical or clinical importance of the effects
Even though rules of thumb exist, use only as a last resort and be wary of “mindlessly invoking” these criteria
Association
A measure of association describes the amount of the covariation between the independent and dependent variables
It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size
We can apply the measure to continuous data(r and R2), categorical predictors with continuous DV (eta2), and strictly categorical settings (e.g. phi)
Again the notion is the same, a measure of linear association which, if squared, provides a measure of variance in the DV can be accounted for by the predictor
Case-level effect sizes for group differences
Indexes such as Cohen’s d and eta2 estimate effect size at the group or variable level only
However, it is often of interest to estimate differences at the case level Case-level indexes of group distinctiveness are proportions of scores
from one group versus another that fall above or below a reference point
– Examples Cohen’s Us, common language effect size, tail ratios Reference points can be relative (e.g., a certain number of standard
deviations above or below the mean in the combined frequency distribution) or more absolute (e.g., the cutting score on an admissions test)
Note that all three effect size types applicable to the group difference setting are transferable to the other, it is just a matter of preference as to which one we use for communication
Confidence Intervals for Effect Size
Effect size statistics such as Cohen’s d and η2 have complex distributions
General form is the same as any CI
( . )statistic critval std error
Confidence Intervals for Effect Size
Traditional methods of interval estimation rely on approximate standard errors assuming large sample sizes
We need a computer program to help us find the correct noncentrality parameters to use in calculating exact confidence intervals for effect sizes
Both standalone programs (Steiger) and statistical packages (R) can do this for us, and thus provide a measure of effect while noting the uncertainty with that estimate
Limitations of effect size measures
Variability across samples– No more a limitation than other statistics, but one needs to be fully aware
of this Just because you found a moderate effect doesn’t mean that there is one
Standardized mean differences: – Heterogeneity of within-conditions variances across studies can limit their
usefulness—the unstandardized contrast may be better in this case Measures of association:
– Correlations can be affected by sample variances and whether the samples are independent or not, the design is balanced or not, or the factors are fixed or not
– Also affected by artifacts such as missing observations, range restriction, categorization of continuous variables, and measurement error (see Hunter & Schmidt, 1994, for various corrections)
– Variance-accounted-for indexes can make some effects look smaller than they really are in terms of their substantive significance
Limitations of effect size measures
How to fool yourself with effect size estimation:
1. Measure effect size only at the group level
2. Apply generic definitions of effect size magnitude without first looking to the literature in your area
3. Believe that an effect size judged as “large” according to generic definitions must be an important result and that a “small” effect is unimportant
4. Ignore the question of how theoretical or practical significance should be gauged in your research area
5. Estimate effect size only for statistically significant results
Limitations of effect size measures
6. Believe that finding large effects somehow lessens the need for replication
7. Forget that effect sizes are subject to sampling error
8. Forget that effect sizes for fixed factors is specific to the particular levels selected for study
9. Forget that standardized effect sizes encapsulate other quantities such as the unstandardized effect size, error variance, and experimental design
10. As a journal editor or reviewer, substitute effect size magnitude for statistical significance as a criterion for whether a work is published
Recommendations
Report effect sizes along with statistical significance
Report confidence intervals Use graphics Use common sense combined with
theoretical considerations Do not rely on any one result to support your
conclusions