Top Banner
~ Pergamon 0005-7967(95)00082-8 Behav. Res. Ther. Vol. 34, No. 5/6, pp. 489~,99, 1996 Copyright~-~1996Elsevier ScienceLtd Printed in Great Britain.All rights reserved 0005-7967/96 $15.00+ 0.00 INVITED ESSAY STATISTICAL POWER: CONCEPTS, PROCEDURES, AND APPLICATIONS MARK HALLAHAN* and ROBERT ROSENTHAL Department of Psychology, Harvard University, 33 Kirkland St., Cambridge, MA 02138, U.S.A. (Received 11 September 1995; in revised form 23 November 1995) Summary--This paper discusses the concept of statistical power and its application to psychological research. Power, the probability that a significance test will produce a significant result when the null hypothesis is false, often is neglected with potentially serious consequences. The concept of power should be considered as part of planning and interpreting research. This article provides explication of the concept of power and suggestions for researchers to increase the power of their investigations. Copyright © 1996 Elsevier Science Ltd. INTRODUCTION Failing to consider the concept of statistical power when planning and interpreting empirical studies often results in our drawing erroneous conclusions from the data, specifically, by overlooking interesting and important effects and by leading one to prematurely give up on promising avenues of investigation. This article explains the concept of power and the potentially serious consequences of ignoring power when planning and interpreting research. Additionally, a number of ways to increase power are described and suggestions are offered for integrating the concept of power into our overall strategy of data analysis. THE CONCEPT OF POWER Researchers collect data from samples in order to make generalizations about a larger population. One way to think about this is that researchers observe the extent to which an effect exists in a sample in order to estimate the magnitude of that effect in a larger population. The key word here is estimate. Even highly accurate estimates are not without some error. Sometimes the population effect sizet will be overestimated and sometimes it will be underestimated by a sample. The degree of possible error depends on the nature of the sample. Small samples and samples with large amounts of variability between observations will provide less precise estimates of population effects than large samples and samples with little variability. Significance testing takes a somewhat different approach. Significance tests are concerned with how likely it is that an obtained effect size would have occurred if the null hypothesis (H0) were true. In most cases the null hypothesis is that there is no effect in the larger population (the population effect size equals zero), but non-zero nulls also are possible.:~ By convention, when it is sufficiently unlikely that sample data would have been observed if the null were true, researchers assume that the null is probably false. However, this probabilistic approach necessarily involves some error. With P < 0.05 as the critical value (~), researchers are willing to believe that the null *Author for correspondence. tExamples of effect sizes include the difference in proportion of patients improved in a treatment versus control condition, the difference between group means (usually standardized by the standard deviation), and the correlation between two levels of a treatment variable and an outcome variable. ~For example, if it were well-known that treatment A is better than a placebo treatment by an amount d = 0.30, we might set the null value to d = 0.30 in our comparison of a new treatment (B) with a placebo treatment. In this case we would "reject" the null if our obtained effect of treatment B were significantly greater than d = 0.30. Conceptually, this is very much like a null value of 0.00 but with the definition of the effect size (es) changing to es = d, - db = 0.00. 489
11

Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

Jul 30, 2018

Download

Documents

lymien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

~ Pergamon 0005-7967(95)00082-8 Behav. Res. Ther. Vol. 34, No. 5/6, pp. 489~,99, 1996

Copyright ~-~ 1996 Elsevier Science Ltd Printed in Great Britain. All rights reserved

0005-7967/96 $15.00 + 0.00

I N V I T E D E S S A Y

S T A T I S T I C A L P O W E R : C O N C E P T S , P R O C E D U R E S ,

A N D A P P L I C A T I O N S

M A R K H A L L A H A N * and R O B E R T R O S E N T H A L Department of Psychology, Harvard University, 33 Kirkland St., Cambridge, MA 02138, U.S.A.

(Received 11 September 1995; in revised form 23 November 1995)

Summary--This paper discusses the concept of statistical power and its application to psychological research. Power, the probability that a significance test will produce a significant result when the null hypothesis is false, often is neglected with potentially serious consequences. The concept of power should be considered as part of planning and interpreting research. This article provides explication of the concept of power and suggestions for researchers to increase the power of their investigations. Copyright © 1996 Elsevier Science Ltd.

I N T R O D U C T I O N

Fai l ing to consider the concept o f s tat is t ical power when p lanning and in terpret ing empir ica l studies often results in our d rawing e r roneous conclus ions f rom the da ta , specifically, by over look ing interest ing and i m p o r t a n t effects and by leading one to p remature ly give up on promis ing avenues o f invest igat ion. This art icle explains the concept o f power and the potent ia l ly serious consequences o f ignor ing power when p lanning and in terpre t ing research. Addi t iona l ly , a number o f ways to increase power are descr ibed and suggest ions are offered for in tegrat ing the concept o f power into our overall s t ra tegy o f da t a analysis.

T H E C O N C E P T O F P O W E R

Researchers collect da t a from samples in o rder to make genera l iza t ions a b o u t a larger popu la t ion . One way to think abou t this is that researchers observe the extent to which an effect exists in a sample in o rder to es t imate the magni tude o f that effect in a larger popula t ion . The key word here is es t imate . Even highly accurate es t imates are not wi thout some error. Somet imes the popu la t i on effect s izet will be overes t imated and somet imes it will be underes t imated by a sample. The degree o f poss ible e r ror depends on the na ture o f the sample. Small samples and samples with large amoun t s o f var iabi l i ty between observa t ions will p rovide less precise es t imates o f popu la t ion effects than large samples and samples with little variabil i ty.

Significance testing takes a somewha t different approach . Significance tests are concerned with how likely it is that an ob ta ined effect size would have occurred if the null hypothesis (H0) were true. In most cases the null hypothes is is tha t there is no effect in the larger popu la t ion (the popu la t ion effect size equals zero), but non-zero nulls also are possible.:~ By convent ion, when it is sufficiently unlikely that sample da ta would have been observed if the null were true, researchers assume that the null is p robab ly false. However , this probabi l i s t ic app roach necessari ly involves some error . Wi th P < 0.05 as the crit ical value (~), researchers are willing to believe that the null

*Author for correspondence. tExamples of effect sizes include the difference in proportion of patients improved in a treatment versus control condition,

the difference between group means (usually standardized by the standard deviation), and the correlation between two levels of a treatment variable and an outcome variable.

~For example, if it were well-known that treatment A is better than a placebo treatment by an amount d = 0.30, we might set the null value to d = 0.30 in our comparison of a new treatment (B) with a placebo treatment. In this case we would "reject" the null if our obtained effect of treatment B were significantly greater than d = 0.30. Conceptually, this is very much like a null value of 0.00 but with the definition of the effect size (es) changing to es = d, - db = 0.00.

489

Page 2: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

4 9 0 M a r k H a l l a h a n a n d R o b e r t R o s e n t h a l

hypothesis is not true if they observe an effect size that would be likely to occur less than 5% of the time if the null were true. This means that for 95% of the samples drawn from a population in which the null hypothesis were true, the observed significance level would not be below the critical P _< 0.05 level and it would be inferred correctly that the null hypothesis is true. However, 5% of the samples from this population will be below P = 0.05 and the researchers will erroneously believe the null is false when in fact it is not. This type of error, believing that the null hypothesis is false when it actually is true, is called type I error. The probability of making a type I error is equal to the ritical level of significance (0t) that is used.

Many people, including Cohen (1990), have pointed out that in reality the null hypothesis is almost never true. A null hypothesis that two groups do not differ could be expressed as d = 0.00, where d is the difference between two means divided by the common a. With a sufficiently large sample, it would be a statistically rare event to observe even exceedingly small effects (e.g. d = 0.01, d = 0.0001, d = 0.000001) in a sample from a population in which the true value of d = 0.00. In addition to almost never being true, the null hypothesis is rarely very interesting. Researchers typically test a null hypothesis with the hope of being able to reject it so that they may infer that their data better fit another, more interesting hypothesis.

Given that the null hypothesis is usually neither true nor scientifically interesting, perhaps it may be more fruitful to consider the types of errors that are made in significance testing when the null hypothesis is not true. If the null hypothesis is false and the significance test for the sample has a P > 0.05 then (as the conventional wisdom goes) researchers believe that they do not have sufficient evidence to reject the null hypothesis. This is an example of type II error (fl), or failing to reject the null hypothesis when it is false and should be rejected. Table 1 illustrates type I and type II errors. Power is the probability of not making a type II error (1 - /3) . In other words, statistical power is the probability of detecting an existing non-zero effect size. By 'detecting' we mean that a significance test for an effect will yield a P-value that is at or below a stipulated critical level (ct) when the effect is non-zero in the larger population.

The power of a particular test is determined by three factors: the number of observations, the size of the effect in the larger population, and the ~ level that is set. It may be useful to keep in mind the relationship between these parameters and power. Power increases with effect size--it is easier to detect a large effect than it is to detect a small effect. Power increases with sample size--it is easier to detect an effect with more observations than it is to detect an effect with fewer observations. Power increases with ct leniency--it is easier to detect an effect with a lenient criterion (e.g. P = 0.20) for detection than it is with a stringent criterion (e.g. P = 0.001) for detection.

THE N E G L E C T OF POWER

Many psychology researchers do not seem to consider power. For example, Cohen's (1962) survey of the statistical power of 70 studies that were published in the Journal of Abnormal and Social Psychology in 1960 found that the median power of these studies to detect medium sized effects (e.g. d = 0.50) was only 0.46 (with ~ = 0.05, two tailed). If the effects that were investigated by the authors of these studies did exist and were of medium size, more than half of these studies had less than a fifty-fifty chance of observing a significant result. Cohen's (1962) study pointed out that his contemporary psychology researchers did not seem to be considering power when designing and carrying out their research. A study with a power of 0.46 has a 54% chance of making a type II error if the null were false, which is nearly 11 times the 5% chance of making a type I error (with 0t = 0.05) if the null were true.

Table I. I l lustration of type l and type II e r ror

In Populat ion In Sample Null is true Null is false (with ct = 0.05) (e .g.e .s . = 0.00) (e.s. ~ 0.00)

P > 0.05 No error Type II er ror (fl) (do not reject null) (rightly do not reject null) (wrongly do not reject null) P < 0.05 Type I e r ror (ct) No er ror (reject null) (wrongly reject null) (rightly reject null)

Page 3: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

Statistical power: concepts, procedures, and applications 491

Is it important to pay attention to power? Apparently many major funding agencies think so. Power analyses are starting to be requested routinely as part of many grant applications, perhaps in response to the problems that the neglect of power can have for research. The consequences of neglecting power are twofold. First, by not considering power while planning research, researchers may design studies that have little chance of detecting an effect that exists, where detection is defined by the ct set by the investigator. In so doing, they risk devoting valuable time and resources to research that is not likely to reject the null hypothesis at the specified level of significance. Second, by not considering power when interpreting results, researchers may give up prematurely on promising avenues of investigation. This is a consequence of mistakenly interpreting a non-signifi- cant result to mean that the null is true, regardless of a sample's ability to detect an existing non-null effect.

A hypothetical example illustrates these problems. A researcher thought that a newly developed treatment might improve the cognitive functioning of people who have suffered strokes. To test this question he randomly assigned 20 patients to receive the new treatment and 20 other patients to a control group that received a standard program of post-stroke rehabilitation. After a period of time, the cognitive functioning of patients in both groups was measured. As predicted, the treatment group did have a higher average level of cognitive functioning than the control. The size of the difference between the two groups (0.4 standard deviations) was not small. However, the researcher was disappointed by the results of the requisite significance test. The P value for the t-test that was performed (P = 0.225, two tailed) was not below the critical 0.05 criterion. From this the researcher inferred that "the difference between treatment and control was not more than would be expected by chance if the groups were identical", and was disappointed to have spent so much time and resources testing a treatment that "provides no additional benefit".

It appears that this researcher did not realize how much the experiment he designed was dependent on luck. The statistical power of a t test with ~ = 0.05, two tailed, and 20 Ss in each condition to detect a difference of 0.4 standard deviations is only 0.23, with a corresponding probability of committing a type II error equal to 1.00-0.23 = 0.77. In other words, if the treatment actually did produce a 0.4 standard deviation increase in cognitive functioning, fewer than one-in-four samples of this size would produce a result that was significant at the 0.05 level, two tailed. The odds were stacked against this researcher from the start. By planning a study with so few observations, he had little chance of observing a significant result. The researcher's conclusion that the new treatment provided no additional benefit was his second mistake. He was giving up, convinced by the non-significant result that the treatment did not work, when he should have been going back to the hospital to test more patients.

THE BACKGROUND OF POWER

The history of how significance testing came to be adopted by the field of psychology may provide some insight into why psychologists seem to pay less attention to power and type II error relative to the null hypothesis and type I error. Gigerenzer and Murray (1987) reported that the field of psychology first began using significance tests widely during the 'inference revolution' (p. 20) that occurred between 1940 and 1955. Around that time, strong dissension existed among statisticians about the kind of inferences that could be made from significance tests. The nature of these disagreements is described in detail by Gigerenzer and Murray (1987). Sir Ronald Fisher's dispute with Jerzy Neyman and Egon Pearson about significance testing and the null hypothesis was most central to the concept of power. In short, Fisher's approach to significance testing (e.g. 1966, 1973) focused on testing a null hypothesis whereas the Neyman-Pearson approach (1933) specified both null and alternative hypotheses.

The concepts of power and type II error are central to Neyman-Pearson but not to Fisher. However, Fisher's views received wider circulation among psychologists through Snedecor's widely read Statistical Methods (1937). Psychology ignored the substantial incompatibilities of the Fisher and the Neyman-Pearson approaches and, instead assimilated some of Neyman and Pearson's ideas with Fisher's to create a seemingly coherent, seemingly uncontroversial " . . . single, hybrid theory of which neither Fisher nor, certainly, Neyman and Pearson would have approved" (Gigerenzer & Murray, 1987, p. 21). Although the statistical texts that psychologists used at that BRT 345 D

Page 4: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

492 Mark Hallahan and Robert Rosenthal

time may have mentioned some Neyman-Pearson concepts, like type II error, they did not attribute these concepts to their founders, nor mention the controversies surrounding them. For example, Guilford (1956, p. 217) mentioned power, but did not attempt to discuss it because it was thought to be too complex. Thus, the way in which psychologists have been taught to analyze data may have given prominence to null hypothesis testing and the avoidance of type I error at the expense of power analysis and the avoidance of type II error. This imbalance seems to be reflected in the asymmetry between the probability of type I and type II errors in psychological research.

COHEN'S CONTRIBUTIONS

Jacob Cohen has done more than anyone to educate the field of psychology about the importance of power and to dispel the confusion and misunderstanding that surrounds the concept. His writings explain why power is important and demonstrate that researchers have paid little attention to power. Further, his power tables make it easy to determine the level of power for a given study. These contributions (and many others) of Cohen to psychological methodology are enormous, but sadly his wisdom seems to have gone unheeded by many. Sedlmeier and Gigerenzer's (1989) recent analysis of the studies that were published in the Journal of Abnormal Psychology in 1984 found the power level of these studies to be nearly identical to when Cohen (1962) examined its predecessor journal over two decades earlier (in 1984, median power = 0.44). If anything, the typical power of current research is somewhat worse than when Cohen first examined this question because of the use of or-adjustment procedures, which were little used in 1960. When Sedlmeier and Gigerenzer (1989) adjusted their power computations for the effect of these procedures, the median level of power dropped to 0.37. Additionally, a recent survey of psychologists' statistical knowledge (Zuckerman, Hodgins, Zuckerman & Rosenthal, 1993) found that most respondents answered questions involving power or type II error incorrectly (39% and 47% correct for two questions). Cohen himself (1990) philosophically acknowledges that change can come slowly in terms of psychology adopting methodological advances. He gives the example of the t test, one of the most widely used statistics in psychological research, that was first published in 1908 (Student, 1908) but did not appear in psychological statistical texts until after World War II.

POWER ANALYSIS

Cohen's Statistical Power Analysis for the Behavioral Sciences (1977, 1988) is the definitive source on power and an invaluable resource for anyone interested in doing power analysis. This book provides detailed tables that make it easy to find: (a) the power of a study, and (b) the number of observations required to achieve a given level of power. These tables allow users to answer power questions for a wide range of significance tests and their utility is increased further by the fact that they can be adapted easily for specific cases such as non-independent samples and single sample tests. For more technical readers, Cohen provides an appendix with the formulae that were used to produce his tables. However, Cohen's classic text (1977, 1988) is by no means the only available resource on power. Cohen (1992) also has a brief introductory article on power that introduces the concept, includes a table to answer basic power questions, and provides clear, simple instructions for performing power analysis. Other recent tests (e.g. Kramer & Thieman, 1987; Lipsey, 1990) discuss power and provide tables to perform power analysis. Additionally, a (non-random) sampling of current statistics/research methods texts for psychology reveals that many have both chapters devoted to power and tables that researchers can use to determine power (e.g. Aron & Aron, 1994; Howell, 1995; Rosenthal & Rosnow, 1991; Welkowitz, Ewen & Cohen, 1991).

Consulting power tables can be useful for planning research and interpreting results. When planning a study, power analysis helps researchers to plan studies that are adequately sensitive to detect predicted effects. When analyzing results, power analysis also can be informative, especially when the significance tests are not below the critical level of • required to reject the null hypothesis. In this context, power analysis can answer two useful questions. First, one would want to know

Page 5: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

Statistical power: concepts, procedures, and applications 493

the power of the significance test that was performed, or given the obtained effect size, what was the chance of rejecting the null hypothesis. This information guides the interpretation of the results. With especially low power a nonsignificant result means very little,* but with especially high power a nonsignificant result means that it is likely the effect being investigated is quite small. Second, one would want to know what number of Ss would be required to reach a given power level for the obtained effect size. This information can help in the planning of future studies investigating the research question.

Practical questions

Power analysis involves estimating one of four parameters-- (a) significance level (~), (b) power, (c) effect size, and (d) number of S s - - f r o m the other three. Determining the power of a study that already has been conducted is a fairly straightforward task. One then knows the number of Ss, the obtained effect size, and the significance level that was used. However, determining in advance the number of Ss required for a given level of power requires researchers to specify the significance level, the desired level of power, and the expected effect size.

Cohen (1977, 1988, 1992) has provided much good advice for how to specify these parameters. The convention to use • = 0.05 is quite strong, and Cohen (1965, 1977, 1988, 1992) suggests power = 0.80 as a sensible goal for research. Both are reasonable standards as long as they maintain their conventional status and are not applied in a slavish, absolute, or uncritical manner. A researcher who is mindful of the relative costs and benefits of type I and type II errors for a specific research question may decide that different levels of power or ~ are more appropriate in that context.

Effect size expectation

For many researchers, the most uncertain part of power analysis involves specifying the expected effect size prior to conducting a study. This article highlights a number of the ways to estimate reasonably the expected effect for a planned study.

1. Consult existing research. Studies that have addressed similar questions or have used a similar paradigm may provide a reasonable estimate of the size of the effect that would be expected in a planned study. Simple meta-analytic procedures (Rosenthal, 1991b) could be used to find the average effect that was observed in the existing relevant studies. The expected effect size for the planned study could be based on the average effect that has been found in similar research.

2. Rely on preliminary data. Many researchers conduct preliminary or pilot research prior to conducting an extensive research project. In addition to providing an opportunity to test and to fine-tune experimental procedures, pilot research generates data that can be used to estimate the size of the effect that would be observed in a larger study. The effect size from pilot data might provide a reasonable estimate of what would be observed in a larger study.

3. Subjective Estimation. In situations where a researcher has no pilot data and absolutely nothing in the existing research literature relates to the planned study, it may not be unwise to make an educated guess about the size of the expected effect. Presumably, the researcher has some intuition as to what the results of the research might be (or else why would the study have been planned?). Of course, the value of guessing is questionable. A speculative, subjective estimate may not be accurate and, without data to support it, who would believe it anyway? However, in some ways, it is an enviable situation to be planning a study for which there is absolutely no prior information to estimate the size of the expected effect. The data from the planned study may be quite valuable, as they are the first information available about the effect size of a potentially interesting phenomenon.

*It is crucially important that power be considered when interpreting the results of research in which the null is a research hypothesis. With low power, one would be unlikely to reject the null even if it were false, but this failure to reject does not mean that the null is true. Disturbingly, Sedlmeier and Gigerenzer (1989) found that studies with the null as research hypothesis had very low power. In 1984, 7 of the 56 articles in the Journal of Abnormal Psychology had the null hypothesis as at least one of their research hypotheses. The median power of these studies to reject the null was an incredibly low 0.25. It is a serious error to infer that the null is true on the basis of a test that has little chance of rejecting a genuinely false null.

Page 6: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

494 Mark Hallahan and Robert Rosenthal

4. Cohen's advice. Again, Cohen (1977, 1988, 1992) provides some very reasonable advice that could be used to help estimate the expected effect size for a planned study in the absence of more specific information. For each type of effect size for which Cohen has created power tables, he also suggests the size of effects that could be considered small, medium, and large. For example, Cohen suggests that d = 0.20 would be a small effect, d = 0.50 would be a medium effect, and d = 0.80 would be a large effect. These benchmarks may be useful in estimating the expected effect for a planned study. Cohen makes it clear that his suggestions are meant to guide researchers' own judgment about their data, not replace it with hard-and-fast rules that can be applied without thought. These conventions for small, medium, and large effect sizes are " . . . recommended for use only when no better basis for estimating the effect size index is available" (Cohen, 1977, p. 25). In cases where there is no previous information upon which to base an estimate, Cohen points out that it might be reasonable to expect a small effect because, without any previous work having been done in an area, the phenomenon of interest probably is not under good experimental control, nor are the available measuring instruments likely to be especially precise.

5. Cost-benefit analysis. In some cases, especially in applied research, it may be appropriate to select an expected effect size based on cost-benefit analysis. An 'implementation threshold effect size', or the degree of effectiveness at which an intervention's anticipated benefits would justify its implementation cost, could be determined. Using this effect size for power analyses would ensure that a planned study had sufficient power to detect the minimum effect size considered important.

6. Avoid backing into an estimate. We suggest avoiding the following approach for estimating expected effect sizes. Imagine a researcher with access to 50 Ss who wanted his or her planned study to have a 'socially desirable' level of power. Perhaps he or she has heard something like "granting agencies only fund studies with high levels of power" or "Cohen says power = 0.80 is desirable". This researcher could consult Cohen's tables (1977, 1988) to learn that a two-tailed t test with 50 Ss has power ~ 0.80 to detect an effect size d = 0.80. Using the tables in this way, the researcher could 'estimate' the expected effect size to be d = 0.80 and then claim to have planned a study with high power. However, using the power tables to 'back into' an expected effect size can be problematic because of the potential for self-delusion. For example, if existing research or pilot data would have suggested an expected effect size d = 0.30, the actual power of the study would be much worse. In fact, a t test with 25 Ss per condition has less than a 20% chance of detecting an effect size d = 0.30; approximately 175 Ss per condition (7 times larger than the planned study) would be required to have power = 0.80.

I N C R E A S I N G POWER

Not only is it easy to determine the power of a study, there are also a number of steps that can be taken to increase power. Working with the power tables likely suggests one obvious way to increase power: increasing the number of observations in a study. Although it certainly is desirable to have large samples, increasing sample size is just one of many ways to increase power. In some cases it may not be possible to increase sample size because Ss are rare, difficult to recruit, or expensive. In such cases, researchers are constrained to work with a small number of Ss, but can achieve reasonable levels of power through other means. Even when Ss are readily available, it is important to be aware of the full range of ways to increase power. Not only is it important to have sufficient power to detect predicted effects, but it is also important to achieve power efficiently.

Efficiency is the ability to maximize power against various cost constraints. This should be considered in the course of planning a study in order to understand where there is leverage to improve the overall quality of a planned study. For example, in cases where Ss are particularly difficult to obtain a researcher might do well to invest attention and resources to minimize experimental error. However, in cases where Ss are plentiful, it may be relatively better, in terms of increasing power, to devote attention and resources to increasing sample sizes. Each research question has its own unique set of scientific, logistic, and cost constraints that influence how a researcher best may go about maximizing power.

As Table 2 illustrates, power can be increased in many ways and at many points in the research process, including design, data analysis, and the use of meta-analytic procedures.

Page 7: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

Stat is t ical power: concepts , procedures , and appl ica t ions 495

Table 2. Ten procedures for increasing power

Design 1. Increase sample sizes 2. Administer stronger treatments 3. Avoid restriction of range for dependent variables 4. Standardize experimental procedures 5. Use more reliable measuring instruments 6. Use more homogenous subject populations 7. Use blocking variables 8. Use repeated measures designs (the ultimate blocking variable)

Analysis 9. Use focused contrasts rather than omnibus tests

Cumulation 10. Combine results of individual studies

D e s i g n

Considering the parameters that determine power--sample size, ~, and effect size--provides a good framework to think about how to increase power. Any test of significance is determined by the size of the observed effect and the size of the sample:

significance test = size of effect × size of study.

Thus, the power of a study will be affected by any action that has implications for any of these three parameters. Increasing the number of observations in a study would increase power, as would setting a to a less stringent level, although that may not be realistic practical advice in a world that holds P = 0.05 in such high regard. Also, any steps that would increase the observed effect size would increase power.

Various things could be done to increase effect size and, thus, increase power. These steps can be organized in terms of the factors that determine effect size: (a) the extent to which observations differ as a function of an experimental variable, also known as 'signal'; and (b) the amount of error variance against which the effect is compared, also known as 'noise'. The effect size d provides a good illustration of this relationship. The numerator, ml - m2, represents the variability between experimental conditions, and the denominator, a, represents the variability among observations within experimental conditions

effect size = variability between experimental conditions e.g. d - rnl - m2

within condition variability

Anything that increases between-condition variability will increase effect size and thereby increase power. For example, increasing the strength of a treatment should increase the difference between conditions. Thus, if one were studying the relationship between amount of therapeutic contact and improvement, the difference in improvement should be larger between a 40-min interview and a 5-min interview than between a 20-min interview and a 15-min interview. Also, we want to avoid restriction of range. The size of a relationship between two variables will be larger in a sample that fully represents the range of scores for the dependent variable than in a sample with a narrowly restricted range for that variable. For example, the size of the correlation between exercise and heart rate would likely be smaller in a sample of elite marathon runners than in the general population.

Anything that reduces within-condition variability will increase effect size and, thereby increase power. Within-condition or error variance can be reduced in many ways. Efforts to standardize experimental procedures reduce variance due to differences in the conditions under which Ss performed an experimental task. Also, using more reliable measuring instruments reduces variance due to measurement error. The term measuring instrument refers broadly to anything used to obtain a measurement on a variable of interest, ranging from reaction time, to heart rate, to a construct that is measured with a paper-and-pencil scale. Regardless of what is being measured, low reliability reduces the size of observed effects which reduces power. Subject-to-subject differences are another source of within-condition variability. One strategy to reduce subject variance is to use a relatively homogenous subject population. Another would be to use blocking variables to reduce error variance. These are known variables other than the primary independent variables that also are related to the dependent variable. The use of blocking variables increases

Page 8: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

496 Mark Hallahan and Robert Rosenthal

effect size because variance in the dependent variable that is due to the blocking variables is effectively removed from the within condition variance. Repeated-measures designs are especially powerful because they employ 'the ultimate blocking variable'--the individual S.

"Student" (193 l) provided an early example of the potential to improve power through research design. According to his analysis, an experiment comparing the height and weight of children who received raw or pasteurized milk, with about 5000 children in each condition, could have achieved the same level of power with only 50 sets of identical twins, with one twin being assigned to each condition. This dramatic increase in power would have resulted because the amount of variance in the height and weight of two identical twins is so much less than between 2 randomly chosen children.

A more recent example of the potential to improve power through research design has special application to the study of dyads. For a given number of Ss, the round robin block design is more powerful than alternative designs. In this design, Ss are assigned to blocks. Within each block, Ss are paired in dyads with every other person in their block. The power advantage of this design is twofold. First, it generates large numbers of dyads with relatively few Ss. For example, a 4-person round robin block generates 6 dyads and a 6-person round robin block generates 15 dyads. This compares to a non round robin design where only 2 dyads are formed from 4 people and only 3 dyads are formed from 6. Second, as in repeated-measures, there is less variability among dyads within round robin blocks than among non round robin dyads because every dyad in a round robin block is created from the same set of people. Recently developed procedures make it easier to analyze data from the round robin block design (e.g. Kenny, 1994; Kenny & La Voie, 1984; Li, Hallahan & Rosenthal, 1995), which should increase the use of this power efficient design in research on dyadic interaction.

Analysis For comparisons of multiple groups, the use of focused contrasts is preferable to the use of

unfocused or omnibus significance tests. Analysis of variance F tests with df > 1 in the numerator or chi-square tests with df > l are examples of omnibus tests. In a multiple group comparison, such as, for example, a comparison of scores on a test of cognitive performance for children of 5 different ages, a contrast testing a focused hypothesis, such as performance increases with age, would be more likely to produce a significant result than the omnibus F test for the analysis of variance comparing the scores for the 5 age groups, assuming that the data corresponded reasonably to the predicted trend.

The power advantage of contrasts comes from asking a specific question over a diffuse one. In effect, contrasts can concentrate the between-group variance in a single, focused prediction in a way that a diffuse test cannot. Contrasts consider the pattern of group means in addition to their overall variance. For example, it would not matter for an omnibus test whether the scores for 5 age groups increased from youngest to oldest in a clear, meaningful pattern or if they differed in a seemingly random pattern, but it would matter very much for a contrast.

However, it should be noted that the most important feature of contrasts is not their power, but rather that they can address scientifically meaningful questions in a way that unfocused tests generally cannot. Contrasts test specific questions that correspond precisely to scientifically meaningful relationships (e.g. performance increases with age), whereas unfocused tests tell only whether groups differ in some unspecified manner (e.g. level of performance is not identical across 5 levels of age). See Koutstaal and Rosenthal (1994) or Rosenthal and Rosnow (1985) for a more detailed discussion of contrast analysis.

Cumulation The use of meta-analytic procedures to cumulate the results of individual studies increases the

probability that effects that exist in nature will not be overlooked because individual studies were unable to reject the null hypothesis at a given level of statistical significance. Cumulating research results in this manner effectively increases the number of observations that can be brought to bear in testing an hypothesis.

Consider this example. A student (Student ,4) proposed an experiment with an intriguing hypothesis, a clever design, and a straightforward test of the experimental hypothesis--a t test

Page 9: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

Statistical power: concepts , procedures , and applicat ions 497

Table 3. Individual and cumulative results for 3 studies

n" d Power b pc Z

Student A 13 0.48 0.32 0.13 1.15 Student B I 1 0.57 0.37 0.11 1.23 Student C 12 0.52 0.35 0.12 1.19 Cumulative 36 0.52 0.71 0.02 2.06 d

"Number of Ss in each condition, bWith ct = 0.05, one-tailed. COne-tailed. dZ = 2.06 based on the Stouffer method. The method of testing the mean Z would yield an even stronger result (Z = 3.56, P = 0.00019). See Rosenthal (1991b, Chapter 5) for more detail on these two and other methods of combining probabilities.

comparing the means of a treatment and control group. The student's advisor allowed her to recruit Ss from a class that typically has between 20 and 30 students. The experiment was performed with 13 Ss in each condition. Although the difference between the treatment and control groups' means was in the predicted direction, this difference was not significant t(24) = 1.18, P = 0.13, one-tailed. The following year another student (Student B) expressed interest in the same research question. The advisor agreed to let him conduct the research with some reservations because of the 'disappointing' results of Student A's experiment. As in the previous year, the difference between the treatment (n = 11) and control (n = 11) groups' means was in the predicted direction, but this difference was not significant t(20) = 1.27, P = 0.11, one-tailed. The following year a third student (Student C) expressed interest in the experiment that Students A and B had performed. Although the advisor was reluctant to sponsor an experiment in which the finding of "no difference between treatment and control" already had been "replicated in two studies", he ultimately was persuaded to let Student C perform the experiment. Again, although the difference between the treatment (n = 12) and control (n = 12) groups' means were in the predicted direction, this difference was not significant t(22)--1.22, P = 0.12, one-tailed.

Although, taken individually, none of these studies produced an effect large enough to reject the null hypothesis, when taken together, the cumulative effect of these studies was statistically significant Z = 2.06, P = 0.02, one tailed, average d = 0.52. As Table 3 illustrates, the power of these experiments to detect a half standard deviation difference between the treatment and control means was quite low (median power = 0.35), and even lower (median power = 0.22) for two-tailed tests. Thus, none of these experiments had a good chance of obtaining a statistically significant result even if the null were false, with d = 0.52. However, when simple meta-analytic procedures are used to cumulate the results of these three studies, the treatment condition was significantly different from the control.

One could obtain a rough idea of how much a recta-analysis increases power by finding the power associated with the total number of Ss and the average effect size for a group of studies. In this case, power = 0.71 for a t test comparing two means with 72 Ss (36 per condition) and an average effect size of d = 0.52, about twice the power of the individual studies.

Researchers, mindful of the concept of power, generally want to conduct studies that have sufficient power to reject the null. However, this is not always possible. For example, in a recent issue of Science, Jon Cohen (1993) reported the results of two pilot studies comparing the health of monkeys who were vaccinated with SIV (the simian analog to HIV) to controls. In these two pilot studies, the experimental conditions included 3 and 5 monkeys respectively, and the controls included 3 and 6 respectively. With recta-analysis, it makes sense to run small, low power studies, especially in cases where Ss are rare or hard to recruit. Results of such studies can be very informative, contributing to a larger database although unlikely to lead to a significant result on their own (Rosenthal, 1995).

C O N C L U D I N G R E M A R K S

Attention to power

We suggest that it is good practice to consult Cohen's (1977, 1988) power tables frequently in the course of conducting research. Knowledge of an experiment's likelihood of producing a significant result for a genuinely false null should help researchers plan studies with sufficient power

Page 10: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

498 Mark Hallahan and Robert Rosenthal

to detect predicted effects and to interpret nonsignificant results properly. Additionally, researchers should use the full complement of available means to increase the power of their research as efficiently as possible, whether it be through increased sample size, stronger effects, reduced experimental error, more precise data analysis, cumulating multiple studies with meta-analytic procedures, or, most likely, a combination of these things.

Thinking about significance tests

Many observers have pointed out that null hypothesis testing often is used in a problematic manner in social science research (Cohen, 1990, 1994; Gigerenzer, 1993; Jones, 1955; Loftus, 1991, 1993, 1994; Rosenthal, 1991a, 1995). Certainly, an overemphasis on significance testing at the expense of useful information about the size of effects can lead to two common inference errors: (a) interpreting failure to reject the null hypothesis to mean the null is true or that there is no effect; and (b) not distinguishing the statistical significance of a result from its scientific importance. These errors can be avoided by using procedures when analyzing and reporting research results that contain more information than the likelihood that the sample effect size could have been obtained if a null hypothesis were true. It is good practice to compute and report effect size estimates for any effect that is tested, and to provide standard errors or confidence intervals for effects, as e.g. Loftus (1991, 1993, 1994) and Rosenthal and Rubin (1978) suggest. Rosenthal and Rubin (1994) propose the counternull statistic as a way to avoid the inference errors associated with an overemphasis on null hypothesis testing.

For a given obtained effect size, the counternull value of an effect size is the non-null magnitude of the effect size that is supported by just the same amount of evidence as supports the null value of the effect size. For example, if a sample effect size was d = 0.30, with P = 0.20, that would mean that only 1 time in 5 would a sample have an effect size as large as d = 0.30 if it were drawn from a larger population where d = 0.00. With P so far from being significant many researchers would conclude that the null was true. However, the counternull specifies the equally likely alternative: a sample effect size as small as d = 0.30 would be observed only 1 time in 5 from a population where d = 0.60. In other words the counternull illustrates that populations with d = 0.00 and d = 0.60 are equally likely to produce a sample effect size d = 0.30. The counternull is easy to compute. For symmetrically distributed effect size statistics (e.g. d), the counternull is simply twice the observed effect size minus the null effect size:

eS~counternull ) = 2eS(ob t a ined ) - - eS (nu l l ) .

For non-symmetric effect sizes, like the Pearson r, this formula can be applied after the effect size has been transformed to a symmetric scale, as Fisher's z transformation does for the Pearson r.

Use of the counternull avoids the errors of inferring that a nonsignificant effect means no effect and assuming a significant effect to be scientifically important. The first error is avoided because the counternull illustrates that it is equally likely that the true population effect size is larger than the observed effect size as that it is zero, and it avoids the second error because if even the value of the counternull is too small to be scientifically important we will be less tempted to conclude that a result is important merely because it is significant.

The paradox of power

Of course, in a world where rejecting the null hypothesis is held in high regard, it is important practically to consider power to ensure that we are able to reject those nulls that deserve rejection. However, we can envision a world in which the goal of scientific understanding takes precedence over null rejection. In this world researchers will ask the questions "How large was the effect?", "How well was it estimated?", and "Is this effect large enough to be scientifically important?" rather than "How likely is it that this effect could have come from a population in which no real effect exists?". However, it is not necessarily inconsistent to advocate on one hand that researchers be aware of the concept of statistical power and take steps to increase it in their research, and, on the other hand, to argue that null hypothesis testing is emphasized too much in social science research. Even with a reduced emphasis on null hypothesis testing, all of the procedures that increase statistical power are still sound research practice. These procedures, collectively, will lead

Page 11: Behav. Res. Ther. ~ Pergamon 0005-7967(95)00082-8 …psych.colorado.edu/~willcutt/pdfs/Hallahan_1996.pdf · Statistical power: concepts, procedures, and applications 491 Is it important

Statistical power: concepts, procedures, and applications 499

to more accurate estimates of effect sizes, to larger effect sizes, and to conceptually more interpretable effect sizes.

Acknowledgements--Preparation of this paper was supported in part by an award from the Tozier Fund and in part by a sabbatical award from the James McKeen Cattell Foundation.

R E F E R E N C E S

Aron, A. & Aron, E. N. (1994). Statistics for Psychology. Englewood Cliffs, NJ: Prentice Hall. Cohen, J. (1962). The statistical power of abnormal-social psychological research: a review. Journal of Abnormal and Social

Psychology, 65, 145-153. Cohen, J. (1965). Some statistical issues in psychological research. In Wolman, B. B. (Ed.), Handbook of Clinical Psychology

(pp. 95-121). New York: McGraw-Hill. Cohen, J. (1977). Statistical Power Analysis for the Behavioral Sciences (rev. edn). New York: Academic Press. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd edn). Hillsdale, NJ: LEA. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. Cohen, J. (1993). A new goal: preventing disease, not infection. Science, 262, 1820-1821. Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49, 997-1003. Fisher, R. A. (1966). The Design of Experiments (8th edn). New York: Hafner. (Reprinted in Bennett, J. H. (Ed.), Statistical

Methods, Experimental Design, and Scientific Inference, Oxford Univ. Press, 1993.) Fisher, R. A. (1973). Statistical Methods and Scientific Inference (3rd edn). New York: Hafner. (Reprinted in Bennett, J. H.

(Ed.), Statistical Methods, Experimental Design, and Scientific Inference, Oxford Univ. Press, 1993.) Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In Keren, G. & Lewis, C. (Eds), A Handbook

for Data Analysis in the Behavioral Sciences: Vol. 1, Methodological Issues (pp. 311 339). Hillsdale, NJ: LEA. Gigerenzer, G. & Murray, D. J. (1987). Cognition as Intuitive Statistics. Hillsdale, NJ: LEA. Guilford, J. P. (1956). Fundamental Statistics in Psychology and Education (3rd edn). New York: McGraw Hill. Howell, D. C. (1995). Fundamental Statistics for the Behavioral Sciences (3rd edn). Belmont, CA: Duxbury Press. Jones, L. V. (1955). Statistical theory and research design. Annual Review of Psychology, 6, 405-430. Kenny, D. A. (1994). Interpersonal Perception. New York: Guilford Press. Kenny, D. A. & La Voie, L. (1984). The social relations model. In Berkowitz, L. (Ed.) Advances in Experimental Social

Psychology, 18, (pp. 141-182). Orlando: Academic Press. Koutstaal, W. & Rosenthal, R. (1994). Contrast analysis in behavioral research. Poznan Studies in the Philosophy of the

Sciences and the Humanities, 39, 135 175. Kramer, H. C. & Thieman, S. (1987). How Many Subjects: Statistical Power Analysis in Research. Beverly Hills, CA: Sage. Li, H., Hallahan, M. & Rosenthal, R. (1995). Comparing dyads: the analysis of variance of round robin block designs.

Manuscript submitted for publication. Lipsey, M. W. (1990). Design Sensitivity: Statistical Power for Experimental Research. Beverly Hills, CA: Sage. Loftus, G. R. (1991). On the tyranny of hypothesis testing in the socal sciences. Contemporary Psychology, 36, 102-105. Loftus, G. R. (1993). One picture is worth a thousand p values: on the irrelevance of hypothesis testing in the microcomputer

age. Behavior Research Methods, Instruments, and Computers, 25, 250-256. Loftus, G. R. (1994). Why psychology will never be a real science until we change the way we analyze data. Paper presented

at the meeting of the American Psychological Association, Los Angeles, CA. Neyman, J. & Pearson, E. S. (1933). The testing of statistical hypotheses in relation to probabilities a priori. Proc. of the

Cambridge Philosophical Society, 29, 492-510. Rosenthal, R. (1991a). Cumulating psychology: an appreciation of Donald T. Campbell. Psychological Science, 2,

213-221. Rosenthal, R. (1991b). Meta-analytic Procedures for Social Research (rev. edn). Beverly Hills, CA: Sage. Rosenthal, R. (1995). Progress in clinical psychology: is there any? Clinical Psychology: Science and Practice, 2,

133-150. Rosenthal, R. & Rosnow, R. L. (1985). Contrast Analysis: Focused Comparisons in the Analysis of Variance. New York:

Cambridge Univ. Press. Rosenthal, R. & Rosnow, R. L. (1991). Essentials of Behavioral Research: Methods and Data Analysis. New York: McGraw

Hill. Rosenthal, R. & Rubin, D. B. (1978). Interpersonal expectancy effects: the first 345 studies. The Behavioral and Brain

Sciences, 3, 377-386. Rosenthal, R. & Rubin, D. B. (1994). The counternull value of an effect size: a new statistic. Psychological Science, 5,

329-334. Sedlmeier, P. & Gigerenzer, G. (I 989). Do studies of statistical power have an effect on the power of studies? Psychological

Bulletin, 105, 309-316. Snedecor, (1937). Statistical Methods (lst edn). Ames, IA: Collegiate Press. "Student", (1908). The probable error of a mean. Biometrika, 6, 1-25. "Student", (1931). The Lanarkshire milk experiment. Biometrika, 23, 398-406. Welkowitz, J., Ewen, R. B. & Cohen, J. (1991). Introductory Statistics for the Behavioral Sciences (4th edn). Orlando:

Harcourt Brace Jovanovich. Zuckerman, M., Hodgins, H. S., Zuckerman, A. & Rosenthal, R. (1993). Contemporary issues in the analysis of data: a

survey of 551 psychologists. Psychological Science, 4, 49-53.