Luc Wouters November 2011 STATISTICAL THINKING AND REASONING IN EXPERIMENTAL RESEARCH
Luc Wouters November 2011
STATISTICAL THINKING AND REASONING IN
EXPERIMENTAL RESEARCH
Statistical thinking will one day be as necessary for efficient citizenship as the
ability to read and write!
S. Wilks (1951) after H.G. Wells (1903; 1938).
TABLE OF CONTENTS
1 Introduction .................................................................................................................................... 1
2 The Architecture of Experimental Research ................................................................................... 1
2.1 The Controlled Experiment ..................................................................................................... 1
2.2 Some Terminology .................................................................................................................. 2
2.3 Scientific Research as a Phased Process ................................................................................. 2
2.4 Scientific Research as an Iterative, Dynamic Process ............................................................. 3
3 Smart Research Design by Statistical Thinking ............................................................................... 3
3.1 Research Styles – The Smart Researcher ................................................................................ 3
3.2 Principles of Statistical Thinking ............................................................................................. 4
4 Planning the experiment ................................................................................................................. 5
4.1 Types of Experiments .............................................................................................................. 5
4.2 The Objective of the Study and the Research Hypothesis ...................................................... 6
4.3 The Pilot Experiment ............................................................................................................... 7
5 Principles of Statistical Design ........................................................................................................ 7
5.1 The Structure of the Response Variable ................................................................................. 7
5.2 Defining the experimental unit ............................................................................................... 8
5.3 Variation is Omnipresent ........................................................................................................ 9
5.4 Balancing Internal and External Validity ............................................................................... 10
5.5 Bias and Variability ................................................................................................................ 10
5.6 Requirements for a Good Experiment .................................................................................. 11
5.7 Strategies for Minimizing Bias and Maximizing Signal‐to‐noise Ratio .................................. 12
5.7.1 Strategies for Minimizing Bias – Good Experimental Practice ...................................... 12
5.7.2 Strategies for Controlling Variability – Good Experimental Design .............................. 16
5.8 Simplicity of Design ............................................................................................................... 19
5.9 The Calculation of Uncertainty ............................................................................................. 19
6 Common Designs in Biological Experimentation .......................................................................... 19
6.1 The Completely Randomized Design .................................................................................... 20
6.2 The Factorial Design .............................................................................................................. 23
6.3 The Randomized Complete Block Design ............................................................................. 26
6.4 The Latin Square Design ........................................................................................................ 28
6.5 Split Plot Designs ................................................................................................................... 29
6.6 Repeated Measures designs ................................................................................................. 30
7 The Required Number of Replicates – Sample Size ...................................................................... 30
7.1 Determining Sample Size is a Risk – Cost Assessment .......................................................... 30
7.2 The Context of Biomedical Experiments ............................................................................... 31
7.3 The Hypothesis Testing Context – The Population Model .................................................... 31
7.4 Sample Size Calculations ....................................................................................................... 32
7.4.1 Power Analysis Computations ...................................................................................... 32
7.4.2 Mead’s Resource Requirement Equation ..................................................................... 33
7.4.3 Multiplicity and Sample Size ......................................................................................... 34
8 The Statistical Analysis .................................................................................................................. 34
8.1 The Statistical Triangle .......................................................................................................... 34
8.2 The Statistical Model Revisited ............................................................................................. 35
8.3 Types of data ......................................................................................................................... 35
8.4 Verifying the Statistical Assumptions ................................................................................... 36
8.5 Multiplicity ............................................................................................................................ 36
9 The Study Protocol ........................................................................................................................ 37
10 Interpretation and Reporting .................................................................................................... 38
10.1 The Methods Section ............................................................................................................ 38
10.1.1 Experimental Design ..................................................................................................... 38
10.1.2 Statistical Methods ....................................................................................................... 38
10.2 The Results Section ............................................................................................................... 39
10.2.1 Summarizing the Data ................................................................................................... 39
10.2.2 Graphical Displays ......................................................................................................... 40
10.2.3 Interpreting and Reporting Significance tests .............................................................. 40
11 Concluding Remarks .................................................................................................................. 42
12 Recommended Reading ............................................................................................................ 42
References and Selected Bibliography.................................................................................................. 45
Appendix Tools for Randomization in MS Excel and R ......................................................................... 49
Completely Randomized Design ....................................................................................................... 49
Randomized Complete Block Design ................................................................................................ 50
1
1 INTRODUCTION
Many biomedical studies are conducted with little or no thought about the statistical aspects of the
study design. As a consequence, these studies are often seriously flawed and are not capable of
meeting their intended purpose. In some cases the studies are designed too small to enable an
answer to the research question. Conversely, some studies use too many experimental units so that
valuable resources are wasted. This lack of quality was also demonstrated in a recent review carried
out by Kilkenny et al. (2009) who surveyed 271 papers reporting laboratory animal experiments.
They found that most of the studies had flaws in their design and almost one third of the papers that
used statistical methods did not describe them or present their results adequately. There is also a
genuine concern about the replicability (lack of confirmation) of research findings and it has been
argued that most research findings could be false (Ioannidis, 2005).
A way to circumvent this lack of quality is by changing the way scientists look at the research
process. This can be accomplished by introducing statistical thinking as an informed skill that
enhances the quality of the research data (Vandenbroeck et al., 2006).
Statistical thinking and reasoning are two powerful skills based on the fundamentals of statistics.
While the science of statistics is mostly involved with the complexities and techniques of statistical
analysis, statistical thinking and reasoning are generalist skills that focus on the application of
nontechnical concepts and principles. There are no clear, generally accepted definitions of statistical
thinking and reasoning. In our conceptualization we consider statistical thinking as a skill that helps
to better understand how statistical methods can contribute to finding answers to specific research
problems and what the implications are in terms of data collection, experimental setup, data
analysis, and reporting. Statistical thinking will provide us with a generic methodology to design
insightful experiments. On the other hand, we will consider statistical reasoning as being more
involved with the interpretation of statistical analyses. Of course, as is apparent from the above
there is a large overlap between the concepts of statistical thinking and reasoning.
2 THE ARCHITECTURE OF EXPERIMENTAL RESEARCH
2.1 THE CONTROLLED EXPERIMENT
There are two basic approaches to implement a scientific research project. One approach is to
conduct an observational (also called correlational) study in which we investigate the effect of
naturally occurring variation and the assignment of treatment groups is outside the control of the
2
investigator. Although there are often good and valid reasons for conducting an observational study,
their main drawback is that the presence of confounding variables can never be excluded, thus
weakening the conclusions. An alternative to an observational study is an experimental or
manipulative study in which the investigator manipulates the experimental system and measures
the effect of his manipulations on the experimental material. Since the manipulation of the
experimental system is under control of the experimenter, one also speaks of controlled
experiments. A well‐designed experimental study eliminates the bias caused by confounding
variables. The great power of a controlled experiment, provided it was well conceived, lies in the fact
that it allows us to demonstrate causal relationships. In the following we will focus on controlled
experiments and how statistical thinking and reasoning can be of use to optimize its design and
interpretation.
2.2 SOME TERMINOLOGY
We refer to a factor as the condition or set of conditions that we manipulate in the experiment, e.g.
the concentration of a drug. A factor level is the particular value of the factor, e.g. 10‐6M, 10‐5M. A
treatment consists of the combination of factor levels. In single‐factor studies a treatment
corresponds to a factor level. The experimental unit is defined as the smallest physical entity to
which a treatment is independently applied. The characteristic that is measured and on which the
effect of the different treatments is investigated and analyzed is referred to as the response or
dependent variable. The observational unit is the unit on which the response is measured. Often
the observational unit is identical to the experimental unit, but this is not necessarily always the
case.
2.3 SCIENTIFIC RESEARCH AS A PHASED PROCESS
From a systems analysis point of view the scientific research process can be divided into five distinct
stages: definition of the research question, design of the experiment, conduct of the experiment
and data collection, data analysis, and reporting. Each of these phases results in a specific
deliverable. The definition of the research question will usually result in a research or grant proposal
stating the hypothesis related to the research and the predictions that follow from this hypothesis.
The design of the experiment needed for testing the research hypothesis is formalized in a written
protocol. After the experiment has been carried out, the data will be collected yielding the dataset.
Statistical analysis of the data will yield conclusions that answer the research question by accepting
or rejecting the formalized hypothesis. Finally, a well carried out research project will result in a
report, thesis, or journal article.
3
2.4 SCIENTIFIC RESEARCH AS AN ITERATIVE, DYNAMIC PROCESS
As depicted in Figure 1, scientific research is not a simple static activity, but an iterative and highly
dynamic process. A research project is carried out within some organizational or management
context which can be rather authoritative; this context can be academic, government, or business. In
this context, the management objectives of the research project are put forward. The aim of our
research project itself is to fill an existing information gap. To fill this gap the research question is
defined, the experiment designed and carried out and the data analyzed. The results of this analysis
allow informed decisions to be made and provide a way of feedback to adjust the definition of the
research question. On the other hand, the experimental results will trigger research management to
reconsider their objectives and eventually request for more information.
Figure 1. Scientific research as an iterative process
3 SMART RESEARCH DESIGN BY STATISTICAL THINKING
3.1 RESEARCH STYLES – THE SMART RESEARCHER
The five phases that make up the research process modulate between the concrete and the abstract
world (Figure 2). Definition and reporting are conceptual and complex tasks requiring a great deal of
abstract reasoning. Conversely, experimental work and data collection is a very concrete,
measurable task handling with the practical details and complications of the specific research
domain.
Scientists exhibit different styles in their research depending on the relative fraction of the available
resources that they are willing to spend at each phase of the research process. This allows us to
recognize different archetypes of researchers: the “novelist” who needs to spend a lot of time
distilling a report from an ill-conceived experiment; the “data salvager” who believes that no matter
how you collect the data or set up the experiment, there is always a statistical fix-up at analysis time;
4
the “lab freak” who strongly believes that if enough data are collected something interesting will
always emerge. Finally, there is the “smart researcher” who is aware of the architecture of the
experiment as a sequence of steps and allocates a major part of his time budget to the first two
steps: definition and design. The “smart researcher” is convinced that time spent planning and
designing an experiment at the outset will save time and money in the long run. He opposes the
“lab freak” by trying to reduce the number of measurements to be taken, thus effectively reducing
the time spent in the lab. In contrast to the “data salvager”, the “smart researcher” recognizes that
the design of the experiment will govern how the data will be analyzed thereby reducing time spent
at the data analysis stage to a minimum. By carefully preparing and formalizing the definition and
design phase, the “smart researcher” can look ahead to the reporting phase with peace of mind,
which is in contrast to the “novelist”.
Figure 2. Modulating between the concrete and the abstract
3.2 PRINCIPLES OF STATISTICAL THINKING
The “smart researcher” recognizes the value of “statistical thinking” for his application area and he
himself is skilled in “statistical thinking” or he collaborates with a professional who masters this skill.
As noted before, “statistical thinking” is related to, but distinct from statistical science. While
statistics is a specialized technical skill based on mathematical statistics as a science on its own,
statistical thinking is a generalist skill based on informed practice and focused on the applications of
nontechnical concepts and principles. The “statistical thinker” attempts to understand how
statistical methods can contribute to finding answers to specific research problems in terms of data
collection, experimental setup, data analysis, and reporting. He or she is able to postulate which
statistical expertise is required to enhance the research project’s success. In this capacity the
“statistical thinker” acts as a “diagnoser”. In contrast to statistics which operates in a closed and
secluded mathematical context, statistical thinking is a practice that is fully integrated with the
researcher’s scientific field, not merely an autonomous science. Hence the “statistical thinker”
operates in a more ambiguous setting where he is deeply involved in applied research, with a good
5
working knowledge of the substantive science. In this role the “statistical thinker” acts as an
intermediary between scientists and statistician and goes into dialogue with them. He attempts to
integrate the several potentially competing priorities that make up the success of a research project:
resource economy, statistical power, and scientific relevance, into a coherent and statistically
underpinned research strategy. While the impact of the statistician on the research process is
limited to discrete interventions, the “statistical thinker” truly permeates the research process. His
combined skills lead to increased efficiency, which is important to increase the speed with which
research data, analyses, and conclusions become available. Moreover, these skills allow to enhance
the quality and to reduce the associated cost. Statistical thinking then helps the scientist to build a
case and negotiate it on fair and objective grounds with those in the organization seeking to
contribute to more business‐oriented measures of performance. In that sense, the successful
“statistical thinker” is a persuasive communicator.
Smart research design is based on seven basic principles of statistical thinking: 1) time spent
thinking on the conceptualization and design of an experiment is time wisely spent; 2) the design of
an experiment reflects the contributions from different sources of variability; 3) the design of an
experiment balances between its internal validity (proper control of noise) and external validity (the
experiment’s generalizability); 4) good experimental practice provides the clue to bias minimization;
5) good experimental design is the clue to the control of variability; 6) experimental design
integrates various disciplines; 7) a priori consideration of statistical power is an indispensable pillar
of an effective experiment.
4 PLANNING THE EXPERIMENT
4.1 TYPES OF EXPERIMENTS
Experimental set‐ups can be classified into four broad categories depending on the type of objective
of the study in question. A comparative experiment is one in which two or more techniques,
treatments, or levels of an explanatory variable are to be compared with one another. There are
many examples of comparative experiments in biomedical areas. For example in nutrition studies
different diets can be compared to one another in laboratory animals. In clinical studies, the efficacy
of an experimental drug is assessed in a trial by comparing it to treatment with placebo.
Comparative experiments are the primary focus of this tutorial.
A second type of experiment is the optimization experiment which has the objective of finding
conditions that give rise to a maximum or minimum response. Optimization experiments are often
6
used in product development, such as finding the optimum combination of concentration,
temperature, and pressure that give rise to the maximum yield in a chemical production plant. Dose
finding trials in clinical development are another example of optimization experiments.
The third type of experiment is the prediction experiment in which the objective is to provide some
statistical/mathematical model to predict new responses. Examples are dose response experiments
in pharmacology and immuno‐assay experiments.
The final experimental type is the variation experiment. This experiment has as objective to study
the size and structure of random variation. Variation experiments are implemented as uniformity
trials, i.e. studies without different treatment conditions. For example, the assessment of sources of
variation in microtiter plate experiments. These sources of variation can be plate effects, row
effects, column effects and their combination. Another example is a study to investigate the effect of
cage location in an animal experiment where animals are kept in a rack of 24 cages.
4.2 THE OBJECTIVE OF THE STUDY AND THE RESEARCH HYPOTHESIS
Prior to the experimental design, it is of importance that the scientist realizes what the actual goal is
of the study. Sometimes a pilot study, making preliminary observations on the study material is
useful to generate clear questions that will be answered in the actual experiment. It is wise to limit
the objectives of a study to a maximum of, say three (Selwyn, 1996). Having a lot of research
objectives compromises the integrity of the study and as a result, often none of the study objectives
will be satisfied. After having formulated the research objectives, the scientist will then try to
transfer them into scientific hypotheses that might answer the question and he will make
predictions of what to expect if these hypotheses are true. The next step in the planning process is
to decide which data are required to confirm or refute these predictions. Generating sensible
predictions is one of the key factors of good experimental design. Good predictions will follow
logically from the hypotheses that we wish to test, and not from other rival hypotheses. Good
predictions will also lead to insightful experiments that allow the predictions to be tested.
Throughout the sequence of question, hypothesis, prediction it is essential to assess each step
critically with enough skepticism and even ask a colleague to play the “devil’s advocate”. It is much
better that problems are identified at this early stage of the research process than after the
experiment started. At the end of the experiment the scientist should be able to determine whether
the objectives have been met, I.e. whether the research questions were answered to satisfaction.
7
4.3 THE PILOT EXPERIMENT
As researchers are often under considerable time pressure there is the temptation to start as soon
as possible with the experiment. However, a critical step in a new research project, that is often
missed, is to spend a bit of time and resources on the beginning of the study collecting some pilot
data. Such a preliminary or pilot experiment on a limited scale is of use for assessing the feasibility of
the actual experiment. During the pilot stage the researcher is allowed to make variations in
experimental conditions such as measurement method, experimental set‐up, etc.
The pilot experiment can be of help to make sure that a sensible research question was asked. For
instance, if our research question was about whether there is a difference in concentration of a
certain protein between diseased and non‐diseased tissue, it is of importance that this protein is
present in a measurable amount. Carrying out a pilot experiment in this case can save considerable
time, resources and eventual embarrassment.
A second crucial role for the pilot study is for the researcher to practice, validate and standardize the
experimental techniques that will be used in the full study. When appropriate, trial runs of different
types of assays allow fine‐tuning them so that they will give optimal results.
Finally, the pilot study provides basic data to debug and fine‐tune the experimental design. Provided
the experimental techniques work well, carrying out a small‐scale version of the actual experiment
will yield some preliminary experimental data. These pilot data can be very valuable and allow to
calculate or adjust the required sample size of the experiment and to set up the data analysis
environment.
The pilot experiment still belongs to the exploratory phase of the research project and is not part of
the actual, final experiment. In order to preserve the quality of the data and the validity of the
statistical analysis, the pilot data cannot be included in the final dataset.
5 PRINCIPLES OF STATISTICAL DESIGN
5.1 THE STRUCTURE OF THE RESPONSE VARIABLE
We assume that the response obtained for a particular experimental unit can be described by a
simple additive model consisting of the effect of the specific treatment, the effect of the
experimental design and an error component that describes the deviation of this particular
experimental unit from the mean value of its treatment group. There are some strong assumptions
associated with this simple model: 1) the treatment terms add rather than, for example multiply; 2)
treatment
the other
5.2 DE
The exper
treatment
that the e
applied to
high or lo
In many s
levels of s
their exp
analysis o
in biomed
Figure
Temme et
measured
t effects are c
units. These
FINING THE
rimental unit
t can be app
experimental
o one unit can
w result in on
studies the c
sampling it o
erimental ma
of their study
dical research
4. Morphome
Me
t al. (2001) co
d the diamete
constant; 3) t
assumptions
Figure 3
EXPERIMENT
t corresponds
lied, such tha
units respon
nnot affect th
ne unit has no
hoice of the
often happen
aterial. This
y. The followi
h.
etric analysis o
ans±SEM from
ompared two
ers of bile ca
the response
are particula
. The response
TAL UNIT
s to the smal
at any two u
nd independe
he response o
o effect on th
experimenta
s that invest
can lead to
ng examples
f the diameter
m three livers.
o genetic stra
naliculi in the
8
in one unit is
arly importan
e variable as a
lest division o
nits can rece
ently of one
obtained in an
he result of an
al unit is obv
igators have
serious erro
represent ty
r of bile canali
*P<0.005 (afte
ins of mice, w
e livers of thr
s unaffected b
t in the statis
dditive model
of the experi
ive different
another, in t
nother unit a
nother unit.
ious. Howeve
difficulties re
ors in both t
ypical situatio
culi in wild‐typ
er Temme et a
wild‐type and
ree wild‐type
by the treatm
stical analysis
mental mate
treatments.
the sense tha
nd that the o
er, in studies
ecognizing th
he design an
ons commonl
pe and C02‐de
al., 2001)
connexin32‐
e and of three
ment applied t
.
erial to which
It is importan
at a treatmen
occurrence of
s with multip
he basic unit
nd subseque
y encountere
eficient liver.
‐deficient. The
e C02‐deficien
to
a
nt
nt
f a
ple
in
nt
ed
ey
nt
9
animals, making several observations on each liver. Their results are shown in Figure 4. There is a
fairly obvious problem with the definition of the experimental units which also act as units of
analysis here. They have two groups and in each group there are only three experimental units, not
groups of 280 and 162 units. Hence their statistical analysis is completely wrong. Similar mistakes are
abundant whenever microscopy is concerned and the individual cell is used as experimental unit.
One could wonder whether these are mistakes made out of ignorance or out of convenience. The
concern is even greater when such studies get published in peer reviewed scientific journals.
Another example of a wrong definition of experimental unit concerns a study in laboratory animals
about the toxicity of N‐nitrosamines (Rivenson et al., 1988). The investigators provided the following
description of their experimental set‐up:
“The rats were housed in groups of 3 in solid‐bottomed polycarbonate cages with hardwood bedding
under standard conditions diet and tap water with or without N‐nitrosamines were given ad libitum.”
Since the treatment was supplied in the drinking water, it is impossible to provide different
treatments to any two individual rats. Furthermore, the responses obtained within the different
animals within a cage can be considered to be dependent upon one another in the sense that the
occurrence of extreme values in one unit can affect the result of another unit. Therefore, the
experimental unit here is not the single rat but the cage.
Still another example shows that a single individual can relate to several experimental units. In a
clinical trial the skin irritation potential of five dermatological compounds is evaluated by
administering all five compounds simultaneously to five separate test areas on the backs of normal
volunteers. Each test area was randomly assigned to one of the five different treatments conditions.
Therefore, the test area rather than the volunteer was the experimental unit.
5.3 VARIATION IS OMNIPRESENT
Variation is everywhere in the natural world and is often substantial in the biomedical area. Despite
accurate execution of the experiment, the measurements obtained in identically treated objects will
yield different results. For example, cells grown in test tubes will vary in their growth rates and in
animal research no two animals will behave exactly the same. In general, the more complex the
system that we study, the more factors will interact with each other and the greater will be the
variation between the experimental units. Experiments in whole animals will undoubtedly show
more variation than in vitro studies on isolated organs. When the variation cannot be controlled or
its source cannot be measured, we will refer to it as noise, random variation or error. This
10
uncontrollable variation masks the effects under investigation and is the reason why statistical
methods are required to extract the necessary information. This is in contrast to other scientific
areas such as physics, chemistry and engineering where the studied effects are much larger than the
natural variation.
5.4 BALANCING INTERNAL AND EXTERNAL VALIDITY
Internal validity refers to the fact that in a well‐conceived experiment the effect of the treatment is
unequivocally attributed to the treatment. However, as we saw earlier the effect of the treatment is
masked by the presence of the uncontrolled variation of the experimental material. An experiment
with a high level of internal validity should have a great chance to detect the effect of the treatment.
If we consider the treatment effect as a signal and the inherent variation of our experimental
material as noise, then a good experimental design will maximize the signal/noise ratio. Increasing
the signal can be accomplished by choosing experimental material that is more sensitive to the
treatment. Identification of factors that increase the sensitivity of the experimental material could
be carried out in preliminary experiments. Reducing the noise is another way to increase the signal‐
to‐noise ratio. This can be accomplished by repeating the experiment in a number of animals, but
this is not a very efficient way of reducing the noise. An alternative for noise reduction is by using
very uniform experimental material. The use of cells harvested from a single animal is an example of
noise reduction by uniformity of experimental material.
External validity is related to the extent that our conclusions can be generalized to the target
population. The choice of the target population, how a sample is selected from this population and
the experimental procedures used in the study are all determinants of its external validity. Clearly,
the experimental material should mimic the target population as close as possible. In animal
experiments specifying species and strain of the animal, the age and weight range and other
characteristics determine the target population and make the study as realistic and informative as
possible. External validity can be very low if we work in a highly controlled environment using very
uniform material. Thus there is a trade‐off between internal and external validity, as one goes up the
other comes down. Fortunately, as we will see, there are statistical strategies for designing a study
such that the noise is reduced while the external validity is maintained.
5.5 BIAS AND VARIABILITY
Bias and variability (Figure 5) are two important concepts when dealing with the design of
experiments. A good experiment will minimize or, at best, try to eliminate bias and will control for
variability. By bias, we mean a systematic deviation in observed measurements from the true value.
For exam
investigat
the contro
of the exp
clear that
confound
study is th
By variabi
also relat
means th
precise. G
most imp
thereby je
much var
increasing
conclusio
5.6 RE
Cox (1958
1. t
2. t
3. t
4. t
5. U
mple, an anim
ted with resp
ol treatment
periment the
t this treatm
ed with the d
he way exper
ility, we mean
ed to the con
hat our meas
Good experim
portant. Failur
eopardizes th
riability, this
g the sample
ns.
Figure
QUIREMENT
8) enunciated
reatment com
he compariso
he conclusion
the experime
Uncertainty i
mal study is p
pect to a con
and all femal
investigator
ent effect is
difference be
imental units
n a random f
ncepts of acc
surement is a
ments are as
re to minimiz
he internal va
can sometim
e size, or oth
5. Bias and va
TS FOR A GOO
d the followin
mparisons sho
ons should als
ns a wide ran
ental arrangem
n the conclus
performed in
trol treatme
les to the exp
finds a stron
a biased res
tween both s
s are allocated
luctuation ab
curacy and pr
accurate, wh
free as possi
ze the bias o
alidity. Conve
mes be remed
er technique
riability illustr
OD EXPERIM
g requiremen
ould as far as
so be made s
nge of validity
ment should
sions should b
11
n which the
nt. Suppose
perimental tre
g difference b
sult since the
sexes. One of
d to treatmen
bout a centra
recision of a
ile little varia
ble from bias
f an experim
ersely, if the
diated by ref
es. In this cas
rated by a mar
MENT
nts for a “goo
s possible free
ufficiently pr
y (external va
be as simple
be assessable
effect of an
the experime
eatment. Furt
between the
e difference b
f the most im
nt groups.
l value. The t
measuremen
ability means
s and variabi
ent leads to
outcome of
finement of t
se the study
rksman shot at
od experiment
e of systemat
recise (signal‐
alidity);
as possible;
e
experimenta
enter allocate
ther assume t
two treatme
between the
portant sour
erms bias and
nt process. A
s that the m
lity. Of the tw
erroneous co
the experim
the experime
may still rea
t a bull’s eye
t”:
tic error (bias
‐to‐noise);
al treatment
es all males t
that at the en
ent groups. It
two groups
ces of bias in
d precision a
Absence of bia
measurement
wo, bias is th
onclusions an
ent shows to
ental method
ch the corre
s);
is
to
nd
is
is
a
re
as
is
he
nd
oo
ds,
ct
12
These five requirements will determine the basic elements of the design. We have discussed already
the first three requirements in the preceding sections. The following section provides some basic
strategies to fulfill these requirements.
5.7 STRATEGIES FOR MINIMIZING BIAS AND MAXIMIZING SIGNAL‐TO‐NOISE RATIO
The basic principle of experimental design is about maximizing the internal validity by minimizing
the bias and maximizing the signal‐to‐noise ratio and, at the same time, maximizing the external
validity of the experiment. To accomplish this objective, the researcher should maximize the signal
by the proper choice of the measuring device and experimental domain, and minimize bias and
variability. Strategies for minimizing the bias are based on good experimental practice, such as: the
use of controls, blinding, the presence of a protocol, calibration, randomization, random sampling,
and standardization. Variability can be minimized by elements of experimental design: replication,
blocking, covariate measurement, and sub‐sampling. In addition random sampling can be added to
enhance the external validity. We will now consider each of these strategies in more detail.
5.7.1 STRATEGIES FOR MINIMIZING BIAS – GOOD EXPERIMENTAL PRACTICE
5.7.1.1 THE USE OF CONTROLS
In biomedical studies, a control or reference standard is a standard treatment condition against
which all others may be compared. The control can either be a negative control or a positive
control. The term active control is also used for the latter. In some studies, both negative and
positive controls are present. In this case, the purpose of the positive control is mostly to provide an
internal validation of the experiment. In some special type of experiment active controls are used to
show equivalence of treatments. When negative controls are used, subjects can act as their own
control (self‐control). In this case the subject is first evaluated under standard conditions (i.e.
untreated). Subsequently, the treatment is applied and the subject is re‐evaluated. This design has
the property that all comparisons are made within the same subject. Since most of the time1
variability within a subject is smaller than between subjects, this is a more efficient design than
comparing control and treatment in two separate groups. However, there are some drawbacks with
this design. The effect of treatment is completely confounded with the effect of time, which
introduces a potential source of bias. Furthermore, blinding which is another method to minimize
bias is impossible in this design.
1 For a quantitative outcome this is the case when the Pearson product‐moment correlation coefficient between the control measurement and the measurement following treatment is at least 0.5
13
Like in the previous case of self‐control, untreated controls are not blinded. Furthermore, applying
the treatment (e.g. a drug) often requires extra manipulation of the subjects (e.g. injection). The
effect of the treatment is then confounded with that of the manipulation and consequently
potentially biased.
Vehicle control (laboratory experiments) or placebo control (clinical trials) are terms that refer to a
control group that receives a matching treatment condition without the active ingredient. Another
term in the context of experimental surgery is sham control. In the sham control group subjects or
animals undergo a faked operative intervention that omits the step thought to be therapeutically
necessary. This type of control is the most desirable and truly minimizes bias. In clinical research the
placebo controlled trial has become the gold standard. However, in the same context of clinical
research ethical consideration may sometimes preclude its application.
5.7.1.2 BLINDING
Blinding is a very useful strategy for minimizing bias, especially when the response variable is
subjectively evaluated. Knowledge of which treatment was assigned to a particular experimental
unit can subconsciously lead to a biased assessment by the observer. In such experiments evaluation
must be done by a person who is blinded to treatment group. In other types of experiments handling
of the experimental material, especially animals, by the investigator can also influence the results.
Here too, proper blinding of investigators will enhance the credibility and reliability of the results.
In single blinding the investigators are uninformed regarding the treatment condition of the
experimental subjects. Single blinding neutralizes investigator bias. The term double blinding in
laboratory experiments means that both the experimenter and the observer are uninformed about
the treatment condition of the experimental units. In clinical trials double blinding means that both
investigators and subjects are unaware of the treatment condition.
Two methods of blinding have found their way to the laboratory: group blinding and individual
blinding. Group blinding involves identical codes, say A, B, C, etc., for entire treatment groups. This
approach is less appropriate, since when results accumulate and the investigator is able to break the
code, blindness is completely destroyed. A much better approach is to assign a code (e.g. sequence
number) to each experimental unit individually and to maintain a list that indicates which code
corresponds to which particular treatment. The sequence of the treatments in the list should be
randomized. In practice, this procedure often involves an independent person that maintains the list
and prepares the treatment conditions (e.g. drugs).
14
5.7.1.3 THE PRESENCE OF A PROTOCOL
The presence of a written technical protocol containing in full detail the specific definitions of
measurement and scoring methods is imperative to minimize potential bias. The technical protocol
describes practical actions and gives guidelines for lab technicians on how to manipulate the
experimental units (animals, etc.), the materials involved in the experiment, the required logistics. It
also gives details on data collection and processing. Last but not least the protocol lays down the
personal responsibilities of the technical staff. The importance and contents of the study protocol
will be discussed further in Section 9.
5.7.1.4 CALIBRATION
Calibration is an operation that compares the output of a measurement device to standards of
known value, leading to correction of the values indicated by the measurement device. Calibration
neutralizes instrument bias, i.e. the bias in the investigator’s measurement system.
5.7.1.5 RANDOMIZATION
Randomization, in our context, is the process of allocating experimental units to treatment groups or
conditions according to a well‐defined stochastic law1. It is an objective and scientifically accepted
method for the allocation of experimental units to treatment groups. Randomization ensures that
the effect of uncontrolled source of variability has equal probability in all treatment groups. In the
long run, randomization balances treatment groups on unimportant or unobservable variables, of
which we are often unaware. Any differences that exist in these variables are to be attributed by the
“play of chance”. The random allocation of experimental units to treatment conditions is also a
necessary condition for a rigorous statistical analysis, in the sense that it provides an unbiased
estimate of the standard error of the treatment effects and justifies the use of exact significance
tests (Fisher, 1935; Cox, 1958; Lehmann, 1975). Moreover, randomization is also of use as a device
for blinding the experiment. It is essential that the randomization procedure covers all important
sources of variation connected with the experimental units. Experimental units receiving the same
treatment should be dealt with separately and independently at all stages. If this is not the case
additional randomization procedures should be introduced. Methods of randomization using Excel
and the R system are contained in the Appendix.
Some investigators are convinced that a systematic arrangement is the preferred way to eliminate
the influence of uncontrolled variables. For example when two treatments A and B have to be
compared, one possibility is to set up pairs of experimental units and always assign treatment A to
1 By the term “stochastic” is meant that it involves some elements of chance, such as picking numbers out of a hat, or preferably, using a computer program to assign experimental units to treatment groups
15
the first member of the pair and B to the remaining unit. However, if there is a systematic effect
such that the first member of each pair consistently yields a higher or lower result than the second
member, the estimated treatment effect will be biased. Other arrangements are more commonly
used in the laboratory, e.g. the alternating sequence AB, BA, AB, BA, …. Here too, it cannot be
excluded that a certain pattern in the uncontrolled variability coincides with this arrangement. For
instance, if 8 experimental units are tested in one day, the first unit on a given day will always
receive treatment A. Furthermore, when a systematic arrangement has been applied, the statistical
analysis is based on the false assumption of randomness and can be totally misleading.
Researchers are sometimes tempted to improve on the random allocation of animals by re‐
arranging individuals so that the mean weights are exactly identical. However, by reducing the
variability between the treatment groups, the variability within the groups is automatically
increased. This reduces the precision of the experiment and invalidates the subsequent statistical
analysis. Later, we will see that the randomized block design instead of systematic arrangement is
the correct way of handling these last two cases.
Proper randomization must be distinguished from haphazard allocation to treatment groups. For
example, an investigator wishes to compare the effect of two treatments (A, B) on the body weight
of rats. All twelve animals are delivered in a single cage to the laboratory. The laboratory technician
takes six animals out of the cage and assigns them to treatment A, while the remaining animals will
receive treatment B. Although, many scientists would consider this as a random assignment, it is not.
Indeed, one could imagine the following scenario. Heavy animals react slower and are easier to
catch than the smaller animals. Consequently, the first six animals will on average weigh more than
the remaining six.
An important issue in the design of an experiment is the moment of randomization. For example, in
an experiment brain cells were taken from animals and placed in Petri dishes, such that one Petri
dish corresponded to one particular animal. The Petri dishes were then randomly divided into two
groups and placed in an incubator. After 72 hrs incubation one group of Petri dishes was treated
with the experimental drug, while the other group received solvent.
Although the investigators made a serious effort to introduce randomization in their experiment,
they overlooked the fact the placement of the Petri dishes in the incubator introduces a systematic
error. Instead of randomly dividing the Petri dishes in two groups at the start of the experiment, they
should have made random treatment allocation after the incubation period. It is important that the
randomization covers all substantial sources of variation connected with the experimental units. As
16
a rule, randomization should be performed immediately before treatment application. Furthermore,
after the randomization process has been carried out the randomized sequence of the experimental
units must be maintained, otherwise a new randomization procedure is required.
5.7.1.6 RANDOM SAMPLING
Using a random sample in our study increases its external validity and allows us to make a broad
inference, i.e. a population model of inference (Lehmann, 1975). However, in practice it is often
difficult and/or impractical to conduct studies with true random sampling. Clinical trials are usually
conducted using eligible patients from a small number of study sites. Animal experiments are based
on the available animals. This certainly limits the external validity of these studies and is one of the
reasons that the results are not always replicable.
In some cases, maximizing the external validity of the study is of great importance. This is especially
the case in studies that attempt to make a broad inference towards the target population
(population model), like gene expression experiments that try to relate a specific pathology to the
differential expression of certain genes probes (Nahon and Shoemaker, 2002). For such an
experiment the bias in the results is minimized if it is based on a random sample from the target
population.
5.7.1.7 STANDARDIZATION
Standardization of the experimental conditions can be of use to minimize the bias but also to reduce
the intrinsic variability in the results. Examples of standardization of the experimental conditions are
the use of genetically uniform animals, use of phenotypically uniform animals, environmental
control, nutritional control, acclimatization, and standardization of the measurement system. As
discussed before, too much standardization of the experimental conditions can jeopardize the
external validity of the results.
5.7.2 STRATEGIES FOR CONTROLLING VARIABILITY – GOOD EXPERIMENTAL DESIGN
5.7.2.1 REPLICATION
Ronald Fisher1 (1935) noted in his pioneering book The Design of Experiments that replication serves
two purposes. The first is to increase the precision of estimation and the second is to supply an
estimate of error by which the significance of the comparisons is to be judged.
1 Sir Ronald Aylmer Fisher (Londen,1890 – Adelaide 1962) is considered a genius who almost single‐handedly created the foundations of modern statistical science
17
In a comparative experiment with two treatments and equal number of experimental units per
treatment group, the precision of the experiment is quantified by the standard error of the
difference between the two treatments, i.e. the quantity: 2 . . ⁄
The precision of the experiment depends on the standard deviation SD, which is composed of the
intrinsic variability of the experimental material and the precision of the experimental work, and
inversely on the number of experimental units. Reduction of the standard deviation is only possible
to a limited extend. However, one can by increasing the number of experiments units effectively
enhance the experiment’s precision. Unfortunately, due to the inverse square root dependency of
the standard error on the sample size, this is not an efficient way to control the precision. Indeed,
the standard error is halved by a fourfold increase in the number of experimental units, but a
hundredfold increase in the number of units is required to divide the standard error by ten. In other
words, replication is an effective but expensive strategy to control variability.
5.7.2.2 SUBSAMPLING
As mentioned above, reduction of the standard deviation is only possible to a very limited extend.
This can be accomplished by standardization of the experimental conditions but also this method is
limited and jeopardizes the external validity of the experiment. In some experiments it is possible
manipulate the physical size of the experimental units. In this case one could choose units of a larger
size to obtain more precise estimates. In still other experiments, there are multiple levels of
sampling. The process of taking samples below the primary level of the experimental unit is known
as subsampling. The experiment reported by Temme et al. (2001) where the diameter of many liver
cells was measured in 3 animals/experimental condition, is an example of subsampling with animals
at the primary level and cells at the subsample level. When the variability of the measurement at the
sublevel (i.e. within‐subject variability) is substantial, e.g. large standard deviation between cell
diameters of the same animal, as compared to the intrinsic variability between the animals, it makes
sense to increase the number of subsamples. However, subsample replication is not identical and
not as effective as replication on the level of the true experimental unit.
Apart from replication and reduction of intrinsic variability there is a third effective and efficient
method to increase the precision of an experiment. This is accomplished by choosing an appropriate
experimental design which takes into account the different sources of variability that can be
identified.
18
5.7.2.3 BLOCKING
If one or more factors other than the treatment conditions can be identified as potentially
influencing the outcome of the experiment, it may be appropriate to group the experimental units
on these factors. Such groupings are referred to as blocks or strata. Units within a block are then
(randomly) assigned to the treatments. Examples of blocking factors are plates (in microtiter plate
experiments), greenhouses, animals, litters, date of experiment, or groupings based on groupings of
continuous characteristics such as body weight, baseline measurement, etc.
What we effectively do by blocking is dividing the variation between the individuals into variation
between blocks and variation within blocks. If the blocking factor has an important effect on the
response, then the between‐block variation is much greater than the within block variation. We will
take this into account in the analysis of the data (ANOVA with blocks as additional factor).
Comparisons of treatments are then carried out within blocks, where the variation is much smaller.
5.7.2.4 COVARIATES
Blocking on a baseline characteristic such as body weight is one possible strategy to eliminate the
variability induced by the heterogeneity in weight of the animals or patients. Instead of grouping in
blocks, or in addition to, one can also make use of the actual value of the measurement. Such a
concomitant measurement is referred to as a covariate. It is an uncontrollable but measurable
attribute of the experimental units (or their environment) that is unaffected by the treatments but
may have an influence on the measured response. Examples of covariates are body weight, age,
ambient temperature, measurement of the response variable before treatment, etc. The covariate
filters out the effect of a particular source of variability. Rather than blocking it represents a
quantifiable attribute of the system studied. The statistical model underlying the design of an
experiment with covariate adjustment is conceptualized in Figure 6. The model implies that there is
a linear relationship between the covariate and the response and that this relationship is the same in
all treatment groups. In other words, there is a series of parallel curves, one per treatment group,
relating the response to the covariate.
Figure 6. Additive model with linear covariate adjustment
19
5.8 SIMPLICITY OF DESIGN
In addition to external validity, bias and precision, Cox (1958) also stated that the design of our
experiment should be as simple as possible. When the design of the experiment is too complex, it
may be difficult to ensure adherence to a complicate schedule of alteration, especially if these are to
be carried out by relatively unskilled people. An efficient and simple experimental design has the
additional advantage that the statistical analysis will be simple without unreasonable assumptions.
5.9 THE CALCULATION OF UNCERTAINTY
This is the last of Cox’s requirements for a “good experiment” and it is the only true statistical
requirement. Fisher (1935) lamented that: “It is possible, and indeed it is all too frequent, for an
experiment to be so conducted that no valid estimate of error is available”. Without the ability to
estimate error, there is no basis for statistical inference. Therefore, in a well conceived experiment,
we should always be able to calculate the uncertainty in the estimates of the treatment
comparisons. This usually means estimating the standard error of the difference between the
treatment means. To make this calculation in a rigorous manner, the set of experimental units must
respond independently to a specific treatment and may only differ in a random way from the set of
experimental units assigned to the other treatments. This requirement again stresses the
importance of independence of experimental units and randomness in treatment allocation.
In experiments with a very small number of experimental units, it is sometimes not possible to
obtain an effective estimate of the standard deviation from the observations themselves. In this
case, one can make use of the results of previous experiments to estimate the standard deviation.
However, we then make the strong assumption that random variation is the same in the new
experiment.
6 COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION
There is a multitude of designs that can be considered when planning an experiment and some have
been employed more commonly in the area of biological research. Unfortunately, the literature
about experimental designs is littered with technical jargon, which makes its understanding quite a
challenge. To name a few, there are: completely randomized designs, randomized complete block
designs, factorial designs, split plot designs, Latin square designs, Greco‐Latin squares, Youden
square designs, lattice designs, Placket‐Burman designs, simplex designs, Box‐Behrken designs, etc..
It helps to find our way through this jungle of designs by keeping in mind that the major principle of
experimental design is to provide a synthetic approach to minimize bias and control variability.
20
Furthermore, we can consider each of the specialized experimental designs as integrating three
different aspects of the design: the treatment design, the error control design, and the sampling
design.
The treatment aspect is concerned about which treatments are to be included in the experiment
and closely linked to the goals and aims of study. Should a negative or positive control be
incorporated in the experiment or should both be present? How many doses or concentrations
should be tested and at which level? Is the interaction of two treatment factors of interest or not?
The error control aspect of the study design implements the strategies that we learned in Section
5.7.1 to filter out different sources of variability. The sampling and observation aspect of the design
of our experiment is concerned how experimental units are sampled from the population, how and
how many subsamples should be drawn, etc.
The complexity and required resources of a study are determined by these three aspects of
experimental design. The required resources, i.e. the number of experimental units, of a study are
governed by the number of treatments, the number of blocks and the standard error. The more
treatments, or the more blocks, the more experimental units are needed. The complexity of the
experiment is determined by the underlying statistical model of Figure 3. In particular the error
design aspect of the study controls its complexity. The randomization process is a major part of the
error‐control aspect. A justified and rigorous estimation of the standard error is only possible in a
randomized experiment. In addition to the above requirement for the calculation of the standard
error, randomization has the advantage that it distributes the effects of uncontrolled variability
randomly over the treatment groups.
Replication of experimental units should be sufficient, such that an adequate number of degrees of
freedom are available for estimating the experiment’s precision (standard error). This parameter is
related to the sampling & observation aspect of the design.
The three aspects of experimental design provide a framework for classifying and comparing the
different types of experimental design that are used in biomedical studies. As we will see, each of
these designs has its advantages and disadvantages.
6.1 THE COMPLETELY RANDOMIZED DESIGN
This is the most common and simplest possible experimental design. Each experimental unit is
randomly assigned to exactly one treatment condition. This is often the default design used by
investigators who do not really think about design problems. In our classification the completely
21
randomized design is a one‐way treatment design with a completely randomized error‐control
design. When the treatment condition represents a single factor (e.g., different concentrations of
the same drug including a zero level), then this design is also called a one‐way layout. On the other
hand, when the treatment condition consists of a combination of more factors, each factor at two or
more levels, it is called a factorial design. We will discuss factorial designs in more detail later.
The advantage of the completely randomized design is that it is simple to implement as
experimental units are simply randomized to the various treatments. The obvious disadvantage is
the lack of precision in the comparisons among the treatments, which is based on the variation
between the experimental units.
The following real‐life laboratory experiment is an example of a completely randomized design in
which the investigators used randomization, blinding, and individual housing of animals to guarantee
the absence of systematic error and independence of experimental units. To investigate the
interaction of chronic treatment with drugs on the proliferation of gastric epithelial cells in rats, an
experiment was set up in which two experimental drugs were compared with their respective
solvent. A total of 40 rats were randomly divided into four groups of each ten animals, using the MS
Excel randomization procedure described in the Appendix. To guarantee independence of the
experimental units, the animals were kept in separate cages. Cages were distributed over the racks
according to their sequence number. Blinding was accomplished by letting the sequence number of
each animal correspond to a given treatment. One laboratory worker was familiar with the codes
and prepared the daily drug solutions. Treatment codes were concealed from the laboratory staff
that was responsible for the daily treatment administration and final histological evaluation.
Figure 7. After Levasseur et al. (1996) presence of bias in 96‐well plates
22
Another example of a completely randomized design is about eliminating the bias present in
experiments using 96‐well mictrotiter plates. Burrows (1984) already described the presence of plate
location effects in ELISA assays carried out in 96‐well microtiter plates. Similar findings were
reported by Levasseur et al. (1995) and Faessel et al. (1999) who described the presence of
parabolic patterns (Figure 7) in cell growth experiments carried out in 96‐well microtiter plates. They
were not able to show conclusively the underlying causes of this systematic error, which, as shown
in Figure 7, could be of considerable magnitude. Therefore, they concluded that only by random
allocation of the treatments to the wells these systematic errors could be avoided. An ingenious
method (Figure 8) for randomizing treatments in 96‐well microtiter plates was developed. Drugs
were serially diluted into tubes in a 96‐tube rack. Next, a randomization map was generated using an
Excel macro. The randomization map was taped to the top of an empty tube rack. The original tubes
were then transferred to the new rack by pushing the numbered tubes through the corresponding
numbered rack position. Using a multipipette system drug‐containing medium was then transferred
from the tubes to the wells of a 96‐well microtiter plate. At the end of the assay, the random data
file generated by the plate reader was imported into the Excel spreadsheet and automatically
resorted. Other research groups make use of robotized systems to implement randomization in their
96‐well plate experiments. This example is also a paradigm of how randomization can be used to
eliminate systematic errors by transforming them into random noise.
Figure 8. Scheme for the addition and dilution of drug‐containing medium to tubes in a 96 tube rack,
randomization of the tubes, and then addition of drug containing medium to cell‐containg wells of a 96‐well
plate. (after Faessel et al. (1999))
23
6.2 THE FACTORIAL DESIGN
In some types of experimental work it can be of interest to assess the joint effect of two or more
factors. In this case, the investigator can make use of factorial designs. Factorial designs are
identified as designs with a factorial treatment design and a completely randomized error‐control
design. The factorial treatment design allows estimating the main effects of the treatments and also
their interaction, i.e. the deviation from additivity of their joint effect.
Table 1. Treatment combinations in 2 x 2 factorial study about joint effect of radiotherapy and
chemotherapy
Control: no radiation, no chemotherapy Chemotherapy alone
Radiotherapy alone Radiotherapy and chemotherapy
An example of a simple 2 x 2 factorial design, is an experiment in which the joint effect of
radiotherapy and chemotherapy was assessed in an animal model of tumor growth. It was shown
that radiotherapy as well as treatment with an experimental drug effectively reduced tumor size.
However, for the drug to be of real therapeutic value the joint effect of both radiotherapy and
chemotherapy should be more than the sum of their individual effects. A factorial experiment was
set up in which animals were randomly allocated to 4 treatment groups as in Table 1: a control
group that was not exposed to radiation and was administered the drug’s vehicle, a group with
radiotherapy and the drug’s vehicle, a group without radiation and treated with the experimental
drug, and finally a group receiving both radiotherapy and chemotherapy.
Table 2. Hypothetical results (expected values) from a 2 x 2 factorial experiment
Factor B (Drug)
Factor A(Radiotherapy)
Absent(no drug)
Present(drug)
Absent (no radiation)
μ μ + β
Present (radiotherapy)
μ + α μ + α + β + γ
To further explore the concept of interaction, consider the hypothetical outcome of the 2 x 2
experiment in Table 2. Let μ stand for the result obtained in the control group that received only the
solvent of the drug. When an additive model applies, the result obtained in the group that received
24
the vehicle and was exposed to radiotherapy can be written as μ + α, where α stands for the main
effect of radiotherapy. Analogously, the result obtained in the group that was treated with the active
drug but left without radiotherapy can be written as μ + β. Still using an additive model let γ stand
for the additional effect in the combination, then the result obtained in the group treated with both
drug and radiotherapy is μ + α + β + γ. Note that γ1, called the interaction effect, can have the
same sign as α or β which means that the effect of the corresponding factor is enhanced, or can
have the opposite sign indicating a counteraction of the effect of the respective factor. Table 2
summarizes these hypothetical results. The interaction effect γ, can be estimated by subtracting the
sum of the off‐diagonal elements from the sum of the diagonal elements, i.e. [(μ ) + (μ + α + β + γ)] –
[(μ + α) + (μ + β)] = γ.
Figure 9. Results from a hypothetical experiment with complete additivity of the two factors
The results from a hypothetical experiment with complete additivity are shown in Figure 9. The lines
which connect the treatment means for the two levels of factor A, are parallel to one another. The
presence of interaction is demonstrated by the lack of this parallelism as is illustrated in Figure 10
where the effect of radiotherapy is enhanced in the presence of the drug. Of course, the opposite is
also possible when the effect of one factor is counteracted by the presence of another factor.
Statistical interaction is sometimes confused with the concept of synergism in pharmacology.
1 In statistical texts, the interaction effect γ is normally denoted as (αβ)ij
25
However, the requirements for two drugs to be synergistic with each other are much more stringent
than just the superadditivity which is demonstrated by the presence of interaction (Tallarida,2001).
Factorial designs are highly efficient, but unfortunately their usage is not widespread. Although our
discussion here is restricted to the 2 × 2 factorial, more factors and more levels can be used.
However, the number of experimental units that are involved may become rather large. For
instance, an experiment with three factors at three levels involves 33 = 27 different treatment
combinations. When there are at least two experimental units in each treatment group, then 54
units will be required. Such an experiment may soon be too large to be practical.
Figure 10. Results from a hypothetical experiment in which the two factors interact
Table 3. Results from 2 x 2 factorial experiment without interaction
No drug Drug
No Radiation 18 15
Radiotherapy 13 10
When, from both a statistical and scientific point of view, there is no important interaction
between the two factors, it is possible to use the factorial experiment to increase the generality of
the results without increasing the size of the experiment. For instance, the mean results of the
experiment in Table 3 indicate complete additivity of the two factors. Suppose that for each
treatment condition n animals were tested. We can estimate the effect of the drug in animals
26
without radiation (15 – 18) and in those with radiotherapy as (10 – 13). We can combine these two
estimates and calculate the standard error on this new estimate. The latter would be based on 4 x
(n ‐1) degrees of freedom since all data from Table 3. The results obtained in this way would be
more general, in the sense that we can say that irrespective of radiotherapy or not, the drug will
induce a decrease in tumor size of 3. Furthermore, using the same data, the effect of radiotherapy
can be estimated in drug treated and solvent treated animals.
6.3 THE RANDOMIZED COMPLETE BLOCK DESIGN
In the randomized complete block design the effect of a single factor is investigated in the presence
of a single isolated extraneous source of variability (block) closely related to the response. We
already discussed the use of blocking as a design tool for enhancement of the signal‐to‐noise ratio in
Section 5.7.2.3. In a randomized complete block design, all treatments are applied within each block
and treatments are compared within the blocks. The randomization procedure randomizes
treatments separately within each block. When a study is designed such that the number of
experimental units within each block and treatment is equal, the design is called a balanced
randomized complete block design. The randomized complete block design is identified as a one‐
way treatment design and a one‐way block error‐control design.
There are two main reasons for choosing a randomized complete block design above a completely
randomized design. Suppose there is an extraneous factor that is strongly related to the outcome of
the experiment. It would be most unfortunate if our randomization procedure yielded a design in
which there was a great imbalance on this factor. If this were the case, the comparisons between
treatment groups would be confounded with differences in this nuisance factor and be biased. The
second main reason for a randomized complete block design is its possibility to considerably reduce
the error variation in our experiment, thereby making the comparisons more precise. The main
objection to a randomized complete block design is that it makes the strong assumption that there is
no interaction between the treatment variable and the blocking characteristics, i.e. that the effect of
the treatments is the same among all blocks.
Table 4. Number of viable cardiomyocytes in a paired experiment
Rat No. Solvent Drug Difference
1 44 68 24
2 64 88 24
3 62 81 19
4 46 54 8
5 76 92 16
27
When only two treatments are compared, the blocking can be simplified to a paired design. An
example of a paired experiment using animal as blocking factor is the following real‐life experiment.
Isolated cardiomyocytes provide an easy tool to assess the effect of drugs on calcium‐overload.
Cardiomyocytes of a single animal are isolated and seeded in plastic Petri dishes. The Petri dishes are
treated with the experimental drug or with its solvent. After a stabilization period the cells are
exposed to a stimulating substance (e.g. veratridine) and the percentage viable, i.e. rod‐shaped,
cardiomyocytes in a dish is counted. Although comparison of the treatment with the solvent control
within a single animal provides the best precision, it lacks external validity. Therefore, a paired
experiment with myocytes from different animals and with animal as blocking factor was carried
out. For each animal two Petri dishes containing exactly 100 cardiomyocytes were prepared. From
the resulting five pairs of Petri dishes, one member was randomly assigned to drug treatment, while
the remaining member received the vehicle. After stabilization and exposure to the stimulus, the
number of viable cardiomyocytes in each Petri dish was counted. The resulting data are contained in
Table 4 and displayed in Figure 11.
Figure 11. The gain in efficiency induced by blocking illustrated in a paired design. In the left panel the
myocyte experiment is considered as a completely randomized design and the two samples largely overlap
each other. In the right panel the lines connect the data of the same animal and show a marked effect of
treatment.
There are 10 experimental units, since Petri dishes can be independently assigned to vehicle or drug.
However, the statistical analysis should take the particular structure of the experiment into account.
More specifically, the pairing has imposed restrictions on the randomization such that data obtained
28
from one animal cannot be freely interchanged with that from another animal. Therefore, for each
pair the difference between the solvent‐treated and drug‐treated Petri dish is evaluated as is
illustrated in the right panel of Figure 11. For each pair the drug‐treated Petri dish yields consistently
a higher result than its vehicle control counterpart. Since the different pairs (animals) are
independent from one another, the mean difference and its standard error can be calculated and a
Student t‐test can be carried out. The mean difference is 18.2 with a standard error of 2.97.
Suppose the experimenter would not have used blocking, i.e. if he had used myocytes originating
from 10 completely different animals. The 10 Petri dishes would then be randomly distributed over
the two treatment groups. Suppose also that the results of this new experiment are identical to
those obtained in the paired experiment. As is illustrated in the left panel of Figure 11, the two
groups largely overlap one another. Since all experimental units are now independent of one
another, the effect of the drug is evaluated by calculating the difference between the two mean
values and its standard error1. Obviously, the mean difference is the same as in the paired
experiment. However, the standard error on the mean difference has risen considerably to a value
9.18, i.e. use of blocking induced a threefold increase in the precision of the experiment.
This example demonstrates that carrying out a paired experiment has the possibility to enhance the
precision of the experiment considerably, while the conclusions have the same validity as in a
completely randomized experiment. However, the forming of blocks of experimental units is only
successful when the criterion upon which the pairing is based, is related to the outcome of the
experiment. Using a characteristic that does not have an important effect on the response variables
as blocking factor is worse than useless since the statistical analysis will lose a bit of power by taking
the blocking into account. This can be of particular importance for small sized experiments.
6.4 THE LATIN SQUARE DESIGN
The Latin square design is an extension of the randomized complete block design, but now blocking
is done on two characteristics that affect the response variable. An obvious example from laboratory
practice is the simultaneous control of the row and column effects in microtiter plates, without loss
of precision. Another example concerns experiments on neuronal protection, where a pair of
animals was tested each day and the investigator expected a systematic difference not only between
the pairs but also between the animal tested in the morning and the one tested in the afternoon.
1 The standard error on the difference between two means is equal to
2 . . ⁄ , where SD is the square root of the mean of the two individual variances of the treated and control group. The variance is defined as the square of SD
29
In a Latin square design the k treatments are arranged in a k x k square such as in Table 5. Each of
the four treatments A, B, C, or D occurs exactly once in each row and exactly once in each column.
The Latin square design is categorized as a one‐way treatment design and two‐way block error
control design.
The main advantage of the Latin square design is that it simultaneous balances out two sources of
error. The disadvantage is the strong assumption that there are no interactions between the
blocking variables or between the treatment variable and blocking variables. In addition Latin square
designs are limited by the fact that the number of treatments, number of row, and number of
columns must all be equal. Fortunately, there are arrangements that do not have this limitation
(Cox, 1958).
Table 5. Arrangement for a 4 x 4 Latin square controlling for column and row effects. The letters A‐D indicate
the four treatments
Column 1 Column 2 Column 3 Column 4
Row 1 B A D C
Row 2 C B A D
Row 3 A D C B
Row 4 D C B A
In a k x k Latin square, only k experimental units are assigned to each treatment group. However, It
may happen that more experimental units are required to obtain an adequate precision. The Latin
square can then be replicated and several squares can be used to obtain the necessary sample size.
In doing this, there are two possibilities to consider. Either one stacks the squares on top of each
other and keeps them as separate independent squares, or one completely randomizes the order of
the rows (or columns) of the design. In general, keeping the squares separate is not a good idea and
leads to less precise estimation and loss of degrees of freedom, especially in the case of a 2 x 2
square. It is only when there is a reason to believe that the column (or row) effects are different
between the squares that keeping them separate makes sense.
6.5 SPLIT PLOT DESIGNS
This type of design incorporates subsampling and allows one to make comparisons among different
treatments at two or more sampling levels. The split plot design allows assessment of the effect of
two independent factors using different experimental units. The term split plot originates from
30
agricultural research where fields are randomly assigned to different levels of a primary factor and
smaller areas within the fields are randomly assigned to one level of another secondary factor.
An example of a split plot design is the following experiment on diets and vitamins. Cages each
containing two mice each were assigned at random to a number of dietary treatments (i.e. cage was
the experimental unit for comparing diets), and the color‐marked mice within the cage were
randomly selected to receive one of two vitamin treatments by injection (i.e. mice were the
experimental units for the vitamin effect).
The split‐plot design is a two‐way crossed (factorial) treatment design and a split plot error design.
Use of the split plot design as described above is less common in the field of biomedical research.
However, when the second factor is time its use is more frequent. This special type of split‐plot
design is then called a repeated measures design.
6.6 REPEATED MEASURES DESIGNS
The repeated measures design is a special case of the split plot design. In a repeated measures
design, we typically take multiple measurements on a subject over time. If any treatment is applied
to the subjects, they immediately become the whole plots, and “Time” is the subplot factor. The
major disadvantage of repeated measures designs is the presence of carry‐over effects by which the
results obtained for a treatment are influenced by the previous treatment. In addition, any
confounding of the treatment effect with time, as was the case in the use of self‐controls in Section
5.7.1.1, must be avoided. Therefore, a parallel control group must always be included in these
designs
7 THE REQUIRED NUMBER OF REPLICATES – SAMPLE SIZE
7.1 DETERMINING SAMPLE SIZE IS A RISK – COST ASSESSMENT
Replication is the basis of all experimental design and a natural question that arises in each study is
how many replicates are required. The more replicates, the more confidence we have in our
conclusions. Therefore, we would prefer to carry out our experiment on a sample that is as large as
possible. However, increasing the number of replicates incurs a rise in cost. Thus, the answer to how
large an experiment should be is that it should be just big enough to give confidence that any
biologically meaningful effect that exists can be detected.
31
7.2 THE CONTEXT OF BIOMEDICAL EXPERIMENTS
The estimation of the appropriate size of the experiment is straightforward and depends on the
statistical context, the assumptions made, and the study specifications. Context and specifications
on their turn depend on study objectives and design of the experiment.
In practice, the most frequently encountered contexts in statistical inference are point estimation,
interval estimation and hypothesis testing, of which hypothesis testing is definitely the most
important in biomedical studies.
7.3 THE HYPOTHESIS TESTING CONTEXT – THE POPULATION MODEL
In the hypothesis testing context one defines a null hypothesis and, for the purpose of sample size
estimation, an alternative hypothesis of interest. The null hypothesis will often be that the response
variable does not really depend on the treatment condition. For example, one may state as a null
hypothesis that the population means of a particular measurement are equal under two or more
different treatment conditions and that any differences found can be attributed to chance.
At the end of the study when the data are analyzed, we will either accept or reject the null
hypothesis in favor of the alternative hypothesis. As is indicated Table 6, there are four possible
outcomes at the end of the experiment. When the null hypothesis is true and we failed to reject it
then we have made the correct decision. This is also the case when the null hypothesis is false and
we did reject it. However, there are two decisions that are erroneous. If the null hypothesis is true
and we incorrectly rejected it, then we made a false positive decision. Conversely, if the alternative
hypothesis is true (i.e. the null hypothesis is false) and we failed to reject the null hypothesis we
have made a false negative decision.
Table 6. The decision process in hypothesis testing
State of Nature
Decision made
Null hypothesis true
Alternative hypothesis true
Do not reject null hypothesis Correct decision (1 – α)
False negative β
Reject null hypothesis False positive α
Correct decision (1 – β)
The basis of sample size calculation is formed by specifying an allowable rate of false positives and
an allowable rate of false negatives for a particular alternative hypothesis and then to estimate a
32
sample size just large enough so that these low error rates can be achieved. The allowable rate for
false positives is called the level of significance or alpha level and is usually set at values of 0.01,
0.05, or 0.10. The false negative rate depends on the postulated alternative hypothesis and is usually
described by its complement, i.e. the probability of rejecting the null hypothesis when the
alternative hypothesis holds. This is called the power of the statistical test of significance. Power
levels are usually 80% or 90%.
Significance level and power are already two of the four major determinants of the sample size for
hypothesis testing. The remaining two are the inherent variability in the study parameter of interest
and the size of the difference to be detected in the postulated alternative hypothesis. Other key
factors that determine the sample size are the number of treatments and the number of blocks used
in the experimental design.
When the significance level decreases or the power increases, the required sample size will become
larger. Similarly, when the variability is larger or the difference to be detected smaller, the required
sample size will also become larger. Conversely, when the difference to be detected is large or
variability low, the required sample size will be small.
It is convenient to express the difference in means as effect size by dividing it by the standard
deviation. The effect size then takes both the difference and inherent variability into account. Cohen
(1988) argues that effect sizes of 0.2, 0.5 and 0.8 can be regarded respectively as small, medium and
large.
7.4 SAMPLE SIZE CALCULATIONS
7.4.1 POWER ANALYSIS COMPUTATIONS
Now that we are familiar with the concepts of hypothesis testing and the determinants of sample
size, we can proceed with the actual calculations. There is a significant amount of free software
available to make elementary sample size calculations. In particular, there is the R‐package pwr
(Champely, 2009).
Assume we wish to plan an experiment comparing the mean values of two treatment groups. We
want to reject the null hypothesis of no difference at a level of significance of 0.05, whatever the
direction of the difference between the two samples (i.e. a two‐tailed test). We want to detect a
sufficient large effect between the two groups (d =0.8) with a power of 80%. The calculations carried
out in R show that 26 experimental units are required in each of the two treatment groups.
33
> pwr.t.test(d=0.8,power=0.8,sig.level=0.05,type="two.sample",
+ alternative="two.sided")
Two-sample t test power calculation
n = 25.52457
d = 0.8
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
A quick and dirty method for sample size calculation in 2 group comparison with a power of 0.8 and
a two‐sided critical value of 0.05 is provided by Lehr’s equation (Lehr, 1992; Van Belle, 2008):
16∆
where Δ represents the effect size and n stands for the required sample size in each treatment
group. For the above example, the equation results in: 16/0.64 = 25 animals per treatment group.
The numerator of Lehr’s equation depends on the desired power. Alternative values for the
numerator are 8 and 21 for a power of 50% and 90%, respectively.
7.4.2 MEAD’S RESOURCE REQUIREMENT EQUATION
There are occasions when it is difficult to use a power analysis, because there is no information on
the inherent variability (i.e. standard deviation) and/or because the effect size of interest is difficult
to specify. An alternative, quick and dirty method for approximate sample size determination was
proposed by Mead (1988). The method is appropriate for comparative experiments which can be
analyzed using analysis of variance (Grafen and Hails, 2002; Kutner et al., 2004), such as:
• Exploratory experiments
• Complex biological experiments with several factors and treatments
• Any experiment where the power analysis method is not possible or practicable
The method depends on the law of diminishing returns. Adding one experimental unit to a small
experiment gives good returns, while adding it to a large experiment does not do so. It has been
used by statisticians for decades, but has been explicitly justified by Mead (1988). An appropriate
sample size can be roughly determined by the number of degrees of freedom for the error term in
the analysis of variance or t test given by the formula:
N - 1 = E + T + B
34
where E, T and B are the error, treatment and block degrees of freedom (number of occurrences or
levels minus 1) in the ANOVA, and N stands for the total sample size. In order to obtain a good
estimate of error it is necessary to have at least 10 degrees of freedom, and many statisticians would
take 12 or 15 degrees of freedom as their preferred lower limit. On the other hand, if E is allowed to
be large, say greater than 20, then the experimenter is wasting resources. In a non‐blocked design
the total number of animals minus the number of treatments should be between ten and twenty.
Suppose an experiment is planned with four treatments, with eight animals per group (32 rats total).
In this case N=32, B=0 (no blocking), T=3, hence E=28. This experiment is a bit too large, and six
animals per group might be more appropriate (23 – 3 = 20).
There is one problem with this simple equation. It appears as though blocking is "bad" because it
reduces the error degrees of freedom. If the above example the experiment was going to be done in
eight blocks, then N=32, B=7, T=3 and E= 32‐1‐7‐3 =21 instead of 28. However, blocking nearly
always reduces the inherent variation which more than compensates for the decrease in the error
degrees of freedom unless the experiment is very small and the blocking criterion was not well
related to the response. Provided the error degrees of freedom is not less than about 6 in a blocked
design, then the experiment is probably of an adequate size.
7.4.3 MULTIPLICITY AND SAMPLE SIZE
As we will see in Section 8.5, when more statistical tests are to be carried out on the data the overall
rate of false positive findings is higher than the false positive rate for each one separately. To
circumvent this inflation of the false positive error rate, the critical value of each individual test is
usually set at a more stringent level (e.g. Bonferroni’s adjustment). However, as we already noted
above, when the significance level is set at a lower value, the required sample size will necessarily
increase.
It can be shown, that for a power of 80% or 90% carrying out two independent statistical tests
instead of one, involves a 20% larger sample size to maintain the overall error rate at its level of
0.05. Similarly, when 3 or 4 independent tests are involved the required sample size increases with
30% or 40%, respectively.
8 THE STATISTICAL ANALYSIS
8.1 THE STATISTICAL TRIANGLE
There is a one‐to‐one correspondence between the study objectives, the study design and the
analysis. The objectives of the study will indicate which of the designs may be considered. Once a
35
study design is selected, it will on its turn determine which type of analysis is appropriate. This
principle that the statistical analysis is determined by the way the experiment is conducted was
enunciated by Fisher in 1935:
"All that we need to emphasize immediately is that, if an experiment does allow us to calculate a
valid estimate of error, its structure must completely determine the statistical procedure by which
this estimate is to be calculated. If this were not so, no interpretation of the data could ever be
unambiguous; for we could never be sure that some other equally valid method of interpretation
would not lead to a different result."
In other words, choice of the statistical methods follows directly from the objectives and design of
the study. With this in mind, many of the complexities of the statistical analysis have now almost
become trivial
8.2 THE STATISTICAL MODEL REVISITED
We already stated that every experimental design is underpinned by a statistical model and that the
experimental results should be considered as being generated by this experimental model. This
conceptual framework greatly simplifies the statistical analysis to just fitting the statistical model to
the data and comparing the model component related to the treatment effect with the error
component (Grafen and Hails, 2002; Kutner et al., 2004). Hence, the choice of the appropriate
statistical analysis is straightforward. However, some important issue statistical issues remain, such
as the type data and the assumptions we make about the distribution of the data.
8.3 TYPES OF DATA
It is important to distinguish between continuous and discrete data. A continuous variable can, in
theory, take on an infinite number of values, but in practice, only a finite number of digits are kept. It
is the number and frequency of values that really make the distinction between continuous and
discrete.
A variable that takes on only a small number of values, say four or less, is called a discrete or
categorical variable and should be analyzed accordingly (Agresti, 1990). A variable whose values
encompass a wide range and for which many different values are observed is usually considered
continuous and on its turn requires specific methods for analysis of continuous data.
Within these broad categories of discrete and continuous data, the techniques for analysis differ. For
discrete data one distinguishes further interval, ordinal, nominal and binary or dichotomous data.
36
For a continuous variable, it can be assumed that the distribution of the error component of the
model is of a particular form. The most commonly used distributional form in statistics is that of the
normal or Gaussian distribution, which has been found to represent diverse sets of data. However, it
can be of interest to consider alternative distributions such as the log‐normal, exponential, gamma,
etc. Alternatively, one could apply methods that do not depend on an underlying distribution
(Lehmann, 1975). Both approaches have their advantages and disadvantages.
The final result of the statistical analysis is in many cases a p‐value. Unfortunately the concept of p‐
value is often misunderstood. It is related to but not the same as the level of significance α in the
hypothesis testing framework. Once the experiment, which was designed to test a specific null
hypothesis, is carried out the experimental data are used to calculate a quantity which we will call a
test statistic. The corresponding p‐value is then calculated as the probability of obtaining a test
statistic that is as extreme or more extreme than the one observed, provided the null hypothesis is
true. When this p‐value is less than the level of significance α, the result is declared significant.
8.4 VERIFYING THE STATISTICAL ASSUMPTIONS
When the inferential results are sensitive to the distributional and other assumptions of the
statistical analysis, it is essential that these assumptions are also verified. The aptness of the
statistical model is preferably assessed by informal methods such as diagnostic plotting (Grafen and
Hails, 2002; Kutner et al., 2004). When planning the experiment, historical data, or the results of the
pilot experiment can already be used for a preliminary verification of the model assumptions.
Another option is to use statistical methods that are robust against departures from the assumptions
(Lehmann, 1975).
8.5 MULTIPLICITY
In Section 4.2 we already mentioned that it is wise to limit the number of objectives in a study. The
same problem of multiplicity arises when a study includes a large number of variables, or when
measurements are made at a large number of time points. Only in studies of the most exploratory
nature the statistical analysis of every possible variable is acceptable. In this case, the exploratory
nature of the study should be stressed and the results interpreted with great care. The reason is that
in comparative experiments the overall rate of false positive findings across all variables is much
higher than the false positive rate for each one separately.
This is illustrated by the following example. Suppose a drug is tested at 20 different doses on a
variable. Suppose we reject the null hypothesis of no treatment effect for each dose separately
37
when the probability of falsely rejecting the null hypothesis (the significance level) is less than or
equal to 0.05. Then the overall probability of falsely declaring the existence of a treatment effect
when all underlying null hypotheses are in fact true is 0.641. This means that we are more likely to
get one significant result than not. The same multiplicity problem arises when a single dose of the
drug is tested on 20 independent variables. The problem of multiplicity is of particular importance
and magnitude in gene expression microarray experiments (Bretz, 2005). For example a microarray
experiment examines the differential expression of 30,000 genes in wildtype and in a mutant.
Assume that for each gene an appropriate two‐sided two‐sample test is performed at the 5%
significance level. Then we expect to obtain roughly 1,500 false‐positives. Strategies for dealing with
this curse of multiplicity in microarrays are provided by Amaratunga (2004) and Bretz ( 2005).
The multiplicity problem must at least be recognized at the planning stage. Ways to deal with it
(Curran‐Everett, 2000; Bretz et al., 2010) should be investigated and specified in the protocol.
9 THE STUDY PROTOCOL
The writing of the protocol finalizes the end of the research design phase. Every study should have a
written protocol before it is started. The complete study protocol consists of a more conceptual
research protocol and the technical protocol, which we already discussed in Section 5.7.1.3. The
research protocol states the rational for performing the study, the study’s objectives, the related
hypotheses and working hypotheses that are tested and their consequential predictions. It should
contain a section on experimental design, how treatments will be assigned to experimental units,
information and justification of planned sample sizes, and a description of the statistical analysis that
is to be performed. Defining the statistical methods in the protocol is of importance since it allows
preparation of the data analytic procedures beforehand and ensures against the misleading practice
of data dredging or data snooping. Writing down the statistical analysis plan beforehand prevents
also from trying several methods of analysis and report only those results that suit the investigator.
Such a practice is of course inappropriate, unscientific, and unethical.
Many investigators consider writing a detailed protocol a waste of time. However, the “smart
researcher” understands that by writing a good protocol he is actually preparing his final study
report. A well‐written protocol is even more essential when the design is complex or the study is
collaborative. Once the protocol has been formalized, it is important that it is followed as good as
possible and every deviation of it should be documented.
1 This probability is in fact 1 1 0.05
38
10 INTERPRETATION AND REPORTING While the previous sections focused on the planning phase of the study with the protocol as final
deliverable, this section deals with some points to consider when interpreting and reporting the
results of the statistical analysis. As a general rule in writing reports containing statistical
methodology, it is recommended not to use technical statistical terms such as “random”, “normal”,
“significant’, “correlation”, and “sample” in their everyday meaning.
10.1 THE METHODS SECTION
The Methods section should contain details of the experimental design such as the size and number
of experimental groups, how experiment units were assigned to treatment groups, how
experimental outcomes were assessed, and what statistical and analytical methods were used.
10.1.1 EXPERIMENTAL DESIGN
Readers should be told about weaknesses and strengths in study design, e.g. when randomization
and/or blinding was used it adds to the reliability of the data. A detailed description of the
randomization and blinding procedures, and how and when these were applied will allow the reader
to judge the quality of the study. Reasons for blocking and the blocking factors should be given and
how blocking was dealt with in the statistical analysis. When there is ambiguity about the
experimental unit, the unit used in the statistical analysis should be specified and a justification for
its choice should be provided.
10.1.2 STATISTICAL METHODS
Statistical methods should be described with enough detail to enable a knowledgeable reader with
access to the original data to verify the reported results. The authors should report and justify which
methods they used. A term like “tests of significance” is too vague, and should be more detailed.
The level of significance and, when applicable, direction of statistical tests should be specified, e.g.
“two‐sided p‐values less than or equal to 0.05 were considered to indicate statistical significance”.
Some procedures, e.g. analysis of variance, chi‐square tests, etc. are by definition two‐sided. Issues
about multiplicity (Section 8.5) and a justification of the strategy that deals with them should also be
addressed here.
The software used in the statistical analysis and its version should also be specified. When the R
system is used (see Dalgaard 2002), both R and the packages that were used should be referenced1
1 The function citation(‘pkgname’) will yield the required reference
39
10.2 THE RESULTS SECTION
10.2.1 SUMMARIZING THE DATA
The number of experimental units used in the analysis should always clearly be specified. When
possible findings should be quantified and presented with appropriate indicators of measurement
error or uncertainty. As measures of spread and precision, standard deviations (SD) and standard
errors (SEM) should not be confused. Standard deviations are a measure of spread and as such a
descriptive statistic, while standard errors are a measure of precision of the mean. Normally
distributed data should preferably be summarized as mean (SD), not as mean ± SD. For non‐normally
distributed data, medians and inter‐quartile ranges are the most appropriate summary statistics. The
practice of reporting mean ± SEM should preferably be replaced by the reporting of confidence
intervals which are more informative. Extremely small datasets should not be summarized at all
but should preferably be reported or displayed as raw data.
When reporting SD (or SEM) one should realize that for positive variables such as concentrations,
durations, and counts, the mean minus 2 x SD (or minus 2 x SEM x √ ) which indicates a lower 2.5%
of the distribution, can lead to a ridiculous negative value. In this case, an appropriate 95%
confidence interval based on the lognormal distribution, or alternatively a distribution‐free interval,
will avoid such a pitfall.
Figure 12. Scatter diagram with indication of median values and 95 % distribution‐free confidence intervals
Spurious precision detracts from a paper’s readability and credibility. Therefore, unnecessary
precision, particularly in tables, should be avoided. When presenting means and standard deviations,
M bM aF a F b
0
5
10
15
20
Occ
uren
ces
40
it is important to bear in mind the precision of the original data. Means should be given one
additional decimal place more than the raw data. Standard deviations and standard errors usually
require one more extra decimal place. Percentages should not be expressed to more than one
decimal place and with samples less than 100 the use of decimal places should be avoided.
Percentages should not be used at all for small samples. Note that the remarks about rounding only
apply to the presentation of results, rounding should not be done at all before or during the
statistical analysis.
10.2.2 GRAPHICAL DISPLAYS
Graphical displays complement tabular presentations of descriptive statistics. Generally, graphs are
better suited than tables for identifying patterns in the data, whereas tables are better for providing
large amounts of data with a high degree of numerical detail. When possible, one should always
attempt to graph individual data points. This especially the case when treatment groups are small.
Graphs such as Figure 12 and Figure 13 are much more informative than the usual mean ± SEM
plots. These graphs are easily constructed using the R language. Specifically the R package
“beeswarm” (Eklund, 2010) can be of great help.
Figure 13. Graphical display of longitudinal data showing individual patient profiles
10.2.3 INTERPRETING AND REPORTING SIGNIFICANCE TESTS
When data are summarized in the Results section, the statistical methods that were used to analyze
them should be specified. It is to the reader of little help to have in the Methods section a statement
such as “statistical methods included analysis of variance, regression analysis, as well as tests of
significance” without any reference to which specific procedure is reported in the Results part.
41
Tests of statistical significance should be two‐sided (two‐tailed). When comparing two means or
two proportions, there is a choice between a two‐sided or a one‐sided test. In a one‐sided test the
investigator alternative hypothesis specifies the direction of the difference, e.g. experimental
treatment greater than control. In a two‐sided test, no such direction is specified. A one‐sided test is
rarely appropriate and when one‐sided tests are used, their use should be justified (Bland and
Altman, 1994). For all two group comparisons, the report should clearly state whether one‐sided or
two‐sided p‐values are reported.
Exact p‐values, rather than statements such as “P<0.05” or even worse “NS” (not significant), should
be reported where possible. Readers can then compare the reported p‐value with their own choice
of critical values. One should avoid reporting a p‐value as p = 0.000 since a value with zero
probability of occurrence is, by definition, an impossible value. No observed event can ever a
probability of zero. Therefore, such an extreme small p‐value must be reported as p < 0.001. In
rounding a p‐value, it happens that a value that is technically larger than the critical value of 0.05,
say 0.051, is rounded down to p = 0.05. This is inaccurate and, to avoid this error, p‐values should be
reported to the third decimal. If a one sided test is used and the result is in the wrong direction, then
the report must state that p>0.05 (Levine and Atkin, 2004), or even better report the complement of
the p‐value, i.e. 1 – p.
An effect that is statistically significant is not necessarily of biomedical importance. Therefore, one
should avoid sole reliance on statistical hypothesis testing and preferably supplement ones findings
with confidence intervals which are more informative Confidence intervals on a difference of means
or proportions provide information about the size of an effect and its uncertainty and are of
particular value when the results of the test fail to reject the null hypothesis.
There is a common misconception among scientists that a non‐significant result implies that the null
hypothesis can be accepted. Consequently, they conclude that there is no effect of the treatment or
that there is no difference between the treatment groups. However, from a philosophical point of
view, one can never prove the non‐existence of something. As Fisher (1935) clearly pointed out “it
should be noted that the null hypothesis is never proved or established, but is possibly disproved,
in the course of experimentation”. To state it otherwise: Lack of evidence is no evidence for lack
of effect. However, in the case of a non‐significant result a confidence interval will provide us with a
region of plausible values for the magnitude of the treatment effect.
42
11 CONCLUDING REMARKS
We have looked at the complexities of the research process from the vantage point of a generalist.
Statistical thinking was introduced as a non‐specialist generalist skill that permeates the entire
research process. The seven principles of statistical thinking were formulated as: 1) time spent
thinking on the conceptualization and design of an experiment is time wisely spent; 2) the design of
an experiment reflects the contributions from different sources of variability; 3) the design of an
experiment balances between its internal validity (proper control of noise) and external validity (the
experiment’s generalizability); 4) good experimental practice provides the clue to bias minimization;
5) good experimental design is the clue to the control of variability; 6) experimental design
integrates various disciplines; 7) a priori consideration of statistical power is an indispensable pillar
of an effective experiment.
We elaborated on each of these and finally discussed some points to consider in the interpretation
and reporting. What we didn’t touch was the role of the statistician in the research project. The
statistician is a professional particularly skilled in solving research problems. She should be
considered as a team member and often even as collaborator or partner in the research process in
which she can have a critical role. Whenever possible, the statistician should be consulted when
there is doubt with regard to design, sample size, or statistical analysis. A statistician working closely
together with a scientist can greatly improve the project’s likelihood of success. Many applied
statisticians become involved into the subject area and, by virtue of their statistical training, take on
the role of statistical thinker, thereby permeating the research process. In a great many instances
this key role of the statistician is recognized and granted with a co‐authorship.
The most effective way to work with a consulting statistician is to include her or him from the very
beginning of the project. What should be avoided is contacting the statistical support group after the
experiment has reached its completion. Finally, to quote once more R.A. Fisher: “To consult the
statistician after an experiment is finished is often merely to ask him to conduct a post mortem
examination. He can perhaps say what the experiment died of.” (Presidential Address to the First
Indian Statistical Congress, 1938; Fisher, 1938).
12 RECOMMENDED READING
The book by David Salsburg “The Lady Tasting Tea” is a lucidly written account of the history of
statistics and how statistical thinking revolutionized 20th Century science. A clear and comprehensive
work on experimental design is the book by Murray Selwyn “Principles of Experimental Design for
the Life Sciences”. On a more introductory level is the book by Ruxton and Colegrave “Experimental
43
Design for the Life Sciences”. A gentle introduction to statistics in general and hypothesis testing,
confidence intervals and analysis of variance in particular, can be found in the highly recommended
book of the two Wonnacott brothers “Introductory Statistics”. For those who want to carry out their
analyses in the freely available R‐language is the book by Peter Dalgaard “Introductory Statistics with
R” a good starter. Guidances and tips for efficient data visualizations can be found in “The Visual
Display of Quantitative Information” by Edward Turfte and in the two books by William Cleveland:
“Visualizing Data” and “The Elements of Graphing Data”. In “Speaking of Graphics”, Paul Lewi takes
the reader on a fascinating journey through the history of statistical graphics. This e‐book is freely
available on http://www. datascope.be).
45
REFERENCES AND SELECTED BIBLIOGRAPHY
Agresti A (1990). The Analysis of Categorical Data. J. Wiley, New Tork, NY.
Amaratunga D, Cabrera J (2004). Exploration and Analysis of DNA Microarray and Protein Array
Data. J. Wiley, New Yokr, NY.
Bailar III JC, Mostelller F (1988). Guidelines for statistical reporting in articles for medical journals.
Ann Int Medicine 108, 226‐273.
Bland M, Altman DG (1994). One and two sided tests of significance. British Medical Journal 309,
248.
Bretz F, Landgrebe J, Brunner E (2005).Multiplicity issues in microarray experiments. Methods Inf
Med 44, 431‐437.
Bretz F, Hothorn T, Westfall P (2010). Multiple Comparisons Using R. CRC Press, Boca Raton
Burrows PM, Scott SW, Barnett OW, McLaughlin MR (1984). Use of experimental designs with
quantitative ELISA. J Virol Methods 8(3), 207‐216.
Champely S (2009). pwr: Basic functions for power analysis. R package version 1.1.1. http://CRAN.R‐
project.org/package=pwr
Cleveland WS (1993). Visualizing Data. Hobart Press, Summit, NJ
Cleveland WS (1994). The Elements of Graphing Data. Hobart Press, Summit, NJ
Cohen J (1988). Statistical Power Analysis for the Behavioral Sciences.2nd Ed. Lawrence Erlbaum
Associates, Hillsdale, NJ.
Cox DR (1958). Planning of Experiments. J. Wiley, New York, NY.
Curran‐Everett D(2000). Multiple comparisons: philosophies and illustrations. Am J Physiol
Regulatory Integrative Comp Physiol 279, R1‐R8.
Dalgaard P(2002). Introductory Statistics wit R. Springer, New York, NY.
Eklund AC (2010). beeswarm: The bee swarm plot, an alternative to stripchart.. R package version
0.0.7. http://CRAN.R‐project.org/package=beeswarm.
46
Faessel H, Levasseur L, Slocum H, Greco W (1999) Parabolic growth patterns in 96‐well plate cell
growth experiments. In Vitro Cell.Dev.Biol.‐Animal 35: 270‐278.
Fisher RA (1935). The Design of Experiments , (8th Ed. 1966), Hafner Press, New York, NY (First
published 1935, Oliver‐Boyd, London UK).
Fisher RA (1938). Presidential address: The first session of the Indian Statistical Conference, Calcutta,
Sankhya, 4, 14—17.
Giesbrecht FG, Gumpertz, ML (2004). Planning, Construction, and Statistical Analysis of Comparative
Experiments. J. Wiley, New York, NY.
Grafen A, Hails R. (2002). Modern Statistics for the Life Sciences. Oxford University Press, Oxford, UK.
Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124.
doi:10.1371/journal.pmed.0020124
Kilkenny C, Parsons N, Kadyszewski E, Festing MFW, Cuthill, IC, Fry D, Hutton J, Altman DG. (2009)
Survey of the quality of experimental design, statistical analysis and reporting of research
using animals. PLoS ONE 4 (11), e7824.
Kutner MH, Nachtsheim C, Neter J, Li W (2004). Applied Linear Statistical Models, 5th ed.,Chicago:
McGraw‐Hill/Irwin
Lehmann EL (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden‐Day:. San
Francisco, CA.
Lehr R (1992). Sixteen s squared over d squared: a relation for crude sample size estimates. Statistics
in Medicine 11, 1099–1102.
Levasseur L, Faessel H, Slocum H, Greco W . (1995) Precision and pattern in 96‐well plate cell growth
experiments. Proc Am Stat Assoc Biopharm Sect, 227‐232..
Levine TR., Atkin C (2004). The accurate reporting of software‐generated p‐values: A cautionary
note. Communication Research Reports, 21, 324‐327.
Lewi PJ (2006). Speaking of Graphics. http://www.datascope.be
Mead R (1988). The design of experiments. Cambridge University Pres,s Cambridge, NY.
Nadon R, Shoemaker J (2002). Statistical issues with microarrays: processing and analysis. Trends in
Genetics 15 (5), 265‐271.
47
R Development Core Team (2010). R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3‐900051‐07‐0,
URL http://www.R‐project.org/.
Rivenson A, Hoffmann D, Prokopczyk B, Amin S, Hecht SS (1988). Induction of lung and exocrine
pancreas tumors in F344 rats by tobacco‐specific and Areca‐derived N‐nitrosamines. Cancer
Research, 48, 6912‐6917.
Ruxton GD, Colegrave N (2003). Experimental Design for the Life Science. Oxford University Press,
Oxford, UK.
Salsburg D (2001).The Lady Tasting Tea. Freeman, New York, NY.
Selwyn MR (1996). Principles of Experimental Design for the Life Sciences. CRC Pres, Boca Raton, FL.
Tallarida, RJ (2001). Drug synergism: its detection and applications. J Pharmacol & Exp Ther 298 (3),
865‐872.
Temme A, Sümpel F, Rieber GSEP, Willecke KJK, Ott T. (2001) Dilated bile canaliculi and attenuated
decrease of nerve‐dependent bile secretion in connexin32‐deficient mouse liver. Eur J Physiol
442, 961‐966.
Tufte ER (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT.
Van Belle G (2008). Statistical Rules of Thumb. J. Wiley, New York, NY.
Vandenbroeck P, Wouters L, Molenberghs G, Van Gestel J, Bijnens L (2006). Teaching statistical
thinking to life scientists. A case‐based approach. J Biopharm Stat 16, 61‐75.
Wells HG (1903). Mankind In The Making. 2007 Edition. 1st World Library, Fairfield, IA. p. 149
Wells HG (1938). World Brain. Cambridge University Press, London, UK, p. 88.
(http://ebooks.adelaide.edu.au/w/wells/hg/world_brain/).
Wilks SS (1951). Undergraduate statistical education. J Amer Statist Assoc 46, 1‐18.
Wonnacott TH, Wonnacott RJ (1990). Introductory Statistics. 5th Ed., J. Wiley, New York, NY.
49
APPENDIX TOOLS FOR RANDOMIZATION IN MS EXCEL AND R
COMPLETELY RANDOMIZED DESIGN
Suppose 21 experimental units have to be randomly assigned to three treatment groups, such that
each group contains exactly seven animals
A. MS EXCEL
A randomization list is easily constructed using a spreadsheet program like Microsoft Excel. This is
illustrated in Figure 14. We enter in the first column of the spreadsheet the code for the treatment
(1, 2, 3). Using the RAND() function, we fill the second column with pseudo‐random numbers
between 0 and 1. Subsequently the two columns are selected and the Sort command from the Data‐
menu is executed. In the Sort window that appears now, we select column B as column to be sorted
by. The treatment codes in column A are now in random order, i.e. the first animal will receive
treatment 2, the second treatment 3, etc.
Figure 14. Generating a completely randomized design in MS Excel
B. R‐LAN
In the ope
> set.se> x<-c(r> x [1] "A""C" "A" > rx<-sa> rx [1] "B" "C" "B"
RANDOM
Suppose 2
treatment
A. MS EX
In MS Exc
spreadshe
indication
numbers
from the
NGUAGE
en source stat
eed(1324)rep(c("A",
" "B" "C" "B" "C" ample(x) #
"B" "A" ""A"
MIZED COMP
20 experimen
t groups A, B,
XCEL
Figure
cel follow the
eet the code
n of the block
between 0 a
Data‐menu i
tistical langua
,”B”,”C”),7
"A" "B" "C
# randomize
"C" "A" "C"
PLETE BLOCK
ntal units, or
, C, D, such th
e 15. Generati
e procedure t
for the treat
(1:5). Using t
and 1. Subse
s executed. I
age R, the sam
7) # vecto
C" "A" "B"
ed treatme
" "B" "B"
K DESIGN
rganized in 5
hat each treat
ing a randomi
that is depict
ment (A, C, B
the RAND() fu
quently the
n the Sort wi
50
me result is o
or with tre
" "C" "A" "
ent assignm
"A" "A" "B
blocks of siz
tment occurs
ized complete
ted in Figure
B, D). The sec
unction, we fi
three column
indow that ap
obtained by:
eatment gr
"B" "C" "A
ment
B" "C" "C"
ze 4 have to
s exactly once
block design i
15. We ente
cond column
ill the third co
ns are select
ppears now,
roups A,B,C
A" "B" "C"
" "B" "C" "
be randomly
e in each bloc
n MS Excel
er in the first
(Column B) is
olumn with p
ted and the S
we select Co
C x 7
"A" "B"
"C" "A" "A
y assigned to
k.
column of th
s filled with a
seudo‐rando
Sort comman
olumn B as fir
A"
4
he
an
m
nd
rst
51
sort criterion and Column C as second sort criterion. The treatment codes in column A are now for
each block in random order, i.e. the first animal in block 1 will receive treatment A, the second
treatment D, etc.
B. R LANGUAGE
> set.seed(3223) > treat<- rep(c("A",”B”,”C”,”D”),5) # treatments > design<-data.frame(treat=treat,block=rep(1:5,rep(4,5)))#treats & blocks > rdesign<-design[sample(dim(design)[1]),] # random sequence > rdesign<-rdesign[order(rdesign$block),] # order by blocks for convenience > rdesign
treat block 11 C 1 6 B 1 16 D 1 1 A 1 17 D 2 2 A 2 12 C 2 7 B 2 13 C 3 18 D 3 3 A 3 8 B 3 14 C 4 9 B 4 4 A 4 19 D 4 5 A 5 15 C 5 10 B 5 20 D 5