Introduction to study designs - case-control
studiesIntroduction
Learning objectives:You will learn about basic introduction to
case-control studies, its analysis and interpretation of outcomes.
Case-control studies are one of the frequently used study designs
due to the relative ease of its application in comparison with
other study designs. This section introduces you to basic concepts,
application and strengths of case-control study. This section also
covers: 1. Issues in the design of case-control studies 2. Common
sources of bias in a case-control study 3. Analysis of case-control
studies 4. Strengths and weaknesses of case-control studies 5.
Nested case-control studiesRead the resource text below.Resource
textCase-control studies start with the identification of a group
of cases (individuals with a particular health outcome) in a given
population and a group of controls (individuals without the health
outcome) to be included in the study.
In a case-control study the prevalence of exposure to a
potential risk factor(s) is compared between cases and controls. If
the prevalence of exposure is more common among cases than
controls, it may be a risk factor for the outcome under
investigation. A major characteristic of case-control studies is
that data on potential risk factors are collected retrospectively
and as a result may give rise to bias. This is a particular problem
associated with case-control studies and therefore needs to be
carefully considered during the design and conduct of the study.1.
Issues in the design of case-control studiesFormulation of a
clearly defined hypothesisAs with all epidemiological
investigations the beginning of a case-control study should begin
with the formulation of a clearly defined hypothesis.Case
definitionIt is essential that the case definition is clearly
defined at the outset of the investigation to ensure that all cases
included in the study are based on the same diagnostic
criteria.Source of casesThe source of cases needs to be clearly
defined.Selection of casesCase-control studies may use incident or
prevalent cases.Incident casescomprise cases newly diagnosed during
a defined time period. The use of incident cases is considered as
preferential, as the recall of past exposure(s) may be more
accurate among newly diagnosed cases. In addition, the temporal
sequence of exposure and disease is easier to assess among incident
cases.Prevalent casescomprise individuals who have had the outcome
under investigation for some time. The use of prevalent cases may
give rise to recall bias as prevalent cases may be less likely to
accurately report past exposures(s). As a result, the
interpretation of results based on prevalent cases may prove more
problematic, as it may be more difficult to ensure that reported
events relate to a time before the development of disease rather
than to the consequence of the disease process itself. For example,
individuals may modify their exposure following the onset of
disease. In addition, unless the effect of exposure on duration of
illness is known, it will not be possible to determine the extent
to which a particular characteristic is related to the prognosis of
the disease once it develops rather than to its cause.Source of
casesCases may be recruited from a number of sources; for example
they may be recruited from a hospital, clinic, GP registers or may
be population bases. Population based case control studies are
generally more expensive and difficult to conduct.Selection of
controlsA particular problem inherent in case-control studies is
the selection of a comparable control group. Controls are used to
estimate the prevalence of exposure in the population which gave
rise to the cases. Therefore, the ideal control group would
comprise a random sample from the general population that gave rise
to the cases. However, this is not always possible in practice. The
goal is to select individuals in whom the distribution of exposure
status would be the same as that of the cases in the absence of an
exposure disease association. That is, if there is no true
association between exposure and disease, the cases and controls
should have the same distribution of exposure. The source of
controls is dependent on the source of cases. In order to minimize
bias, controls should be selected to be a representative sample of
the population which produced the cases. For example, if cases are
selected from a defined population such as a GP register, then
controls should comprise a sample from the same GP register.
In case-control studies where cases are hospital based, it is
common to recruit controls from the hospital population. However,
the choice of controls from a hospital setting should not include
individuals with an outcome related to the exposure(s) being
studied. For example, in a case-control study of the association
between smoking and lung cancer the inclusion of controls being
treated for a condition related to smoking (e.g. chronic
bronchitis) may result in an underestimate of the strength of the
association between exposure (smoking) and outcome. Recruiting more
than one control per case may improve the statistical power of the
study, though including more than 4 controls per case is generally
considered to be no more efficient.Measuring exposure
statusExposure status is measured to assess the presence or level
of exposure for each individual for the period of time prior to the
onset of the disease or condition under investigation when the
exposure would have acted as a causal factor. Note that in
case-control studies the measurement of exposure is established
after the development of disease and as a result is prone to both
recall and observer bias. Various methods can be used to ascertain
exposure status. These include: Standardized questionnaires
Biological samples Interviews with the subject Interviews with
spouse or other family members Medical records Employment records
Pharmacy recordsThe procedures used for the collection of exposure
data should be the same for cases and controls.2. Common sources of
bias in case-control studiesDue to the retrospective nature of
case-control studies, they are particularly susceptible to the
effects of bias, which may be introduced as a result of a poor
study design or during the collection of exposure and outcome data.
Because the disease and exposure have already occurred at the
outset of a case control study, there may be differential reporting
of exposure information between cases and controls based on their
disease status. For example, cases and controls may recall past
exposure differently (recall bias). Similarly, the recording of
exposure information may vary depending on the investigator's
knowledge of an individual's disease status (interviewer/observer
bias). Therefore, the design and conduct of the study must be
carefully considered, as there are limited options for the control
of bias during the analysis.Selection bias in case-control
studiesSelection bias is a particular problem inherent in
case-control studies, where it gives rise to non-comparability
between cases and controls. Selection bias in case control studies
may occur when: 'cases (or controls) are included in (or excluded
from) a study because of some characteristic they exhibit which is
related to exposure to the risk factor under evaluation' [1]. The
aim of a case-control study is to select study controls who are
representative of the population which produced the cases. Controls
are used to provide an estimate of the exposure rate in the
population. Therefore, selection bias may occur when those
individuals selected as controls are unrepresentative of the
population that produced the cases.
The potential for selection bias in case control studies is a
particular problem when cases and controls are recruited
exclusively from hospital or clinics. Hospital patients tend to
have different characteristics than the population, for example
they may have higher levels of alcohol consumption or cigarette
smoking. If these characteristics are related to the exposures
under investigation, then estimates of the exposure among controls
may be different from that in the reference population, which may
result in a biased estimate of the association between exposure and
disease. Berkesonian bias is a bias introduced in hospital based
case-control studies, due to varying rates of hospital admissions.
As the potential for selection bias is likely to be less of a
problem in population based case-control studies, neighbourhood
controls may be a preferable choice when using cases from a
hospital or clinic setting. Alternatively, the potential for
selection bias may be minimized by selecting controls from more
than one source, such as by using both hospital and neighbourhood
controls. Selection bias may also be introduced in case-control
studies when exposed cases are more likely to be selected than
unexposed cases.3. Analysis of case-control studiesThe odds ratio
(OR) is used in case-control studies to estimate the strength of
the association between exposure and outcome. Note that it is not
possible to estimate the incidence of disease from a case control
study unless the study is population based and all cases in a
defined population are obtained.The results of a case-control study
can be presented in a 2x2 table as follow:
The odds ratio is a measure of the odds of disease in the
exposed compared to the odds of disease in the unexposed (controls)
and is calculated as:
Example: Calculation of the OR from a hypothetical case-control
study of smoking and cancer of the pancreas among 100 cases and 400
controls. Table 1. Hypothetical case-control study of smoking and
cancer of the pancreas.
OR =60 x 300100 x 40 OR = 4.5 The OR calculated from the
hypothetical data in table 1 estimates that smokers are 4.5 times
more likely to develop cancer of the pancreas than non-smokers. NB:
The odds ratio of smoking and cancer of the pancreas has been
performed without adjusting for potential confounders. Further
analysis of the data would involve stratifying by levels of
potential confounders such as age. The 2x2 table can then be
extended to allow for stratum specific rates of the confounding
variable(s) to be calculated and, where appropriate, an overall
summary measure, adjusted for the effects of confounding, and a
statistical test of significance can also be calculated. In
addition, confidence intervals for the odds ratio would also be
presented.4. Strengths and weaknesses of case-control
studiesStrengths Cost effective relative to other analytical
studies such as cohort studies. Case-control studies are
retrospective, and cases are identified at the beginning of the
study; therefore there is no long follow up period (as compared to
cohort studies). Efficient for the study of diseases with long
latency periods. Efficient for the study of rare diseases. Good for
examining multiple exposures.Weaknesses Particularly prone to bias;
especially selection, recall and observer bias. Case-control
studies are limited to examining one outcome. Unable to estimate
incidence rates of disease (unless study is population based). Poor
choice for the study of rare exposures. The temporal sequence
between exposure and disease may be difficult to
determine.References 1. Hennekens CH, Buring JE. Epidemiology in
Medicine, Lippincott Williams & Wilkins, 1987.
Case-Control Studies
Cohort studies have an intuitive logic to them, but they can be
very problematic when:1. The outcomes being investigated are
rare;2. There is a long time period between the exposure of
interest and the development of the disease; or3. It is expensive
or very difficult to obtain exposure information from a cohort.In
the first case, the rarity of the disease requires enrollment of
very large numbers of people. In the second case, the long period
of follow-up requires efforts to keep contact with and collect
outcome information from individuals. In all three situations, cost
and feasibility become an important concern.A case-control design
offers an alternative that is much more efficient. The goal of a
case-control study is the same as that of cohort studies, i.e. to
estimate the magnitude of association between an exposure and an
outcome. However, case-control studies employ a different sampling
strategy that gives them greater efficiency. As with a cohort
study, a case-control study attempts to identify all people who
have developed the disease of interest in the defined population.
This is not because they are inherently more important to
estimating an association, but because they are almost always rarer
than non-diseased individuals, and one of the requirements of
accurate estimation of the association is that there are reasonable
numbers of people in both the numerators (cases) and denominators
(people or person-time) in the measures of disease frequency for
both exposed and reference groups. However, because most of the
denominator is made up of people who do not develop disease, the
case-control design avoids the need to collect information on the
entire population by selecting a sample of the underlying
population.Rothmandescribes the case-control strategy as
follows:"Case-control studies are best understood by considering as
the starting point asource population, which represents a
hypothetical study population in which a cohort study might have
been conducted. Thesource populationis the population that gives
rise to the cases included in the study. If a cohort study were
undertaken, we would define the exposed and unexposed cohorts (or
several cohorts) and from these populations obtain denominators for
the incidence rates or risks that would be calculated for each
cohort. We would then identify the number of cases occurring in
each cohort and calculate the risk or incidence rate for each. In a
case-control study the same cases are identified and classified as
to whether they belong to the exposed or unexposed cohort. Instead
of obtaining the denominators for the rates or risks, however, a
control group is sampled from the entire source population that
gives rise to the cases. Individuals in the control group are then
classified into exposed and unexposed categories. The purpose of
the control group is to determine the relative size of the exposed
and unexposed components of the source population."
To illustrate this consider the following hypothetical scenario
in which the source population is the state of Massachusetts.
Diseased individuals are red, and non-diseased individuals are
blue. Exposed individuals are indicated by a whitish midsection.
Note the following aspects of the depicted scenario:1. The outcome
being investigated is rare.2. There is a fairly large number of
exposed individuals in the state, but most of these are not
diseased.3. The proportion of exposed individuals among the disease
cases (7/13) is higher than the proportion of exposure among the
controls.
If I somehow had exposure and outcome information on all of the
subjects in the source population and looked at the association
using a cohort design, it might look like
this:DiseasedNon-diseasedTotal
Exposed71,0001,007
Non-exposed65,6345,640
Therefore, the incidence in the exposed individuals would be
7/1,007 = 0.70%, and the incidence in the non-exposed individuals
would be 6/5,640 = 0.11%. Consequently, the risk ratio would be
0.70/0.11=6.52, suggesting that those who had the risk factor
(exposure) had 6.5 times the risk of getting the disease compared
to those without the risk factor. This is a strong association.In
this hypothetical example, I had data on all 6,647 people in the
source population, and I could compute the probability of disease
(i.e., the risk or incidence) in both the exposed group and the
non-exposed group, because I had the denominators for both the
exposed and non-exposed groups.The problem, of course, is that I
usually don't have the resources to get the data on all subjects in
the population. If I took a random sample of even 5-10% of the
population, I might not have any diseased people in my sample.An
alternative approach would be to use surveillance databases or
administrative databases to find most or all 13 of the cases in the
source population and determine their exposure status. However,
instead of enrolling all of the other 5,634 residents, suppose I
were to just take a sample of the non-diseased population. In fact,
suppose I only took a sample of 1% of the non-diseased people and I
then determined their exposure status. The data might look
something like this:DiseasedNon-diseasedTotal
Exposed710unknown
Non-exposed656unknown
With this sampling approach I can no longer compute the
probability of disease in each exposure group, because I no longer
have the denominators in the last column. In other words, I don't
know the exposure distribution for the entire source population.
However, the small control sample of non-diseased subjects gives me
a way to estimate the exposure distribution in the source
population. So, I can't compute the probability of disease in each
exposure group, but I can compute the odds of disease in each
group.The Odds RatioThe odds of disease in the exposed group are
7/10, and the odds of disease in the non-exposed group are 6/56. If
I compute the odds ratio, I get (7/10) / (5/56) = 6.56, very close
to the risk ratio that I computed from data for the entire
population. We will consider odds ratios and case-control studies
in much greater depth in a later module. However, for the time
being the key things to remember are that:1. The sampling strategy
for a case-control study is very different from that of cohort
studies, despite the fact that both have the goal of estimating the
magnitude of association between the exposure and the outcome.2. In
a case-control study there is no "follow-up" period. One starts by
identifying diseased subjects and determines their exposure
distribution; one then takes a sample of the source population that
produced those cases in order to estimate the exposure distribution
in the overall source population that produced the cases. [In
cohort studies none of the subjects have the outcome at the
beginning of the follow-up period.]3. In a case-control study, you
cannot measure incidence, because you start with diseased people
and non-diseased people, so you cannot calculate relative risk.4.
The case-control design is very efficient. In the example above the
case-control study of only 79 subjects produced an odds ratio
(6.56) that was a very close approximation to the risk ratio (6.52)
that was obtained from the data in the entire population.5.
Case-control studies are particularly useful when the outcome is
rare is uncommon in both exposed and non-exposed people.The
Difference Between "Probability" and "Odds"?
The probability that an event will occur is the fraction of
times you expect to see that event in many trials.Probabilities
always range between 0 and 1. The odds are defined as the
probability that the event will occur divided by the probability
that the event will not occur.If theprobabilityof an event
occurring is Y, then the probability of the event not occurring is
1-Y. (Example: If the probability of an event is 0.80 (80%), then
the probability that the event will not occur is 1-0.80 = 0.20, or
20%.Theoddsof an event represent the ratio of the (probability that
the event will occur) / (probability that the event will not
occur). This could be expressed as follows:Odds of event = Y /
(1-Y)So, in this example, if the probability of the event occurring
= 0.80, then the odds are 0.80 / (1-0.80) = 0.80/0.20 = 4 (i.e., 4
to 1).
If a race horse runs 100 races and wins 25 times and loses the
other 75 times, the probability of winning is 25/100 = 0.25 or 25%,
but the odds of the horse winning are 25/75 = 0.333 or 1 win to 3
loses. If the horse runs 100 races and wins 5 and loses the other
95 times, the probability of winning is 0.05 or 5%, and the odds of
the horse winning are 5/95 = 0.0526. If the horse runs 100 races
and wins 50, the probability of winning is 50/100 = 0.50 or 50%,
and the odds of winning are 50/50 = 1 (even odds). If the horse
runs 100 races and wins 80, the probability of winning is 80/100 =
0.80 or 80%, and the odds of winning are 80/20 = 4 to 1.NOTEthat
when the probability is low, the odds and the probability are very
similar.
On Sept. 8, 2011 the New York Times ran an article on the
economy in which the writer began by saying "If history is a guide,
the odds that the American economy is falling into a double-dip
recession have risen sharply in recent weeks and may even have
reached 50 percent."Further down in the article the author quoted
the economist who had been interviewed for the story. What the
economist had actually said was, "Whether we reach the technical
definition [of a double-dip recession] I think is probably close to
50-50."Question:was the author correct in saying that the "odds" of
a double-dip recession may have reached 50 percent?ANSWERHepatitis
Outbreak in Marshfield, MAIn 2004 there was an outbreak ofhepatitis
Aon the South Shore of Massachusetts. Over a period of a few weeks
there were 20 cases of hepatitis A that were reported to the MDPH,
and most of the infected persons were residents of Marshfield, MA.
Marshfield's health department requested help in identifying the
source from MDPH. The investigators quickly performed descriptive
epidemiology. The epidemic curve indicated a point source epidemic,
and most of the cases lived in the Marshfield area, although some
lived as far away as Boston. They conducted hypothesis-generating
interviews, and taken together, the descriptive epidemiology
suggested that the source was one of five or six food
establishments in the Marshfield area, but it wasn't clear which
one. Consequently, the investigators wanted to conduct an analytic
study to determine which restaurant was the source.They invited all
20 cases of hepatitis A to answer questions from a questionnaire
designed for this study, and 19 of the cases agreed to complete the
survey.
Summary
Note that the lower three study designs (retrospective and
prospective cohort studies and clinical trials) are similar in that
an initially disease free cohort is divided into groups based on
their "exposure" status, i.e., whether or not they have a
particular "risk factor," and for all three, the investigator
measures and compares the incidence of disease. In contrast,
case-control studies identify diseased and non-diseased subjects
and then measure and compare their likelihood of having had certain
prior exposures.