-
Further Poisson regression (AS05)
EPM304 Advanced Statistical Methods in Epidemiology
Course: PG Diploma/ MSc Epidemiology
This document contains a copy of the study material located
within the computer assisted learning (CAL) session. If you have
any questions regarding this document or your course, please
contact DLsupport via [email protected]. Important note: this
document does not replace the CAL material found on your module
CDROM. When studying this session, please ensure you work through
the CDROM material first. This document can then be used for
revision purposes to refer back to specific sessions. These study
materials have been prepared by the London School of Hygiene &
Tropical Medicine as part of the PG Diploma/MSc Epidemiology
distance learning course. This material is not licensed either for
resale or further copying.
London School of Hygiene & Tropical Medicine September 2013
v1.0
-
Section 1: Further Poisson regression Aim
To discuss the Poisson distribution, review Poisson regression
and learn about overdispersion within a Poisson model.
Objectives By the end of this session you will be able to:
describe the Poisson distribution calculate Poisson
probabilities summarise the procedures involved in developing a
Poisson model explain overdispersion in a Poisson model explain how
to deal with overdispersion
This session should take you between 1.5 and 2 hours to
complete. Section 2: Planning your study This session will look at
the Poisson distribution and Poisson regression in more detail. We
will first consider the Poisson distribution, then review the main
issues of model development for a Poisson regression model, and
finally discuss what to do if Poisson regression does not provide
the most appropriate model for your data. To work through this
session you should be familiar with the analysis of cohort studies
and regression modelling. Students who completed EPM202 can refer
to the appropriate sessions listed below. Occasional students
should refer to their preferred text. Cohort studies SM02
Likelihood SM05 Introduction to Poisson and Cox regression SM11
Interaction: Hyperlink: SM02: SM02 window opens. Interaction:
Hyperlink: SM05: SM05 window opens. Interaction: Hyperlink: SM11:
SM11 window opens.
-
2.1: Planning your study To illustrate the methods discussed in
this session three examples are used. Click on each of the studies
listed below for further details. Study 1: Weekly registrations of
new breast cancer cases within a region Study 2: A study of optic
nerve disease carried out in Nigeria Study 3: A pneumococcal
vaccine trial based in Papua New Guinea Interaction: Hyperlink:
Study 1:(card appears on right handside): Study 1 In the UK, all
new cases of breast cancer are reported to the regional cancer
registry via the GP's or a hospital. The cases are reported on a
weekly basis. Interaction: Hyperlink: Study 2: (card appears on
right hand side): Study 2 Individuals living in an area of rural
northern Nigeria that is endemic for onchocerciasis formed the
placebo arm of a trial of ivermectin. All individuals were screened
for optic nerve disease (OND) at the start of the trial. Only
individuals without optic nerve disease at baseline were included.
Individuals were re-examined at about 2 and 3 years post-baseline
to identify incident cases of optic nerve disease. Interaction:
Hyperlink: Study 3: (card appears on right hand side): Study 3 A
trial was carried out in Papua New Guinea to assess the effect of a
pneumococcal vaccine on morbidity from acute lower respiratory
tract infections in schoolchildren. Information was collected on
all children in the study and throughout the trial each episode of
infection was recorded. Section 3: The Poisson distribution You
should already be familiar with some statistical distributions,
i.e. the Normal and Binomial distributions (covered in sessions
SC04 and SC05). During parts of the EPP course we have applied the
Poisson distribution but not fully discussed the distribution
itself. On this page we will look more closely at this
distribution. Poisson distribution
-
3.1: The Poisson distribution The Poisson distribution is named
after the French mathematician Simon Poisson (17811840). Let's
first think about the type of data it can be applied to, i.e. data
in discrete counts. Shown below are some examples of instances in
which you would count the number of occurrences of an event over a
period of time. Examples 1. The weekly number of new cases of
breast cancer notified to the regional cancer registry. 2. The
number of deaths that occur in a cohort of civil servants in a
year. 3. The number of cases of salmonella reported in a regional
health authority within a week. 3.2: The Poisson distribution So,
the Poisson distribution is appropriate for describing: The number
of occurrences of an event during a period of time. ...but only
when the events are independent of each other and occur at
random.
-
Poisson distribution
3.3: The Poisson distribution The simplest way to describe the
Poisson distribution is to consider the average rate at which
events occur over time. For example, imagine you were told that the
weekly number of new cases of breast cancer reported to the
regional register was 4 on average. Does this mean that every week
you should expect 4 new cases of breast cancer to be reported?
Interaction: Button: clouds picture (pop up box appears): It is
highly unlikely that you would get exactly 4 new cases of breast
cancer reported each week, but the weekly average over a few weeks
should be 4. In one week there might be no cases reported and the
next week there might be more than 6 cases reported. 3.4: The
Poisson distribution A Poisson distribution describes the sampling
distributionof the number of occurrences, r, of an event during a
period of time. It depends on only one parameter = the mean number
of occurrences in periods of the same length. This is shown below.
Interaction: Hyperlink: sampling distribution (pop up box
appears):
-
Sampling Distribution The distribution of sample estimates
obtained when many random samples are generated independently from
the same population. Interaction: Hyperlink: parameter (pop up box
appears): Parameter A parameter is a measurement from a population
that characterises one of its features.
3.5: The Poisson distribution Now, the plot below shows the
distribution of the weekly notifications of breast cancer to a
regional register over 19 weeks. What are the values of the
following parameters? (You may wish to go back to the previous
card. To work out you need to do a calculation considering the
frequency of each event). Maximum value of r observed = Mean, =
Interaction: Calculation: Maximum value of r =____: Correct
Response 8: Correct Yes, the greatest number of notifications in a
week was 8.
-
Incorrect Response: Sorry, that's not right. You can see from
the graph that the greatest number of notifications in a week was
8. Interaction: Calculation: Mean, = : Correct Response 4.2: That's
right, the mean is given by the total number of notifications over
the 19 week period (80), divided by the number of weeks (19).
Therefore the mean is 80 / 19 = 4.2. Incorrect Response: No, that's
not right. To find the mean, you first need to calculate the total
number of notifications over the 20-week period. This is given by:
(1x1) + (2x2) + (4x3) + (5x4) + (3x5) + (1x6) + (2x7) + (1x8) = 80
Therefore the mean is 80 / 19 = 4.2
3.6: The Poisson distribution Let's go back to the theoretical
Poisson distribution. If you assume the weekly registrations are
independent and follow a Poisson distribution, you can calculate
the probability that a specific number of cases will occur in a
week.
-
You will see how to do this calculation next, but first consider
the plot below. What is the probability of 2 registrations in a
week? Give your answer to 2 decimal places. Prob (2
registrations/week) = Interaction: Calculation: Prob (2
registrations/week) =____: Correct Response 0.14 or 0.15 Correct
Yes, the plot shows that the probability of 2 registrations in a
week is somewhere between 0.14 and 0.15. Incorrect Response: No,
the probability of 2 registrations in a week is shown as the height
of the vertical line at 2 events on the x-axis. This line reaches
to a point just below a probability of 0.15, so your answer should
be either 0.14 and 0.15.
3.7: The Poisson distribution The probability of r occurrences
can be calculated from the Poisson formula shown below.
Probability (r occurrences) = e- r
r!
-
where: e is the mathematical constant 2.71878 is the mean number
of events r is the number of occurrences r ! is the factorial of r.
Interaction: Hyperlink: factorial (pop up box appears): Factorial
The factorial of an integer n is written n! and is calculated as 1
x 2 x 3 x ... x n. For example, the factorial of 5 is: 5! = 1 x 2 x
3 x 4 x 5 = 120 Let's work this out for the registration of breast
cancer cases. The mean number of new cases of breast cancer
registered each week is 4.0. Using this number with the formula
below, you can calculate the probability of 0, 1, 2, 3... new
registrations in a week. For example, to calculate the probability
of 2 registrations in a week for a mean of 4:
Probability (2 registrations) = e-4 42
= e-4 x 16
= 0.1465 2! 2 x 1
Does this agree with the probability you read from the plot? Go
back and check. 3.8: The Poisson distribution The table below shows
the probabilities for mu = 4. Click on each cell in the probability
column to see how it was calculated. Then calculate the missing
probabilities, giving your answers to 4 decimal places.
Remember, prob( r ) = e- r
r!
Probability of r breast cancer registrations per week Number of
registratio
ns (r)
Probability
0 0.0183 1 0.0733
-
2 0.1465 3 0.1954 4 0.1954 5 0.1563 6 0.1042 7 8 0.0298 9
Interaction: Button: 0.0183 (Number of registrations = 0):
Prob(0 registrations) = (e 4 x 40) / 0! = e 4 / 1 = 0.0183
Interaction: Button: 0.0733 *(Number of registrations = 1): Prob(1
registration) = (e 4 x 41) / 1! = 4e 4 / 1 = 0.0733 Interaction:
Button: 0.1465 (Number of registrations = 2): Prob(2 registrations)
= (e 4 x 42) / 2! = 16 x e 4 / 2 = 0.1465 Interaction: Button:
0.1954 (Number of registrations = 3): Prob(3 registrations) = (e 4
x 43) / 3! = 64e 4 / 6 = 0.1954 Interaction: Button: 0.1954 (Number
of registrations = 4): Prob(4 registrations) = (e 4 x 44) / 4! =
256e 4 / 24 = 0.1954 Interaction: Button: 0.1563 (Number of
registrations = 5): Prob(5 registrations) = (e 4 x 45) / 5! = 1024e
4 / 120 = 0.1563 Interaction: Button: 0.1042 (Number of
registrations = 6): Prob(6 registrations) = (e 4 x 46) / 6! = 4096e
4 / 720
-
= 0.1042 Interaction: Button: 0.0298 (Number of registrations =
8): Prob(8 registrations) = (e 4 x 48) / 8! = 65536e 4 / 40320 =
0.0298 Interaction: Calculation: Number of registrations = 7,
Probability = ___: Correct Response 0.0595: That's right, the
probability of 7 registrations per week is: Prob(7 registrations) =
(e 4 x 47) / 7! = 16384e 4 / 5040 = 0.0595 Incorrect Response:
Sorry, that's not correct. Remember that the mean, mu = 4, so for r
= 7, you use the formula as follows: Prob(7 registrations) = (e x r
) / r ! = (e 4 x 47) / 7! = 16384e 4 / 5040 = 0.0595 Interaction:
Calculation: Number of registrations = 9, Probability = ___:
Correct Response 0.0132: That's right, the probability of 9
registrations per week is: Prob(9 registrations) = (e 4 x 49) / 9!
= 262144e 4 / 362880 = 0.0132 Incorrect Response: Sorry, that's not
correct. Remember that the mean, mu = 4, so for r = 9, you use the
formula as follows: Prob(9 registrations) = (e 4 x 49) / 9! =
262144e 4 / 362880 = 0.0132 (When Probability (7 registrations) is
answered the following text and interaction appears on the LHS.
Also an extra row with a calculation appears on the table)
-
Now, can you calculate the probability of 10 or more
registrations, to 4 decimal places? Make sure you have entered the
correct values in rows 7 and 9 before you do this. Interaction:
Button: Hint: Hint Remember that probabilities should add up to 1.
Number of registratio
ns (r)
Probability
0 0.0183 1 0.0733 2 0.1465 3 0.1954 4 0.1954 5 0.1563 6 0.1042 7
8 0.0298 9
10+ Interaction: Calculation: Number of registrations = 10+,
probability = ___: Correct Response 0.0081: Correct That's right,
Prob(10+) = 1 [prob(0) + prob(1) + ... + prob(9)] = 1 0.9919 =
0.0081 Incorrect Response: Sorry, that's not right. Remember that
the probabilities must add up to 1 in total, so by summing the
probabilities for rows 0 to 9, and subtracting that total from 1,
you get the probability that there are 10 or more registrations in
a week. Prob(10+) = 1 [prob(0) + prob(1) + ... + prob(9)] = 1
0.9919 = 0.0081 3.9: The Poisson distribution
-
The mean and variance are often used to summarize a
distribution. For a Poisson distribution the variance is equal to
the mean, so it can be described by a single parameter.
Interaction: Button: More (text appears on bottom LHS): So, if the
variance is equal to the mean, the standard error equals the square
root of the mean, . This standard error can be used to obtain a
confidence interval for the mean 3.10: The Poisson distribution The
Poisson distribution for the cases of breast cancer weekly
registration is shown below. Which of the boxes below best
describes the distribution below?
Interaction: Hotspot: Left skewed: Incorrect Response (pop up
box appears): Left skewed Sorry, a left skewed distribution has a
low probability of a small number of events and a high probability
of a large number of events. That is not the case for this
distribution. Please try again. Interaction: Hotspot: Symmetrical:
Incorrect Response (pop up box appears): Symmetrical No, one tail
of the distribution is much longer than the other tail, therefore
the distribution is not symmetrical. Please try again. Interaction:
Hotspot: Right skewed: Correct Response (pop up box appears):
Correct The Poisson distribution is skewed to the right since the
probability of a small number of events is high and the probability
of a large number of events is low.
Symmetrical
Left skewed
Right skewed
-
3.11: The Poisson distribution Use the drop-down menu beneath
the plot below to view the shape of the Poisson distribution for
different values of the mean. Consider the plots. Which
distribution can we use as an approximation to the Poisson when the
mean is large? Interaction: Button: clouds picture (pop up box
appears): For large means, the shape of the Poisson distribution is
symmetrical. In such cases, the distribution can be approximated by
the Normal distribution.
-
Interaction: Pulldown: =3:
Interaction: Pulldown: =5:
-
Interaction: Button: =10:
Section 4: The Poisson distribution and rates In epidemiology we
are more interested in the rate of occurrence of an outcome rather
than counts.
-
The mean rate, , of the occurrence of an event is given by:
= r / t where r is the number of events in a time period t. So
for the Poisson distribution what is the standard error of this
rate? Interaction: Button: clouds picture (text appears on the
bottom RHS): The Poisson distribution is described by one
parameter, the mean, which is equal to the variance. So the
standard error in this case will be = ( r / t ). 4.1: The Poisson
distribution and rates Usually the follow-up time differs for each
individual, and in calculating rate we therefore divide by the
person-time. So, the mean rate is given by: = r / person time
Example A study was carried out in a poor area of Mexico, in which
652 children aged less than 5 years were followed for a period of
up to 2 years. During the study 73 cases of lower respiratory
infection were recorded. The total child-years of follow-up was
868. Calculate the overall (incidence) rate of lower respiratory
infection estimated from this study and enter your answer in the
box below. Rate = Interaction: Calculation: Rate =____: Correct
Response 0.084: Correct Yes, the incidence rate is given by the
number of infections divided by the total person-time: = 73 / 868 =
0.084 Incorrect Response:
-
Sorry, that's not right. The total number of infections during
the study, r , was 73, and the total person-time of follow-up was
868 person-years. So, using the formula: = 73 / 868 = 0.084 4.2:
The Poisson distribution and rates Such rates are usually expressed
in units of 100 or 1000 person-years. For example, the incidence
rate of lower respiratory infection in the Mexico study was 0.084 x
100 = 8.4 per 100 child-years. Section 5: Poisson regression model
A Poisson regression model is appropriate for the analysis of
cohort data where the outcome measure is a rate. The information
essential for analysis is: 1. Time of entry to the study 2. Whether
the event of interest occurred 3. Time of exit / end of follow-up
This allows the calculation of rates, based on the information for
the whole cohort. Rate = number of events person-time at risk
Interaction: Button: More (card appears on RHS): Using Poisson
regression we can model the log rates for the exposure(s) of
interest to adjust for many simultaneously. The general format of a
Poisson model may be written as:
log = + 1X1 + 2X2 +... i X
i
Click below to compare this model to two other models you are
familiar with, namely the logistic model and the Cox model.
Interaction: Button: Logistic: Logistic model log ( ) = + 1X1 + 2X2
+...i Xi 1 - Interaction: Button: Cox:
-
log(t;X)= log(t;0) + 1X1 + 2X2 +...i Xi
5.1: Poisson regression model As with all regression models, a
Poisson model can: Interaction: Timed pop up 1: Interaction: Timed
pop up 2: Interaction: Timed pop up 3: Interaction: Timed pop up 4:
5.2: Poisson regression model You will now investigate the
application of the Poisson model using data from a study of optic
nerve disease carried out in Nigeria. Individuals living in an
onchocercal area in Nigeria were followed over a 3-year period to
study the incidence of optic nerve disease (OND). The exposure of
interest was onchocerciasis which was measured by microfilarial
load. The categories for this are shown opposite. Interaction:
Hyperlink: onchocerciasis (pop up box appears): Onchocerciasis
(Also known as river blindness) A disease affecting the skin and
eyes, caused by the filiarial worm Onchocerca volvulus. The disease
is transmitted by blackflies and has been endemic in large areas of
West Africa.
Account for interaction
Model a linear effect
Adjust for many confounders
Simultaneously model different exposure types
-
Interaction: Hyperlink: microfilarial (pop up box appears):
Microfilariae Offspring of the female worm onchocerca volvulus,
which migrate through the skin causing onchocerciasis.
Category of microfilarial
load
Microfilariae per mg
0 0
1 0.1 10
2 10.1 50
3 > 50
5.3: Poisson regression model The rates of optic nerve disease
for each category of microfilarial load are calculated by dividing
the number of OND cases by the total person-time in each category
of microfilarial load. Examine the rates shown below. Do you think
there is a relationship between optic nerve disease and
microfilarial load? Rates of optic nerve disease by microfilarial
load per 1000 person-years Microfilarial
load D Y (/
1000) Rate 95% confidence
limits 0 mg 14 1.158 12.091 7.161 20.415
0.1 to 10 mg 18 1.512 11.903 7.500 18.893 10.1 to 50
mg 31 0.992 31.256 21.981 44.444
> 50 mg 8 0.242 33.014 16.510 66.015 Rate = counts (D) /
person-time (Y) Person-time is the total follow-up time for all
people in each group Interaction: Button: clouds picture (pop up
box appears and part of table is highlighted): The rates of optic
nerve disease appear to increase with a higher microfilarial load,
from 12.1 per 1000 person years in the baseline microfilarial load
group to 33.0 per 1000 person years in individuals with the highest
microfilarial load. Rates of optic nerve disease by
-
microfilarial load per 1000 person-years Microfilarial
load D Y (/
1000) Rate 95% confidence
limits 0 mg 14 1.158 12.091 7.161 20.415
0.1 to 10 mg 18 1.512 11.903 7.500 18.893 10.1 to 50
mg 31 0.992 31.256 21.981 44.444
~> 50 mg 8 0.242 33.014 16.510 66.015 Rate = counts (D) /
person-time (Y) Person-time is the total follow-up time for all
people in each group 5.4: Poisson regression model To compare the
rate of optic nerve disease in individuals with higher levels of
microfilarial load to individuals with no onchocercal infection you
can look at the rate ratios. These are shown below. What can you
conclude from these estimates? Rate ratios for each level of
microfilarial load compared to the baseline level Rate
Ratio X P > |z| 95% confidence
limits 0.1 to 10mg
0.985 0.002 0.965 0.490 1.979
10.1 to 50mg
2.585 9.370 0.002 1.375 4.859
>50mg 2.731 5.580 0.018 1.146 6.509 Interaction: Button:
clouds picture (pop up box appears): From the estimates in the
table, there appears to be no difference in the rates for the lower
category 0.1 mg 10 mg and the baseline group (P = 0.97). However,
the two highest categories of microfilarial load have significantly
higher rates of optic nerve disease compared to the baseline group,
(P = 0.002, P = 0.018). 5.5: Poisson regression model By assuming
that the rate of OND follows the Poisson distribution, we can model
the rates using a Poisson regression model: log (rate of OND) = +
1X1 + 2X2 + + iXi where: x1 = Mfload1 = 0.1 mg to 10 mg; x2 =
Mfload2 = 10.1 mg to 50 mg; x3 = Mfload3 = > 50 mg.
-
The model estimates are shown below. Click 'swap' below to see
the results from the classical analysis. How do the estimates from
the two analyses compare? Poisson model for the effect of
microfilarial load on optic nerve disease Rate
ratio SE z P > |z| 95% confidence
limits Mfload1
0.985 0.351 -0.044 0.965 0.490 1.979
Mfload2
2.585 0.832 2.950 0.003 1.375 4.859
Mfload3
2.731 1.210 2.266 0.023 1.146 6.509
Log likelihood = 275.95294 Interaction: Button: clouds picture
(pop up box appears): The results from the simple Poisson model
produce exactly the same rate ratios and confidence intervals as
the classical analysis of the association between optic nerve
disease and microfilarial load. However, the p values are slightly
different for the 2 methods, because in the regression analysis a
Wald test is used, whereas in the classical analysis a score test
is used Note that neither analysis is adjusted for potential
confounders. Interaction: Button: Swap (graph changes to the
following): Rate ratios for each level of microfilarial load
compared to the baseline level Rate
Ratio X P > |z| 95% confidence
limits 0.1 to 10mg
0.985 0.002 0.965 0.490 1.979
10.1 to 50mg
2.585 9.370 0.002 1.375 4.859
>50mg 2.731 5.580 0.018 1.146 6.509 5.6: Poisson regression
model Is there a single estimate in the table that shows whether
microfilarial load has an effect on the rate of optic nerve
disease? Interaction: Button: clouds picture (box appears on RHS):
No, from the table below we can say how each level of microfilarial
load compares to the baseline group but we cannot say what the
overall effect of microfilarial load is. For this you need to
calculate a likelihood ratio test.
-
Poisson model for the effect of microfilarial load on optic
nerve disease Rate
ratio SE z P > |z| 95% confidence
limits Mfload1
0.985 0.351 -0.044 0.965 0.490 1.979
Mfload2
2.585 0.832 2.950 0.003 1.375 4.859
Mfload3
2.731 1.210 2.266 0.023 1.146 6.509
Log likelihood = 275.95294 5.7: Poisson regression model The
likelihood ratio test for the effect of microfilarial load is
calculated as follows: Log likelihood for: Model with microfilarial
load = 275.95 Model without microfilarial load = 284.17 Therefore,
the likelihood ratio statistic (LRS): LRS = 2 x (275.95 (284.17)) =
16.44; P=0.001 So what can you conclude about the overall effect of
microfilarial load on optic nerve disease? Interaction: Button:
clouds picture (pop up box appears): P = 0.001, so we can say that
microfilarial load has a significant effect on the rate of optic
nerve disease, before adjusting for potential confounders. 5.8:
Poisson regression model Using the same procedures and comparing
models you can also: Assess the effect of confounding variables
Test for interaction (and the proportional rates assumption) Look
for a linear effect for increasing (or decreasing) exposures
Interaction: Hyperlink: Assess the effect of confounding variables
(card appears on RHS): Confounding variables (review this)
-
To assess confounding, you fit two models. One is a model with
the exposure AND the potential confounder, the other is a model
with the exposure but NOT the potential confounder. You then
compare the rate ratios for the exposure in the model that includes
the potential confounder, with the rate ratios for the exposure in
the model that does not include the potential confounder. If the
rate ratios for the exposure are considerably different between the
2 models (e.g. they change by around 10% or more between the 2
models), then we can say there is confounding. If there is
confounding, then we must use the model that includes the
confounder when estimating the effect of the exposure variable
Interaction: Hyperlink: (review this): Window showing SM12 appears,
section 5 Interaction: Hyperlink: Test for interaction (and the
proportional rates assumption) (card appears on RHS): Test for
interaction (Click here to review this from SM10.) The simple
Poisson model assumes that the effect of the exposure is constant
over the follow-up period. We can test this assumption by splitting
the follow-up time into several categories (time bands), and then
seeing if there is evidence that the effect of the exposure varies
over time. For example, if the follow-up period is 10 years we
could split the follow-up period into 2 time bands, the first 5
years of follow-up and the second 5 years of follow-up. We could
then investigate whether there is evidence that the effect of the
exposure is different in the first 5 years (0-5 years) from the
second 5 years (6-10 years) of follow-up. We would do this by
fitting an interaction between the exposure variable and timeband,
and using a likelihood ratio test (LRT) to see if the interaction
was statistically significant (comparing the model with the
interaction term to the one without). Alternatively, we might think
that the effect of the exposure varied with calendar time. Suppose
that individuals were followed up over the period 1980-1989. We
could investigate whether the effect of the exposure was different
during 1980-1984 from the period 1985-1989. Or we might think that
the effect of the exposure varied with an individual's age. You
will see how to investigate this in Practical 5 Interaction:
Hyperlink: review this: Window showing SM10, section 5.
Interaction: Hyperlink: Look for a linear effect for increasing (or
decreasing) exposures (card appears on RHS): Linear effect
-
(Click here to review thisfrom SM10.) For a variable with
increasing or decreasing exposure it is possible to model a linear
effect. The goal in all model analysis should be to produce the
most simple model that describes the data. If a linear effect can
be assumed then this reduces the number of parameters required in a
model. In addition where a linear effect shows a dose-response
relationship this is further evidence of a causal relationship.
Interaction: Hyperlink: review this: Window showing SM10, section
6. Section 6: Overdispersion The assumptions underlying the Poisson
model are that all events are independent, and that the event rates
are similar within categories of the covariates. In some situations
this assumption is invalid, i.e. where for some reason an event may
be more likely to happen in certain individuals. For examples,
click below: Recurrent events Clustered observations
The analysis of such data is covered in more detail in AS09.
Interaction: Hyperlink: covariates (pop up box appears): Covariates
This is another term for explanatory variables, or independent
variables. Interaction: Hyperlink: Recurrent events (pop up box
appears): Recurrent events The possibility of recurrent asthma
episodes is high for children who have previously had asthma, i.e.
the chance of a recurrent episode is dependent on whether a child
has previously had asthma. Interaction: Hyperlink: Clustered
observations(pop up box appears): Clustered observations The chance
of contracting an infectious disease may be higher for subjects
within a certain community than for individuals living in another
community. This may be due to some environmental or social factors.
6.1: Overdispersion
-
For such data, the variance of the distribution of events will
be greater than that predicted by a Poisson model. This is known as
overdispersion. For example, clusters of individuals can have
different rates to each other; overall this will give a wider
spread of rates. Sometimes you might not be aware of clustering in
a dataset, because the characteristic that distinguishes the
clustering has not been measured. Overdispersion can occur because
of extra variability, which is not completely explained by the
covariates in the model.
6.2: Overdispersion You can account for the extra variability by
assuming that individual rates depend on the covariates that you
would normally put in the Poisson model plus an unobserved
component specific to each subject. This component is known as the
frailty. The frailty component varies between individuals. It is a
measure of how some individuals may be more prone to infection than
others. This may be because of the community they live in or their
disease history. Therefore, the model can be written:
log = + 1X1 + 2X2 + ....iXi +
frailty
-
6.3: Overdispersion Statistically we can incorporate a second
distribution to model the frailty component. When the Poisson
distribution is expanded in this way the distribution is called the
negative binomial model. This is fitted to the data in the same way
a Poisson model is fitted. Regression methods can be used to
estimate the usual exposure effects accounting for the frailty
term. Negative binomial model
log = + 1X1 + 2X2 + ....iXi +
frailty
6.4: Overdispersion To illustrate the difference in estimates
obtained from a Poisson model and negative binomial model you will
now consider a study of recurrent infections. The data is from a
trial of a pneumococcal vaccine carried out in Papua New Guinea.
The trial involved school children, and occurrence of infection was
recorded over 2 years. Which of the calculations opposite do you
think gives the best estimate for the rate of infection in the
population of schoolchildren?
Interaction: Hotspot: The number of children who had an
infection during the two year period, divided by the total
children-years. Incorrect Response (pop up box appears): No,
remember each child could have a number of infections within the
two-year period. So, to measure the rate of infection in the
population, it is more relevant to consider the number of
infections. Interaction: Hotspot: The number of infections during
the two-year period, divided by the total children-years.
The number of children who had an infection during the
two-year
period, divided by the total children-years.
The number of infections during the two-year period, divided by
the
total children-years.
-
Correct Response (pop up box appears): Yes, calculating the rate
of infection in this group of schoolchildren using the number of
infections gives a better estimate of the burden of infection in
the population of schoolchildren. 6.5: Overdispersion The table
opposite gives the distribution of recurring infections and
infection rates by the number of previous infections observed in
the Papua New Guinea schoolchildren during the 2-year period. Click
'swap' beneath the table to see a plot of these data. Then answer
the questions below. 1. Why might overdispersion occur with these
data? Interaction: Button: clouds picture (pop up box appears):
Overdispersion would occur if children who have had previous
infection(s) are more prone to infection than children who have
not, i.e. if re-infection is more probable with an increasing
number of previous infections. 2. Is there any evidence of
overdispersion from the table and plot? Interaction: Button: clouds
picture (pop up box appears): You can see from the data that the
infection rate per 100 person-years increases with an increasing
number of previous infections. This is an indication that recurring
infections produce variability between individuals that is likely
to cause overdispersion. Distribution of recurrent infections in
children followed for 2 years
Number of previous
infections
Number of children with new infections
Infection rate per 100
person-years
0 702 51 1 444 82 2 261 102 3 147 135 4 77 141 5 34 107 6 25 252
7 11 172 8 6 217 9 3 189
10 2 451 Interaction: Button: Swap (table changes to the
following):
-
Distribution of recurrent infections
6.6: Overdispersion The estimates from the Poisson model for the
effect of vaccination are shown below.
Click 'swap' to see the corresponding estimates from the
negative binomial model.
How do the estimates compare? Interaction: Button: clouds
picture (pop up box appears): Both models give similar estimates
for the effect of vaccination. However, the negative binomial,
which has accounted for the frailty component gives larger standard
errors, i.e. it has allowed for the extra variation due to
overdispersion of the Poisson. Because the Poisson model
underestimates the standard error, the inference from this model is
incorrect, showing an apparent stronger association between
vaccination and infection rates. Poisson model for pneumococcal
vaccine trial in Papua New Guinea rate
ratio Standard
error z P > |z| 95% confidence
limits Vaccination 0.9038 0.383 2.389 0.017 0.8318 0.9820 This
model assumes no child is more prone to infection than another,
i.e. all children have the same probability of infection.
-
You could conclude that vaccination has a significant effect in
reducing the infection rate by almost 10%. Interaction: Button:
Swap (table and text changes to the following): Negative binomial
model for pneumoncoccal vaccine trial in Papua New Guinea
Coefficient Standard
error z P > |z| 95% confidence
limits Vaccination 0.8845 0.5683 1.910 0.056 0.7788 1.003 This
model assumes that some children who have been previously infected
are more likely to get infected than children who have not, i.e. it
assumes observations are not independent. Vaccination appears to
reduce the infection rate but is not significant at the 5% level.
6.7: Overdispersion The distribution of the observed number of
infections, and the expected number of infections based on the
Poisson distribution, can be seen below. The expected values are
calculated assuming the rate of infection does not depend on the
number of previous infections. So the rate is assumed to be the
same for each of the 11 groups of children (0-10 previous
infections), and is equal to the overall rate for the study
population (1712 infections / 2376 person-years = 72 per 100
person-years). Notice how the number of new infections among
children with high numbers of previous infections is much greater
than that expected from the Poisson distribution
Infections in Papua New Guinea schoolchildren followed for 2
years
No. of previous infections
No. of children with new infections
Expected number of children with
new infections, if no effect of
previous infections 0 702 982 1 444 390 2 261 184 3 147 78 4 77
39 5 34 23 6 25 7 7 14 5 8 6 2 9 3 1 10 2 0.3
- 6.8: Overdispersion There are two ways to check for
overdispersion within your data: Interaction: Tabs: Examine Rates:
You can subjectively examine the infection rates within clusters.
If you observe a variation of rates within clusters of the data
this is most likely an indication of overdispersion. Interaction:
Tabs: Model deviance: To formally check for overdispersion, you can
compare the model that allows for differences between clusters with
a model that does not. So for this example, you would fit the model
including "number of previous infections" and compare it to the
model without this variable, using a likelihood ratio test. If the
test is significant, then you have evidence of overdispersion. For
this data set, the likelihood ratio statistic (LRS) is highly
statistically significant (p
-
It is the log rate that is modelled. Using a Poisson regression
model As with all regression models, using a Poisson model you can:
Estimate the effect of many exposure variables at the same time
Adjust for potential confounding variables Examine for effect
modification Estimate the effect of a quantitative exposure
Overdispersion A Poisson model assumes that all events are
independent and event rates are similar within categories of the
explanatory variables. If this is not true, the variance of events
will be greater than that predicted by the model. This is known as
overdispersion. The presence of overdispersion should always be
assessed for in a Poisson model, and if it is present a more
complex model (such as the negative binomial model) is
required.
2.1: Planning your study3.1: The Poisson distribution3.2: The
Poisson distribution3.3: The Poisson distribution3.4: The Poisson
distribution3.5: The Poisson distribution3.6: The Poisson
distribution3.7: The Poisson distribution3.8: The Poisson
distribution3.9: The Poisson distribution3.10: The Poisson
distribution3.11: The Poisson distribution4.1: The Poisson
distribution and rates4.2: The Poisson distribution and rates5.1:
Poisson regression model5.2: Poisson regression model5.3: Poisson
regression model5.4: Poisson regression model5.5: Poisson
regression model5.6: Poisson regression model5.7: Poisson
regression model5.8: Poisson regression model6.1:
Overdispersion6.2: Overdispersion6.3: Overdispersion6.4:
Overdispersion6.5: Overdispersion6.6: Overdispersion6.7:
Overdispersion6.8: Overdispersion