as05

Further Poisson regression (AS05)

EPM304 Advanced Statistical Methods in Epidemiology

Course: PG Diploma/ MSc Epidemiology

This document contains a copy of the study material located within the computer assisted learning (CAL) session. If you have any questions regarding this document or your course, please contact DLsupport via [email protected]. Important note: this document does not replace the CAL material found on your module CDROM. When studying this session, please ensure you work through the CDROM material first. This document can then be used for revision purposes to refer back to specific sessions. These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale or further copying.

London School of Hygiene & Tropical Medicine September 2013 v1.0

Section 1: Further Poisson regression Aim

To discuss the Poisson distribution, review Poisson regression and learn about overdispersion within a Poisson model.

Objectives By the end of this session you will be able to:

describe the Poisson distribution calculate Poisson probabilities summarise the procedures involved in developing a Poisson model explain overdispersion in a Poisson model explain how to deal with overdispersion

This session should take you between 1.5 and 2 hours to complete. Section 2: Planning your study This session will look at the Poisson distribution and Poisson regression in more detail. We will first consider the Poisson distribution, then review the main issues of model development for a Poisson regression model, and finally discuss what to do if Poisson regression does not provide the most appropriate model for your data. To work through this session you should be familiar with the analysis of cohort studies and regression modelling. Students who completed EPM202 can refer to the appropriate sessions listed below. Occasional students should refer to their preferred text. Cohort studies SM02 Likelihood SM05 Introduction to Poisson and Cox regression SM11 Interaction: Hyperlink: SM02: SM02 window opens. Interaction: Hyperlink: SM05: SM05 window opens. Interaction: Hyperlink: SM11: SM11 window opens.

2.1: Planning your study To illustrate the methods discussed in this session three examples are used. Click on each of the studies listed below for further details. Study 1: Weekly registrations of new breast cancer cases within a region Study 2: A study of optic nerve disease carried out in Nigeria Study 3: A pneumococcal vaccine trial based in Papua New Guinea Interaction: Hyperlink: Study 1:(card appears on right handside): Study 1 In the UK, all new cases of breast cancer are reported to the regional cancer registry via the GP's or a hospital. The cases are reported on a weekly basis. Interaction: Hyperlink: Study 2: (card appears on right hand side): Study 2 Individuals living in an area of rural northern Nigeria that is endemic for onchocerciasis formed the placebo arm of a trial of ivermectin. All individuals were screened for optic nerve disease (OND) at the start of the trial. Only individuals without optic nerve disease at baseline were included. Individuals were re-examined at about 2 and 3 years post-baseline to identify incident cases of optic nerve disease. Interaction: Hyperlink: Study 3: (card appears on right hand side): Study 3 A trial was carried out in Papua New Guinea to assess the effect of a pneumococcal vaccine on morbidity from acute lower respiratory tract infections in schoolchildren. Information was collected on all children in the study and throughout the trial each episode of infection was recorded. Section 3: The Poisson distribution You should already be familiar with some statistical distributions, i.e. the Normal and Binomial distributions (covered in sessions SC04 and SC05). During parts of the EPP course we have applied the Poisson distribution but not fully discussed the distribution itself. On this page we will look more closely at this distribution. Poisson distribution

3.1: The Poisson distribution The Poisson distribution is named after the French mathematician Simon Poisson (17811840). Let's first think about the type of data it can be applied to, i.e. data in discrete counts. Shown below are some examples of instances in which you would count the number of occurrences of an event over a period of time. Examples 1. The weekly number of new cases of breast cancer notified to the regional cancer registry. 2. The number of deaths that occur in a cohort of civil servants in a year. 3. The number of cases of salmonella reported in a regional health authority within a week. 3.2: The Poisson distribution So, the Poisson distribution is appropriate for describing: The number of occurrences of an event during a period of time. ...but only when the events are independent of each other and occur at random.

Poisson distribution

3.3: The Poisson distribution The simplest way to describe the Poisson distribution is to consider the average rate at which events occur over time. For example, imagine you were told that the weekly number of new cases of breast cancer reported to the regional register was 4 on average. Does this mean that every week you should expect 4 new cases of breast cancer to be reported? Interaction: Button: clouds picture (pop up box appears): It is highly unlikely that you would get exactly 4 new cases of breast cancer reported each week, but the weekly average over a few weeks should be 4. In one week there might be no cases reported and the next week there might be more than 6 cases reported. 3.4: The Poisson distribution A Poisson distribution describes the sampling distributionof the number of occurrences, r, of an event during a period of time. It depends on only one parameter = the mean number of occurrences in periods of the same length. This is shown below. Interaction: Hyperlink: sampling distribution (pop up box appears):

Sampling Distribution The distribution of sample estimates obtained when many random samples are generated independently from the same population. Interaction: Hyperlink: parameter (pop up box appears): Parameter A parameter is a measurement from a population that characterises one of its features.

3.5: The Poisson distribution Now, the plot below shows the distribution of the weekly notifications of breast cancer to a regional register over 19 weeks. What are the values of the following parameters? (You may wish to go back to the previous card. To work out you need to do a calculation considering the frequency of each event). Maximum value of r observed = Mean, = Interaction: Calculation: Maximum value of r =____: Correct Response 8: Correct Yes, the greatest number of notifications in a week was 8.

Incorrect Response: Sorry, that's not right. You can see from the graph that the greatest number of notifications in a week was 8. Interaction: Calculation: Mean, = : Correct Response 4.2: That's right, the mean is given by the total number of notifications over the 19 week period (80), divided by the number of weeks (19). Therefore the mean is 80 / 19 = 4.2. Incorrect Response: No, that's not right. To find the mean, you first need to calculate the total number of notifications over the 20-week period. This is given by: (1x1) + (2x2) + (4x3) + (5x4) + (3x5) + (1x6) + (2x7) + (1x8) = 80 Therefore the mean is 80 / 19 = 4.2

3.6: The Poisson distribution Let's go back to the theoretical Poisson distribution. If you assume the weekly registrations are independent and follow a Poisson distribution, you can calculate the probability that a specific number of cases will occur in a week.

You will see how to do this calculation next, but first consider the plot below. What is the probability of 2 registrations in a week? Give your answer to 2 decimal places. Prob (2 registrations/week) = Interaction: Calculation: Prob (2 registrations/week) =____: Correct Response 0.14 or 0.15 Correct Yes, the plot shows that the probability of 2 registrations in a week is somewhere between 0.14 and 0.15. Incorrect Response: No, the probability of 2 registrations in a week is shown as the height of the vertical line at 2 events on the x-axis. This line reaches to a point just below a probability of 0.15, so your answer should be either 0.14 and 0.15.

3.7: The Poisson distribution The probability of r occurrences can be calculated from the Poisson formula shown below.

Probability (r occurrences) = e- r

r!

where: e is the mathematical constant 2.71878 is the mean number of events r is the number of occurrences r ! is the factorial of r. Interaction: Hyperlink: factorial (pop up box appears): Factorial The factorial of an integer n is written n! and is calculated as 1 x 2 x 3 x ... x n. For example, the factorial of 5 is: 5! = 1 x 2 x 3 x 4 x 5 = 120 Let's work this out for the registration of breast cancer cases. The mean number of new cases of breast cancer registered each week is 4.0. Using this number with the formula below, you can calculate the probability of 0, 1, 2, 3... new registrations in a week. For example, to calculate the probability of 2 registrations in a week for a mean of 4:

Probability (2 registrations) = e-4 42

= e-4 x 16

= 0.1465 2! 2 x 1

Does this agree with the probability you read from the plot? Go back and check. 3.8: The Poisson distribution The table below shows the probabilities for mu = 4. Click on each cell in the probability column to see how it was calculated. Then calculate the missing probabilities, giving your answers to 4 decimal places.

Remember, prob( r ) = e- r

r!

Probability of r breast cancer registrations per week Number of registratio

ns (r)

Probability

0 0.0183 1 0.0733

2 0.1465 3 0.1954 4 0.1954 5 0.1563 6 0.1042 7 8 0.0298 9

Interaction: Button: 0.0183 (Number of registrations = 0): Prob(0 registrations) = (e 4 x 40) / 0! = e 4 / 1 = 0.0183 Interaction: Button: 0.0733 *(Number of registrations = 1): Prob(1 registration) = (e 4 x 41) / 1! = 4e 4 / 1 = 0.0733 Interaction: Button: 0.1465 (Number of registrations = 2): Prob(2 registrations) = (e 4 x 42) / 2! = 16 x e 4 / 2 = 0.1465 Interaction: Button: 0.1954 (Number of registrations = 3): Prob(3 registrations) = (e 4 x 43) / 3! = 64e 4 / 6 = 0.1954 Interaction: Button: 0.1954 (Number of registrations = 4): Prob(4 registrations) = (e 4 x 44) / 4! = 256e 4 / 24 = 0.1954 Interaction: Button: 0.1563 (Number of registrations = 5): Prob(5 registrations) = (e 4 x 45) / 5! = 1024e 4 / 120 = 0.1563 Interaction: Button: 0.1042 (Number of registrations = 6): Prob(6 registrations) = (e 4 x 46) / 6! = 4096e 4 / 720

= 0.1042 Interaction: Button: 0.0298 (Number of registrations = 8): Prob(8 registrations) = (e 4 x 48) / 8! = 65536e 4 / 40320 = 0.0298 Interaction: Calculation: Number of registrations = 7, Probability = ___: Correct Response 0.0595: That's right, the probability of 7 registrations per week is: Prob(7 registrations) = (e 4 x 47) / 7! = 16384e 4 / 5040 = 0.0595 Incorrect Response: Sorry, that's not correct. Remember that the mean, mu = 4, so for r = 7, you use the formula as follows: Prob(7 registrations) = (e x r ) / r ! = (e 4 x 47) / 7! = 16384e 4 / 5040 = 0.0595 Interaction: Calculation: Number of registrations = 9, Probability = ___: Correct Response 0.0132: That's right, the probability of 9 registrations per week is: Prob(9 registrations) = (e 4 x 49) / 9! = 262144e 4 / 362880 = 0.0132 Incorrect Response: Sorry, that's not correct. Remember that the mean, mu = 4, so for r = 9, you use the formula as follows: Prob(9 registrations) = (e 4 x 49) / 9! = 262144e 4 / 362880 = 0.0132 (When Probability (7 registrations) is answered the following text and interaction appears on the LHS. Also an extra row with a calculation appears on the table)

Now, can you calculate the probability of 10 or more registrations, to 4 decimal places? Make sure you have entered the correct values in rows 7 and 9 before you do this. Interaction: Button: Hint: Hint Remember that probabilities should add up to 1. Number of registratio

ns (r)

Probability

0 0.0183 1 0.0733 2 0.1465 3 0.1954 4 0.1954 5 0.1563 6 0.1042 7 8 0.0298 9

10+ Interaction: Calculation: Number of registrations = 10+, probability = ___: Correct Response 0.0081: Correct That's right, Prob(10+) = 1 [prob(0) + prob(1) + ... + prob(9)] = 1 0.9919 = 0.0081 Incorrect Response: Sorry, that's not right. Remember that the probabilities must add up to 1 in total, so by summing the probabilities for rows 0 to 9, and subtracting that total from 1, you get the probability that there are 10 or more registrations in a week. Prob(10+) = 1 [prob(0) + prob(1) + ... + prob(9)] = 1 0.9919 = 0.0081 3.9: The Poisson distribution

The mean and variance are often used to summarize a distribution. For a Poisson distribution the variance is equal to the mean, so it can be described by a single parameter. Interaction: Button: More (text appears on bottom LHS): So, if the variance is equal to the mean, the standard error equals the square root of the mean, . This standard error can be used to obtain a confidence interval for the mean 3.10: The Poisson distribution The Poisson distribution for the cases of breast cancer weekly registration is shown below. Which of the boxes below best describes the distribution below?

Interaction: Hotspot: Left skewed: Incorrect Response (pop up box appears): Left skewed Sorry, a left skewed distribution has a low probability of a small number of events and a high probability of a large number of events. That is not the case for this distribution. Please try again. Interaction: Hotspot: Symmetrical: Incorrect Response (pop up box appears): Symmetrical No, one tail of the distribution is much longer than the other tail, therefore the distribution is not symmetrical. Please try again. Interaction: Hotspot: Right skewed: Correct Response (pop up box appears): Correct The Poisson distribution is skewed to the right since the probability of a small number of events is high and the probability of a large number of events is low.

Symmetrical

Left skewed

Right skewed

3.11: The Poisson distribution Use the drop-down menu beneath the plot below to view the shape of the Poisson distribution for different values of the mean. Consider the plots. Which distribution can we use as an approximation to the Poisson when the mean is large? Interaction: Button: clouds picture (pop up box appears): For large means, the shape of the Poisson distribution is symmetrical. In such cases, the distribution can be approximated by the Normal distribution.

Interaction: Pulldown: =3:

Interaction: Pulldown: =5:

Interaction: Button: =10:

Section 4: The Poisson distribution and rates In epidemiology we are more interested in the rate of occurrence of an outcome rather than counts.

The mean rate, , of the occurrence of an event is given by:

= r / t where r is the number of events in a time period t. So for the Poisson distribution what is the standard error of this rate? Interaction: Button: clouds picture (text appears on the bottom RHS): The Poisson distribution is described by one parameter, the mean, which is equal to the variance. So the standard error in this case will be = ( r / t ). 4.1: The Poisson distribution and rates Usually the follow-up time differs for each individual, and in calculating rate we therefore divide by the person-time. So, the mean rate is given by: = r / person time Example A study was carried out in a poor area of Mexico, in which 652 children aged less than 5 years were followed for a period of up to 2 years. During the study 73 cases of lower respiratory infection were recorded. The total child-years of follow-up was 868. Calculate the overall (incidence) rate of lower respiratory infection estimated from this study and enter your answer in the box below. Rate = Interaction: Calculation: Rate =____: Correct Response 0.084: Correct Yes, the incidence rate is given by the number of infections divided by the total person-time: = 73 / 868 = 0.084 Incorrect Response:

Sorry, that's not right. The total number of infections during the study, r , was 73, and the total person-time of follow-up was 868 person-years. So, using the formula: = 73 / 868 = 0.084 4.2: The Poisson distribution and rates Such rates are usually expressed in units of 100 or 1000 person-years. For example, the incidence rate of lower respiratory infection in the Mexico study was 0.084 x 100 = 8.4 per 100 child-years. Section 5: Poisson regression model A Poisson regression model is appropriate for the analysis of cohort data where the outcome measure is a rate. The information essential for analysis is: 1. Time of entry to the study 2. Whether the event of interest occurred 3. Time of exit / end of follow-up This allows the calculation of rates, based on the information for the whole cohort. Rate = number of events person-time at risk Interaction: Button: More (card appears on RHS): Using Poisson regression we can model the log rates for the exposure(s) of interest to adjust for many simultaneously. The general format of a Poisson model may be written as:

log = + 1X1 + 2X2 +... i X

i

Click below to compare this model to two other models you are familiar with, namely the logistic model and the Cox model. Interaction: Button: Logistic: Logistic model log ( ) = + 1X1 + 2X2 +...i Xi 1 - Interaction: Button: Cox:

log(t;X)= log(t;0) + 1X1 + 2X2 +...i Xi

5.1: Poisson regression model As with all regression models, a Poisson model can: Interaction: Timed pop up 1: Interaction: Timed pop up 2: Interaction: Timed pop up 3: Interaction: Timed pop up 4: 5.2: Poisson regression model You will now investigate the application of the Poisson model using data from a study of optic nerve disease carried out in Nigeria. Individuals living in an onchocercal area in Nigeria were followed over a 3-year period to study the incidence of optic nerve disease (OND). The exposure of interest was onchocerciasis which was measured by microfilarial load. The categories for this are shown opposite. Interaction: Hyperlink: onchocerciasis (pop up box appears): Onchocerciasis (Also known as river blindness) A disease affecting the skin and eyes, caused by the filiarial worm Onchocerca volvulus. The disease is transmitted by blackflies and has been endemic in large areas of West Africa.

Account for interaction

Model a linear effect

Adjust for many confounders

Simultaneously model different exposure types

Interaction: Hyperlink: microfilarial (pop up box appears): Microfilariae Offspring of the female worm onchocerca volvulus, which migrate through the skin causing onchocerciasis.

Category of microfilarial

load

Microfilariae per mg

0 0

1 0.1 10

2 10.1 50

3 > 50

5.3: Poisson regression model The rates of optic nerve disease for each category of microfilarial load are calculated by dividing the number of OND cases by the total person-time in each category of microfilarial load. Examine the rates shown below. Do you think there is a relationship between optic nerve disease and microfilarial load? Rates of optic nerve disease by microfilarial load per 1000 person-years Microfilarial

load D Y (/

1000) Rate 95% confidence

limits 0 mg 14 1.158 12.091 7.161 20.415

0.1 to 10 mg 18 1.512 11.903 7.500 18.893 10.1 to 50

mg 31 0.992 31.256 21.981 44.444

> 50 mg 8 0.242 33.014 16.510 66.015 Rate = counts (D) / person-time (Y) Person-time is the total follow-up time for all people in each group Interaction: Button: clouds picture (pop up box appears and part of table is highlighted): The rates of optic nerve disease appear to increase with a higher microfilarial load, from 12.1 per 1000 person years in the baseline microfilarial load group to 33.0 per 1000 person years in individuals with the highest microfilarial load. Rates of optic nerve disease by

microfilarial load per 1000 person-years Microfilarial

load D Y (/

1000) Rate 95% confidence

limits 0 mg 14 1.158 12.091 7.161 20.415

0.1 to 10 mg 18 1.512 11.903 7.500 18.893 10.1 to 50

mg 31 0.992 31.256 21.981 44.444

~> 50 mg 8 0.242 33.014 16.510 66.015 Rate = counts (D) / person-time (Y) Person-time is the total follow-up time for all people in each group 5.4: Poisson regression model To compare the rate of optic nerve disease in individuals with higher levels of microfilarial load to individuals with no onchocercal infection you can look at the rate ratios. These are shown below. What can you conclude from these estimates? Rate ratios for each level of microfilarial load compared to the baseline level Rate

Ratio X P > |z| 95% confidence

limits 0.1 to 10mg

0.985 0.002 0.965 0.490 1.979

10.1 to 50mg

2.585 9.370 0.002 1.375 4.859

>50mg 2.731 5.580 0.018 1.146 6.509 Interaction: Button: clouds picture (pop up box appears): From the estimates in the table, there appears to be no difference in the rates for the lower category 0.1 mg 10 mg and the baseline group (P = 0.97). However, the two highest categories of microfilarial load have significantly higher rates of optic nerve disease compared to the baseline group, (P = 0.002, P = 0.018). 5.5: Poisson regression model By assuming that the rate of OND follows the Poisson distribution, we can model the rates using a Poisson regression model: log (rate of OND) = + 1X1 + 2X2 + + iXi where: x1 = Mfload1 = 0.1 mg to 10 mg; x2 = Mfload2 = 10.1 mg to 50 mg; x3 = Mfload3 = > 50 mg.

The model estimates are shown below. Click 'swap' below to see the results from the classical analysis. How do the estimates from the two analyses compare? Poisson model for the effect of microfilarial load on optic nerve disease Rate

ratio SE z P > |z| 95% confidence

limits Mfload1

0.985 0.351 -0.044 0.965 0.490 1.979

Mfload2

2.585 0.832 2.950 0.003 1.375 4.859

Mfload3

2.731 1.210 2.266 0.023 1.146 6.509

Log likelihood = 275.95294 Interaction: Button: clouds picture (pop up box appears): The results from the simple Poisson model produce exactly the same rate ratios and confidence intervals as the classical analysis of the association between optic nerve disease and microfilarial load. However, the p values are slightly different for the 2 methods, because in the regression analysis a Wald test is used, whereas in the classical analysis a score test is used Note that neither analysis is adjusted for potential confounders. Interaction: Button: Swap (graph changes to the following): Rate ratios for each level of microfilarial load compared to the baseline level Rate

Ratio X P > |z| 95% confidence

limits 0.1 to 10mg

0.985 0.002 0.965 0.490 1.979

10.1 to 50mg

2.585 9.370 0.002 1.375 4.859

>50mg 2.731 5.580 0.018 1.146 6.509 5.6: Poisson regression model Is there a single estimate in the table that shows whether microfilarial load has an effect on the rate of optic nerve disease? Interaction: Button: clouds picture (box appears on RHS): No, from the table below we can say how each level of microfilarial load compares to the baseline group but we cannot say what the overall effect of microfilarial load is. For this you need to calculate a likelihood ratio test.

Poisson model for the effect of microfilarial load on optic nerve disease Rate

ratio SE z P > |z| 95% confidence

limits Mfload1

0.985 0.351 -0.044 0.965 0.490 1.979

Mfload2

2.585 0.832 2.950 0.003 1.375 4.859

Mfload3

2.731 1.210 2.266 0.023 1.146 6.509

Log likelihood = 275.95294 5.7: Poisson regression model The likelihood ratio test for the effect of microfilarial load is calculated as follows: Log likelihood for: Model with microfilarial load = 275.95 Model without microfilarial load = 284.17 Therefore, the likelihood ratio statistic (LRS): LRS = 2 x (275.95 (284.17)) = 16.44; P=0.001 So what can you conclude about the overall effect of microfilarial load on optic nerve disease? Interaction: Button: clouds picture (pop up box appears): P = 0.001, so we can say that microfilarial load has a significant effect on the rate of optic nerve disease, before adjusting for potential confounders. 5.8: Poisson regression model Using the same procedures and comparing models you can also: Assess the effect of confounding variables Test for interaction (and the proportional rates assumption) Look for a linear effect for increasing (or decreasing) exposures Interaction: Hyperlink: Assess the effect of confounding variables (card appears on RHS): Confounding variables (review this)

To assess confounding, you fit two models. One is a model with the exposure AND the potential confounder, the other is a model with the exposure but NOT the potential confounder. You then compare the rate ratios for the exposure in the model that includes the potential confounder, with the rate ratios for the exposure in the model that does not include the potential confounder. If the rate ratios for the exposure are considerably different between the 2 models (e.g. they change by around 10% or more between the 2 models), then we can say there is confounding. If there is confounding, then we must use the model that includes the confounder when estimating the effect of the exposure variable Interaction: Hyperlink: (review this): Window showing SM12 appears, section 5 Interaction: Hyperlink: Test for interaction (and the proportional rates assumption) (card appears on RHS): Test for interaction (Click here to review this from SM10.) The simple Poisson model assumes that the effect of the exposure is constant over the follow-up period. We can test this assumption by splitting the follow-up time into several categories (time bands), and then seeing if there is evidence that the effect of the exposure varies over time. For example, if the follow-up period is 10 years we could split the follow-up period into 2 time bands, the first 5 years of follow-up and the second 5 years of follow-up. We could then investigate whether there is evidence that the effect of the exposure is different in the first 5 years (0-5 years) from the second 5 years (6-10 years) of follow-up. We would do this by fitting an interaction between the exposure variable and timeband, and using a likelihood ratio test (LRT) to see if the interaction was statistically significant (comparing the model with the interaction term to the one without). Alternatively, we might think that the effect of the exposure varied with calendar time. Suppose that individuals were followed up over the period 1980-1989. We could investigate whether the effect of the exposure was different during 1980-1984 from the period 1985-1989. Or we might think that the effect of the exposure varied with an individual's age. You will see how to investigate this in Practical 5 Interaction: Hyperlink: review this: Window showing SM10, section 5. Interaction: Hyperlink: Look for a linear effect for increasing (or decreasing) exposures (card appears on RHS): Linear effect

(Click here to review thisfrom SM10.) For a variable with increasing or decreasing exposure it is possible to model a linear effect. The goal in all model analysis should be to produce the most simple model that describes the data. If a linear effect can be assumed then this reduces the number of parameters required in a model. In addition where a linear effect shows a dose-response relationship this is further evidence of a causal relationship. Interaction: Hyperlink: review this: Window showing SM10, section 6. Section 6: Overdispersion The assumptions underlying the Poisson model are that all events are independent, and that the event rates are similar within categories of the covariates. In some situations this assumption is invalid, i.e. where for some reason an event may be more likely to happen in certain individuals. For examples, click below: Recurrent events Clustered observations

The analysis of such data is covered in more detail in AS09. Interaction: Hyperlink: covariates (pop up box appears): Covariates This is another term for explanatory variables, or independent variables. Interaction: Hyperlink: Recurrent events (pop up box appears): Recurrent events The possibility of recurrent asthma episodes is high for children who have previously had asthma, i.e. the chance of a recurrent episode is dependent on whether a child has previously had asthma. Interaction: Hyperlink: Clustered observations(pop up box appears): Clustered observations The chance of contracting an infectious disease may be higher for subjects within a certain community than for individuals living in another community. This may be due to some environmental or social factors. 6.1: Overdispersion

For such data, the variance of the distribution of events will be greater than that predicted by a Poisson model. This is known as overdispersion. For example, clusters of individuals can have different rates to each other; overall this will give a wider spread of rates. Sometimes you might not be aware of clustering in a dataset, because the characteristic that distinguishes the clustering has not been measured. Overdispersion can occur because of extra variability, which is not completely explained by the covariates in the model.

6.2: Overdispersion You can account for the extra variability by assuming that individual rates depend on the covariates that you would normally put in the Poisson model plus an unobserved component specific to each subject. This component is known as the frailty. The frailty component varies between individuals. It is a measure of how some individuals may be more prone to infection than others. This may be because of the community they live in or their disease history. Therefore, the model can be written:

log = + 1X1 + 2X2 + ....iXi +

frailty

6.3: Overdispersion Statistically we can incorporate a second distribution to model the frailty component. When the Poisson distribution is expanded in this way the distribution is called the negative binomial model. This is fitted to the data in the same way a Poisson model is fitted. Regression methods can be used to estimate the usual exposure effects accounting for the frailty term. Negative binomial model

log = + 1X1 + 2X2 + ....iXi +

frailty

6.4: Overdispersion To illustrate the difference in estimates obtained from a Poisson model and negative binomial model you will now consider a study of recurrent infections. The data is from a trial of a pneumococcal vaccine carried out in Papua New Guinea. The trial involved school children, and occurrence of infection was recorded over 2 years. Which of the calculations opposite do you think gives the best estimate for the rate of infection in the population of schoolchildren?

Interaction: Hotspot: The number of children who had an infection during the two year period, divided by the total children-years. Incorrect Response (pop up box appears): No, remember each child could have a number of infections within the two-year period. So, to measure the rate of infection in the population, it is more relevant to consider the number of infections. Interaction: Hotspot: The number of infections during the two-year period, divided by the total children-years.

The number of children who had an infection during the two-year

period, divided by the total children-years.

The number of infections during the two-year period, divided by the

total children-years.

Correct Response (pop up box appears): Yes, calculating the rate of infection in this group of schoolchildren using the number of infections gives a better estimate of the burden of infection in the population of schoolchildren. 6.5: Overdispersion The table opposite gives the distribution of recurring infections and infection rates by the number of previous infections observed in the Papua New Guinea schoolchildren during the 2-year period. Click 'swap' beneath the table to see a plot of these data. Then answer the questions below. 1. Why might overdispersion occur with these data? Interaction: Button: clouds picture (pop up box appears): Overdispersion would occur if children who have had previous infection(s) are more prone to infection than children who have not, i.e. if re-infection is more probable with an increasing number of previous infections. 2. Is there any evidence of overdispersion from the table and plot? Interaction: Button: clouds picture (pop up box appears): You can see from the data that the infection rate per 100 person-years increases with an increasing number of previous infections. This is an indication that recurring infections produce variability between individuals that is likely to cause overdispersion. Distribution of recurrent infections in children followed for 2 years

Number of previous

infections

Number of children with new infections

Infection rate per 100

person-years

0 702 51 1 444 82 2 261 102 3 147 135 4 77 141 5 34 107 6 25 252 7 11 172 8 6 217 9 3 189

10 2 451 Interaction: Button: Swap (table changes to the following):

Distribution of recurrent infections

6.6: Overdispersion The estimates from the Poisson model for the effect of vaccination are shown below.

Click 'swap' to see the corresponding estimates from the negative binomial model.

How do the estimates compare? Interaction: Button: clouds picture (pop up box appears): Both models give similar estimates for the effect of vaccination. However, the negative binomial, which has accounted for the frailty component gives larger standard errors, i.e. it has allowed for the extra variation due to overdispersion of the Poisson. Because the Poisson model underestimates the standard error, the inference from this model is incorrect, showing an apparent stronger association between vaccination and infection rates. Poisson model for pneumococcal vaccine trial in Papua New Guinea rate

ratio Standard

error z P > |z| 95% confidence

limits Vaccination 0.9038 0.383 2.389 0.017 0.8318 0.9820 This model assumes no child is more prone to infection than another, i.e. all children have the same probability of infection.

You could conclude that vaccination has a significant effect in reducing the infection rate by almost 10%. Interaction: Button: Swap (table and text changes to the following): Negative binomial model for pneumoncoccal vaccine trial in Papua New Guinea Coefficient Standard

error z P > |z| 95% confidence

limits Vaccination 0.8845 0.5683 1.910 0.056 0.7788 1.003 This model assumes that some children who have been previously infected are more likely to get infected than children who have not, i.e. it assumes observations are not independent. Vaccination appears to reduce the infection rate but is not significant at the 5% level. 6.7: Overdispersion The distribution of the observed number of infections, and the expected number of infections based on the Poisson distribution, can be seen below. The expected values are calculated assuming the rate of infection does not depend on the number of previous infections. So the rate is assumed to be the same for each of the 11 groups of children (0-10 previous infections), and is equal to the overall rate for the study population (1712 infections / 2376 person-years = 72 per 100 person-years). Notice how the number of new infections among children with high numbers of previous infections is much greater than that expected from the Poisson distribution

Infections in Papua New Guinea schoolchildren followed for 2 years

No. of previous infections

No. of children with new infections

Expected number of children with

new infections, if no effect of

previous infections 0 702 982 1 444 390 2 261 184 3 147 78 4 77 39 5 34 23 6 25 7 7 14 5 8 6 2 9 3 1 10 2 0.3

6.8: Overdispersion There are two ways to check for overdispersion within your data: Interaction: Tabs: Examine Rates: You can subjectively examine the infection rates within clusters. If you observe a variation of rates within clusters of the data this is most likely an indication of overdispersion. Interaction: Tabs: Model deviance: To formally check for overdispersion, you can compare the model that allows for differences between clusters with a model that does not. So for this example, you would fit the model including "number of previous infections" and compare it to the model without this variable, using a likelihood ratio test. If the test is significant, then you have evidence of overdispersion. For this data set, the likelihood ratio statistic (LRS) is highly statistically significant (p

It is the log rate that is modelled. Using a Poisson regression model As with all regression models, using a Poisson model you can: Estimate the effect of many exposure variables at the same time Adjust for potential confounding variables Examine for effect modification Estimate the effect of a quantitative exposure Overdispersion A Poisson model assumes that all events are independent and event rates are similar within categories of the explanatory variables. If this is not true, the variance of events will be greater than that predicted by the model. This is known as overdispersion. The presence of overdispersion should always be assessed for in a Poisson model, and if it is present a more complex model (such as the negative binomial model) is required.

2.1: Planning your study3.1: The Poisson distribution3.2: The Poisson distribution3.3: The Poisson distribution3.4: The Poisson distribution3.5: The Poisson distribution3.6: The Poisson distribution3.7: The Poisson distribution3.8: The Poisson distribution3.9: The Poisson distribution3.10: The Poisson distribution3.11: The Poisson distribution4.1: The Poisson distribution and rates4.2: The Poisson distribution and rates5.1: Poisson regression model5.2: Poisson regression model5.3: Poisson regression model5.4: Poisson regression model5.5: Poisson regression model5.6: Poisson regression model5.7: Poisson regression model5.8: Poisson regression model6.1: Overdispersion6.2: Overdispersion6.3: Overdispersion6.4: Overdispersion6.5: Overdispersion6.6: Overdispersion6.7: Overdispersion6.8: Overdispersion

as05

Documents

poisson model

poisson regression model

poisson distribution

review poisson regression

poisson probabilities

nigeria study

region study

study materials