Lecture notes in BiostatisticsPrepared, edited and compiled by Kazaura, M. R. Makwaya, C. K. Masanja, C. M. Mpembeni, R.C. Muhimbili University College of Health Sciences Institute of Public Health Department of Epidemiology and Biostatistics Dar es Salaam, 1997
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
The study of statistics deals with the collection, processing and interpretation of data. The concepts ofstatistics are applied in many scientific fields that include agriculture, business, engineering and health.
When focus is on biological and health sciences, the term biostatistics is used. This manual of
biostatistics was written for students of the health sciences and serves as an introduction to the study of
biostatistics. The contents of the manual are based on the requirements for the biostatistics courses
offered at the Muhimbili University College of Health Sciences for both undergraduates and
postgraduates.
Textbooks on mathematical statistics usually include theoretical examples and exercises. The task of
finding relevant data is so enormous that even textbooks on applied statistics rarely include practical
examples and exercises. In particular, a course in biostatistics which is not introduced via numerous
examples of real data renders a restrictive view of the subject and hence tends to discourage the
uninitiated student. This manual is intended to provide substantial contact with a variety of statistical
methods and data sets so that the student can appreciate their application and the contexts in which they
are used. In the process the manual will facilitate learning of the student and provide handy notes and
references for further reading.
The authors have performed a valuable service in compiling the present manual. Many of the examples
and exercises given in this collection are based on health-related data, and the techniques which the
student is expected to apply cover a wide range of commonly used techniques. The manual will be of
great value both as the basis for a taught course and for private study.
ACKNOWLEDGEMENT
This work would have been impossible without the generous financial support of SIDA (SAREC) as
part of Research Capability Strengthening in the Department of Epidemiology/Biostatistics.
too congested. Hence a bar chart is more appropriate.
Fig 2. Distribution of the population using different control methods.
Two-way tables:
A statistical information on two variables can be presented simultaneously in a form of a two-way table. Thistable makes the information easier to assimilate by showing at a glance many of the properties of the data.
In a two way table data are presented in rows and columns. The format for a table depends upon the data and
the aspects of the data which are important to portray.
A two-way table should include the following:
1. A clear title.
2. A caption for the rows and columns with units of measurement of the variable.
3. Labels for each individual row or column. i.e The values taken by the variable concerned.
4. Marginal and grand totals.
Consider the following example:
In a study to investigate whether or not HIV1 infection is a risk factor to pulmonary tuberculosis (PTB), a
total of 2165 individuals were examined. Blood samples were also collected from these individual for
laboratory diagnosis of HIV1 infection.
The following results were obtained:
Of the 2165 individuals examined, 651 were found to be negative for HIV1 infection. Of those who were
negative, 57 were found to have PTB. 1526 of the HIV1 positive, 875 were found to have PTB.
Table 2.4: Frequency distribution of number of lesions caused by small pox virus in egg membranes.
NUMBER OF LESIONS FREQUENCY (NUMBER OF MEMBRANES)
0-
10-
20-
30-
40-
50-
60-
70-
80-
90-
100-
110-119
1
6
14
14
17
8
9
3
6
1
0
1
Total 80
Note: "-" means up to but not including the next tabulated value. Example, 10- means 10 is the lower limitwhile 19 is the upper limit. 14.5 is the mid point for the class interval 10- .
The following rules are used to make frequency distribution for a grouped data.
1. Determine the range, R, of values. (R=largest value -smallest value)
2. Decide on the number, I, of classes. This number depends on the form of data and the requirements of the
frequency distribution. But usually they should be between 5 and 20 for convenience.
3. Determine the width of the class interval, W, such that W=R/I. A constant width for all classes is
preferable.
4. Choose the upper and lower limits of the class interval careful to avoid ambiguities.
5. List the intervals in order. Use tallies to allocate each observation into the class in which it falls. Add the
tally marks to obtain class frequencies.
Use of diagrams in quantitative data:
A: Histograms:
A histogram is a familiar bar-type diagram. Values of a variable are represented on a horizontal scale and the
vertical scale represents the frequency or relative frequency at each value. Each bar centres at the mid point
Fig.4 A histogram showing distribution of age at loss of last tooth
B: Line diagrams:
These are often used to express the change in some quantity over a period of time or to illustrate therelationship between continuous quantities. Each point on the graph represent represents a pair of values i.e. a
value on the x-axis and a corresponding value on the y-axis. The adjacent points are then connected by
straight lines.
1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
Year
0
10
20
30
40Cumulative no. of cases (Thousands)
Fig. 5 A line diagram showing cumulative number of AIDS cases in Tanzania from 1983 to 1992.
C: Frequency polygons
Frequency polygons are a series of points (located at the mid-point of the interval) connected by straight
lines. The height of these points is equal to the frequency or relative frequency associated with the values of
the variable (or the interval). The end points are joined to the horizontal axis at the mid points of the groups
immediately below and above the lowest and highest non-zero frequencies respectively.
Frequency polygons are not as popular as histogram but are also a visual equivalent of a frequency distribution. They can
easily be superimposed and therefore superior to histograms for comparing sets of data.
Generally, when n (number of observations) is odd the median is1 / 2 (n+1)
th observations. But when n is
even since there is no middle observation, the median is the mean of the two middle observations, i.e.1 / 2n
th
and (1 / 2n+1)
th observation.
In frequency distributions, the median can be obtained by accumulating the frequencies and noting the valueof the variable which divides the data into two equal halves i.e. An observation where
1 / 2n of the observation
lie.
Note:
1. The median is less efficient than the mean because it takes no account of the magnitude of most of the
observations.
2. If two groups of observations are pooled, the median of the combined group can not be expressed in terms
of the medians of the two component groups.
3. The median is much less amenable than the mean to mathematical treatments and so it is less used in more
elaborate statistical techniques.
However if the data are distributed asymmetrically, the median is more stable than the mean. Consider the
example on the duration of stay in hospital where the median is 7; this is more realistic than the calculated
mean of 22 days.
3. Mode:
The mode is the value with the highest frequency. i.e. The value which occurs most frequently. The modal
value (days) for the duration of stay in hospital, example given above, is 5.
Take an example of playing cards. It is a pack of 52 cards which are 13 Spades, 13 Diamonds, 13 Hearts, 13
Flowers. If you draw two cards (one at a time) from a pack of cards, what is the probability that the 1st
and
2nd
cards will be Spades?
NOTE: P (spade) = 13/52
P (spade/spade on 1st draw) = 12/51
This is so because of the fact that you have already drawn 1 spade thus decreasing the number of spades and
the pack by 1. So P (spade on 1st
and 2nd
draw) = 13/52 x 12/51 = 0.0588.
Definition:
Independent events:
Two events are independent if the occurrence of one does not affect in anyway the occurrence of the other.Thus if A and B are independent events P(B/A) = P(B). When a coin is tossed, the outcome of the 1
st trial
does not affect the outcome of the 2nd
trial.
In independent trials, the multiplication rule assumes a simple form P(A and B) = P(A) P(B).
e.g. P(H and 5) = P(H) x P(5)
= 1/2 x 1/6 = 1/12.
EXERCISE
1. Define the following terms:
a) Probabilityb) Mutually exclusive events
c) Independent events
d) Conditional probability
2. The following table shows 1000 nursing school applicants classified according to scores made on a
college entrance examination and the quality of the high school from which they graduated, as rated
by a group of educators.
QUALITY OF HIGH SCHOOL
SCORE POOR (P) AVERAGE (A) SUPERIOR (S) TOTAL
Low (L)
Medium (M)High (H)
Total
105
7025
200
60
17565
300
55
145300
500
220
390390
1000
a) Calculate the probability that, an applicant picked at random from this group:
i) Made a low score on the examination.
ii) Graduated from a superior high school.
iii) Made a low score on the examination and graduated from a superior high school.
iv) Made a high score or graduated from a superior high school.
b) Calculate the following probabilities:
(i) P(A) (ii) P(H) (iii) P(M) (iv) P(A/H) (v) P(H/S).
Often in research work we are dealing with groups which are effectively infinite, such as the number of
underfives in a district, for example. In sampling, part of a group (population) is chosen to provide
information which can be generalized to the whole, although in theory it would be possible to investigate the
whole group. Sampling is adopted to reduce labour and hence costs.
Definition:
Sampling is the process of selecting a number of study units from a defined study population. Otherwise, ifthe whole population is studied the process is referred to as taking a census. We can illustrate the process of
sampling and the important activities involved with the following diagram:-
The diagram depicts drawing a sample of size n using a particular sampling method from a study population
with N units (subjects). Inferential statistics techniques are then used to make inferences about the studypopulation on the basis of results from the sample.
The steps:1) Identifying the study population (note: it is possible to have different study populations in one study).
2) Drawing a sample from the study population.
3) Describing the sample (e.g. by calculating relevant statistics).
4) Making inferences about the parameters.
5) Drawing conclusions about the study population.
ii. The mean of the distribution of x is the same as that of X (i.e. the mean of the
sample means is the same as the mean µ of the parent population).
iii. The variance of x is σ2 /n where σ2
is the variance of X. It is easy to see that as the
sample size n increases, the variance of x decreases. From an earlier explanation,this observation is expected.
iv. The standard deviation of x is the square-root of its variance, and is often referred to
as the standard error of the mean. That is, the standard error of the (sample) mean,
usually written as SE(x ), is given by σ / √n.
Note: In practice, the value of σ2 will be unknown. It can be replaced by the sample value,
s2
, and the expression for the standard error SE(x ) applies accordingly.
The fact x that tends to follow a normal distribution is remarkable, since this implies that the
properties of normal distributions apply to the distribution of the sample mean. In particular, we nowknow that x follows a normal distribution with parameters µ and σ2
/n as the mean and variance,
respectively.
Hence, it follows, for example, that 95% of the sample means lie within the interval µ±1.96×SE(x ).
This implies that there is a 95% chance of getting a sample mean within the interval µ±1.96×SE(x ).
Equivalently, we are saying that the probability of having a sample mean in the interval µ±1.96×
SE(x ) is 0.95.
Note: The limits of the interval µ±1.96×SE(x ) are µ-1.96×SE(x ) and µ+1.96×SE(x ). That is,
alternatively, we are talking of the interval ranging from µ-1.96×SE(x ) to µ+1.96×SE(x )
We can express the above statements mathematically as follows:-
Pr{µ-1.96×SE(x ) < x < µ+1.96×SE(x )} = 0.95, where Pr{x} means "probability of x"
Re-arranging the left-hand side of the above equation, we obtain the following equivalent equation:
Pr{(x -1.96×SE(x ) < µ <x +1.96×SE(x )} = 0.95.
In words, this says that the probability that the interval
x -1.96×SE(x ) to x+1.96×SE(x ) includes the population value µ is 0.95.
When the value of x (and that of SE(x )) is known, then the interval x -1.96×SE(x ) to x+1.96×SE(x), often written also as (x -1.96×SE(x ), x+1.96×SE(x )), is called the 95% confidence interval of µ.
The logic of this is that, for known values of x and SE(x ), the interval (x -1.96×SE(x ), x+1.96×
SE(x )) is known and fixed. Hence, it no longer makes sense to talk of the interval including µ with
0.95 probability since the probability is definitely either 1 or 0. That is, either the interval includes
or does not include µ.
Wider intervals, and therefore higher "confidence" can be set if required. For example, the value
2.58 can be used in the place of 1.96 to set 99% confidence intervals. Indeed an appropriate
standardized normal deviate, z, can be used to obtain desirable confidence intervals.
Chapter 6 dealt with the estimation of population parameters by sample statistics. These
sample statistics may further be idealized to answer questions about the population
parameters. In the framework of statistical inference the question is reduced to a hypothesis
and the answer to it expressed as the result of a test of the hypothesis.
Definition of terms
1. Statistical hypothesis: This is a statement about the parameter(s) or distributional
form of the population(s) being sampled.
2. Null hypothesis, Ho: This term relates to the particular hypothesis under test. In many
instances it is formulated for the sole purpose of being rejected or nullified. It is often
a hypothesis of 'no difference’.
3. Alternative hypothesis, H1: This is a statistical hypothesis that disagrees with the
null hypothesis.
The null hypothesis H0 and the alternative hypothesis H1 concern populations but ourconclusions are based on samples taken from these populations. Generalization from
sample to population is dangerous since sampling errors are involved. Therefore we
are unable to say that H0 or H1 are definitely true because of this sampling effect.
If sampling errors are taken into account, it can be investigated how likely each of these
hypothesis is. We have to measure the relevant information in the sampled data and weigh
this information in relation to the sampling errors involved.
4. A statistic: is a value which depends on the outcomes on a variable for the sampled elements.
5. A test statistic: is a statistic which represents the relevant sample information for thequestion under investigation. It provides a basis for testing a statistical hypothesis and has a
known sampling distribution with tabulated percentage points (e.g. standard normal, χ2, t
etc). The value of a test statistic differs from sample to sample.
6. Significance level: This is the probability of rejecting H0 when it is true. It is often
expressed as a percentage, i.e. the probability α is multiplied by 100. Often the 5% and 1%
levels (i.e. α=0.05, 0.01 respectively) are chosen as important, but the selection is fairly
arbitrary.
7. Critical value: This is the value of the test statistic corresponding to a given significance
level as determined from the sampling distribution of the test statistic (by using statistical
Statistical significance and practical significance
There are many situations in which a result may reveal a statistically significant difference whichmight be quite unimportant clinically. For example, in a study to compare blood pressure in the left
and right arms, a small difference of about 1 mmHg was found. This difference was highly
statistically significant but of no importance clinically. Similarly, it is not reasonable to take a non-
significant result as indicating no effect, just because we cannot rule out the null hypothesis.
ONE SAMPLE SIGNIFICANCE TEST FOR A MEAN (standard deviation, σσσσ known)
Problem: Is it reasonable to conclude that a sample of n observations, with mean x could have been
from population with mean µ and standard deviation σ?
Null hypothesis: The difference between µ and x is merely due to sampling error.
Calculate SND = x - µ and consider the numerical value of SND.
σ / √n
If SND<1.96 we have no strong evidence against the null hypothesis and cannot convincingly show
that it is wrong. i.e p>0.05
If SND>1.96 we have evidence that the null hypothesis is false. It is unlikely that the difference
between x and µ is due to sampling error only, i.e.p<0.05. If SND>2.58 we have strong evidence
against the null hypothesis p<0.01. If 1.96 < SND < 2.58, we write 0.01<p<0.05.
Example:
A large number of patients with cancer at a particular site, and of particular clinical stage, are found
to have a mean survival time from diagnosis of 38.3 months with a standard deviation of 43.3 months.
100 patients are treated by a new technique and their mean survival time is 46.9 months. Is this apparent
increase in mean survival time associated with the new technique?
Solution:
Null hypothesis: There is no increase in mean survival time in the patients treated with the new
technique.
We have the standard normal deviate as
SND = 46.9-38.3 = 8.6 = 2.0
(43.3/ √100) 4.33
This value just exceeds the 5% value of 1.96, and the difference is therefore significant, i.e p<0.05
Thus we conclude that it is likely that there is an increase in the mean survival time among patients
The χ2 (Greek letter chi, pronounced kye) squared test is used to determine whether a set of
frequencies follow a particular distribution (e.g. Binomial, Normal, Poisson, etc). In its basic form it
tests whether the observed frequencies of individuals with some characteristics are significantly
different to those expected on some hypothesis.
THE 2X2 TABLE
Considering our previous example which arises from the comparison of two proportions. The resultsof the clinical trial in which the proportion of patients dying who received either treatment A or B
were compared, are presented in the following table:
Outcome
Died Survived Total
Treatment A 41 216 257
Treatment B 64 180 244
Total 105 396 501
Such a table is called a 2x2 contingency table since there are 2 rows and 2 columns. (In general we
can have an "rxc" contingency table, i.e. a table with r rows and c columns).
From the above table, the observed frequencies are 41, 216, 64 and 180. We need to obtain the
expected frequencies under the null hypothesis that "the row treatments have the same effect on
the outcome".
The expected frequencies are calculated in the following way:-
Expected frequency, E = row total x column total
grand total
For example, in the top left cell, where we observe 41 deaths the expected frequency under the null
hypothesis is
105x257 = 53.86
501
These expected frequencies are shown in the table below. They add up to the same grand total as the
observed frequencies.
We can then compare between the observed and the expected frequencies by looking at their
differences. We need also to consider the importance of the magnitude of the differences (eg. a
difference of 5 between 995 and 1000 is not as important as the "discrepancy" of size 5 between 2
Fig.11.1 Scatter diagram of plasma volume and body weight
Examination of plasma volume and body weight suggests that there is a trend of plasma volume to
increase with increasing body weight.
LINEAR REGRESSION
When a response variable appears to change with a change in values of the explanatory variable, we
may wish to summarize this relationship by a line drawn through the scatter of points.
Geometrically, any straight line drawn on a graph can be represented by the equation:y = a + bx
y refers to the values of the response (dependent) variable and x to values of the explanatory
(independent) variable. The equation tells us how these variables, x and y, are related. The constant 'a'
is the intercept, the point at which the line crosses the y-axis; that is, the value of y when x = 0.
The coefficient of x variable ('b') is the slope of the line. It tells us the average change (increase or
decrease) due to a unit change in x. It is sometimes called the regression coefficient.
Although we could draw the line through these points 'by eye', this would be a subjective approach
and therefore unsatisfactory. An objective approach, and therefore better, way of determining theposition of a straight line is to use the method of least squares. Through this method, we choose a
and b such that the vertical distances of the points from the line are minimized; or, we minimize the
sum of squares of these vertical distances - hence the term 'least squares'.
The (Pearson's) correlation coefficient has the following properties:-
1. It must lie between -1 and +1.
2. Positive values of r are obtained from upward sloping lines (b > 0) ie. y increasing with
increasing x values. Negative values of r are obtained from downward sloping lines (b<0) ie.
y decreasing with increasing x values.
3. If |r| = 1, the relationship is perfectly linear, ie. all points lie exactly on the regression line.
For perfect positive correlation, r = +1 and for perfect negative correlation r = -1.
4. If r lies between 0 and +1, or between 0 and -1 there is some scatter about the line. The less
the scatter, the closer |r| is to 1.
5. If r = 0, there is no linear relationship between the explanatory and the response variable.
This does not necessarily mean that there is no relationship at all, what it suggests is that if
the relationship exists is NOT linear. Example, a curved relationship between the
independent and dependent variables.
LOGISTIC REGRESSION
Introduction
We have so far dealt with simple linear regression with a continuous dependent variable. We can extend
the methods of simple linear regression to deal with more than one independent variables in the form of
multiple linear regression. That is the multiple regression model yields an equation in which the
dependent (outcome) variable is expressed as a combination of the independent (explanatory) variables.
This takes the following form:
y = β0+β1x1+...+βk xk , wherey is the dependent variable,
,1 x 1 x2, ...xk are the k explanatory variables (sometimes called predictor variables or covariates), and
β0, β1, ... βk are the regression coefficients.
As stated earlier on, these methods assume that the outcome variable of interest is numerical ( and
measured on a continuous scale), although the explanatory variables do not necessarily have to be
continuous.
It is very common, however, that in many kinds of medical research the outcome variable of interest is a
proportion ( or a percentage) rather than a continuous measurement.
We cannot use the ordinary multiple linear regression for the analysis of the individual and joint effectsof a set of explanatory variables on the outcome variable which is in the form of a proportion. Two
features of proportions based on counts (proportions based on measurements do not come in here) are
important when considering a statistical analysis:
(a) if the denominator of the proportion is n and the population value is π, the variance of this
proportion is π (1 - π )/n and for a given n, this depends upon the value of π ,being largest when π=1/2
and smaller when π is in the neighbourhood of 0 or 1. Hence the usual assumption of constant
variance σ2 can no longer hold.
(b) when we relate a proportion variable to other quantities by some form of a regression model, we
need to take seriously the fact that the true proportion cannot go outside the range 0 to 1. Because of
this the parameters have a limited interpretation and range of validity. We can instead use a similar
approach known as multiple linear logistic regression or just logistic regression.
Transformed proportions
We can overcome some of the problems in (b) above by looking at the response proportion on a
transformed scale which does not have the fixed boundaries at 0 and 1. Suppose p is the proportion of
individuals with some characteristic of interest. Or equivalently, let p be the probability of a subject
having a disease, then 1-p is the probability that the individual does not have the disease, and the odds
of having the disease is p/(1-p). As p changes from 0 to 1, the corresponding odds (i.e the ratio
p/(1-p)) change from 0 to ∞. So this transformation removes one of the boundaries. To remove the
other, we consider the odds on a logarithm (log) scale: the log odds will go from -∞ to +∞ as p goes
from 0 to 1. If we use the natural logs (i.e. logarithms to the base e), the transformation loge(p/1-p)) is
called the logit of p.
That is, logit(p) = loge pp1-
2, and this is the log odds. The estimated value of p can be derived from
logit(p), and always lies in the range 0 to 1.
If y =logit(p), then we have ey=p/ (1-p) and p=e
y /(1+e
y).
If we wish to compare risks of having some disease between individuals who are exposed to some
factor and those who are not exposed to the factor, we can do that using our model. We estimate
y1=logit( p1) for the group with the factor present, and y0=logit( p0) for the group without the factor.
Then we have y1- y0 = logit( p1)-logit( p0) = loge1
1
p
1- p
3-loge
0
0
p
1- p
= loge
1 0
0 1
p (1- p )
p (1- p )
4, which is
the log of the odds ratio.5
Regression with transformed proportions
Just as with ordinary regression, we can develop regression equations with transformed proportions as
the y-variate. When the logit transform is used, this procedure is called logistic regression. The
mathematical calculations involved are generally heavy, but this is taken care of by several computer
packages such as GLIM (Generalised Linear Interactive Modelling), SAS (Statistical Analysis
System), SPSS (Statistical Package for the Social Sciences), etc., which are available on a wide range
of computers, from mainframes to micros. Other computer packages that can handle logistic regression
analysis include Egret and a less familiar one known as Logxact which is particularly useful with smallsamples as it employs exact (as opposed to asymptotic) methods. With the exception of SAS, these
packages are fully available on some PC's in the Department of Epidemiology and Biostatistics,
although the GLIM version that we have appears only in limited features.
Simple logistic regression
The simple logistic regression model takes the form:
logit(p) = β0+β1x1, where β's are the regression coefficients and x a covariate.
Suppose we treat batches of about 50 mosquitoes with a series of concentrations of an insecticide,
record the number of mosquitoes killed, and obtain the following results:
Table 10.2 Number of mosquitoes killed in a batch a the dose of insecticide used
Dose of insecticide Number of mosquitoes
killed
Number of mosquitoes in a batch
10.27.7
5.1
3.8
2.6
4442
24
16
6
5049
46
48
50
Plotting the proportion killed in each batch against the dose of insecticide (a log scale for the dose or
concentration is usually appropriate) is a recommendable starting point. The simple linear regression
model will not fit very well the data, and it will lead us to expect responses which are negative for very
low doses or greater than 1 for high responses. In fitting a logistic regression model to these data,
working with ln(dose), the model is such that loge
p
p1-
6 = -4.887 + 3.104 ln(dose)
Multiple logistic regression The main difference between multiple logistic regression and ordinary multiple regression is that in the
former, we use a combination of the set of values of covariates to predict a transformed dependent
variable rather than the dependent variable on its original scale. Hence multiple logistic regression
model is represented in a similar manner as follows:
logit(p) = β0+β1x1+...+βk xk
An example in which multiple logistic regression can be used is provided by the data below, from an
article by Norton, P.G. and Dunn, E.V. (1985), Br. Med J., 291, 630-632. These relate hypertension to
smoking, obesity and snoring among men aged 40 years or over. In such a case logistic regression canbe used to see which of the factors smoking, obesity and snoring are predictive of hypertension.
Table 10.3 Hypertension in men aged 40+ in relation to smoking, obesity and snoring
Smoking Obesity Snoring No. of men No (%) with hypertension
Table 10.4 Logistic regression analysis of the hypertension data shown above in Table 10.3
Regression coefficient
(b)
Standard error
se (b)
z p-value
Constant
Smoking (x1)
Obesity (x2)
Snoring (x3)
-2.378
-0.068
0.695
0.872
0.380
0.278
0.285
0.398
0.24
2.44
2.19
0.810
0.015
0.028
The significance of each variable can be tested by treating z=b/se(b) as a standard normal deviate. Can
see that the P-value for smoking is very large (0.81) and hence we can say that, smoking has no
association with hypertension. Obesity and hypertension have a significant association with
hypertension. (in both cases P<0.05).
The analyses presented relate only to the main effects of obesity, smoking and snoring. We need to
consider also the possible presence of any important interaction between two of these factors. That is,we should investigate whether the effect of a factor depends on the level of another factor. In fact this
was done, and no interaction term was found to be statistically significant at any interesting level.
Omission of smoking in the model produced only minimal changes in the values of the other
coefficients. Hence the regression equation for this model is
logit(p) = -2.378 - 0.068x1 + 0.695x2 + 0.872x3, where
x1, x2, and x3 are codes for smoking, obesity, and snoring, respectively.
The above equation enables us to calculate the estimated probability of having hypertension, given
values of the three variables. In particular, we can obtain the odds ratio of hypertension associated
with any of the three factors. For example, let us consider variable x2, obesity:putting x2 = 1 (for presence of obesity), gives:
logit(p1) = -2.378 - 0.068x1 + 0.695 + 0.872x3, and
putting x2 = 0 (for non-obese), gives:
logit(p0) = -2.378 - 0.068x1 + 0.872x3.
As discussed earlier, the difference logit(p1)-logit(p0) = 0.695 is the log odds ratio. Hence the odds
ratio for hypertension associated with obesity = e0.695
= 2.00. In general, for any binary variable the
odds ratio (OR) can be estimated directly from the regression coefficient b as OR = eb. Confidence
limits follow immediately from the standard error of b and on taking b to have an approximate Normal
Quality of data depend on many factors, one of which is the source of data. Sources of data have a
direct implication to the quality in terms of coverage, completeness and cost.
In this chapter we will concentrate on the following sources of demographic data:
(a) Census
(b) Vital registration
(c) Sample surveys
Census
Census is a systematic, routine way of counting subjects in a defined boundary or limits of land.
Census produces reports of individuals, population size and structure at a point in time.
Originally, census was limited to people only; but very recently we find censuses of agriculture,
business, livestock, housing, etc and sometimes done concurrently with population census.
The main characteristics of census is that it covers the whole population. No sampling is involved and
each person should be enumerated separately. Census must have a legal basis to make it complete and
compulsory. It reflects a single point in time although the whole process can take a longer time.
Basic questions which should appear on the questionnaire are name, age, sex, relationship with the
head of household, marital status, race/religion/ethnicity, education, occupation, employment status,
migration and amenities. Additional questions would depend on the availability and quality of vital
registration.
Population census can be carried out using either of the below mentioned methods:
1. De facto method:
This method designates persons to an area or location they are found during enumeration.
The population "in fact" there. The question of originality does not count here. It is
considered that, say, in 1988 Tanzania Population Census, Zanzibar had a population of641,000. This implies that, these people spent a night in Zanzibar before a census night.
Tanzania follows this method of enumeration.
2. De jure method:
De jure method of enumeration allocates persons to their normal residence. Meaning "people
who belong to the area or have the right to live there through citizenship, legal residence or
whatever". For example, a businessman working in Dar es Salaam but living in Arusha would
be assigned to Arusha on a de jure type of enumeration.
In Tanzania census is normally conducted after every ten years (decennial). This has a planning set-
back implication in a sense that population is changing rapidly because of births, deaths and
movements. To overcome this problem, normally inter-censal surveys or mini-surveys are conducted.
Example of such surveys is a 1991 Tanzania Demographic and Health Survey (TDHS). However,
further surveys on morbidity and for specific diseases can be conducted whenever a need arise.
Vital registration
Vital registration system is very common to developed counties where information on births,
marriages, deaths and migrations are collected. In developing countries the system whenever
employed is prone to incompleteness otherwise they are non-existent.
Questions in the vital registration system are always very simple and few. Consider hospital or health
service data here in Tanzania. Examples of such registrations are information on deaths found in
hospitals (death certificates). Birth and marriage data found in churches, mosques, Area
Commissioner's offices and migration data found at airports and borders.
The short-fall of vital registration system is that they are normally incomplete, selective samples,diverse and are practically unreliable. This does not mean that the system should be discarded,
instead it should be improved to remove these errors.
Sample surveys
Sample surveys give the same information in a more detailed form where vital registration system
does not exist. Only a sample of a population is involved. Sample surveys are thus, less costly when
compared with census.
The other advantages of surveys include the pace of collecting the information. They are relatively
quicker, more detailed than other systems like census. The cost of surveys are the errors introducedthrough sampling.
COMMON RATES IN PUBLIC HEALTH
(a) Measures of fertility:
There are four common measures of fertility. These are crude birth rate, general fertility rate,
gross reproductive rate and the total fertility rate.
i. Crude birth rate:
It is called the 'rate' but in practice it is the ratio defined as:
number of livebirths in a year x 1000
Total population
The rate is 'crude' because it does not take into account the risk of giving birth
The modern, conventional and much more acceptable 'rate' is the general fertility rate
or simply known as 'fertility rate'. The denominator is restricted to women at risk of
child-bearing rather than the general population. It is thus, defined by:
number of livebirths in a year x 1000
mid-year population of women aged 15-49
iii. The total fertility rate:
Total fertility rate means the average number of children a woman would have during
her reproductive life time given that the current specific fertility rates would still be
applicable at that time.
The total fertility rate is calculated from age-specific fertility rates (ASFRs). We get
the ASFRs when we divide the number of livebirths by the number of women in eachage interval. The following example shows required steps to calculate total fertility
rate (TFR).
Table 11.1: Number of livebirths and maternal age, Tanzania, 1988.
Age Number of women Number of live
births
Age specific fertility
rate
15-19
20-24
25-29
30-3435-39
40-44
45-49
665000
516000
459000
344000310000
229000
218000
21000
114000
118000
12300037000
6000
5000
0.0316
0.2209
0.2571
0.35760.1194
0.0262
0.0229
Total 2741000 424000 1.0357
Total fertility rate (TFR) equals the sum of all age specific fertility rates. In this case,
TFR = 1.0357 x 5 = 5.1785.
The sum of all ASFRs is multiplied times 5 because of the 5 year age group interval. If ages
are in single years, then there is no need to multiply this sum times 5.
The figure 5.1785 means on average each woman will have 5 children during her
reproductive period given that these age specific fertility rates will still apply until she
finishes her reproductive life.
Unlike the CBR and GFR, the calculation of TFR greatly depends on the age composition
although its use is independent of age distribution.
The gross reproductive rate is similar to the total fertility rate only that it considers
female live births rather than all births. This implies that, ASFR for GRR is based on
females.
GRR is interpreted as the average number of daughters a woman would have if she
survived to at least age 50 and experienced the given female ASFRs. A figure of 1
means that women are able to replace themselves while a figure of 2.0 means that the
population is doubling itself: each woman is on average producing two daughters.
Like the TFR, GRR is also a hypothetical measure. It is a period measure which does
not take into account the effect of female mortality either before age 15 or 15 to 50
years.
Referring to Table 11.1 above, given the number of female livebirths the GRR is
computed as follows:
Age Number of women Number of live
births
Female births Female
ASFR
15-19
20-24
25-29
30-34
35-39
40-44
45-49
665000
516000
459000
344000
310000
229000
218000
21000
114000
118000
123000
37000
6000
5000
11000
58000
60000
63000
19000
3000
3000
0.0165
0.1124
0.1307
0.1831
0.0613
0.0131
0.0138
Total 2741000 424000 217000 0.5309
Then GRR = 0.5309 x 5 = 2.6545
If the true sex ratio at birth is known, the GRR can be calculated using the TFR.
Thus, GRR = 5.1785 x 217/424 = 2.65
(b) Measures of morbidity:
i. Incidence rates:
Incidence measures the occupance of new cases of a disease in a population relative
to the number of persons at risk of contracting the disease. Therefore, the incidencerate is the rate of contracting a disease among those still at risk. It should be noted to
make a difference between being at risk of contracting the disease at the beginning of
a period and being at risk during the entire period. The former would refer to the
incidence risk and the latter to the incidence rate. Incidence rate is expressed as:
number of new cases of disease in a period of time x 10k
number of person-years of exposure in a period
where k = 2, 3, 4, 5 or 6 depending on the convenience or convention.
Standardizes death rates which have been discussed above can be used to study the levels of mortality of
a population and also they can be used to compare the mortality experience of two or more populations.The standardized death rates are however a single figure index of the level of mortality. They contain no
direct information about mortality levels of different age groups. Life tables on the other hand can
summarize the mortality experience of a population at every age. They provide answers to questions
like, Suppose in a population 100,000 babies are born on the same day, how many babies will survive to
celebrate their 1st, 2nd etc birthdays assuming that babies die at the current rates of mortality?
However, the use of current mortality rates for this calculation is unreal since the babies would die at
the rates existing at the time when they die.
There are two distinct ways in which a life table may be constructed from mortality data:
In the current life table the survival pattern of a group of individuals is described subject through out
life to the age specific death rates currently observed in that particular community. This kind of a life
table is more often used for actuarial purposes and is less common in medical research.
On the other hand the cohort life table describes the actual survival experience of a group or 'cohort' of
individuals through time. The cohort may be babies born at the same time or an occupational group or
patients following a particular treatment etc. This type of life table has its most useful application in
medical research in follow-up studies eg an IUD retention study or more generally survivorship studies.
There are two types of life tables:
1. Full life table: Includes every single year of age from 0 to the highest age to which any person
survives.
2. Abridged life tables: usually considers only 5 year age groups except that the first five years of life
may be considered singly.
THE FULL (COMPLETE) LIFE TABLE:
The number of imaginary births considered in the life table is called the radix. This is usually the power
of ten but its value is determined by convenience and accuracy. A life table comprises a set of six
columns headed x, lx, dx, px, n qx and ex0
x - The age to which the numbers in other columns relate.lx - The number still surviving at axact age x.
dx - The number of deaths occurring between exact age x and exact age x+1, i.e dx =lx - lx+1
Px - This gives the probability of surviving from exact age x to exact age x+1.
Px = lx+1
lx
qx = The probability of dying between exact age x and exact age x+1.
qx = 1 - Px = 1 - lx+1 = dx
lx lx
ex0 gives the expectation of life at age x. i.e the average number of years to be lived by persons who
The following is an abridged life table for a certain country in a given year.
x lx 10dx 10qx 10px e0x
0
10
20
30
40
50
60
70
80
90
100
100000
97062
96215
94726
92859
88473
77456
54944
24669
3800
80
2938
847
1489
1867
4386
11017
22512
30275
20869
3720
80
0.029
0.009
0.015
0.012
0.047
0.124
0.291
0.551
0.846
0.979
1.000
0.971
0.991
0.985
0.988
0.953
0.876
0.709
0.449
0.154
0.021
0.000
68.03
59.94
50.42
41.13
3.86
23.15
15.7
10.20
6.57
5.21
5.00
THE POPULATION PYRAMID
Both age and sex compositions can be represented by a special type of bar graph called a population
pyramid. Population pyramids provide graphic statements of the age and sex distribution of a
population for a given year. It also shows the history of a population including effects of war, waves of
in- or out-migration, fluctuations in fertility and mortality, etc.
The pyramid is a two-way histogram with the X and Y axes reversed, so that frequencies are represented
by the horizontal axis and class intervals by the vertical axis. Thus the population pyramid consists of
two bar graphs (or histograms) placed on their sides and back to back (see Figure 11.1). The length of
each bar represents either the total number of the percentage size for each age or age group (it is
conventional to use either single-year or five-year age groups, though other groupings are possible).Pyramids are drawn showing the male population on the left hand side and the female population on the
right. The young are usually at the bottom and the old at the top. The last open-ended age group is
normally omitted entirely from the pyramid because it is impossible to draw truthfully.
Since every year cohorts normally lose part of their number through death or emigration, each bar is
usually shorter than the previous one, which gives the impression of a pyramid. A vertical comparison
1. Armitage, P. and Berry, G. (1994). Statistical Methods in Medical Research, 3rd Edition.Oxford: Blackwell Scientific Publications. (older versions are just as good for most topics).
2. Brownlee, A., Pathmanathan, I., Varkevisser, C. (1991). Health Systems Research Training
Series, Volume 2 (Part 1): Designing and Conducting Health Systems Research Projects.
Canada: IDRC.
3. Healy, M.J.R., Hills, M. and Osborn, J. (1987). Manual of Medical Statistics. Volume II.
London: London School of Hygiene and Tropical Medicine.
4. Hill, A. Bradford (1984). A Short Textbook of Medical Statistics, 11th Edition. London:
Hodder and Stoughton.
5. Kirkwood, B.R. (1988). Essentials of Medical Statistics, 1st Edition. London: Blackwell
Scientific Publications.
6. Newell C. (1988). Methods and Models in Demography.
Blackwell Scientific Publications.
7. Osborn, J. (1988). Manual of Medical Statistics. Volume I. London: London School of
Hygiene and Tropical Medicine.
8. Petrie, Aviva (1990). Lecture Notes on Medical Statistics, 2nd Edition. Oxford: Blackwell