Biostat Manual

7/17/2019 Biostat Manual

http://slidepdf.com/reader/full/biostat-manual 1/97

Lecture notes

in

Biostatistics

Prepared, edited and compiled by

Kazaura, M. R.Makwaya, C. K.Masanja, C. M.

Mpembeni, R.C.

Muhimbili University College of Health Sciences

Institute of Public Health

Department of Epidemiology and Biostatistics

Dar es Salaam, 1997



2

ABOUT THE AUTHORS

Kazaura, M. R. is currently an Assistant Lecturer in the Department of Epidemiology and

Biostatistics. He is a Statistician - Demographer with research interests in population studies,specifically related to youths and family planning.

Makwaya, C. K. is a Lecturer in Biostatistics in the Department of Epidemiology and Biostatistics of

the Muhimbili University College of Health Sciences. His main interest include analysis of binary

data and mathematical modelling of infectious diseases, particularly STD’s.

Masanja, C. M. is an Assistant Lecturer in Biostatistics in the Department of Epidemiology and

Biostatistics. She has a Masters degree in Medical Statistics. Her research interests are in adult

morbidity and mortality.

Mpembeni, R. C. is an Assistant Lecturer in Biostatistics in the Department of Epidemiology and

Biostatistics. She has a Master’s degree in Medical Statistics. Her main research interest is in

reproductive health.



3

CONTENTS

Pages

Chapter 1 Introduction

Background

Definition of Biostatistics

Need for Biostatistics

Application of Biostatistical Methods

Chapter 2 Descriptive Statistics

Introduction

Descriptive Methods for Qualitative Data

Descriptive Methods for Quantitative Data

Chapter 3 Probability

Introduction

Probability Calculation (Addition and Multiplication Rules)

Chapter 4 The Normal Distribution

Introduction

Characteristics of the Normal distribution

Chapter 5 Introduction to Sampling Techniques

Introduction

Sampling Methods

Sample Size

Chapter 6 Estimation

Statistic and Parameter

The Standard Error of a Mean

The Standard Error of a Proportion

Chapter 7 Significance Tests: One Sample

Introduction

Concept of p-values

One Sample Significance Test for a Mean

One Sample Significance Test for a Proportion

Chapter 8 Significance Tests: Two Samples

Comparison of Two Means

Comparison of Two Proportions

Chapter 9 The Chi-Squared Test

The 2x2 Table

Larger Contingency Table (rxc)

Chapter 10 Association Between Quantitative Variables

Introduction

Scatter Diagram



4

Linear Regression

Correlation

Logistic Regression

Chapter 11 Vital Statistics and Demography

Sources of Demographic Information

Common Rates in Public Health

Standardization of Rates

Life Tables

Population Pyramids



5

FOREWORD

The study of statistics deals with the collection, processing and interpretation of data. The concepts ofstatistics are applied in many scientific fields that include agriculture, business, engineering and health.

When focus is on biological and health sciences, the term biostatistics is used. This manual of

biostatistics was written for students of the health sciences and serves as an introduction to the study of

biostatistics. The contents of the manual are based on the requirements for the biostatistics courses

offered at the Muhimbili University College of Health Sciences for both undergraduates and

postgraduates.

Textbooks on mathematical statistics usually include theoretical examples and exercises. The task of

finding relevant data is so enormous that even textbooks on applied statistics rarely include practical

examples and exercises. In particular, a course in biostatistics which is not introduced via numerous

examples of real data renders a restrictive view of the subject and hence tends to discourage the

uninitiated student. This manual is intended to provide substantial contact with a variety of statistical

methods and data sets so that the student can appreciate their application and the contexts in which they

are used. In the process the manual will facilitate learning of the student and provide handy notes and

references for further reading.

The authors have performed a valuable service in compiling the present manual. Many of the examples

and exercises given in this collection are based on health-related data, and the techniques which the

student is expected to apply cover a wide range of commonly used techniques. The manual will be of

great value both as the basis for a taught course and for private study.

ACKNOWLEDGEMENT

This work would have been impossible without the generous financial support of SIDA (SAREC) as

part of Research Capability Strengthening in the Department of Epidemiology/Biostatistics.

Japhet Z. J. Killewo

Associate Professor and Head of Department

Department of Epidemiology/Biostatistics



6

Chapter 1

INTRODUCTION

BACKGROUND

Biostatistics can be defined as the application of statistics to biological problems. To many

biomedical scientists, however, the term is considered to mean the application of statistics

specifically to medical problems. For this group of people, therefore, biostatistics and medical

statistics are synonymous. Indeed the kind of (bio)statistics taught in University Medical Schools is

medical statistics in which some applications which are specific for agricultural sciences, for

example, are not included.

Conversely, in Universities of Agriculture the term “BIOMETRY” is preferred to biostatistics.

Biometry (literally meaning “measurement of life”), refers to the application of statistical methods to

the analysis of biological data. In strict terms this should include analysis of data from (human)

medical sciences as well, but in practice less weight is attached to this.

Whether biometry or biostatistics (and in some places biomathematics is used) the word “statistics” is

implied. We attempt in the following section to define statistics by describing what it is.

DEFINITION OF BIOSTATISTICS:

We can define statistics in two forms:-

First “statistics” as a “noun” is a plural for the word “statistic” which simply means numerical

statements (i.e. information that is available in numbers). Examples of this include:-

(i) hospital data on the number of admissions for some condition in a defined time

period

(ii) How much drug (e.g. chloroquine tablets) is distributed to health units -

hospitals, health centres, dispensaries, etc.

Secondly “statistics” as a “discipline” is a field of study concerned in broad terms with:-

(i) Collecting, organizing and summarizing data in a systematic way.(ii) Drawing of inferences about a population on the basis of only a part of the

population targeted.

Note: Talking of the singular form of statistics here is as meaningless as it would be in putting

mathematic or physic as singulars for mathematics or physics, respectively.

The first part of the subject is usually referred to as Descriptive Statistics, while the second part,

which, provides objective means of drawing conclusions, constitutes Inferential Statistics.

In this course we are concerned mainly with the second sense of the meaning of statistics - that is, as

a discipline. Moreover, from the above background, the kind of BIOSTATISTICS here will be

specifically that of medical statistics.



7

NEED FOR BIOSTATISTICS

At first it may not be clear why statistics should be taught in medical schools. But the simple element

of variability in life is in evidence of the need for some standardized techniques to cope with the

inevitable biological variability.

In physical sciences, for example, we often deal with constants.

• The number of hydrogen atoms in any single molecule of water is always two.

• The velocity of electromagnetic waves in a given medium is always the same (e.g. the speed of

light in vacuum is always equal to 3x108

ms-1

.

• But in the biological and medical sciences, no constituent or characteristic of living organisms can be

defined by a single value which is identical for all individuals. Consider, for example, the following

general questions:-

(a) What is the normal blood pressure in man?

(b) What is the amount of haemoglobin in blood?

Or even some specific questions to medical specialists:

(c) "Mr. Physician, what are the limits of error in your blood pressure measurements?"

(d) "Mr. Radiologist, what is the probability that your colleague’s reports on these X-ray films would

agree with yours?"

(e) "Mr. Pathologist, what proportions of your diagnoses are correct at post mortem?"

Clearly, answers to the first two questions suggest variations and at the same time we need to quantify

answers to the last three questions in order to cope with the situation. That is, we need a numerical approach

- biostatistical methods.

We illustrate further the need for biostatistics in medical and health sciences generally with the following

(fictitious) example. A study to compare two treatments: new and standard, in which 400 patients (200 males

and 200 females) were recruited gave the following results:-

Table 1.1: Results of comparison between two treatments:

OUTCOME

TREATMENT Improved Did not Improve Total % Improved

Standard

New

80

100

120

100

200

200

40

50

Total 180 220 400

With these results one may be tempted to conclude that the new treatment is better than the old (standard).

But an analysis which looks at the results for male patients separately from the female patients revealed the

following:

Table 1.2: Results of comparison between two treatments among females:

FEMALES

OUTCOME


Standard

New

32

96

8

64

40

160

80

60



8

Total 128 72 200

From this table, we note that for female patients it is the standard treatment that is doing better. This is

exactly the opposite of what we saw in the overall assessment, and one might expect the new treatment to fair

better among the male patients.

If this holds the conclusion is going to be:-”to female patients give the old treatment while for male patients it

is better to give the new treatment”. In practical terms, the decision following this controversial conclusion

would be undesirable. However, when we look at the results relating to the male patients we see the

following:-

Table 1.3: Results of comparison between two treatments among males:

MALES

OUTCOME


StandardNew

484

11236

16040

3010

Total 52 148 200

Just as in female patients it is the standard treatment that produces a higher percent of improvement. You

should check and verify that all this hangs together; the overall rate of improvement for the standard

treatment, for example is (32+48)/(40+160) = 80/200 = 40% as shown above. With a proper statistical

method of analysis it becomes clear now that the difference in improvement between the two treatments when

sex has been taken into account is 20% in favour of the standard treatment. Such features are common in

medical surveys and are a typical aspect of observational studies. The situation would have been put under

control in an experimental study set up. These arguments emphasize the need for biostatistical methods not

only for data analysis but also for study designs .

APPLICATION OF BIOSTATISTICAL METHODS

Statistical methods have a role to play in:-

(a) Official health statistics (statements). (e.g. studying time trends of number of cases of a

disease)

(b) Epidemiology (e.g. association of diseases with some aetiological factors)

(c) Clinical studies (e.g. comparison of treatments in clinical trials)

(d) Human biology (e.g. growth pattern)

(e) Laboratory studies (e.g. dose-response investigations)

(f) Health service administration (e.g. with limited resources, there may be need to prioritisetarget groups for necessary interventions)



9

Chapter 2

DESCRIPTIVE STATISTICS

INTRODUCTION

Numerical information needs to be summarized before it can be used. The methods of summarizing data

(methods of descriptive statistics) vary with different types of data which are generated from different types

of variables. We first define what a variable is and then distinguish the different types of data. A variable is

an observation, characteristic or phenomenon that can take different values for different persons, times,

places. etc.

Examples of variables

VARIABLE POSSIBLE VALUES

Height (cm)

Weight (kg)

Parity

Outcome of disease

Marital status

Age (years)

Haemoglobin (g/dl)

Number of AIDS cases

158, 169.3 170, 200.6, etc.

10.2, 50, 69.4, 84, etc.

0, 1, 6, 8, 10, etc.

Recovery, Chronic illness, death

Single, Married, Widowed, Separated

1, 5, 30, 36, etc.

8.9, 14.2, 12.7, etc.

278, 301, 313, 350, etc.

Types of variables

There are two types of variables:

(1) Qualitative (categorical) variables

(2) Quantitative (numerical) variables.

1. Qualitative (categorical) variables:

These are variables which do not take numerical values.

Examples:Sex (male, female)

Outcome of disease (recovery, chronic illness, death)

Hair colour (black, blonde)

Marital status (single, married, widowed, separated) etc.



10

2. Quantitative (numerical) variables:

These are variables which take numerical values

Examples:Age (yrs) (10, 19, 45, 60), etc.

Height (in cms) 140, 50.6, 200 etc.

Parity 0, 3, 6, 10 etc

Haemoglobin (in g/dl) 16, 8.9, 12.7 etc.

Quantitative variables are of two types:

(i) Continuous variables:

These variables take any value within meaningful extremes.

Examples:

Height (in cm) 159.25cm, 160,35 cm etcWeight (in Kg) 71.127 Kg, 80.56kg.

Exact age like 21 yrs 6 months and 4 days

(ii) Discrete variables

These variables take only fixed values (in most cases whole numbers).

Examples:

Parity (0, 2, 6, 10 etc)

Age last birth day (5, 19, 45, 90 yrs, etc)

Counts (1, 4, 5, 9, etc)

Number of AIDS cases (100, 10000, 34278 cases, etc)

Levels of measurement

Variables are measured on different levels/scales (Note: The term “measurement” is used here in a broad

sense).

(a) Nominal measurement

These are used for identifying various categories that make up a given variable.

Example:

(1) Religion: 1= Muslim , 2 = Christian, 3 = Other

(2) Sex: 1=male, 2=FemaleNote that the numbering (codes) does not signify ranking.

The categories comprising a nominal variable can not occur together and are not related.

(b) Ordinal Measurement

These are used to reflect a rank order among categories comprising a variable.

Example: perceived level of pain

1=No pain, 2=moderate pain, 3=severe pain

Number used has no other meaning than indication of rank order.



11

Ordinal measurement enables one to make a qualitative comparison (such as more/less pain) but not a

quantitative comparison like how much more

(c) Interval measurement

Numbers used for this level of measurement are more meaningful than in the former levels.

Arithmetic operations (“+” and “-”) can be performed. The distance between any two consecutive

points is the same along the scale.

Examples:

i. Difference between 3 and 4 is the same as that between 8 and 9.

ii. Temperature measurement usingoC and

oF.

The zero point is arbitrarily defined and so multiplicative statements cannot be made i.e.

There is no TRUE ZERO. 0o

F does not signify an absence of heat.

(d) Ratio Measurement

This is the most sophisticated level of measurement. This level has all the characteristics of interval

measurement but it has an absolute zero point that represents an absence of the measured quantity.

Example, weight, length, height, age, etc.

Note: Measurement at ratio level can be converted to lower levels.

DESCRIPTIVE METHODS FOR QUALITATIVE DATA

Frequency and Relative Frequency Distribution.

A simple technique used to summarize data so as to show the important features is the formation of a

frequency distribution.

Definitions:

1. A frequency distribution is a table which shows the values taken by a variable and the frequency with

which each value has occurred.

2. Relative Frequency distribution: Is a frequency taken by a value relative to total frequency of a variable.

3. Cumulative relative frequency distribution. This is accumulated relative frequency distribution as thevalue of variable increases.

Use of tallies in making frequency distribution.

A frequency distribution is normally formed (manually) by a process known as tallying. This involves the

following steps:

1. To scan the data and determine the categories.

2. List the categories

3. Work through the data and allocate each observation the category where it belongs and use the tally marks

to keep a count of the number in each category.

4. Add the tally marks to give the frequency.



12

Example, the following data shows a qualitative variable "Result of sputum examination".

If: 1 stands for smear -ve, culture -ve.

2 stands for smear -ve, not cultured.

3 stands for smear or culture +ve.

1 2 1 1 3 1 1 3 3 2 1 3 1 1 2 3 1 1 3 1 2 3 1 1 3 1 1 3 1 3 2 3 2 1 1 3 1 1 2 1 1 2 3 1

1 1 2 1 2 2 3 1 1 2 1 3 1 1 1 1 1 2 1 3 1 1 3 1 1 1 2 1 1 1 3 2 3 3 3 1 1 1 2 1 1 1 2 1

1 2 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 3 3 3 1 3 1 3 2 1 1 3 2 1 1 3 1 3 1 1 1 2 1 1 2 1 1 1

3 1 2 1 1 1 3 3 1 1 1 1 2 2 2 1 3 1 1 1 1 1 1 1 3 1 1 2 1 1 3 1 1 2 1 1 3 1 1 3 1 2 3 1

1 1 1 1 1 1 1 2 1 1 3 1 1 1 1 3 2 3 1 2 1 3 3 1 1 1 3 3 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1

2 1 2 1 1 1 1 1 1

From the above data:

Value Tally Frequency

Smear-ve, culture-ve ------ 144

Smear-ve, not cultured ------ 40

Smear or culture +ve ------ 45

Note: indicates 5 observations.

Table 2.1: A frequency distribution for the variable "Result of sputum examination".

VALUE Frequency Relative frequency Cumulative relative frequency

Smear -ve, culture -ve

Smear -ve, not cultured

Smear or culture +ve

144

40

45

62.9

17.5

19.6

62.9

80.4

100.0Total 229 100.0

Use of diagrams:

Frequency distributions can be illustrated visually by means of statistical diagrams. These diagrams serve two

main purposes:-

(i) Presentation of information/data (e.g. report) in articles for ease of appreciation

(ii) To serve as a private aid for further statistical analysis.

Two types of diagrams are commonly used to illustrate qualitative data. These are pie charts and bar charts.

1. Pie chart

Pie charts are used to express the distribution of individual observations into different categories (Note: The

frequencies should be converted into percentages totalling 100 for a pie chart to be used).



13

Example, below is a pie chart showing the distribution of first year students at Muhimbili University

College of Health Sciences (MUCHS) by course of study.

Fig 1:

D

istribution of First Year Students at MUCHS by Course of

Study.

2. Bar chart

These are the simplest and most effective means of illustrating qualitative data. The various categories of a

variable are represented on one axis and the frequency or relative frequency are represented on the other axis.

The length of each bar represents the number of observations (frequency) in each category or the relative

frequency in percentage. Example, cnsider the following birth control method mix in a certain population:

Abstinence 3%

Oral contraceptive 32%

Depo Provera 9%

Loop 17%

Spermicides 7%

Condoms 26%

Vasectomy 3%

Hysterectomy 2%

Norplant 1%

In this example, use of a pie chart for this variable would not be suitable because the diagram will be

BSc N12.0%

DDS14.0%

MD49.0%

BPharm25.0%



14

too congested. Hence a bar chart is more appropriate.

Fig 2. Distribution of the population using different control methods.

Two-way tables:

A statistical information on two variables can be presented simultaneously in a form of a two-way table. Thistable makes the information easier to assimilate by showing at a glance many of the properties of the data.

In a two way table data are presented in rows and columns. The format for a table depends upon the data and

the aspects of the data which are important to portray.

A two-way table should include the following:

1. A clear title.

2. A caption for the rows and columns with units of measurement of the variable.

3. Labels for each individual row or column. i.e The values taken by the variable concerned.

4. Marginal and grand totals.

Consider the following example:

In a study to investigate whether or not HIV1 infection is a risk factor to pulmonary tuberculosis (PTB), a

total of 2165 individuals were examined. Blood samples were also collected from these individual for

laboratory diagnosis of HIV1 infection.

The following results were obtained:

Of the 2165 individuals examined, 651 were found to be negative for HIV1 infection. Of those who were

negative, 57 were found to have PTB. 1526 of the HIV1 positive, 875 were found to have PTB.

Abstinance Oral Contrac. Depo Provera Loop Spermicides Condoms Vasectomy Hysteroctomy Norplant

Methods of Birth Control

0

5

10

15

20

25

30

35

Percentage



15

This information can be summarized in a two way table.



16

Table 2.2: PTB infection by HIV1 status

PTB STATUS

HIV STATUS Positive Negative Total

Positive

Negative

875 (57.0)

57 (8.9)

651 (43.0)

582 (91.0)

1526 (100.0)

639 (100.0)

Total 932 (34.0) 1233 (57.0) 2165 (100.0)

Numbers in brackets show the row percentages.

The cells of a two way table may contain percentages instead of the real counts. Calculation of percentages

may be row-wise or column-wise depending on the purpose of the table.

Example: In the above table our interest is to investigate whether HIV1 infection is a risk factor to PTB. So

our aim is to see whether PTB is higher in HIV1 positives than in HIV1 negatives. Hence, the row

percentages are more appropriate in this case.

Ratio, proportion and rate

These are terms which are used commonly in epidemiology and vital statistics so we should be able to

distinguish them.

Ratio: Any number (numerator) divided by any other number (denominator) gives a ratio.

Example: X/Y is a ratio

X is the numerator and,

Y is the denominator.It is not necessary that X and Y need not be counts or have the same units.

Example: Sex ratio at birth = Number of male births

Number of female births.

Proportion: A proportion is a special form of a ratio only that in a proportion the numerator is part of a

denominator.

Example: (i) Proportion of girls in the first year MUCHS= Number of girls in 1st Year

Total number of 1st year students

(ii) Proportion of male births= Number of male births

Total number of births.

A proportion is often expressed as a percentage.

Rate: A rate is a proportion with an extra dimension of time. i.e. one has to study the population for a

particular period, say, 1 year and count the number of times the particular event occurs. So a rate indicates the

frequency of events occurring in a population per unit of time.

Example: The death rate per year is given by the number of deaths during the year divided by the number of

person-years of exposure to the risk of death.



17

That is, crude death rate = Number of deaths in one year x 1000

Total population

Rates may be expressed per 1000, per 100,000 or per 1,000,000 population depending on convention and

convenience.

DESCRIPTIVE METHODS FOR QUANTITATIVE DATA

Frequency distributions are also used to summarize quantitative data. A frequency distribution for

quantitative data can be for ungrouped or grouped data.

Frequency distribution for ungrouped data:

For discrete variables, frequency may be tabulated for each value.

Table 2.3: The distribution of the counts of trypanosomes in the tail blood of a rat.

COUNT FREQUENCY RELATIVE FREQUENCY CUMULATIVE FREQUENCY

0

1

2

3

4

5

6

7

89

4

27

27

20

16

17

12

2

12

3.1

21.0

21.0

15.6

12.5

13.4

9.5

1.6

0.81.6

3.1

24.1

45.1

60.7

73.2

86.6

96.1

97.1

98.7100.0

Total 128 100.0

Frequency distribution for grouped data:

When dealing with a continuous variable or a discrete variable with a wide range of possible values, a

summary frequency table is produced by distributing the data into CLASSES or GROUPS and determine the

number of observations belonging to each class. Table 2.4 provides an example of frequency distribution for

grouped data.



18

Table 2.4: Frequency distribution of number of lesions caused by small pox virus in egg membranes.

NUMBER OF LESIONS FREQUENCY (NUMBER OF MEMBRANES)

0-

10-

20-

30-

40-

50-

60-

70-

80-

90-

100-

110-119

1

6

14

14

17

8

9

3

6

1

0

1

Total 80

Note: "-" means up to but not including the next tabulated value. Example, 10- means 10 is the lower limitwhile 19 is the upper limit. 14.5 is the mid point for the class interval 10- .

The following rules are used to make frequency distribution for a grouped data.

1. Determine the range, R, of values. (R=largest value -smallest value)

2. Decide on the number, I, of classes. This number depends on the form of data and the requirements of the

frequency distribution. But usually they should be between 5 and 20 for convenience.

3. Determine the width of the class interval, W, such that W=R/I. A constant width for all classes is

preferable.

4. Choose the upper and lower limits of the class interval careful to avoid ambiguities.

5. List the intervals in order. Use tallies to allocate each observation into the class in which it falls. Add the

tally marks to obtain class frequencies.

Use of diagrams in quantitative data:

A: Histograms:

A histogram is a familiar bar-type diagram. Values of a variable are represented on a horizontal scale and the

vertical scale represents the frequency or relative frequency at each value. Each bar centres at the mid point

of the class. Example, using data on Table 2.3,



19

Fig 3: Histogram representing the frequency distribution of counts of trypanosomes in the tail blood of a rat.

If the frequency distribution is made of class intervals which are not equal, it is necessary to calculate the

average frequency per standard interval.

Example:

Table 2.5: Frequency distribution of age at loss of last tooth

Age

Frequency Interval width Average No/year of age

11-15

16-19

20-24

25-29

30-34

35-44

45-54

55-74

1

7

21

35

40

58

28

10

5

4

5

5

5

10

10

20

0.20

1.75

4.20

7.00

8.00

5.80

2.80

0.50

0 1 2 3 4 5 6 7 8 9

Count

0

5

10

15

20

25

30Frequency



20

Fig.4 A histogram showing distribution of age at loss of last tooth

B: Line diagrams:

These are often used to express the change in some quantity over a period of time or to illustrate therelationship between continuous quantities. Each point on the graph represent represents a pair of values i.e. a

value on the x-axis and a corresponding value on the y-axis. The adjacent points are then connected by

straight lines.

1983 1984 1985 1986 1987 1988 1989 1990 1991 1992

Year

0

10

20

30

40Cumulative no. of cases (Thousands)

Fig. 5 A line diagram showing cumulative number of AIDS cases in Tanzania from 1983 to 1992.

C: Frequency polygons

Frequency polygons are a series of points (located at the mid-point of the interval) connected by straight

lines. The height of these points is equal to the frequency or relative frequency associated with the values of

the variable (or the interval). The end points are joined to the horizontal axis at the mid points of the groups

immediately below and above the lowest and highest non-zero frequencies respectively.

Frequency polygons are not as popular as histogram but are also a visual equivalent of a frequency distribution. They can

easily be superimposed and therefore superior to histograms for comparing sets of data.



21

0 1 2 3 4 5 6 7 8 9Counts

0

5

10

15

20

25

30Frequency

Fig.6 Frequency polygon for the number of trypanosomes in the tail blood of a rat.

D: Cumulative Frequency curve

This is similar to a frequency polygon but the vertical axis displays cumulative relative frequency and

the point is placed at the upper limit of the interval. Example,

0 1 2 3 4 5 6 7 8 9

Counts

0

20

40

60

80

100

120Frequency

Fig.7 Cumulative frequency curve for the number of trypanosomes in a tail blood of a rat.



22

Note: When making a statistical diagram:

- Axes should be clearly labelled and units of measurement indicated.

- Choice of scales should be made with care.

Measures of location/central tendency (mean, median, mode):

The measures of location give the overall magnitude of the values observed for each variable. The three

common measures of location are: arithmetic mean, median and the mode.

1. The arithmetic mean

This is the average or simply the mean. It is the sum of all observations divided by the number of

observations.

Example:, consider the heights of 10 men in cms: 165, 167, 169, 169, 171, 173, 175, 176, 176, 169

The mean height is calculated by adding the heights for the ten men and dividing the sum by 10.

Arithmetic mean

= 165+167+169+169+171+173+175+176+176+169

10

= 1710

10

= 171 cm

The arithmetic mean is denoted by x

Generally: x = ΣXi

n

where, ΣX = X1 + X2 + X3 + ... + Xn,

ΣXi = sum all the values of the variable X from i=1 to i=n.

n = number of observations.

The arithmetic mean can also be calculated from frequency distributions.

Refer data on Table 2.3. Multiply each value of the variable with its frequency. Add them up and divide by

the total frequency.

i.e. x = ΣXifi

Σfi

Where Xi stands for the value of the variable and

fi stands for frequency for value Xi.

Example: Mean count of trypanosomes in a tail blood of a rat is given by:

(0x4)+(1x27)+(2x27)+...+(7x2)+(8x1)+(9x2)

128

= 402 = 3.1 = 3



23

128

With the grouped data the class midpoint should be used when calculating the mean. Consider data on Table

2.4 The mean number of lesions caused by small pox virus in egg membranes is :

(5x1)+(15x6)+(25x14)+ ... +(95x1)+(105x0)+(115x1)

80

= 3670 = 45.8

80

The arithmetic mean is a preferred measure since it uses more information from each observation. However it

tends to be pulled by extreme values. Example, the following are duration of stay in hospital (in days) for

some condition.

5 5 5 7 10 20 102

The mean duration of stay x = 154 = 22, this does not reflect the mean duration of stay.

7

2. Median:

The median is the middle observations when all the observations are listed in increasing or decreasing order.

Example, below is a series of duration (in days) of absence from classes due to sickness.

1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 10, 10, 38, 80

The median is 5.

Generally, when n (number of observations) is odd the median is1 / 2 (n+1)

th observations. But when n is

even since there is no middle observation, the median is the mean of the two middle observations, i.e.1 / 2n

th

and (1 / 2n+1)

th observation.

In frequency distributions, the median can be obtained by accumulating the frequencies and noting the valueof the variable which divides the data into two equal halves i.e. An observation where

1 / 2n of the observation

lie.

Note:

1. The median is less efficient than the mean because it takes no account of the magnitude of most of the

observations.

2. If two groups of observations are pooled, the median of the combined group can not be expressed in terms

of the medians of the two component groups.

3. The median is much less amenable than the mean to mathematical treatments and so it is less used in more

elaborate statistical techniques.

However if the data are distributed asymmetrically, the median is more stable than the mean. Consider the

example on the duration of stay in hospital where the median is 7; this is more realistic than the calculated

mean of 22 days.

3. Mode:

The mode is the value with the highest frequency. i.e. The value which occurs most frequently. The modal

value (days) for the duration of stay in hospital, example given above, is 5.

Measures of variability:



24

These measures express the degree of variation or scatter of a series of observations. Common measures of

variation are range, variance and standard deviation.

1. The range:

This is the difference between the maximum value and the minimum value.

Example: if the lowest and highest of a series of diastolic blood pressure are 65 mm Hg and 95 mm Hg.

Then, the range = 95 mm Hg - 65 mm Hg = 30 mm Hg.

The range is seldom used in statistical analysis because:-

a) It wastes information since it uses information from only two extreme values.

b) The two extreme values are more likely to be faulty.

c) The range increases with increasing number of observations.

2. Variance and standard deviation:

The variance is a measure of variability which makes use of the differences from each observation and the

mean i.e. (Xi - x ). If all the differences are added together, and their mean calculated, it gives an indication

of the overall variability of the observations.

But Σ(Xi - x ) is always zero since some differences are positive while some are negative. Because of this, the

differences are squared.

The variance is the mean value of the squared deviations from the mean.

i.e. variance = Σ(Xi - x )2

n

and the numerator, Σ(Xi - x )2 is called, the sum of squares about the mean.

Since these differences are squared, the variance is measured in the square of the units in which the variable

X is measured. For example, if X is height in cm. The variance will be in cm2

.

A measure of variation which is measured in the original units of the variable is the standard deviation

which is the square root of the variance.

Standard deviation = √Σ(Xi - x )2

n

The standard deviation shows the average deviation of observations from the mean. And the interval x +

2SD covers roughly a 95% of all the observations.

The population variance is in most cases unknown because data are normally not available for the whole

population. When this is the case, the population variance is estimated by the sample variance, S2.

S2 = (Xi - x )

2

n - 1



25

Note a change in the denominator from n to n-1. When n-1 is used in the denominator, it gives a better

estimate of the population variance than when n is used.

Calculation of variance and standard deviation:

To calculate the variance and standard deviation for the following data:

Xi Xi - x (Xi -x )2

8 0 0

5 -3 9

4 -4 16

12 +4 16

15 7 49

5 -3 9

7 -1 1

56 100

ΣXi = 56

n = 7

x = 56 = 8

7

Σ(Xi - x )2 = 100

S2= 100 = 16.67

6

S =- √16.67 = 4.08

Variance and standard deviation can be calculated using the shortcut formula for Σ(Xi- x )2 (don't forget to

divide by n-1, later)

Σ(Xi - x )2 = ΣXi

2 - (ΣXi)

2

n

So using the same data above,

Xi Xi2

8 64

5 25 Σ(Xi - x )2 = 548 -(56)

2

4 16 7

12 144 = 548 - 3136

15 225 7

5 25 = 100

7 49 S2 = 100 = 16.67

ΣXi=56 ΣXi2=548 6

S = √16.67 = 4.08

Symmetric and skew distributions:



26

We defined the mode as the value of the variable which occurs most frequently. In other words it is the value

at which the frequency curve reaches a peak. When the frequency distribution has one peak (one mode), it is

called a unimodal distribution.

Table 2.6: Frequency distribution of number of males in sibships of eight children.

NUMBER OF MALES

FREQUENCY

(NUMBER OF SIBSHIPS)

0

1

2

3

4

5

6

78

161

1,152

3,951

7,603

10,262

8,498

4,948

1,655264

Total 38,495

The mode of this distribution is 4 and the distribution is Unimodal as seen in Fig. 8.

In some of the unimodal distributions, the frequency curve is "BELL SHAPED". i.e. the mode is somewhere

between the two extremes of the distribution. Such distributions are said to be symmetric.

In symmetric distributions the mean, mode and median coincide. Other Unimodal distributions are

asymmetric. Asymmetric distributions have the mode (peak) not at the centre of the distribution curve. An

asymmetric distribution is called skew distribution.

The distribution is positively skew if the upper tail is longer than the lower tail and is negatively skew if the

lower tail is longer than the upper tail.

Some distributions have more than one mode. If a distribution has two modes, it is called a bimodal

distribution. If the distribution is symmetric but bimodal the mean and the median are approximately the

same, But this common value can be somewhere between the two peaks.



27

0 1 2 3 4 5 6 7 8

NO.OF MALES

0

2

4

6

8

10

12NO.OF SIBSHIPS. (Thousands)

Fig.8: Distribution of the number of males in sibships of 8 children.

Normally a bimodal distribution indicates that within the population under study there are two distinct groups

which differ in the variable being measured.

Examples of variables that follow a Bimodal distribution are:

i. Body temperature of malaria patients

ii. Distribution of values of dilution levels of phenylthiourea solution to determine tasters and non tasters.



28

The table below shows data for 104 medical students who determined their taste threshold to phenylthiourea

(PTC):

Table 2.7: Phenylthiourea taste thresholds in a sample of 104 medical students

Solution number Concentration (mg/l) Number of students

1

2

3

4

5

6

7

8

9

1011

12

1.27

2.54

5.08

10.20

20.30

40.60

81.20

182.00

325.00

650.001300.00

>1300.00

11

16

23

12

3

0

3

5

8

108

5

If log concentration is plotted against frequency, a binomial distribution is observed.

0.1 0.4 0.7 1 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7

Log concentration

0

5

10

15

20

25No. who tasted

Fig.9 Taste threshold for PTC.

This shows that the students form two distinct groups one for "tasters" and the other for "non-tasters".



29

EXERCISE

1. The following table shows the numbers of viral infected patients not in hospital and in hospital

subdivided by sex and age.

NOT IN HOSPITAL IN HOSPITAL

Age (years) Males Females Males Females

0 - 14

15 - 29

30+

43

59

65

42

49

28

25

55

39

9

27

14

Obtain a two way summary table to show how the proportion (in percent) of patients who are in hospital

varies with : i. Age ii. Sex

2. The table below shows Accidental deaths by place and time.

YEAR

PLACE OF DEATH 1971 1976 1980 1982 1983

Transport

Work

Home

Other

8401

860

6917

3068

7306

712

6250

2831

6945

630

6009

2516

6407

457

5468

2781

6138

443

5514

2459

Construct:

a. A bar chart showing accidental deaths by place for each year shown.

b. A pie chart showing accidental deaths by place for 1983.

3. A sample of 11 patients admitted for diagnosis and evaluation to a newly opened psychiatric ward of

a general hospital experienced the following lengths of stay.

PATIENT NUMBER LENGTH OF STAY PATIENT NUMBER LENGTH OF STAY

1

2

3

4

5

6

29

14

11

24

14

14

7

8

9

10

11

14

28

14

18

22

Find:

a. The mean length of stay for these patients.

b. The variance

c. The mode.



30

4. The following are the fasting blood glucose levels of 100 children.

56 60 65 66 69 68 65 72 73 67 61 57 72 61 64 71 60 73 74 56

57 61 65 69 66 72 65 73 68 67 77 57 61 76 65 58 80 75 59 52

62 67 68 72 65 73 66 75 69 65 75 62 73 57 76 55 80 74 55 75

63 69 65 75 65 73 68 66 67 62 55 67 63 68 58 79 55 68 65 63

64 68 75 81 65 81 66 73 67 63 60 59 80 64 56 71 65 63 59 74.

Construct for these data:

i. A frequency distribution, the relative frequency distribution, cumulative relative frequency

distribution.

ii. A histogram, a frequency polygon and a cumulative relative frequency curve.

5. The following are the number of babies born during a year in 60 community hospitals.

30 37 32 39 52 55 55 26 56 57 45 43 28 58 46 27 52 40 59 43

56 54 53 49 54 48 42 54 53 31 45 32 29 30 22 49 59 42 53 31

32 35 42 21 24 57 46 54 34 24 47 24 53 28 57 56 57 59 50 29

From these data find:

(a) The mean, (b) The median, (c) The variance, (d) The standard deviation.

6. The following are the haemoglobin values (g/100ml) of 10 children receiving treatment for

haemolytic anaemia.

9.1 10.1 11.4 12.4 9.8 8.3 9.9 9.1 7.5 6.7

Compute:

The sample mean, median, variance and the standard deviation.



31

Chapter 3

PROBABILITY

INTRODUCTION

The theory of probability underlies the methods for drawing statistical inferences in medicine. The

knowledge of probability will therefore help you to set the groundwork for the development of statistical

inference.

Definition:

Probability of an event is defined to be the proportion of times the event occurs in a long series of random

trials.

Examples:

1. If an unbiased coin is tossed many times roughly 50% of the results will be heads. Thus when tossed once,

the probability of a Head (H) or a Tail (T) is ½.

2. If you are told that in a certain country 10% of the population are HIV positive. If a person is selected from

this population at random, it could be said that the probability that he/she is HIV positive is1 / 10 since this

event occurs on average to one person in 10.

3. A die has six sides numbered 1, 2, 3, 4, 5, 6. If an unbiased die is tossed once, the probability of any of the

sides showing is 1/6 i.e. P(1) = 1/6, P(2)= 1/6, P(3)= 1/6, P(4)= 1/6, P(5)= 1/6, P(6)= 1/6.

Note:

(1) Probabilities are proportions and so they take values between 0 and 1.

(2) A probability of 0 means that the event never occurs whereas the probability of 1 means the event

certainly occurs.

(3) The sum of probabilities of all possible outcomes is 1.

Example: in tossing a coin P(H) + P(T) = 1 and in tossing a die,

P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1.

PROBABILITY CALCULATIONS (ADDITION AND MULTIPLICATION RULES)

Mutually exclusive events and the addition law:

Two events that cannot happen together are defined as mutually exclusive events. That is, if events A and B

are mutually exclusive, only one of them (either A or B) will occur at only and one only particular trial.

Example: When a coin is tossed, the events Head(H) or Tail(T) are mutually exclusive. i.e. when you toss a

coin you either get a Head or a Tail and not both.



32

The Addition law:-

If A and B are possible events with known probabilities of occurrence then, P(A or B or both) = P(A) + P(B) -

P(A and B).

P(A and B) is the probability of the double event for non mutually exclusive events.

Consider a doctor's name being chosen haphazardly from the Tanzania medical register. If the probability that

this doctor is a male is 0.9 and the probability that the doctor qualified at Muhimbili Medical School is about

0.8,

what is the probability that the doctor is either a male or the doctor qualified at Muhimbili medical

school?

Let A be the event that the doctor is a male and B be the event that the doctor qualified at Muhimbili medical

school.

P(A or B) = P(A) + P(B) - P(A and B)

=0.9 + 0.8 -0.72

=0.98

Note that A and B are not mutually exclusive events because a doctor can be a male and qualified at

Muhimbili Medical School. In this case if the probability of the double event is not subtracted the probability

will exceed 1.

But if the two events are mutually exclusive, the probability of the double event is 0 and so the probability of

either A or B is given by the sum of the probabilities of two events.

That is, P(A or B) =P(A) + P(B)

Example: When a die is thrown, what is the probability of a 3 or 5 showing?

These are mutually exclusive because you cannot have 3 and 5 at the same time.

So P(3 or 5) = P(3) + P(5).

= 1/6 + 1/6

= 1/3

Multiplication rule:

If for example there are two random sequence of trials proceeding simultaneously. e.g. at each stage a coin

may be tossed and a die thrown.

How can we get the probability of a particular combination of results. e.g. P(H and 5)

we need to use the multiplication rule.

P(H and 5) = P(H) x P(5, given H) can be written as

P(H and 5) = P(H) x P(5/H).

The second term on the right side, P(5/H) is called Conditional Probability i.e. Probability of a 5 showing on

a die given that a Head appeared on the coin.



33

Take an example of playing cards. It is a pack of 52 cards which are 13 Spades, 13 Diamonds, 13 Hearts, 13

Flowers. If you draw two cards (one at a time) from a pack of cards, what is the probability that the 1st

and

2nd

cards will be Spades?

NOTE: P (spade) = 13/52

P (spade/spade on 1st draw) = 12/51

This is so because of the fact that you have already drawn 1 spade thus decreasing the number of spades and

the pack by 1. So P (spade on 1st

and 2nd

draw) = 13/52 x 12/51 = 0.0588.

Definition:

Independent events:

Two events are independent if the occurrence of one does not affect in anyway the occurrence of the other.Thus if A and B are independent events P(B/A) = P(B). When a coin is tossed, the outcome of the 1

st trial

does not affect the outcome of the 2nd

trial.

In independent trials, the multiplication rule assumes a simple form P(A and B) = P(A) P(B).

e.g. P(H and 5) = P(H) x P(5)

= 1/2 x 1/6 = 1/12.

EXERCISE

1. Define the following terms:

a) Probabilityb) Mutually exclusive events

c) Independent events

d) Conditional probability

2. The following table shows 1000 nursing school applicants classified according to scores made on a

college entrance examination and the quality of the high school from which they graduated, as rated

by a group of educators.

QUALITY OF HIGH SCHOOL

SCORE POOR (P) AVERAGE (A) SUPERIOR (S) TOTAL

Low (L)

Medium (M)High (H)

Total

105

7025

200

60

17565

300

55

145300

500

220

390390

1000

a) Calculate the probability that, an applicant picked at random from this group:

i) Made a low score on the examination.

ii) Graduated from a superior high school.

iii) Made a low score on the examination and graduated from a superior high school.

iv) Made a high score or graduated from a superior high school.

b) Calculate the following probabilities:

(i) P(A) (ii) P(H) (iii) P(M) (iv) P(A/H) (v) P(H/S).



34

Chapter 4

THE NORMAL DISTRIBUTION

INTRODUCTION

Breakdown of the total probabilities into the probabilities of each of the events is called probability

distribution. A variable, the different values of which follow a probability distribution, is known as a

random variable.

In genetical experiment, an example of a probability distribution can be obtained when we may cross two

heterozygotes with genotypes Aa. The progeny will be homozygotes (aa or AA) a heterozygotes (Aa) with the

probabilities shown below:

No. of A genes

Genotype in the genotype Probability

aa 0 ¼

Aa 1 ½

AA 2 ¼

Total 1

Probability distribution can be presented on the drawing as:

0 0.5 1 1.5 2 2.5

NO.OF A GENES

0

0.1

0.2

0.3

0.4

0.5

0.6PROBABILITY

Fig.10: Probability distribution for a random variable. The number of A genes in the genotype of progeny of

an Aa x Aa cross.



35

However, for continuous random variables, the probabilities of particular values of the variable are negligible

(sometimes zero). So to obtain the probability distribution of a continuous random variable, the concept of

probability should be confined to a specified interval on the continuous scale.

For example: when the probability that a man to be selected at random is exactly 70.2876 inches in height is

presumably zero; the probability that the man selected at random his height is between 70 and 72 inches is

0.12.

Continuous probability distribution.

Different random variables have different probability distributions, but the one which we will discuss here is

the Normal distribution.

The normal distribution:

The Normal (or Gaussian) distribution, is the most important continuous probability distribution.

Characteristics of the normal distribution

1. It is a distribution of a Continuous Random variable.

2. It is bell-shaped.

3. It is symmetrical about its mean.

4. It is determined by two Quantities, its mean µ, and its standard deviation, σ. Changing µ shifts the

whole curve to the left or right, increasing σ makes the curve flatter and more spread out.

5. The probability between the limits:

µ-σ and µ+σ is 0.68

µ-1.96σ and µ+1.96σ is 0.95µ-2.58σ and µ+2.58σ is 0.99

Normally, the probability distribution of the variables we observe are unknown. But if the smooth curve

depicting the probability distribution is bell shaped and reasonably symmetrical about the mean, use can be

made of the normal distribution.

The normal distribution as we have seen above, is determined by its mean and its standard deviation. These

quantities are different for different problems and so it is not possible to make tables of the Normal

distribution for all the values of µ and σ. So calculations are made by referring to the Standard Normal

distribution which has µ=0 and σ=1.

Thus an observation X from the normal distribution with mean µ and standard deviation σ can be related to a

standard normal deviate by calculating (SND).

SND = X - µ

σ

Thus, any normal distribution with mean µ and standard deviation σ, the probability between X1 and X2 is

the same as the probability between SND1 and SND2 in the standard Normal distribution where

SND1= X1 - µ and SND2 = X2 - µ

σ σ

The table showing probabilities of the standard normal distribution is found at the end of this manual.



36

The first two digits of the SND are shown in the 1st column and the third digit is given by the other column

headings. The figures in the body of the table for particular values of the SND show the area under the

standard normal curve to the right of the SND.

If SND = 0.00 , the area to the right of SND is 0.5.

If SND = 1.14, the area to the right of SND is 0.12714 and the area to the left is given by 1-0.12714 =

0.87286.

Examples of applications of the standard normal distribution

1. A study of blood pressure of Negro schoolboys gave a distribution of systolic blood pressure (SBP)

close to Normal with µ = 105.8mmHg and σ = 13.4 mmHg.

a) What proportion of boys would be expected to have SBP greater than 120mmHg?

Calculate SND = 120 - 105.8 = 1.06

13.5

From the tables, the area to the right of SND = 1.06 is 0.14457 or 14%. So about 14.5% of the boys would be

expected to have SBP greater than 120mmHg.

b) What proportion of boys would be expected to have SBP less than 120mmHg.

If 14.5% have SBP greater than 120 mmHg, then 100 - 14.5 = 85.5% will have SBP less than 120

mmHg.

c) What proportion of boys would be expected to have SBP between 85 and 120 mmHg.

calculate SND1 =, 85 - 105.8 = -1.55

13.4

and SND2 = 120 - 105.8 = 1.06

13.4



37

Now look for the area between SND1 and SND2.

The area to the right of SND 1.06 is 0.14457

The area to the left of SND 1.55 is 0.06571

So the proportion with SBP between 85 mmHg and 120 mmHg is

100 - 14.5 - 6.1 = 79.4.

4. Within what limits would the central 95% of SBPs be expected?

If µ = 105.8

σ = 13.4

then

µ + 1.96σ includes 95% of SPB.(105.8 - 19.6(13.4) to 105.8 + 1.96 (13.4)

i.e 79.5 to 132.1 mmHg.

i.e. 95% of the School boys have SBPs between 79.5 mm/Hg and 132.1 mmHg.



38

EXERCISE

1. Suppose the average length of stay in a chronic disease hospital of a certain type of patient is 60 days

with a standard deviation of 15. If it is reasonable to assume an approximately normal distribution oflengths of stay,

Find the probability that a randomly selected patient from this group will have a length of stay :

a) Greater than 50 days.

b) Less than 30 days.

c) Between 30 and 50.

d) Greater than 90 days.

2. If the total cholesterol values for a certain population are approximately normally distributed with a

mean of 200 mg/100 ml and a standard deviation of 20mg/100 ml,

Find the probability that an individual picked at random from this population will have a cholesterol

value:

a) Between 180 and 200 mg/100 ml.

b) Greater than 225 mg/100 ml.

c) Less than 150 mg/100 ml

d) Between 190 and 210 mg/100 ml.



39

Chapter 5

INTRODUCTION TO SAMPLING TECHNIQUES

INTRODUCTION

Often in research work we are dealing with groups which are effectively infinite, such as the number of

underfives in a district, for example. In sampling, part of a group (population) is chosen to provide

information which can be generalized to the whole, although in theory it would be possible to investigate the

whole group. Sampling is adopted to reduce labour and hence costs.

Definition:

Sampling is the process of selecting a number of study units from a defined study population. Otherwise, ifthe whole population is studied the process is referred to as taking a census. We can illustrate the process of

sampling and the important activities involved with the following diagram:-

The diagram depicts drawing a sample of size n using a particular sampling method from a study population

with N units (subjects). Inferential statistics techniques are then used to make inferences about the studypopulation on the basis of results from the sample.

The steps:1) Identifying the study population (note: it is possible to have different study populations in one study).

2) Drawing a sample from the study population.

3) Describing the sample (e.g. by calculating relevant statistics).

4) Making inferences about the parameters.

5) Drawing conclusions about the study population.

Study population

N

n

(1)

Parameters Statistics

(2)

(3)

(4)

(5)

sample



40

Random against biased sampling:-

Selection of the study units can be purposive or random. When it is purposive, no valid assessment of

sampling error can be made and in many instances this will lead to some bias. We will come back to "bias" in

detail later under "other aspects of sampling".

If conclusions that are valid for the whole population are to be drawn on the basis of a sample, then the

sample should be representative of that population. A representative sample is one that has all the important

characteristics of the population from which it is drawn. Selection of sample on random basis is a necessary

but not always sufficient condition to achieve representativeness.

We shall consider two main aspects of sampling, namely:-

i. the sampling methods

ii. sample size.

Moreover, in the discussion, we shall confine ourselves to surveys designed to provide estimates (particularly

the mean and proportion) of certain characteristics of populations as opposed to other study types.

SAMPLING METHODS

The choice of a particular sampling method is influenced by the availability of a list of all the units that

compose the study population. This is called the sampling frame.

Examples could be a list of villages, a list of eligible users of family planning methods, a list of University

students, etc.

Types of sampling:

We can classify sampling methods into two types:

i. non-probability sampling and

ii. probability sampling.

Non-probability sampling:

There are two common methods which fall under this method: These are (i) convenience sampling and (ii)

quota sampling.

i. Convenience sampling - sample is obtained on convenience basis, e.g. the study units that happen

to be available at the time of data collection are selected (many hospital based studies use

convenience samples).

A major limitation of this approach is that the sample drawn may be quite unrepresentative of the

study population.

ii. Quota sampling - a fixed predetermined number of sample units from different categories of the

study population is obtained. Obtaining a sample in this manner ensures that a certain number of

sample units from different categories with specific characteristics (such as sex, religion, age) are

represented in the sample. It is useful when one desires to provide a balance of study units

according to some characteristics of interest. Convenience sampling would not achieve this sort of

balance.



41

Probability sampling:

In this type of sampling the selection procedure has some element of probability/chance. In particular, a study

unit has some known probability of being selected into the sample. We shall discuss five forms of probability

(also known as random) sampling.

(1) Simple random sampling:

This is the simplest form of random sampling and forms the model for all the basic results of

sampling theory. Units in the study population have an equal chance of being selected. The steps

involved in simple random sampling include:-

(a) Obtaining a numbered list of all units in the study population (i.e. availability of complete

sampling frame).

(b) Deciding on the size of the sample.

(c) Selecting the required number of units using either the "lottery" system or tables of randomnumbers.

Use of random sampling numbers:

The following steps are important when one uses tables of random numbers in selecting a sample on

a random basis. To illustrate the steps we shall use the random number table (found at the end of this

manual) which has been taken from Hill, A.B. (1984), A Short Textbook of Medical Statistics, (page

290), 11th Edn. London: Hodder and Stoughton.

i. First, determine how many number digits you need. That is, see whether the sampling

frame is a one, two, or more digits. For example, if your sampling frame consists of 10 units,

this implies you will be choosing from a frame of a two-digit number in size. You must use

two digits from the random number table to choose from numbers 1-10.

If, however, your sampling frame is of three digits in size, then you obviously need to

choose from three digits. For example, the number 43 in columns 10, 11, and row 27, would

become 431. Going down the next numbers would be 107, 365, etc.

You would follow same reasoning if you needed a four digit number, for a sampling frame

which is of four digits. In our example of the number 431 on columns 10, 11,12, row 27, this

would now become 4316, the next down being 1075, and so on.

ii. Decide before hand whether you are going to go across the page to the right, down the page,across the page to the left, or up the page.

iii. Without looking at the table, and using a pencil, pen, or any sharp-ending object, pin-point a

number to establish your starting point.

iv. If this number (in step 3) is among those on your sampling frame, take it. If not, continue to

the next number in the direction you decided before hand in step 2 until you find a number

that is within the range you need. This process goes on until you achieve enough units for

your sample.



42

(2) Systematic sampling:

As the name suggests, this sampling method is such that elements in the sample are obtained in a

systematic way.

In carrying out systematic sampling, the following steps are important:

i) Obtain the sampling frame (and the size of the study population N, say)

ii) Decide on the sample size, n

iii) Calculate the sampling interval, k = N/n

iv) Select the first element at random from the first k units

v) Include every k th

unit from the frame into the sample.

For example, suppose a sample of 80 individuals is to be selected from a population comprising of

720 people.

Then n = 80, N = 720 and the sampling interval

k = N = 720 = 9 This is step (iii).

n 80

Step (iv) requires us to determine the first unit in the sample by selecting randomly one individual

from the first 9 individuals on the list. If, using simple random sampling for example, the initial

selection was 7 the selected individuals would be those occupying positions 7, 16, 25, 34, ..., etc,

according to step (v). This continues until 80 individuals have been obtained.

(3) Stratified sampling:

In this method the population is divided into subgroups, or strata whereby each stratum is sampled

randomly with a known sample size. Strata may be defined according to some characteristics of

importance in the survey. These could be occupation, religion, age groups or even locality whereby

regions of the country may be taken as strata in a national health survey.

The steps involved in stratified sampling are as follows:-

i. Divide the population into subgroups (strata)

ii. Draw a sample (of predetermined size) randomly from each of the stratum.

An important stratification principle is that the between-strata variability should be as high as

possible, or equivalently that each stratum should be as homogeneous as possible (i.e. units within a

stratum should be as much alike as possible and units in different stratum should be as much differentas possible).

(4) Cluster sampling:

There are situations in which obtaining a complete list of individuals in the study population is

practically not feasible or a complete sampling frame is not available before the investigation starts.

In such cases it would be easy and convenient to talk of a sampling frame in which the sampling units

are a collection (cluster) of study units.



43

Examples of such clusters would be schools, hospital wards, villages, etc. Since in this case the

sampling unit is a cluster (e.g. a school) the sampling method is known as cluster sampling. The

selection steps will be exactly the same as those for any of the above random sampling methods but

the sampling unit being the cluster.

Unlike in stratified sampling, an important principle in cluster sampling is that units within a cluster

should be as much heterogeneous as possible while the between-cluster variability should be as low

as possible.

(5) Multistage sampling:

Multi-stage (originating from the Latin word "multus" meaning "many") sampling is carried out in

many (more than 1) stages, and different sampling techniques can be employed at every stage. In this

method the sampling frame is divided into a population of ‘first-stage sampling units’, of which a

‘first-stage’ sample is taken. Each first-stage unit selected is subdivided into ‘second-stage sampling’

units, which are then sampled. The process continues till it is convenient to stop.

To illustrate multistage sampling consider a health survey of primary school children in Tanzania

mainland. An immediate problem to taking a sample of these children is that it is almost impossible

to construct a complete sampling frame. A multistage sample might be:

(a) to take a sample of regions;

(b) within each selected region take a sample of districts;

(c) within each selected district, take a sample of schools;

(d) within each selected school, take a sample of school children, and carry out the

investigation.

The sample would thus be accomplished in four stages and notice that the construction of a completesampling frame for each stage is relatively easy.

Apart from this advantage (of coming up easily with complete sampling frames), multistage sampling

procedure is likely to result in an appreciable saving in cost by concentrating resources at selected

schools instead of a sample made up of children scattered in all parts of the country.

Sometimes, in the final stage of sampling, complete enumeration of the available units is undertaken.

In the above example, once a survey team has reached the level of a school it may cost little extra to

examine all the children in the school; it may indeed be useful to avoid complaints from children not

included in the study within the same school.

Other aspects of sampling

(a) Bias in sampling

Bias in sampling refers to the systematic error in sampling procedures that may lead to distortion in

results. Sources of bias in sampling include the following:

i) Non-response:

This is encountered mainly when subjects refuse to give a reply during interview, or when

they (the subjects) forget to fill in a questionnaire. The non-respondents (particularly those

due to refusal) may differ systematically from those who respond.



44

ii) Studying volunteers only:

The fact that some people volunteer to participate in a study may mean that such people are

different from the general population on the factors being studied.

iii) Sampling registered patients only:

Patients going to a hospital are likely to differ from those being treated elsewhere.

iv) Missing cases of short duration:

In prevalence studies, cases of short duration (e.g. fatal cases, cases with short episodes, and

mild cases) are more likely to be missed.

v) Seasonal bias:

If the condition under study exhibits different characteristics in different seasons of the year,

this may lead to a distortion in the results, depending on the period of data collection.

vi) Tarmac bias:

Selecting a study area on the basis of "accessibility" will generally constitute a selection bias.

(b) Ethical considerations

If recommendations from a study are intended for the entire study population (e.g all relevant

individuals in a region), then one is bound ethically to ensure the sample studied is representative of

that population.

Remember that random selection of a sample does not guarantee representativeness.

SAMPLE SIZE

(NOTE: This sub-section can be skipped without loss of continuity until variance and standard deviation has

been covered).

In the planning of a study in almost any subject, one of the first and fundamental questions to be considered is

the size of the study. The trivial answer to the question ‘how big a sample do I need?' would be ‘make as large

a sample as possible, since in a given study an increase in sample size will increase the precision of the

sample results'.

Clearly issues of cost of collection and processing of data come in with a potential limiting effect on the

sample size. We shall discuss the aspect of sample size in the simplest situation whereby a study is designed

to estimate a parameter such as the mean or the proportion and confine ourselves to the statistical problems

involved in the calculations.

(1) Sample size for estimation of a mean

A study is to be designed to estimate the mean, say µ, of a population. Suppose the sample mean is x

, then the investigator is required to specify the maximum likely error, say ε=( µ -x ) he can accept.

From sampling distribution theory, we know that the interval µ + 2σ /n will include x 95% of the

time, where σ is the population standard deviation, and n is the sample size. (Note that the critical

value 1.96 has been approximated to 2). That is, the maximum likely error is 2 σ / √n.

Thus ε = 2σ / √n. Hence ε2 = 4σ2

/n, and the required sample size, n, is given by:

n = 4σ

2

/ ε

2



45

This formula implies knowledge of the population standard deviation σ, and in almost all

surveys this is unknown. It is necessary to replace σ with an estimate. This estimate may be

obtained from results of previous studies on the variable or alternatively be obtained as a

direct result from a pilot study.

(2) Sample size for estimation of a proportion

Estimating a proportion is the most common thing in surveys; that is, prevalence studies.

Like in the problem of estimating the population mean, the researcher is required to specify

the maximum likely error (sometimes also called the margin of error) he can tolerate.

Suppose the total proportion of a population with the characteristic of interest is π and a

sample of size n is to be taken to estimate π, then we know the standard error of a proportion

estimated from the sample is

√π /(1 - π) or √π(100-π) if π is in percent.

n n

If the maximum likely error which can be tolerated is ε, then assuming approximate

normality, this is given by

ε = 2√π(1-π) (i.e. as before, approximating 1.96 to 2)

n

Hence ε2 = 4π(1-π), and

n

n = 4π(1-π) or n = 4π(100-π) when p is expressed in percent form, in which case ε must also be

ε2 ε2 in percent.

Here, as before, it is necessary to know the population value π in order to calculate the

sample size n.

But it is π that the study aims at estimating! Again for the purpose of calculating the sample

size, π should be replaced with a value obtained either from previous studies or directly from

a pilot study. Alternatively, if neither of the two sources works, plugging in a value of 0.5

(or 50%) for π is justifiable since it can be shown mathematically that this value maximizes n

when ε is taken to be constant.

As a word of caution we need to point out that the formulae for determining the desired

sample sizes lead to rough approximations only giving us the order of the minimum numbers

required.

In reality we will find that we need to strike a balance between what we desire (as reflected

by the calculations) and what is feasible (as dictated by available resources such as the time,

transport, manpower, money, etc).

Moreover, it is not always necessary or possible to apply formulae to calculate the desired

sample size. In an exploratory study on shortage of drugs in Health Centres, for example, one

may simply choose the few clinics which are best-off and few others which are worst-off in

terms of availability of drugs, and go on with the study. Nevertheless, if no formulae can be



46

applied the following considerations are important in coming-up with a reasonable sample

size:-

(a) The type of study: -in exploratory studies, you usually have relatively small

samples.

(b) The number of variables to be used in the study: -the more variables, the smaller

the sample size, for practical reasons.

(c) The expected variation in the study population with respect to the most important

variables: -the bigger the variation, the larger the sample one needs.

(d) The scale on which the findings and recommendations from the study will be

used: -the larger the scale, the larger the sample.

Finally, we wish to point out that it is not generally true that the bigger the sample size, the

better the study becomes! In general, it is much better to increase in accuracy of data

collection (e.g. through careful pre-testing of the tools, or improving the quality ofinterviewers, if any) than to increase on the sample size.

EXERCISE:

1. A study is being planned to determine the mean birthweight of babies born at Muhimbili

Medical Centre. Birthweights are approximately normally distributed and 95% of the

weights are probably between 2000g and 4000g.

Determine the required sample size so that there is a 95% chance that the estimated mean

birthweight does not differ from the true value by more than 50g. (Hint: calculate thestandard deviation of the birthweights, first).

2. You have been assigned to conduct a study in order to determine the prevalence (i.e.

proportion of people affected with) of bancroftian filaria infection in Dar-es-Salaam region.

A review of literature on the subject reveals that, studies done along the East African coastal

strip some years back, showed the prevalence to be in the order of 30%. What sample size do

you require in order to come up with a reasonable estimate in your study? Give a complete

answer including describing any assumptions or prior decisions that you undertake.



47

Chapter 6

ESTIMATION

In Chapter 5 we mentioned that we study a sample with the view to learning something about the

population as a whole.

In general, we wish to estimate characteristics of the population such as:

i. the mean value of some measurement;

ii. the proportion of the population with some characteristic.

STATISTIC AND PARAMETER

When we take a sample the quantity that we get is called a statistic. This quantity is an estimate ofsome population value known as a parameter.

That is, quantities obtained by studying a sample are referred to as statistics, while population values

are referred to as parameters. Usually Roman letters are used to denote a statistic while Greek letters

denote a parameter.

Example:

Quantity Sample (Statistic) Population (Parameter)

mean x µ

variance S

2

σ2

proportion p π

Thus, the sample mean x estimates the population mean µ, for example.

In general, the sample mean or sample proportion is unlikely to be exactly equal to the mean or

proportion in the population, although the former is intended to estimate the latter. If the two are

exactly equal to one another, it is just by coincidence.

This amounts to saying that almost always our conclusion about a population on the basis of the

sample we have taken will have some error.

We distinguish between two sorts of error:

(i) Sampling errors and

(ii) non-sampling errors

Sampling errors are those which arise due to the fact that we have observed only part of the whole

population, and they get less important as the sample size increases.

For example, an estimate of the mean number of children per household in a certain district based on

two households only (in the district) will certainly be poorer than that based on a sample of say 100

households.



48

We say there is less sampling error in the latter situation than in the former. If we investigated the

whole population (i.e. all households in the district) the sampling error would be zero because we

would know the population mean exactly.

Non-sampling errors are due mainly to fault in the sampling process which is likely to create room

for the potential sources of bias (sometimes also referred to as systematic errors) highlighted in

Chapter 2. These errors are potentially serious since the bias they cause may lead to invalid

conclusions being drawn. Increasing the size of a sample will not necessarily reduce the non-

sampling errors.

For example, subjects may refuse to give a reply during interview or they may forget to fill in a

questionnaire. These non-respondents may differ systematically from those who respond.

Non-sampling errors also occur through equipment faults, observer errors and during data processing

through coding, data entry, etc.

However, in this section we will direct our attention to sampling (also known as random) errors.

THE STANDARD ERROR OF A MEAN

Consider the variable X. Suppose we take a sample of n units and measure this variable. We know

the sample mean x (given by ΣX/n) may be different from the population value, µ simply because we

have taken a sample. The question is, how do we measure this sampling variation.

Ideally, we could take several samples of size n each and calculate several values of x . It is unlikely

that the values of xwill be the same, but if they were all similar (i.e. at least close) this would imply

that the sampling error is small. If, in contrast, the values differed markedly, we would reasonablyconclude that the sampling error is large.

Let us revisit the issue of the sampling error in the situation of a sample taken once. Two properties

about the sampling error are apparent:-

1. The larger the sample size the better the precision in estimating (i.e. large samples are more

likely to produce closer estimates than small samples).

2. If the variability of the observations in the parent (study) population is small we would

expect the error to be small also, and vice-versa. Thus the sampling error depends on the

variability of observations in the population.

We mentioned earlier on the idea of taking repeatedly a random sample of size n and calculating each

time the sample mean x . This would lead to obtaining a series of values of x , and the natural

questions relating to this (new) variable x will be on its distribution as well as the mean and

variance of the variable. It can be shown mathematically that:-

i. x tends to follow a normal distribution irrespective of the type of the parent

distribution (i.e. the distribution of X). In fact the distribution of x becomes closer to

the normal distribution as n increases.



49

ii. The mean of the distribution of x is the same as that of X (i.e. the mean of the

sample means is the same as the mean µ of the parent population).

iii. The variance of x is σ2 /n where σ2

is the variance of X. It is easy to see that as the

sample size n increases, the variance of x decreases. From an earlier explanation,this observation is expected.

iv. The standard deviation of x is the square-root of its variance, and is often referred to

as the standard error of the mean. That is, the standard error of the (sample) mean,

usually written as SE(x ), is given by σ / √n.

Note: In practice, the value of σ2 will be unknown. It can be replaced by the sample value,

s2

, and the expression for the standard error SE(x ) applies accordingly.

The fact x that tends to follow a normal distribution is remarkable, since this implies that the

properties of normal distributions apply to the distribution of the sample mean. In particular, we nowknow that x follows a normal distribution with parameters µ and σ2

/n as the mean and variance,

respectively.

Hence, it follows, for example, that 95% of the sample means lie within the interval µ±1.96×SE(x ).

This implies that there is a 95% chance of getting a sample mean within the interval µ±1.96×SE(x ).

Equivalently, we are saying that the probability of having a sample mean in the interval µ±1.96×

SE(x ) is 0.95.

Note: The limits of the interval µ±1.96×SE(x ) are µ-1.96×SE(x ) and µ+1.96×SE(x ). That is,

alternatively, we are talking of the interval ranging from µ-1.96×SE(x ) to µ+1.96×SE(x )

We can express the above statements mathematically as follows:-

Pr{µ-1.96×SE(x ) < x < µ+1.96×SE(x )} = 0.95, where Pr{x} means "probability of x"

Re-arranging the left-hand side of the above equation, we obtain the following equivalent equation:

Pr{(x -1.96×SE(x ) < µ <x +1.96×SE(x )} = 0.95.

In words, this says that the probability that the interval

x -1.96×SE(x ) to x+1.96×SE(x ) includes the population value µ is 0.95.

When the value of x (and that of SE(x )) is known, then the interval x -1.96×SE(x ) to x+1.96×SE(x), often written also as (x -1.96×SE(x ), x+1.96×SE(x )), is called the 95% confidence interval of µ.

The logic of this is that, for known values of x and SE(x ), the interval (x -1.96×SE(x ), x+1.96×

SE(x )) is known and fixed. Hence, it no longer makes sense to talk of the interval including µ with

0.95 probability since the probability is definitely either 1 or 0. That is, either the interval includes

or does not include µ.

Wider intervals, and therefore higher "confidence" can be set if required. For example, the value

2.58 can be used in the place of 1.96 to set 99% confidence intervals. Indeed an appropriate

standardized normal deviate, z, can be used to obtain desirable confidence intervals.



50

While we have used a property of the normal distribution (notably, the one which states that 95% of

the values lie within 1.96 standard deviations about the mean) to define a confidence interval, it is

important to distinguish between the 95% spread (or tolerance) interval/limits and the 95%

confidence interval/limits. The former is a descriptive measure while the latter is used in estimation

problems as a measure of precision. In particular, the limits µ±1.96×σ include 95% of the values in

the population whereas the limits µ±1.96×σ / √n include 95% of the sample means.

THE STANDARD ERROR OF A PROPORTION

Recall that if in a random sample of size n there are r units with some characteristic of interest (and

n-r do not have the characteristic), then the proportion p in the sample with the characteristic is

given by p=r/n. This estimates the parameter π (i.e. the population proportion). Like in the sample

mean, the fundamental question is "how precise is p in estimating π?" Again, this can be measured

by the standard error of p, SE(p). This can be arrived at through the theory of repeated sampling.

In strict terms, and particularly with small samples, p follows the binomial distribution: we are not

going to learn about this distribution in this course! But for reasonably large samples, the

distribution of p can be approximated by the Normal distribution. It can be shown that the standard

error of p is given by SE(p)=√π(1-π)/n. Because the population proportion, π, will generally be

unknown, the standard error of p can be estimated by √p(1-p)/n.

Arguments used to develop interval estimates of a population mean µ above, also hold true for

estimating a population proportion π. Hence the 95% confidence interval of the population

proportion, is given by p±1.96×SE(p)≡(p-1.96×SE(p), p+1.96×SE(p)).

EXERCISE

1. The distribution of the duration of stay in a hospital for a certain condition is known to be

skewed to the right. The mean length of stay is 10 days and the standard deviation is 8 days.

It is proposed to study a sample of 100 patients admitted in hospital for that condition.

(a) What kind of distribution will the duration of stay of the patients in the sample

follow?

(b) Comment on the suitability of the use of the mean duration of stay as a summary

measure of central tendency in this case.

(c) If you took many such samples (i.e. repeatedly) what kind of distribution would thesample means follow?

(d) What would be the mean and the standard deviation of the distribution of the sample

means in (c) above? Give a complete numerical answer.

2. In a random sample of 150 University of Dar-es-Salaam students it was found that 38 of them

received or needed to receive treatment for defective vision.

(a) Estimate the proportion (in percentage) of students at the University who receive or

need to receive treatment for defective vision.



51

(b) Estimate 90%, 95% and 99% confidence intervals for the true proportion of

University of Dar-es-Salaam students who receive or need to receive treatment for

defective vision.



52

Chapter 7

SIGNIFICANCE TESTS: ONE SAMPLE

INTRODUCTION

Chapter 6 dealt with the estimation of population parameters by sample statistics. These

sample statistics may further be idealized to answer questions about the population

parameters. In the framework of statistical inference the question is reduced to a hypothesis

and the answer to it expressed as the result of a test of the hypothesis.

Definition of terms

1. Statistical hypothesis: This is a statement about the parameter(s) or distributional

form of the population(s) being sampled.

2. Null hypothesis, Ho: This term relates to the particular hypothesis under test. In many

instances it is formulated for the sole purpose of being rejected or nullified. It is often

a hypothesis of 'no difference’.

3. Alternative hypothesis, H1: This is a statistical hypothesis that disagrees with the

null hypothesis.

The null hypothesis H0 and the alternative hypothesis H1 concern populations but ourconclusions are based on samples taken from these populations. Generalization from

sample to population is dangerous since sampling errors are involved. Therefore we

are unable to say that H0 or H1 are definitely true because of this sampling effect.

If sampling errors are taken into account, it can be investigated how likely each of these

hypothesis is. We have to measure the relevant information in the sampled data and weigh

this information in relation to the sampling errors involved.

4. A statistic: is a value which depends on the outcomes on a variable for the sampled elements.

5. A test statistic: is a statistic which represents the relevant sample information for thequestion under investigation. It provides a basis for testing a statistical hypothesis and has a

known sampling distribution with tabulated percentage points (e.g. standard normal, χ2, t

etc). The value of a test statistic differs from sample to sample.

6. Significance level: This is the probability of rejecting H0 when it is true. It is often

expressed as a percentage, i.e. the probability α is multiplied by 100. Often the 5% and 1%

levels (i.e. α=0.05, 0.01 respectively) are chosen as important, but the selection is fairly

arbitrary.

7. Critical value: This is the value of the test statistic corresponding to a given significance

level as determined from the sampling distribution of the test statistic (by using statistical



53

tables which will be explained later). The critical value is the boundary value such that if the

value of the test statistic is more extreme (i.e. more unlikely) than the critical value, then H0

is rejected and the probability of rejecting H0 when it is true is less than the significance

level.

CONCEPT OF P-VALUES

The p-value is a probability associated with the observed test statistic value.

The p-value of an observed test statistic value is the probability to obtain a test statistic value as

extreme as, or more extreme than, the observed test statistic value, if H0 is true. For example, in a

clinical trial this statement refers to the observed difference between the treatment groups. We are

therefore relating our data to the likely variation in a sample due to chance when the null hypothesis

is true in the population.

Interpretation of p-value

Large p-value points the null hypothesis

Small p-values are evidence for the alternative hypothesis.

A proposed guideline is:

p > 0.05 No evidence against Ho

0.01 < p < 0.05 Evidence in favour of H1, but be careful

0.001 < p < 0.01 Substantial evidence in favour of H1

p < 0.001 Very strong evidence in favour of H . The possibility that H is true can be neglected.

However, for a proper interpretation of the p-value the sample size should be considered. If the

sample size is too small the sampling error will be large. This will prohibit us to find evidence against

Ho and result in high p-values, even if H

o is not true.

Relationship between p-values and sample size

Sample size is important in the interpretation of p-values.

Sample size

p-value Small Large

Small- evidence against Ho - results point away from H

- evidence against Ho - results support H

Large

- difficult to interpret

- can't distinguish between H and H

- no evidence against H1

- results point at H

The following results relating to malnutrition among underfives in Dodoma and Mwanza using

different sample sizes confirm the above explanation.

n Dodoma Mwanza P Conclusion

50

500

50000

40%

40%

31%

30%

30%

30%

0.29

0.0098

0.0012

No significant difference

Highly significant

Highly significant



54

Statistical significance and practical significance

There are many situations in which a result may reveal a statistically significant difference whichmight be quite unimportant clinically. For example, in a study to compare blood pressure in the left

and right arms, a small difference of about 1 mmHg was found. This difference was highly

statistically significant but of no importance clinically. Similarly, it is not reasonable to take a non-

significant result as indicating no effect, just because we cannot rule out the null hypothesis.

ONE SAMPLE SIGNIFICANCE TEST FOR A MEAN (standard deviation, σσσσ known)

Problem: Is it reasonable to conclude that a sample of n observations, with mean x could have been

from population with mean µ and standard deviation σ?

Null hypothesis: The difference between µ and x is merely due to sampling error.

Calculate SND = x - µ and consider the numerical value of SND.

σ / √n

If SND<1.96 we have no strong evidence against the null hypothesis and cannot convincingly show

that it is wrong. i.e p>0.05

If SND>1.96 we have evidence that the null hypothesis is false. It is unlikely that the difference

between x and µ is due to sampling error only, i.e.p<0.05. If SND>2.58 we have strong evidence

against the null hypothesis p<0.01. If 1.96 < SND < 2.58, we write 0.01<p<0.05.

Example:

A large number of patients with cancer at a particular site, and of particular clinical stage, are found

to have a mean survival time from diagnosis of 38.3 months with a standard deviation of 43.3 months.

100 patients are treated by a new technique and their mean survival time is 46.9 months. Is this apparent

increase in mean survival time associated with the new technique?

Solution:

Null hypothesis: There is no increase in mean survival time in the patients treated with the new

technique.

We have the standard normal deviate as

SND = 46.9-38.3 = 8.6 = 2.0

(43.3/ √100) 4.33

This value just exceeds the 5% value of 1.96, and the difference is therefore significant, i.e p<0.05

Thus we conclude that it is likely that there is an increase in the mean survival time among patients

who were treated by a new technique.



55

ONE SAMPLE SIGNIFICANCE TEST FOR A PROPORTION

Problem: Is it reasonable to conclude that a sample of n observations in which r have a characteristic,

could have been taken from a population in which the proportion with the characteristic is π?

Null hypothesis: The difference between the number with the characteristic, nπ, expected in the

sample, and the number observed, r, is merely due to sampling error.

If n is reasonably large, then calculate

SND = r - nπ

√nπ(1-π)

If SND<1.96 we have no strong evidence against the null hypothesis and cannot convincingly show

that it is wrong. i.e p>0.05

If SND>1.96 we have evidence that the null hypothesis is false (p<0.05). It is unlikely that the

difference between p and π is due only to sampling error.

If SND>2.58 we have strong evidence against the null hypothesis p<0.01. If 1.96 < SND < 2.58 we

write 0.01<p<0.05.

Example:

In a clinical trial to compare two analgesics A and B, 100 patients were each given the two drugs on

different occasions. Of the 100 patients, 65 say they prefer A, 35 prefer B. Is this reasonably good

evidence that more patients prefer A than B?

Solution:

If patients in general showed no preference for A or B, the proportion of A preferences, π, would be

0.5.

Null hypothesis: The proportion of all patients of this type who prefer A is π = 0.5.

Let r be the observed number of A preferences out of n=100 patients. Then r=65 and SD(r) = √nπ(1-π) = √100x0.5x0.5 = 5.

Therefore SND = 65-50 = 3

5Since SND>2.58, then p<0.01. Therefore we have strong evidence against the null hypothesis and we

can conclude that there is good evidence of more patients preferring drug A.

Significance tests from confidence intervals

Recall that a confidence interval for a population parameter (µ or π) provides limits which have a

high probability, for example 95%, of including the unknown parameter.

In a significance test of a hypothesis the question of whether the population parameter takes a

particular value is posed.



56

Clearly these two approaches are related. If, for example, the 95% confidence interval includes the

value of the parameter proposed by the hypothesis then the result of the test must be non significant

at the 5% level (i.e. p>0.05).

If, on the other hand, the 95% confidence interval does not include the value of the parameter

specified in the null hypothesis, then the result of the test must be significant at the 5% level (i.e.

p<0.05).

For example, in the test of a sample mean (example on mean survival time of patients after being

treated by a new technique), x=46.9 months and SE(x )=4.33 months.

Thus a 95% confidence interval for the true mean survival time due to this new technique is

x ± 1.96 x SE(x )

= 46.9 ± 1.96 x 4.33

= 46.9 ± 8.49

= 38.4 to 55.4

The value proposed in the null hypothesis is 38.3 months and we note that it is not included in the

confidence interval. It would thus be concluded that 38.3 is an unlikely value for the mean survival

time of cancer patients after treatment. Equivalently, we are saying, the null hypothesis is rejected at

5% level (i.e. p<0.05).

The t-test

As already shown above, the standard normal deviate test involves the calculation of

SND = x - µ = x -µ

SE(x ) σ / √n

The SND is then compared with the critical values 1.96 or 2.58. This was applied since the

population standard deviation, σ, was known. If, as is usually the case, σ is unknown, the SND cannot

be calculated. However, the value of σ can be estimated from the sample by the standard deviation s.

Replacing σ in the above formula by s, we obtain a new quantity t, given by

t = x - µ

s/ √n

t follows the t-distribution on n-1 degrees of freedom.

As the sample size increases, s should be nearly equal to σ and t will be very close to the standard

normal deviate.

At the end of this manual, we find a table which shows the critical values of t, for each number of

degrees of freedom.



57

Example:

The following data are uterine weights (in mg) of each of 20 rats drawn at random from a large stock.

Is it likely that the mean weight for the whole stock could be 24 mg, a value observed in some

previous work?

9 18 21 26 14 18 22 27 16 20

15 19 22 29 15 19 24 30 24 32

Here n=20, µ=24, Σx=420, x =420/20 =21 and s=5.91.

The null hypothesis is that the mean weight for the whole stock is 24 mg.

Therefore, t = 21 - 24 = -2.27

1.3219

The degrees of freedom, df, are 20-1=19. From the t-Table, t(0.05, 19)= 2.093. Since 2.27>2.093,

then p<0.05.

Thus there is sufficient evidence to suggest that the mean uterine weight of the stock is different from

24 mg.

The 95% confidence interval for the true mean is

x ± t(0.05,19) x SE(x )

i.e. 21 ± 2.093 x 1.3219

= 18.2 to 23.8

The exclusion of the value 24 corresponds to the significant result of testing this value at the 5%level.

EXERCISE

1. The mean level of prothrombin in the normal population is known to be 20.0 mg/100 ml of

plasma and the standard deviation is 4 mg/100 ml. A sample of 40 patients showing vitamin

K deficiency has a mean prothrombin level of 18.5 mg/100 ml.

(a). How reasonable is it to conclude that the true mean for patients with vitamin Kdeficiency is the same as that for a normal population?

(b). Within what limits would the mean prothrombin level be expected to lie for all

patients with vitamin K deficiency? (Give the 95% confidence limits).



58

Chapter 8

SIGNIFICANCE TESTS: TWO SAMPLES

COMPARISON OF TWO MEANS

We shall distinguish between two situations: the unpaired case, in which the two samples are of equal

size and the individual members of one sample are paired with particular members of the other

sample; and the unpaired case, in which the samples are quite independent.

Matched/paired observations

So far, the problem arising from the comparison of a single sample mean with some value proposed

under the null hypothesis has been considered. We had only one sample which was compared with afixed value µ, that is µ has no sampling error.

A common problem which normally arises in medical trials is the comparison of the responses to 2 or

more treatments. It is sometimes possible to reduce this problem of comparing 2 sets of responses to

treatments to a single sample problem previously described.

Suppose we have 10 patients as experimental units and they each have responses to 2 treatments, i.e

we are using the same patient as his own control assuming the order of administration has no effect.

Example: The following are anxiety scores recorded for 10 patients receiving a new drug and a

placebo in random order.

Anxiety score Difference (d)

Patient Drug Placebo (drug-placebo)

1

2

3

4

5

6

7

8

910

19

11

14

17

23

11

15

19

118

22

18

17

19

22

12

14

11

197

-3

-7

-3

-2

1

-1

1

8

-81

-13

Null hypothesis: The mean difference in the anxiety scores in the population from which this sample

was taken is zero. (i.e the mean difference observed in the sample is merely due to sampling error).

n=10, Σd = -13, d =-1.3, s=4.548

Estimated standard error of d =s/ √n = 1.438

Calculate t = d - 0



59

SE(d )

Therefore, t = -1.3 - 0 = -0.90 on 9 df.1.438

From the table of the t distribution, the critical value of t, with 9 d.f. at the 5% level is 2.26. Since

0.96<2.26, then p>0.05. Therefore the difference is not statistically significant.

Conclusion: These data show no strong evidence of a difference between the drug and the placebo in

their effect on anxiety scores.

The 95% confidence for the 'population' mean difference is

d ± t(0.05,9)xSE(d ) = -1.30 ± 2.26 x 1.438

That is, -4.55 to 1.95, where t(0.05, 9) denotes the critical value of t corresponding to the 5% level of

significance with 9 df.

Difference between two independent sample means (σσσσ1, σσσσ2 known)

It is often of interest to determine whether two populations differ from one another with respect to

certain characteristics which summarize the values of the variable, such as their means. A test is

performed by selecting a random sample of ni observations from population i (i=1, 2). Two samples

are independent when an individual in one sample is unrelated to any particular individual in the

other sample.

The null hypothesis is: The two samples were taken from populations whose means µ1 and µ2 are

equal.

i.e. µ1 = µ2 or µ1-µ2=0

This is equivalent to saying that the observed difference, x1-x 2 is merely due to sampling error.

To test this hypothesis we calculate the standard normal deviate (SND) which is

SND x x

SE x x

= −

−

( )

( )

1 2

1 2

where

SE x xn n

SE x SE x( ) ( ) ( )1 2

1 2

2

1

2

21

2

2

2

− = + = +σ σ

Example:

In a study of the age of menarche in women in the USA the following distributions were observed for

samples of women aged 21-30 and 31-40 years.

Age of menarche women now aged 31-40 women now aged 21-30



60

10 3

11 2 11

12 8 28

13 14 23

14 27 12

15 5 1

16 8

17 1

18 1

Total, n 66 78

Σx 916 969

X 13.88 12.42

Σx2 12838 12127

(Σx2)/n 12713 12038

Σ(x-)2= Σx2-(Σx)2 /n 125 89

s2 1.923 1.156

s 1.387 1.075

SE(X) 0.171 0.122

= √0.04413 = 0.2101

Therefore, SND = 13.88 - 12.42 = 1.46

0.2101 0.2101

= 6.95, p<0.001

Conclusion: There is very strong evidence that on average, younger women's age of menarche is

less than the older women's age.

The 95% confidence limits for the true difference are:

(x - x ) x - x )1 2 1 2±1 96. (SE

i.e. 13.88-12.42 ± 1.96 x 0.2101

i.e. 1.05 to 1.87 years

Difference between two sample means (σσσσ1, σσσσ

2 unknown)

Null hypothesis: 2 samples taken from normal populations whose means µ1 and µ

2 are equal and

whose variances s1

2 and s

2

2are also equal i.e the two samples are taken from identical populations.

SE ( x1 - x2 )= 0.1712+0.1222



61

Suppose as proposed under the null hypothesis σ1

2 = σ2

2 =σ2

But our two sample variances s1

2 and s2

2 are two separate estimates of the same quantity σ2

. So they

are combined to give a single best estimate of σ2

, s2

with degrees of freedom equal to (n1-1)+(n

2-1) or

n1+n2-2. Therefore,

and

s x x

n2

22 2

2

2

1=

−−

Σ( )

Hence, s

2

= (n1 - 1)s1

2

+ (n2 -1)s2

2

(n1 -1) + (n2 -1)

Therefore the standard error of the difference between the means is estimated by:

and the t value is given as:

Example:

The following data show the abrasiveness of two brush-on denture cleaners A and B, measured by

weight loss in mg.

A: 10.2, 11.0, 9.6, 9.8, 9.9, 10.5, 11.2, 9.5, 10.1, 11.8

B: 9.6, 8.5, 9.0, 9.8, 10.7, 9.0, 9.5, 9.9

A B

Σx

n

X Σx

2

(Σx)2 /n

Σ(x-X)2

103.6

10

10.36

1078.44

1073.296

5.144

76.0

8

9.50

725.20

722.0

3.20

s2

1 =

S( x1- x1 )2

n1-1

SE(x - x ) = s (1

n+

1

n)1 2

1 2

t =

x1 - x2

SE ( x 1- x2 ), d . f .=n1+n2-2



62

s2 = 5.144+3.20 = 0.5215

(10-1)+(8-1)

s = √0.5215 = 0.7221

The standard error of the difference between the two groups , A and B is estimated by

SE x x A B( ) . .− = + =0 72211

10

1

80 3425

Therefore t = 10.36-9.50 = 0.86 = 2.51

0.3425 0.3425

Here we have (10-1)+(8-1) = 16 d.f. From the table of the t distribution, t(0.05, 16) = 2.12.

Since 2.51>t(0.05,16), then p<0.05.

Thus, the difference is significant.

The 95% confidence interval for the true mean difference is

The 95% confidence interval for

the true mean difference is

i.e. 0.86 ± 2.12 x 0.3425 = 0.13 to 1.59 mg.

COMPARISON OF TWO PROPORTIONS

Remember that when the variable under investigation is qualitative, i.e when we are interested in the

presence or absence of some characteristic we take a random sample of n individuals, observe the

number r with the characteristic and calculate the sample proportion p =r/n. If n is large and p not too

small, we use the normal approximation of the binomial. Now with two samples the approximation is

similarly used.

Suppose there are two populations in which the probabilities that an individual shows a certain

characteristic are π1 and π2. A random sample of size n1 from the first population has r1 individuals

showing the characteristic and the proportion is p1 = r1 / n1. The values for a sample from the secondpopulation are n

2, r

2, and p

2 = r

2 / n

2 .

Suppose we wish to test the hypothesis that π1=π

2.

If the null hypothesis is true, both samples are from the same population, and the best estimate of π

will be obtained by p which is given by

The standard error of p1-p

2 is

x A - x B +- 2.12 xSE ( x A- x B )

p =r 1+r 2

n1+n2



63

SE p p p pn n

( ) ( )( )1 2

1 2

11 1

− = − +

The null hypothesis is thus tested approximately by using the standard normal deviate which is

Example:

A clinical trial was undertaken to assess the value of a new method of treatment A, in comparison

with the old treatment B. The patients were divided into two groups randomly.

Of 257 patients treated with treatment A 41 died.

Of 244 patients treated with treatment B 64 died.

The two proportions of patients dying are:

p1 = 41/257= 0.1595 and p

2 = 64/244= 0.2623

Null hypothesis: The two treatments are equally effective. ie. the population proportions π1 and π

2

are equal.

If the null hypothesis is true, then the two equal population proportions π1 and π

2 can be written

simply as π, i.e. π1 = π

2 = π.

We replace π by the best single estimate available. This estimate is the proportion say p, obtained by

pooling the two samples.

This gives

p = 41+64 = 105 = 0.21

257+244 501

Therefore SE(p1-p

2) = √0.21x.79(1/257 +1/244) = √0.001327= 0.0364

Standard normal deviate, SND =(p1-p2) - (π1-π2) = 0.1595 - 0.2623 -0SE(p

1-p

2) 0.0364

SND = -2.82, p<0.01

The result is highly significant and suggests that treatment A (with a smaller proportion of patients

dying) is better than treatment B.

95% confidence limits for the true difference in the proportions dying are

p1 - p2 ± 1.96 x SE(p1-p2)

-0.1028 ± 1.96 x 0.0364

i.e. -0.174 to -0.031

SND = p1- p2

SE ( p1- p2)



64



65

EXERCISE

1. A clinical trial to test the effectiveness of a sleeping drug was conducted among 11 patients.

They were observed during one night with the drug and one night with a placebo. One patient

died before the placebo reading was taken.

The following are the results of testing the effectiveness of the drug:

Hours of sleep

Patient number Drug Placebo

1 6.1 5.2

2 7.0 7.9

3 8.2 3.9

4 7.6 4.7

5 6.5 5.3

6 8.4 5.4

7 6.8 (died)

8 6.9 4.2

9 6.7 6.1

10 7.4 3.8

11 5.8 6.3

Σx 77.4 52.8

Σx2 551.16 292.88

(a). Establish whether there is or there is no real difference in sleeping time between the

drug and the placebo.

(b). Determine the 95% confidence interval for the difference in the mean sleeping time.

2. Comparison of birth weights of children born to 15 non-smokers with those born to 14 heavy

smokers gave the following results:

Non-smokers Heavy smokers

Mean 3.5933 3.2029

Standard deviation 0.3707 0.4927

Is there enough evidence that on average children born to non-smokers are heavier than

children born to heavy smokers? Confirm your results by a 95% confidence interval of their

difference in birth weights.

3. In a study of the cariostatic properties of dentrifices 423 children were issued with dentrifice

A and 408 were issued with dentrifice D. After 3 years, 163 of the children on A and 119 of

the children on D had withdrawn from the trial. The authors suggest that the main reason for

withdrawal from the trial was because the children disliked the taste of the dentrifices. Do

these data indicate that one of the dentrifices is disliked more than the other?



66

Chapter 9

THE CHI-SQUARED (χχχχ2) TESTS

INTRODUCTION

The χ2 (Greek letter chi, pronounced kye) squared test is used to determine whether a set of

frequencies follow a particular distribution (e.g. Binomial, Normal, Poisson, etc). In its basic form it

tests whether the observed frequencies of individuals with some characteristics are significantly

different to those expected on some hypothesis.

THE 2X2 TABLE

Considering our previous example which arises from the comparison of two proportions. The resultsof the clinical trial in which the proportion of patients dying who received either treatment A or B

were compared, are presented in the following table:

Outcome

Died Survived Total

Treatment A 41 216 257

Treatment B 64 180 244

Total 105 396 501

Such a table is called a 2x2 contingency table since there are 2 rows and 2 columns. (In general we

can have an "rxc" contingency table, i.e. a table with r rows and c columns).

From the above table, the observed frequencies are 41, 216, 64 and 180. We need to obtain the

expected frequencies under the null hypothesis that "the row treatments have the same effect on

the outcome".

The expected frequencies are calculated in the following way:-

Expected frequency, E = row total x column total

grand total

For example, in the top left cell, where we observe 41 deaths the expected frequency under the null

hypothesis is

105x257 = 53.86

501

These expected frequencies are shown in the table below. They add up to the same grand total as the

observed frequencies.

We can then compare between the observed and the expected frequencies by looking at their

differences. We need also to consider the importance of the magnitude of the differences (eg. a

difference of 5 between 995 and 1000 is not as important as the "discrepancy" of size 5 between 2

and 7).



67

O E O - E (O - E)2

41

216

64

180

53.86

203.14

51.14

192.86

-12.86

12.86

12.86

-12.86

3.07

0.81

3.24

0.86

501 501.00 0.00 7.98

The chi-squared value is obtained by calculating (observed-expected)2 /expected for each of the four

cells in the contingency table and then summing them.

The general formula for χ2 is

χ2 = Σ (O-E)2

E

The percentage points of the chi-squared distribution are given at the back of this manual. The values

depend on the degrees of freedom.

If a contingency table has r rows and c columns then the degrees of freedom are given by

df = (r-1)(c-1). From our example the degrees of freedom is, df= (2-1)(2-1) = 1

Therefore from the above table, χ2 = 7.98 on 1 df.

The Chi-Squared Table at the end of this manual shows that the observed value of 7.98 is beyond the

0.01 point of the chi-squared distribution. Therefore p<0.01. We conclude as before that the

differences between the two treatments are highly significant.

Note that previous analysis yielded Z = -2.82. It can be shown that for d.f.=1, Z2 = χ2 i.e. 2.82

2 =

7.978. A short-cut formula for computing χ2 for a 2x2 table is given as follows.

Variable y

Variable x

x x Row marginal total

y1

y

a b

c d

r1 = a+b

r = c+d

Column marginal total s =a+c s =b+d n = a+b+c+d

χ2 = (ad -bc)2n

r1r

2s

1s

2



68

LARGER CONTINGENCY TABLES (rxc)

The following data show a sample of 10 years old children classified according to the state of oral

hygiene and type of school attended.

Oral hygiene

Type of school Good Fair+ Fair- Bad Total

Below average

Average

Above average

62

50

80

103

36

69

57

26

18

11

7

2

233

119

169

Total 192 208 101 20 521

Null hypothesis (Ho): There is no association between oral hygiene classification and type of school

attended. i.e the proportions of children attending below average, average and

above average schools are the same in children with good, fair+, fair- or bad

oral hygiene.

The expected number of children attending below average schools in a sample of 192 children with

good oral hygiene is

233 x 192 = 85.9

821

Similarly, the expected number of children attending below average schools out of 208 children with

fair+ oral hygiene is

233 x 208 = 93.0

521

Thus the expected frequencies are given in the table below:-

Oral hygiene

Type of school Good Fair+ Fair- Bad Total

Below average

Average

Above average

85.9

43.9

62.3

93.0

47.5

67.5

45.2

23.1

32.8

8.9

4.6

6.5

233.0

119.1

169.1

Total 192.1 208.1 101.1 20.0 521.2

We then apply the general formula for χ2 which is

χ2 = Σ(O-E)2

E

χ2 = (62-85.9)2 + (103-93.0)

2 + ........ + (2-6.5)

2

85.9 93.0 6.5

χ2 = 6.6 + 1.1 + 3.1+ .... + 0.0 + 6.7 + 3.1

χ2 = 31.43; df = (3-1)(4-1) = 6. Therefore, p< 0.001

Thus, we reject Ho and conclude that it is highly probable that there is an association between oral

hygiene and type of school attended.



69

The observed proportions of children with good oral hygiene:-

Type of school Proportion with good oral hygiene

Below average 62 = 0.27

233

Average 50 = 0.42

119

Above average 80 = 0.47

169

From above, we note that a large proportion of children with good oral hygiene attended above

average schools compared to those who attended below average schools.

Comments regarding the use of χχχχ2 tests

1. The Chi-squared test is only valid for comparing observed and expected frequencies. It is not

valid for other variables such as percentages, means, rates, etc.

2. The Chi-squared test is not valid for cells with expected frequencies less than 5. With very

small frequencies in a 2x2 table, the Fisher's exact test should be used.

EXERCISE

1. Pregnancies with retarded intra-uterine growth observed in a clinic were classified as to

whether threatened miscarriage occurred and whether the placenta was circumvallate (3

degrees, normal, minor, and major) at delivery.

Types of placenta

Miscarriage Normal Minor Major Total

Threatened

Not threatened

10

36

18

12

14

8

42

56

Total 46 30 22 98

Investigate the association between threatened miscarriage and the degree to which the

placenta is circumvallate at delivery.



70

Chapter 10

ASSOCIATION BETWEEN QUANTITATIVE VARIABLES

INTRODUCTION

Examples of quantitative variables have been seen in Chapter 2. The methods for analyzing the

relationships between two or more of such variables are linear regression and correlation.

In order to illustrate the methods of linear regression and correlation, we will use data on body weight

and plasma volume of eight healthy men.

The objective of the analysis is to see whether a change in plasma volume is associated with a change

in body weight.

Table 10.1: Plasma volume and body weight in eight healthy men.

Subject Body weight (kg) Plasma volume (l)

1

2

3

4

5

6

7

8

58.0

70.0

74.0

63.5

62.0

70.5

71.0

66.0

2.75

2.86

3.37

2.76

2.62

3.49

3.05

3.12

SCATTER DIAGRAM

When two related variables, also called bivariate data, are plotted on a graph in the form of points or

dots, the graph is called a scatter diagram. Each point on the diagram represents a pair of values, one

based on X-scale and the other based on Y-scale. Usually, making a scatter diagram is the first step in

investigating the relationship between two variables, because the diagram shows visually the shape

and degree of closeness of the relationship.

Values on the X-scale refer to the explanatory or independent variable and on the Y-scale refer to the

response or dependent variable. In situations where it is not clear which is the response variable, the

choice of axes is arbitrary.

In the above example, take the independent variable (x) to be body weight and the response variable

(y) to be plasma volume. The scatter diagram would look like the one drawn below.



71

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

56 58 60 62 64 66 68 70 72 74 76

Body weight (kg)

P

l

a

s

ma

v

o

l

u

m

e

Fig.11.1 Scatter diagram of plasma volume and body weight

Examination of plasma volume and body weight suggests that there is a trend of plasma volume to

increase with increasing body weight.

LINEAR REGRESSION

When a response variable appears to change with a change in values of the explanatory variable, we

may wish to summarize this relationship by a line drawn through the scatter of points.

Geometrically, any straight line drawn on a graph can be represented by the equation:y = a + bx

y refers to the values of the response (dependent) variable and x to values of the explanatory

(independent) variable. The equation tells us how these variables, x and y, are related. The constant 'a'

is the intercept, the point at which the line crosses the y-axis; that is, the value of y when x = 0.

The coefficient of x variable ('b') is the slope of the line. It tells us the average change (increase or

decrease) due to a unit change in x. It is sometimes called the regression coefficient.

Although we could draw the line through these points 'by eye', this would be a subjective approach

and therefore unsatisfactory. An objective approach, and therefore better, way of determining theposition of a straight line is to use the method of least squares. Through this method, we choose a

and b such that the vertical distances of the points from the line are minimized; or, we minimize the

sum of squares of these vertical distances - hence the term 'least squares'.

b is computed as follows:

b = Sxy = Σ(x -x )(y -y )

Sxx Σ(x -x )2



72

The numerator (Sxy) in the above formula can be simplified to get:

Σxy -(ΣxΣy)/n,

while the denominator (Sxx

) becomes Σx2 - (Σx)

2 /n

'a' is obtained through a = y - Sxy

Sxx

= y - bx

where y = Σy/n and x = Σx/n

Nowadays, the method of least-squares is available in most computer packages (even in some

calculators) and is usually referred to as linear regression. The resultant line is called the regression

line, which, estimates the average value of y for a given value of x. This line passes through the point

defined by the mean of the y and the mean of x.

Example:

When the above data on body weight and plasma volume were analyzed using the above formula, the

intercept and slop values are obtained as follows:

n = 8

Σx = 535

Σx2 = 35983.5

Σy = 24.02

Σy

2

= 72.798Σxy = 1615.295

Using the above formula we get:

b = 1615.296 - (535)(24.02)/8

35983.5 - (535)2 /8

= 8.96/205.38

= 0.043615

and

a = 3.0025 - 0.043615 x 66.875

= 0.0857

The resultant regression line is therefore

Plasma volume = 0.09 + 0.04 x Body weight.

This can be superimposed on the scatter diagram as shown below:



73

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

56 58 60 62 64 66 68 70 72 74 76

Body weight (kg)

P

l

a

s

ma

v

o

l

u

m

e

Fig.11.2 Scatter diagram of plasma volume and body weight showing the linear

regression line.

CORRELATION

Linear regression provides us with a straight line with which to summarize the relationship between

two variables. However, it does not tell us how closely the data lie on a straight line. The closeness

with which the points lie along the straight line is measured by the (Pearson's) correlation coefficient,

r.

r = Sxy = Σ(x -x )(y -y )

√SxxSyy √[Σ(x -x )2

Σ(y -y )2

]

As we noted with the regression coefficients' calculation, here also further simplification in the

denominator can be made when calculating the correlation coefficient:

Σ(x - x )2 = Σx

2 - (Σx)

2 /n

Σ(y -y )2 = Σy

2 - (Σy)

2 /n

Considering the above example,

Σ(x -x )(y -y ) = 1615.295 - 535 x 24.02/8 = 8.96

Σ(x -x )2 = 35983.5 - 5352 /8 = 205.38

Σ(y -y )2 = 72.798 - 24.02

2 /8 = 0.678

Therefore,

r = 8.96

√(205.38 x 0.678)

= 0.76



74

The (Pearson's) correlation coefficient has the following properties:-

1. It must lie between -1 and +1.

2. Positive values of r are obtained from upward sloping lines (b > 0) ie. y increasing with

increasing x values. Negative values of r are obtained from downward sloping lines (b<0) ie.

y decreasing with increasing x values.

3. If |r| = 1, the relationship is perfectly linear, ie. all points lie exactly on the regression line.

For perfect positive correlation, r = +1 and for perfect negative correlation r = -1.

4. If r lies between 0 and +1, or between 0 and -1 there is some scatter about the line. The less

the scatter, the closer |r| is to 1.

5. If r = 0, there is no linear relationship between the explanatory and the response variable.

This does not necessarily mean that there is no relationship at all, what it suggests is that if

the relationship exists is NOT linear. Example, a curved relationship between the

independent and dependent variables.

LOGISTIC REGRESSION

Introduction

We have so far dealt with simple linear regression with a continuous dependent variable. We can extend

the methods of simple linear regression to deal with more than one independent variables in the form of

multiple linear regression. That is the multiple regression model yields an equation in which the

dependent (outcome) variable is expressed as a combination of the independent (explanatory) variables.

This takes the following form:

y = β0+β1x1+...+βk xk , wherey is the dependent variable,

,1 x 1 x2, ...xk are the k explanatory variables (sometimes called predictor variables or covariates), and

β0, β1, ... βk are the regression coefficients.

As stated earlier on, these methods assume that the outcome variable of interest is numerical ( and

measured on a continuous scale), although the explanatory variables do not necessarily have to be

continuous.

It is very common, however, that in many kinds of medical research the outcome variable of interest is a

proportion ( or a percentage) rather than a continuous measurement.

We cannot use the ordinary multiple linear regression for the analysis of the individual and joint effectsof a set of explanatory variables on the outcome variable which is in the form of a proportion. Two

features of proportions based on counts (proportions based on measurements do not come in here) are

important when considering a statistical analysis:

(a) if the denominator of the proportion is n and the population value is π, the variance of this

proportion is π (1 - π )/n and for a given n, this depends upon the value of π ,being largest when π=1/2

and smaller when π is in the neighbourhood of 0 or 1. Hence the usual assumption of constant

variance σ2 can no longer hold.

(b) when we relate a proportion variable to other quantities by some form of a regression model, we

need to take seriously the fact that the true proportion cannot go outside the range 0 to 1. Because of



75

this the parameters have a limited interpretation and range of validity. We can instead use a similar

approach known as multiple linear logistic regression or just logistic regression.

Transformed proportions

We can overcome some of the problems in (b) above by looking at the response proportion on a

transformed scale which does not have the fixed boundaries at 0 and 1. Suppose p is the proportion of

individuals with some characteristic of interest. Or equivalently, let p be the probability of a subject

having a disease, then 1-p is the probability that the individual does not have the disease, and the odds

of having the disease is p/(1-p). As p changes from 0 to 1, the corresponding odds (i.e the ratio

p/(1-p)) change from 0 to ∞. So this transformation removes one of the boundaries. To remove the

other, we consider the odds on a logarithm (log) scale: the log odds will go from -∞ to +∞ as p goes

from 0 to 1. If we use the natural logs (i.e. logarithms to the base e), the transformation loge(p/1-p)) is

called the logit of p.

That is, logit(p) = loge pp1-

2, and this is the log odds. The estimated value of p can be derived from

logit(p), and always lies in the range 0 to 1.

If y =logit(p), then we have ey=p/ (1-p) and p=e

y /(1+e

y).

If we wish to compare risks of having some disease between individuals who are exposed to some

factor and those who are not exposed to the factor, we can do that using our model. We estimate

y1=logit( p1) for the group with the factor present, and y0=logit( p0) for the group without the factor.

Then we have y1- y0 = logit( p1)-logit( p0) = loge1

1

p

1- p

3-loge

0

0

p

1- p

= loge

1 0

0 1

p (1- p )

p (1- p )

4, which is

the log of the odds ratio.5

Regression with transformed proportions

Just as with ordinary regression, we can develop regression equations with transformed proportions as

the y-variate. When the logit transform is used, this procedure is called logistic regression. The

mathematical calculations involved are generally heavy, but this is taken care of by several computer

packages such as GLIM (Generalised Linear Interactive Modelling), SAS (Statistical Analysis

System), SPSS (Statistical Package for the Social Sciences), etc., which are available on a wide range

of computers, from mainframes to micros. Other computer packages that can handle logistic regression

analysis include Egret and a less familiar one known as Logxact which is particularly useful with smallsamples as it employs exact (as opposed to asymptotic) methods. With the exception of SAS, these

packages are fully available on some PC's in the Department of Epidemiology and Biostatistics,

although the GLIM version that we have appears only in limited features.

Simple logistic regression

The simple logistic regression model takes the form:

logit(p) = β0+β1x1, where β's are the regression coefficients and x a covariate.

Suppose we treat batches of about 50 mosquitoes with a series of concentrations of an insecticide,

record the number of mosquitoes killed, and obtain the following results:



76

Table 10.2 Number of mosquitoes killed in a batch a the dose of insecticide used

Dose of insecticide Number of mosquitoes

killed

Number of mosquitoes in a batch

10.27.7

5.1

3.8

2.6

4442

24

16

6

5049

46

48

50

Plotting the proportion killed in each batch against the dose of insecticide (a log scale for the dose or

concentration is usually appropriate) is a recommendable starting point. The simple linear regression

model will not fit very well the data, and it will lead us to expect responses which are negative for very

low doses or greater than 1 for high responses. In fitting a logistic regression model to these data,

working with ln(dose), the model is such that loge

p

p1-

6 = -4.887 + 3.104 ln(dose)

Multiple logistic regression The main difference between multiple logistic regression and ordinary multiple regression is that in the

former, we use a combination of the set of values of covariates to predict a transformed dependent

variable rather than the dependent variable on its original scale. Hence multiple logistic regression

model is represented in a similar manner as follows:

logit(p) = β0+β1x1+...+βk xk

An example in which multiple logistic regression can be used is provided by the data below, from an

article by Norton, P.G. and Dunn, E.V. (1985), Br. Med J., 291, 630-632. These relate hypertension to

smoking, obesity and snoring among men aged 40 years or over. In such a case logistic regression canbe used to see which of the factors smoking, obesity and snoring are predictive of hypertension.

Table 10.3 Hypertension in men aged 40+ in relation to smoking, obesity and snoring

Smoking Obesity Snoring No. of men No (%) with hypertension

0

1

0

1

0

1

0

1

0

0

1

1

0

0

1

1

0

0

0

0

1

1

1

1

60

17

8

2

187

85

51

23

5 ( 8%)

2 (11%)

1 (13%)

0 ( 0%)

35 (19%)

13 (15%)

15 (29%)

8 (35%)

Total 433 79 (18%)

Codes 0 and 1 are for No and Yes, respectively.

The full model is as shown in Table 10.4



77

Table 10.4 Logistic regression analysis of the hypertension data shown above in Table 10.3

Regression coefficient

(b)

Standard error

se (b)

z p-value

Constant

Smoking (x1)

Obesity (x2)

Snoring (x3)

-2.378

-0.068

0.695

0.872

0.380

0.278

0.285

0.398

0.24

2.44

2.19

0.810

0.015

0.028

The significance of each variable can be tested by treating z=b/se(b) as a standard normal deviate. Can

see that the P-value for smoking is very large (0.81) and hence we can say that, smoking has no

association with hypertension. Obesity and hypertension have a significant association with

hypertension. (in both cases P<0.05).

The analyses presented relate only to the main effects of obesity, smoking and snoring. We need to

consider also the possible presence of any important interaction between two of these factors. That is,we should investigate whether the effect of a factor depends on the level of another factor. In fact this

was done, and no interaction term was found to be statistically significant at any interesting level.

Omission of smoking in the model produced only minimal changes in the values of the other

coefficients. Hence the regression equation for this model is

logit(p) = -2.378 - 0.068x1 + 0.695x2 + 0.872x3, where

x1, x2, and x3 are codes for smoking, obesity, and snoring, respectively.

The above equation enables us to calculate the estimated probability of having hypertension, given

values of the three variables. In particular, we can obtain the odds ratio of hypertension associated

with any of the three factors. For example, let us consider variable x2, obesity:putting x2 = 1 (for presence of obesity), gives:

logit(p1) = -2.378 - 0.068x1 + 0.695 + 0.872x3, and

putting x2 = 0 (for non-obese), gives:

logit(p0) = -2.378 - 0.068x1 + 0.872x3.

As discussed earlier, the difference logit(p1)-logit(p0) = 0.695 is the log odds ratio. Hence the odds

ratio for hypertension associated with obesity = e0.695

= 2.00. In general, for any binary variable the

odds ratio (OR) can be estimated directly from the regression coefficient b as OR = eb. Confidence

limits follow immediately from the standard error of b and on taking b to have an approximate Normal

distribution.



78

EXERCISE

1. In the following data, four doses (on a long-scale) of vitamin D were tested, each on a

number of rats, and the results were assessed by means of a line test on bones in terms of

arbitrary scores.

Dose, x -0.45 0.25 0.77 1.46

(Mean) response, y 2.64 7.36 12.29 16.56

Σx = 2. 032; Σy = 38.85

x = 0.51; y = 9.71

Σx2 = 2.9895; Σy

2 = 486.4169

Σxy = 34.2929

(a) What diagram would be appropriate to examine these data?

(b) Calculate the regression equation that explains the dose-relationship over the range

of doses considered.



79

Chapter 11

VITAL STATISTICS AND DEMOGRAPHY

SOURCES OF DEMOGRAPHIC INFORMATION

Quality of data depend on many factors, one of which is the source of data. Sources of data have a

direct implication to the quality in terms of coverage, completeness and cost.

In this chapter we will concentrate on the following sources of demographic data:

(a) Census

(b) Vital registration

(c) Sample surveys

Census

Census is a systematic, routine way of counting subjects in a defined boundary or limits of land.

Census produces reports of individuals, population size and structure at a point in time.

Originally, census was limited to people only; but very recently we find censuses of agriculture,

business, livestock, housing, etc and sometimes done concurrently with population census.

The main characteristics of census is that it covers the whole population. No sampling is involved and

each person should be enumerated separately. Census must have a legal basis to make it complete and

compulsory. It reflects a single point in time although the whole process can take a longer time.

Basic questions which should appear on the questionnaire are name, age, sex, relationship with the

head of household, marital status, race/religion/ethnicity, education, occupation, employment status,

migration and amenities. Additional questions would depend on the availability and quality of vital

registration.

Population census can be carried out using either of the below mentioned methods:

1. De facto method:

This method designates persons to an area or location they are found during enumeration.

The population "in fact" there. The question of originality does not count here. It is

considered that, say, in 1988 Tanzania Population Census, Zanzibar had a population of641,000. This implies that, these people spent a night in Zanzibar before a census night.

Tanzania follows this method of enumeration.

2. De jure method:

De jure method of enumeration allocates persons to their normal residence. Meaning "people

who belong to the area or have the right to live there through citizenship, legal residence or

whatever". For example, a businessman working in Dar es Salaam but living in Arusha would

be assigned to Arusha on a de jure type of enumeration.

In Tanzania census is normally conducted after every ten years (decennial). This has a planning set-

back implication in a sense that population is changing rapidly because of births, deaths and



80

movements. To overcome this problem, normally inter-censal surveys or mini-surveys are conducted.

Example of such surveys is a 1991 Tanzania Demographic and Health Survey (TDHS). However,

further surveys on morbidity and for specific diseases can be conducted whenever a need arise.

Vital registration

Vital registration system is very common to developed counties where information on births,

marriages, deaths and migrations are collected. In developing countries the system whenever

employed is prone to incompleteness otherwise they are non-existent.

Questions in the vital registration system are always very simple and few. Consider hospital or health

service data here in Tanzania. Examples of such registrations are information on deaths found in

hospitals (death certificates). Birth and marriage data found in churches, mosques, Area

Commissioner's offices and migration data found at airports and borders.

The short-fall of vital registration system is that they are normally incomplete, selective samples,diverse and are practically unreliable. This does not mean that the system should be discarded,

instead it should be improved to remove these errors.

Sample surveys

Sample surveys give the same information in a more detailed form where vital registration system

does not exist. Only a sample of a population is involved. Sample surveys are thus, less costly when

compared with census.

The other advantages of surveys include the pace of collecting the information. They are relatively

quicker, more detailed than other systems like census. The cost of surveys are the errors introducedthrough sampling.

COMMON RATES IN PUBLIC HEALTH

(a) Measures of fertility:

There are four common measures of fertility. These are crude birth rate, general fertility rate,

gross reproductive rate and the total fertility rate.

i. Crude birth rate:

It is called the 'rate' but in practice it is the ratio defined as:

number of livebirths in a year x 1000

Total population

The rate is 'crude' because it does not take into account the risk of giving birth

according to age and sex differences.

ii. Fertility rate (General fertility rate):



81

The modern, conventional and much more acceptable 'rate' is the general fertility rate

or simply known as 'fertility rate'. The denominator is restricted to women at risk of

child-bearing rather than the general population. It is thus, defined by:

number of livebirths in a year x 1000

mid-year population of women aged 15-49

iii. The total fertility rate:

Total fertility rate means the average number of children a woman would have during

her reproductive life time given that the current specific fertility rates would still be

applicable at that time.

The total fertility rate is calculated from age-specific fertility rates (ASFRs). We get

the ASFRs when we divide the number of livebirths by the number of women in eachage interval. The following example shows required steps to calculate total fertility

rate (TFR).

Table 11.1: Number of livebirths and maternal age, Tanzania, 1988.

Age Number of women Number of live

births

Age specific fertility

rate

15-19

20-24

25-29

30-3435-39

40-44

45-49

665000

516000

459000

344000310000

229000

218000

21000

114000

118000

12300037000

6000

5000

0.0316

0.2209

0.2571

0.35760.1194

0.0262

0.0229

Total 2741000 424000 1.0357

Total fertility rate (TFR) equals the sum of all age specific fertility rates. In this case,

TFR = 1.0357 x 5 = 5.1785.

The sum of all ASFRs is multiplied times 5 because of the 5 year age group interval. If ages

are in single years, then there is no need to multiply this sum times 5.

The figure 5.1785 means on average each woman will have 5 children during her

reproductive period given that these age specific fertility rates will still apply until she

finishes her reproductive life.

Unlike the CBR and GFR, the calculation of TFR greatly depends on the age composition

although its use is independent of age distribution.



82

iv. Gross reproduction rate (GRR):

The gross reproductive rate is similar to the total fertility rate only that it considers

female live births rather than all births. This implies that, ASFR for GRR is based on

females.

GRR is interpreted as the average number of daughters a woman would have if she

survived to at least age 50 and experienced the given female ASFRs. A figure of 1

means that women are able to replace themselves while a figure of 2.0 means that the

population is doubling itself: each woman is on average producing two daughters.

Like the TFR, GRR is also a hypothetical measure. It is a period measure which does

not take into account the effect of female mortality either before age 15 or 15 to 50

years.

Referring to Table 11.1 above, given the number of female livebirths the GRR is

computed as follows:

Age Number of women Number of live

births

Female births Female

ASFR

15-19

20-24

25-29

30-34

35-39

40-44

45-49

665000

516000

459000

344000

310000

229000

218000

21000

114000

118000

123000

37000

6000

5000

11000

58000

60000

63000

19000

3000

3000

0.0165

0.1124

0.1307

0.1831

0.0613

0.0131

0.0138

Total 2741000 424000 217000 0.5309

Then GRR = 0.5309 x 5 = 2.6545

If the true sex ratio at birth is known, the GRR can be calculated using the TFR.

Thus, GRR = 5.1785 x 217/424 = 2.65

(b) Measures of morbidity:

i. Incidence rates:

Incidence measures the occupance of new cases of a disease in a population relative

to the number of persons at risk of contracting the disease. Therefore, the incidencerate is the rate of contracting a disease among those still at risk. It should be noted to

make a difference between being at risk of contracting the disease at the beginning of

a period and being at risk during the entire period. The former would refer to the

incidence risk and the latter to the incidence rate. Incidence rate is expressed as:

number of new cases of disease in a period of time x 10k

number of person-years of exposure in a period

where k = 2, 3, 4, 5 or 6 depending on the convenience or convention.



83



84

ii. Prevalence rates:

The prevalence measures the extent to which a disease exists in a population. It is

based on the total number of existing cases among the entire population. It can be

measured at an instant time (point prevalence) or looking for cases over a stretch

period of time (period prevalence).

• Point prevalence 'rate' = number of subjects with the disease at time t x 10k

Total population at time t

k = 2, 3, 4, 5 or 6 depending on the convenience or convention.

This index is prone to bias because cases with long duration have a higher

probability of being in the sample than those with short duration (Refer to screening

tests).

• Period prevalence 'rate' = total number of persons with disease during a period

x10k

Total population at mid-point of the interval

k = 2, 3, 4, 5 or 6 depending on the convenience or convention.

iii. Case fatality rate:

This is defined as the proportion of persons whose deaths is due to that particular

disease. In practice, a time limit is imposed, say, a proportion of cases who die from

malaria within the last two years.

iv. Specific rates:

These are rates which apply only to different geographical areas, to specific age

groups, to sex separately, to educational or marital stratification, etc. They are called

rates to that specification.

(c) Measures of mortality (death):

i. The crude death 'rate' = number of deaths in a year x 1000

mid-year population

When the denominator is approximated by the 'total population', then the index

obtained is not the actual rate but it turns out to be a 'crude mortality ratio'

ii. Infant mortality rate = number of deaths in a year under 1 year of age x 1000

number of livebirths in the same period

Infant mortality rate is often broken down into several indices depending on the age

categories of an infant.

• Neonatal mortality rate = number of deaths in a year under 28 days of age x1000

number of livebirths in a year



85

• The early neonatal mortality 'rate' =

number of deaths aged under one week in a year x 1000


• The late neonatal mortality 'rate' =

number of deaths between 1 - 4 weeks in a year x 1000


• The post neonatal mortality 'rate' =

number of deaths between 4 - 52 weeks in a year x 1000


iii. Stillbirth 'rate' = number of stillbirths in a year x 1000Total number of live and still-births in a year

iv. The perinatal mortality 'rate' =

number of still and neonate deaths in a year x 1000

Total number of live and still-births in a year

This index is important because it removes ambiguity when an outcome of pregnancy

dies very soon after delivery that it was born dead or alive.

STANDARDIZATION OF RATES

There are situations in which one intends to compare two or more different populations (geographical

areas, different hospital populations, experimental groups, etc.) using the already mentioned crude

rates (mortality, morbidity, fertility, etc). For instance, consider the crude mortality rate. The risk of

dying depends very much to age, and sometimes differs according to sex. Age specific death rates are

high for infants and very old people and low for middle age groups.

It is therefore true that the crude mortality rate and overall incidence rates will depend on age-sex

composition of the population concerned. Crude rates may be misleading indicators of the level of

mortality, morbidity, fertility, etc.when comparing two different populations if the populations do not

have the same age and sex structure.

Standardization provides an overall summary measure of the event occurrence which does not depend

on the age, sex, race or other distribution of the group. It therefore permits comparisons of those

events occurrence in two or more study groups which are adjusted for differences in the variable of

interest of the groups.

Two methods of standardization which are commonly used are: (1) Direct standardization and

(2) Indirect standardization.

1. Direct standardization:



86

The direct standardization is applied when, for example, age and sex rates from each of the

populations under study are applied to the standard population. The outcome is the age-sex adjusted

mortality, morbidity or fertility rates.

To use the indirect method, age and sex specific rates from the standard population are treated to the

study populations to give the standardized mortality, morbidity or fertility ratios.

The choice of which method to use depend very much on the availability of data. However, it has

been found that in general, direct standardization is used for prevalence while indirect method is used

for incidence.

The following information should be available when one intends to use direct standardization

method:

(a) Study population(s) characteristics eg. age, sex rates.(b) Standard population characteristics composition.

Once the two data have been obtained, (a) is applied to (b) to get, say, age and sex adjusted rate.

Since the standard population may or may not be one of those populations to be compared, it has to

be defined arbitrarily. But a common choice of standard population is the larger population from

which the index (study) population(s) came.

The detailed steps in calculating the standardized rate for the index population are:-

(a) Define your standard population.

(b) Apply the age, sex (or any other characteristic) specific rates of the index population to

the standard population to get what we would expect if the index population rates would

be rolling in the standard population.

(c) Add these cases to get the total expected number of subject in all age groups.

(d) Divide the total expected number of cases by the total in the standard population to get

the crude rate known as the standardized incidence rate for the index population.

Table 11.2a: Use of direct standardization to compare the prevalences of malaria in two villages,

A and B. The total population of the two villages has been taken as the standard.

VILLAGE A VILLAGE B

Age Number examined Number of cases Number examined Number of cases Total examined

0- 45- 9

10-14

15-29

30-49

50+

7194

27

30

36

29

38

6

18

28

23

3143

19

22

28

15

26

13

21

28

15

102137

46

52

64

44

Total 287 86 158 85 445

Table 11.2b: Proportion of A and B villagers with malaria.



87

Percent of villagers diseased Expected malaria cases using proportions of

Age Village A Village B Village A Village B

0-4

5-910-14

15-29

30-49

50+

4.23

8.5122.20

60.00

77.78

79.31

6.45

13.9568.42

95.45

100.00

100.00

4.31

11.6610.21

31.20

49.78

34.90

6.77

19.1131.47

49.63

64.00

44.00

Total 29.97 53.80 142.06 214.98

Total malaria cases for each age group are obtained by multiplying the proportion of villagers

diseased by the "standard" population (2 villagers total examined per age group). The results are

expected cases of malaria if the prevalence of malaria in village A and B respectively were applying.

The Age adjusted prevalence of malaria = Expected cases

Total standard populationThus,

Village A: 142.06 = 31.92

445

Village B: 214.98 = 48.31

445

Conclusion: Village B has higher prevalence (%) of malaria adjusted for age.

Considerations over direct standardization method:

(a) The direct method of standardization requires stratum-specific (eg. age-specific) rates in

the index population(s) which sometimes are not available. In this case the method can

not be applied.

(b) The number of cases observed in the study population should be large enough to give

meaningful stratum-specific rates necessary for direct standardization. Short of this, the

method can not be used.

(c) In general, comparing disease rates in two or more groups via direct standardization is

subject to less bias than the indirect method. The reasons for this will not be discussed

here.

2. Indirect method of standardisation:

Unlike direct standardisation, the indirect standardisation of rates entails the use of known specific rates

applied to the actual (observed) population characteristic of interest being compared to generate the

expected events.

When the observed events are divided by the expected events, the resultant is the "standardised

incidence ratio. In case of death, it will be the standardised mortality ratio (SMR).



88

SMR= Observed events in the index population

Expected events from the standard rates

In order to get the indirectly standardised rates for the index population being compared, the standard

crude rates are multiplied by the SMR.

Using the previous example of malaria cases in two villages, the standard rates (prevalence from two

villages combined) and the expected malaria cases are:

TOTAL VILLAGE A VILLAGE B

1 2 3 4 5

Age Rate per 100 Population Expected cases Population Expected cases

00 - 04

05 - 09

10 - 14

15 - 2930 - 49

50+

4.92

10.22

41.30

75.0087.50

86.50

71

94

27

3036

29

3.5

9.6

11.2

22.531.5

25.0

31

43

19

2228

15

1.5

4.4

7.8

16.524.5

13.0

Total 38.43 287 103.3 158 67.7

Dividing the observed number of malaria cases by the expected number would give us the the

standardised morbidity ratio.

Village A: 86 = 0.83 = 83%

103.3

Village B: 85 = 1.25 = 125%

67.7

We conclude using the standardised ratios by computing the actual age-adjusted morbidity rates for each

group to control for the effect of age:

Village A: 38.43 x 0.83 = 0.32 = 32%

100

Village B: 38.43 x 1.25 = 0.48 = 48%

100

The primary advantage of the indirect standardisation lies on the fact that this method does not

necessitate one to know the specific rates on the population being compared with, of which sometimesare not available.



89

EXERCISE:

Consider the following data for cancer mortality in the US in 1940 and 1986:

1940 1986

Age Population (000) Deaths Population (000) Deaths

00 - 04

05 - 14

15 - 24

25 - 34

35 - 44

45 - 54

55 - 64

65 - 74

75+

10,541

22,431

23,922

21,339

18,333

15,512

10,572

6,377

2,643

494

667

1,287

3,696

11,198

26,180

39,071

44,328

31,279

18,152

33,860

39,021

42,779

33,070

22,815

22,232

17,332

11,836

666

1,165

2,115

5,604

14,991

37,800

98,805

146,805

161,381

All ages 131,670 158,200 241,097 469,330

(a) Compute the crude cancer mortality rates for 1940 and 1986 and compare these rates.

(b)Using the US population in 1940 as the standard population, apply the direct method of

standardisation. What are the age-adjusted cancer mortality rates for 1940 and 1986?

(c)Using the age-specific cancer mortality rates for 1940 as the standard, apply the indirect method to

compute the standard mortality ratios for 1940 and 1986.

(d) How does the 1986 population compare with the 1940 population in terms of cancer mortality rate?



90

LIFE TABLES

Standardizes death rates which have been discussed above can be used to study the levels of mortality of

a population and also they can be used to compare the mortality experience of two or more populations.The standardized death rates are however a single figure index of the level of mortality. They contain no

direct information about mortality levels of different age groups. Life tables on the other hand can

summarize the mortality experience of a population at every age. They provide answers to questions

like, Suppose in a population 100,000 babies are born on the same day, how many babies will survive to

celebrate their 1st, 2nd etc birthdays assuming that babies die at the current rates of mortality?

However, the use of current mortality rates for this calculation is unreal since the babies would die at

the rates existing at the time when they die.

There are two distinct ways in which a life table may be constructed from mortality data:

In the current life table the survival pattern of a group of individuals is described subject through out

life to the age specific death rates currently observed in that particular community. This kind of a life

table is more often used for actuarial purposes and is less common in medical research.

On the other hand the cohort life table describes the actual survival experience of a group or 'cohort' of

individuals through time. The cohort may be babies born at the same time or an occupational group or

patients following a particular treatment etc. This type of life table has its most useful application in

medical research in follow-up studies eg an IUD retention study or more generally survivorship studies.

There are two types of life tables:

1. Full life table: Includes every single year of age from 0 to the highest age to which any person

survives.

2. Abridged life tables: usually considers only 5 year age groups except that the first five years of life

may be considered singly.

THE FULL (COMPLETE) LIFE TABLE:

The number of imaginary births considered in the life table is called the radix. This is usually the power

of ten but its value is determined by convenience and accuracy. A life table comprises a set of six

columns headed x, lx, dx, px, n qx and ex0

x - The age to which the numbers in other columns relate.lx - The number still surviving at axact age x.

dx - The number of deaths occurring between exact age x and exact age x+1, i.e dx =lx - lx+1

Px - This gives the probability of surviving from exact age x to exact age x+1.

Px = lx+1

lx

qx = The probability of dying between exact age x and exact age x+1.

qx = 1 - Px = 1 - lx+1 = dx

lx lx

ex0 gives the expectation of life at age x. i.e the average number of years to be lived by persons who

reach age x.



91

Example:

The intra-uterine device (IUD) is a method of contraception which is not well tolerated by all women

because of medical side effects such as abdominal pain, excessive bleeding, infection etc. If such side

effects occur the IUD is removed, although the IUD may also be removed for non medical reasons such

as the woman's wanting to become pregnant.

In an IUD retention study 2,479 women who had an IUD insertion during the month of January were

interviewed. They were asked whether they had retained their IUD until the 24th month during which

month it was the practise to arrange a special medical check-up and remove the IUD. For those who

their IUDs were removed, reasons for the removal and the duration of use were determined. The results

of the survey indicated that 180 women lost their IUD during the first month after insertion and 162

during the second month after insertion. The corresponding figures for the third to the twenty third

month were 90, 85, 76, 180, 162, 90, 85,76, 63, 51, 72, 85, 87, 72, 78,70, 65, 90, 92,89, 88.

This information can be represented on a life table as follows:

x qx Px lx dx ex

0 0.073 0.927 2479 180 11.75

1 0.07 0.93 2299 162 11.63

2 0.04 0.96 2137 90 11.47

3 0.04 0.96 2047 85 10.95

4 0.04 0.96 1962 76 10.41

5 0.09 0.91 1886 180 9.79

6 0.09 0.91 1706 162 9.79

7 0.06 0.94 1544 90 9.76

8 0.06 0.94 1454 85 9.34

9 0.06 0.94 1369 76 8.88

10 0.05 0.95 1293 63 8.38

11 0.04 0.96 1230 51 7.78

12 0.06 0.94 1179 72 7.09

13 0.08 0.92 1107 85 6.52

14 0.09 0.91 1022 87 6.02

15 0.08 0.92 935 72 5.73

16 0.09 0.91 863 78 4.96

17 0.09 0.91 785 70 4.40

18 0.09 0.91 715 65 3.78

19 0.13 0.87 650 90 3.11

20 0.16 0.84 560 92 2.50



92

21 0.19 0.81 268 89 1.93

22 0.23 0.77 379 88 1.27

23 1.00 0.00 291 291 0.50

We can use the life table above to calculate the following probabilities:

a) What is the probability that a woman who retained an IUD for the first six months will have it by the

end of the 20th month

=l20

l6

=560

1706

=0.33

b) What is the probability that a woman who retained an IUD up to the beginning of 10th month will

lose it after the 18th month?

= l18

l9

=715

1369

=0.52

ABRIDGED LIFE TABLES

These are constructed in a very similar way to that of complete life tables but instead of calculating d x

one calculates ndx where n is the length of the interval, not its start. For example, 5d10 refers to the

interval 10-15 years and not 5 - 10 years. ndx denotes the number of deaths occurring between exact age

x to exact age x + n.

Px and qx need to be altered similarly. npx is the probability of surviving between exact age x to exact

age x+n and nqx is the probability of dyeing between exact age x to exact age x + n.

lx remain unchanged and specifies the number of survivors from the radix who reach exact age x. e0x

remains unchanged as it refers to exact age x.

Example:



93

The following is an abridged life table for a certain country in a given year.

x lx 10dx 10qx 10px e0x

0

10

20

30

40

50

60

70

80

90

100

100000

97062

96215

94726

92859

88473

77456

54944

24669

3800

80

2938

847

1489

1867

4386

11017

22512

30275

20869

3720

80

0.029

0.009

0.015

0.012

0.047

0.124

0.291

0.551

0.846

0.979

1.000

0.971

0.991

0.985

0.988

0.953

0.876

0.709

0.449

0.154

0.021

0.000

68.03

59.94

50.42

41.13

3.86

23.15

15.7

10.20

6.57

5.21

5.00

THE POPULATION PYRAMID

Both age and sex compositions can be represented by a special type of bar graph called a population

pyramid. Population pyramids provide graphic statements of the age and sex distribution of a

population for a given year. It also shows the history of a population including effects of war, waves of

in- or out-migration, fluctuations in fertility and mortality, etc.

The pyramid is a two-way histogram with the X and Y axes reversed, so that frequencies are represented

by the horizontal axis and class intervals by the vertical axis. Thus the population pyramid consists of

two bar graphs (or histograms) placed on their sides and back to back (see Figure 11.1). The length of

each bar represents either the total number of the percentage size for each age or age group (it is

conventional to use either single-year or five-year age groups, though other groupings are possible).Pyramids are drawn showing the male population on the left hand side and the female population on the

right. The young are usually at the bottom and the old at the top. The last open-ended age group is

normally omitted entirely from the pyramid because it is impossible to draw truthfully.

Since every year cohorts normally lose part of their number through death or emigration, each bar is

usually shorter than the previous one, which gives the impression of a pyramid. A vertical comparison



94

of the bars shows the relative proportions of each age or age group in the population, while a horizontal

comparison shows the proportion of males and females in each age or age group.

The population pyramid can be based on absolute numbers or on percentages, the latter is more

common. The percentages are calculated using the total population of both sexes combined as the

denominator. If the percentages are calculated separately for males and females, then the pyramid will

present a false picture.

Types of Pyramids

There are several types of population pyramids but we will discuss the three more frequent forms.

The first class of population pyramid looks like an ordinary triangle. It reflects a population with

relatively high vital rates and a low median age. The age structure in the Netherlands in 1849 (Figure

11.2) fits this category.

The second variety has a broader base than the first. The 0-14 group is larger because this population is

beginning to control mortality but not fertility, and the most impressive gains in reduction are made in



95

the younger age groups. The steeply sloping sides reflect the large proportion of younger people and the

small percentage of aged people. The population structure of Hai District, Tanzania in 1994 fits this

description (Figure 11..3).

The third class of pyramids looks like a beehive. The numbers in this age-sex profile are roughly equal

for all age groups, gradually decreasing at the apex. Many Western populations conformed to this

pattern in the 1930's as seen in the Figure 11.4.

Population pyramids of Immigration

Population pyramids of immigrants reflect the age selectivity of migrant populations, and so tend to take

the form of a cross, with low populations at the top and bottom and a disproportionately large middle

section. The pyramid in Figure 11.5 showing immigrants reaching Dar es Salaam in 1994 is typical in

that the large proportion of young adults (in this case 15-29 years) clearly stands out, although there isquite a large proportion of under fives too.



96

EXERCISE

1. Discuss the advantages and disadvantages for each of the systems of collecting data.

2. During an epidemic of gastro-enteritis the number of cases and deaths in a city hospital and

all hospitals were as shown below:

CITY HOSPITAL ALL HOSPITALS

Age group (years) Cases Deaths Cases Deaths

Under 1

1 - 4

Above 4

240

140

20

41

21

8

1550

1880

500

341

235

16

Total 400 70 3930 592

(a) Calculate for the city hospital and all hospitals the case mortalities in each age group

and for all ages combined.(b) Find the standardized mortality rate (Comparative Mortality Ratio) for the city

hospital by the direct method using the case mortalities by age group of all hospitals

as the standard rates.



BIBLIOGRAPHY

1. Armitage, P. and Berry, G. (1994). Statistical Methods in Medical Research, 3rd Edition.Oxford: Blackwell Scientific Publications. (older versions are just as good for most topics).

2. Brownlee, A., Pathmanathan, I., Varkevisser, C. (1991). Health Systems Research Training

Series, Volume 2 (Part 1): Designing and Conducting Health Systems Research Projects.

Canada: IDRC.

3. Healy, M.J.R., Hills, M. and Osborn, J. (1987). Manual of Medical Statistics. Volume II.

London: London School of Hygiene and Tropical Medicine.

4. Hill, A. Bradford (1984). A Short Textbook of Medical Statistics, 11th Edition. London:

Hodder and Stoughton.

5. Kirkwood, B.R. (1988). Essentials of Medical Statistics, 1st Edition. London: Blackwell

Scientific Publications.

6. Newell C. (1988). Methods and Models in Demography.

Blackwell Scientific Publications.

7. Osborn, J. (1988). Manual of Medical Statistics. Volume I. London: London School of

Hygiene and Tropical Medicine.

8. Petrie, Aviva (1990). Lecture Notes on Medical Statistics, 2nd Edition. Oxford: Blackwell

Scientific Publications.

Biostat Manual

Documents