An Application of Cluster Analysis to HIV/AIDS Prevalence in Nigeria Data clustering is a vital tool when it comes to understanding data items with similar characteristics in a data set for the sake of grouping. Clustering may be for understanding or utility. Clustering for understanding, which is the focus of this work deals with grouping items with common characteristics in order to better understand a dataset and to identify possible or pre-interest sub-groups that could be formed from such data. The HIV prevalence statistics in Nigeria is measured bi-annually across 36 states and FCT which were zoned under 6 geo- political zones happens to be a suitable data to implement this subject matter. Cluster Analysis was implemented through the general methods of Hierarchical (agglomerative nesting) and Partitioning methods (K-Means). These techniques where implemented on the platform of R (Statistical Computing Language) to cluster HIV prevalence rate in Nigeria so as to find out states that could be considered same category and to investigate the concentration of the disease in respect to geo-political zones. Relative type of validation was used for cluster validation (a mechanism for evaluating the correctness of clustering). 2012 Raheem Kabir Kola Department Of Statistics, University Of Abuja, Nigeria 2012
42
Embed
An Application of Cluster Analysis to HIV/AIDS Prevalence in Nigeria
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
An Application of Cluster Analysis to HIV/AIDS Prevalence in Nigeria Data clustering is a vital tool when it comes to understanding data items with similar characteristics in a data set for the sake of grouping. Clustering may be for understanding or utility. Clustering for understanding, which is the focus of this work deals with grouping items with common characteristics in order to better understand a dataset and to identify possible or pre-interest sub-groups that could be formed from such data. The HIV prevalence statistics in Nigeria is measured bi-annually across 36 states and FCT which were zoned under 6 geo-political zones happens to be a suitable data to implement this subject matter. Cluster Analysis was implemented through the general methods of Hierarchical (agglomerative nesting) and Partitioning methods (K-Means). These techniques where implemented on the platform of R (Statistical Computing Language) to cluster HIV prevalence rate in Nigeria so as to find out states that could be considered same category and to investigate the concentration of the disease in respect to geo-political zones. Relative type of validation was used for cluster validation (a mechanism for evaluating the correctness of clustering).
2012
Raheem Kabir Kola Department Of Statistics, University Of Abuja, Nigeria
2012
1
AN APPLICATION OF CLUSTER ANALYSIS TO HIV/AIDS
PREVALENCE IN NIGERIA
RAHEEM KABIR KOLA
DEPARTMENT OF STATISTICS
FACULTY OF SCIENCE, UNIVERSITY OF ABUJA,
NIGERIA
IN PARTIAL FULFILMENT FOR
THE AWARD OF BACHELOR OF SCIENCE (B.Sc.) DEGREE IN
STATISTICS
2012
2
CHAPTER ONE
INTRODUCTION
1.0 Introduction
Cluster analysis divides data into groups called clusters such that items in the
same cluster are similar while items in different cluster are distinct, the essence of
such grouping may be for understanding or for utility. The first case, clustering for
understanding has to do with understanding the behaviour of data or in some
respect knowing the characteristics that data items may have in common, this is
the focal point of this work. Meanwhile, clustering for utility has to do with
classification of data for further usage or analysis.
Over the years, the world had suffered several deaths due to a virus known as HIV,
several institutes, governments and international organisations had invested a lot
in providing measures to curb the spread of this disease, but till date HIV still
posed a significant challenge among other diseases particularly in Africa. Nigeria
as a case study has second highest number of people living with HIV (CIA World
Factbook, 2012) with an estimate of 3.3 million of its populace which amounts to
about 2% of the entire population. The country has 36 states zoned under 6 geo-
political zones. The HIV prevalence rate is measured periodically across these
states by means of a sentinel survey.
This research work applied the technique of cluster analysis on the platform of R
(statistical computing) to classify the spread of this virus among the populace of
Nigeria. Classification of this information would help to take better decisions as to
what part of the country needs certain attention.
1.1 Objective
This study intends to implement Cluster Analysis using R (Statistical Computing)
and to understand the nature of the spread of Human Immunodeficiency Virus
3
state-wise by classifying the prevalence rate among states and to investigate if
states in a particular geographical location have similar prevalence. These goals
are itemised as follows:
1. Categorize states according to level of prevalence
2. Determine the areas where we have high concentration or dominance
3. Implement Cluster Analysis using R (Statistical Computing) and
4. Make further recommendation on the prevalence of most affected areas.
1.2 Scope
This study is concerned with identifying the possible categories for which Nigerian
states could be categorized by the rate of HIV prevalence among states by means
of cluster analysis. It does not in any way investigate or probe into factors that
could be responsible for concentration of the disease in certain geographical areas
with high concentration or dominance.
The implemented Cluster analysis was based on specific algorithms of
Agglomerative (hierarchical method) and K-means (partitioning method), in which
case all computations were made using the Cluster Library available in R, thus
there is less attention for detail theoretical computations usually involved in
Clustering techniques. Graphs (Dendogram and banner plot) were used in
appropriate places. Advanced concepts, such as model based clustering and
numerical clustering validation is beyond the scope of this work.
The relative type of cluster validation was used in attempt to validate the result of
our clustering.
1.3 Limitation
Research works that depends on data acquired over long periodic circles involving
a large population like Nigeria are likely be challenged by one or two drawbacks.
4
A major restriction to this study is the inability to access sufficient and consistent
data such as quarterly prevalence rate which is as a problem attributable to
inefficient medical system. The HIV Sentinel survey usually commence bi-annually
which forms a sequence of 2 years lag as from 1991, but in 2007, no survey was
conducted until 2008 which resulted in a 3 years lag and altered the sequence.
Also, until 1999 HIV sentinel survey did not cover the entire states in Nigeria. For
this reasons the use of average prevalence rate was employed (average
prevalence rate from 1999 to 2003 and 2005 to 2010).
Unwillingness to disclose HIV status by infected persons due to Stigmatization is
another factor of concern as this restricted the source of data acquisition to
antenatal patients. Therefore the available data may not be a true representation.
5
1.4 Definition of terms
AIDS: Acquired Immune Deficiency Syndrome
Cluster: A cluster is an ordered list of objects, which have some common
characteristics.
Cluster Membership: Items in a particular cluster have same cluster membership.
Cluster Size: number of items or members in a cluster.
Console: command line (not graphical)
Class: A class in an object oriented programming is a construct that serves as a
template for creating, or instantiating, specific objects within a program (plural
classes).
Dendogram: a tree like structure used to display the result of hierarchical
clustering.
Function: A chunk of code written to achieve a particular goal in programming,
usually gives a result in return.
GUI: Graphical User Interface
HIV: Human Immunodeficiency Virus, a retrovirus that causes AIDS.
HSS: HIV Sentinel Survey
IDE: Integrated Development Environment (an interface that enhances
programming and code writing)
Object: is an instance of a class (see class above).
Seroprevalence: is the number of persons in a population who test positive for a
specific disease based on serology (blood serum) specimens
Sentinel survey: a survey system in which a designated group of reporting sources
(hospitals, agencies e.t.c.) from certain geographical location agrees to report all
cases of one or more notifiable conditions for the purpose of acquiring high
quality data.
6
CHAPTER TWO
LITERATURE REVIEW
2.0 HIV prevalence in Nigeria
The HIV and AIDS pandemic have constituted the greatest health challenge in
Nigeria over time. In 2012 Nigeria was adjudged to have the second highest
burden of HIV in the world after South Africa.
Since the first reported case of HIV and AIDS in Nigeria in 1986, the epidemic has
continued to unleash a huge blow on the commonwealth with about 3.3 million
Nigerians currently infected. To respond to this epidemic, the Federal Government
of Nigeria put in place several programmes aimed at controlling and mitigating its
effect. The sole purpose of these intervention programmes is usually the
continuous monitoring of the HIV epidemic via a biennial sentinel survey among
pregnant women at finishing antenatal clinics in Nigeria.
In the African region, active HIV sero-surveillance using pregnant women
attending ante-natal clinics as the survey population is employed in line with the
World Health Organization (WHO) plus the Joint United international locations
Programme on HIV and AIDS (UNAIDS). The HIV sentinel sero-surveillance survey
has been conducted biennially in Nigeria since 1991.
The HIV prevalence in Nigeria had been on a consistent increase from 1.8% in
1991 to 5.8% in 2001 before a decline to 5% in 2003 and 4% in 2005. The report of
the 2005 survey further reaffirms that no state or community is spared this
epidemic. There are wide variations in HIV prevalence among states and between
urban and rural places across the commonwealth. Data resulting from sentinel
survey could be inconclusive to get ready direct comparisons between aggregate
figures acquired inside a several surveys due to differences in location and counts
of survey sites.
7
2.1 Research works on HIV /AIDS
Over the years, there are agencies responsible for monitoring and control of the
HIV/AIDs. Example of such agencies include NACA (National Agency for the
Control of Aids), UNFPA (United Nations Fund for Population Activities), These
agencies organises periodic survey to estimate data needed to investigate the
nature of HIV prevalence and also conduct research works based on specifications
of international organisation such as WHO (World Health Organisation), UNAIDS
(Joint United Nations Programme on HIV/AIDS) and so on.
There is little or no literature on the study HIV Prevalence using Cluster Analysis,
though certain clustering techniques were employed in available works majorly
from Federal Ministry of Health and some HIV/AIDS agencies in the country.
The Federal Ministry of Health, recent research on HIV prevalence presented a
clustering by state of HIV spread in a map view as found in the technical report of
National HIV Sero-prevalence Sentinel Survey 2010 (figure 2.0 below).
8
Fig 2.0: Geographical distribution of HIV Prevalence by States (HSS 2010)
Source: Federal Ministry of Health, 2010 National HIV Sero-prevalence
Sentinel Survey - Technical Report
2.2 Cluster Analysis
Cluster analysis is a study of dividing data elements into groups, such that the
elements in each group are homogenous as much as possible.
The most common use of cluster analysis is classification. That is, subjects are
separated into groups such that each subject is more similar to other subjects in
its group than to subjects outside the group. Cluster analysis is thus concerned
ultimately with classification and represents a set of techniques that are part of
the field of numerical taxonomy (Frank and Green, [1968]; Punj and Stewart
[1983]; Aldenderfer and Blashfield [1984]).
9
Though, Clustering is often confused or used interchangeably with classification,
but the two can be distinguish. In classification the objects are assigned to pre
defined classes or groups, whereas in clustering the classes or groups are also to
be found by the algorithm used. Thus classification is a supervised approach while
clustering is unsupervised.
The term cluster analysis does not identify a particular statistical method or
model, as do discriminant analysis, factor analysis, and regression. We do not have
to make any assumptions about the underlying distribution of the data. Using
cluster analysis, groups of related variables can be formed, similar to the case of
factor analysis. There are numerous ways to sort cases into groups. The choice of
a method depends on the size of the data and other factors. Methods commonly
used for small data sets are impractical for data files with thousands of cases.
CLARA (clustering for large sample) was suggested by Kaufman and Rousseeuw
(1990).
There are generally two major approaches used for clustering, these are
Hierarchical method and Partitioning method.
2.2.1 Hierarchical Clustering
The hierarchical procedures are characterized by the construction of a hierarchy
or tree-like structure. In some methods every single point starts as a unit (single-
point) cluster. At the next level the two closest points are placed in a cluster. At
the next level a third point joins the first two or else a second two-point cluster is
formed, based on different criterion rules for assignment. A popular technique for
hierarchical clustering is the Agglomerative Nesting which is discussed in
subsequent section. The results of hierarchical clustering are usually presented in
a dendogram, this clustering technique relies on dissimilarity (distance) matrix.
10
2.2.1.1 Dissimilarity Matrix:
Dissimilarity matrix (also called distance matrix) describes pairwise distinction
between 𝑝 objects. It is a square symmetrical 𝑝 𝑋 𝑝 matrix with the (ab)th
element equal to the value of a chosen measure of distinction between the (a)th
and the (b)th object. The diagonal elements are either not considered or are
usually equal to zero - i.e. the distinction between an object and itself is
postulated as zero.
If we denote distance by 𝒅, then 𝒅(𝒂,𝒃) is the distance between one object and
the other (the value of distance depends on the chosen metric, may be Euclidean
or Manhattan as explained in further discussion). The pairwise distance matrix is
given as:
In order to decide which clusters should be combined (for agglomerative), or
where a cluster should be split (for divisive), a measure of dissimilarity between
sets of observations is required. In most methods of hierarchical clustering, this is
achieved by use of an appropriate metric and a linkage criterion which specifies
the dissimilarity of sets as a function of the pairwise distances of observations in
the sets. Metric is a measure of distance between pairs of observations. Examples
of some commonly used metric are:
Manhattan distance:
D = d(a, b) =
11
Euclidean Distance:
Squared Euclidean Distance:
Maximum Distance:
2.2.1.2 Linkage Criteria
Linkage criterion determines how clusters should be formed; it computes the
distance between clusters.
A linkage criterion specifies the dissimilarity of sets as a function of the pairwise
distances of observations in the sets. Metric is a measure of distance between
pairs of observations. Examples of some commonly used linkage are single linkage,
complete linkage and the ward’s method. Some of these methods are discussed
below.
Single Linkage Rule
The single linkage or minimum distance rule starts out by finding the two points
with the minimum distance. These are placed in the first cluster. At the next stage
a third point joins the already-formed cluster of two if the minimum distance to
any of the members of the cluster is smaller than the distance between the two
closest unclustered points. Otherwise, the two closest unclustered points are
placed in a cluster. The process continues until all points end up in one cluster.
The distance between two clusters is defined as the shortest distance from a point
in the first cluster that is closest to a point
12
Complete Linkage Rule
The complete linkage option starts out in just the same way by clustering the two
points with the minimum distance. However, the criterion for joining points to
clusters or clusters to clusters involves maximum (rather than minimum) distance.
That is, a third point joins the already formed cluster of two if the maximum
distance to any of the members of the cluster is smaller than the distance
between the two closest un-clustered points. In other words, the distance
between two clusters is the longest distance from a point in the first cluster to a
point in the second cluster.
Average Linkage
The average linkage option starts out in the same way as the other two. However,
in this case the distance between two clusters is the average distance from points
in the first cluster to points in the second cluster.
2.2.1.3 Dendogram
A dendogram (from Greek dendron "tree", -gramma "drawing") is a tree diagram
frequently used to illustrate the arrangement of the clusters produced by
hierarchical clustering. Dendograms are often used in computational biology to
illustrate the clustering of genes.
The x-axis of a dendogram usually consists of clusters while the y-axis (height)
denotes the distance between one cluster and the other.
13
Fig 2.0: Sample Dendogram
2.2.2 Partitioning Method
The partitioning methods generally result in a set of M clusters, each object
belonging to one cluster. Each cluster may be represented by a centroid or a
cluster representative; this is some sort of summary description of all the objects
contained in a cluster. The precise form of this description will depend on the type
of the object which is being clustered. In case where real-valued data is available,
the arithmetic mean of the attribute vectors for all objects within a cluster
provides an appropriate representative; alternative types of centroid may be
required in other cases, e.g., a cluster of documents can be represented by a list of
those keywords that occur in some minimum number of documents within a
cluster. If the number of the clusters is large, the centroids can be further
clustered to produces hierarchy within a dataset. K-Means is the most popularly
known partitioning technique. The technique of K-means is discussed below.
2.2.2.1 K Means:
K-means (MacQueen, 1967) is one of the simplest unsupervised learning
algorithms that solve the well-known clustering problem. The procedure follows a
simple and easy way to classify a given data set through a certain number of
Distance or Height
Objects
14
clusters (denoted by k) fixed a priori. The main idea is to define k centroids, one
for each cluster. These centroids are then placed far away from each other as
much as possible. The next step is to take each point belonging to a given data set
and associate it to the nearest centroid. When no point is pending, the first step is
completed and an early groupage is done. At this point we need to re-calculate k
new centroids as barycenters of the clusters resulting from the previous step.
After we have these k new centroids, a new binding has to be done between the
same data set points and the nearest new centroid. A loop has been generated. As
a result of this loop we may notice that the k centroids change their location step
by step until no more changes are done. In other words centroids do not move
any more.
Finally, this algorithm aims at minimizing an objective function, in this case a
squared error function.
A. Algorithm for K-means
1. Place K points into the space represented by the objects that are being
clustered. These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K
centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a
separation of the objects into groups from which the metric to be
minimized can be calculated.
B. Benefits of using K-means
1. With a large number of variables, K-Means may be computationally faster
than hierarchical clustering (if K is small).
2. K-Means may produce tighter clusters than hierarchical clustering,
especially if the clusters are globular.
15
C. Drawback of using K-means
1. Difficulty in comparing quality of the clusters produced (e.g. for different
initial partitions or values of K affect outcome).
2. Fixed number of clusters can make it difficult to predict what K should be.
3. Does not work well with non-globular clusters.
4. Different initial partitions can result in different final clusters. It is helpful to
rerun the program using the same as well as different K values, to compare
the results achieved.
2.3 Cluster Validation
Since there are diverse ways of conducting cluster analysis, and each method has
combination of possible alternatives whereby choice of selection depends on
users discretion, there is need to do a check on the resulting output. Cluster
validation has to do with the evaluation of the result obtained in a particular
clustering technique.
Conducting a Cluster Analysis is not much of an easy task but cluster validation is
proven to be more difficult.
“The validation of clustering structures is the most difficult and frustrating part of
cluster analysis. Without a strong effort in this direction, cluster analysis will
remain a black art accessible only to those true believers who have experience
and great courage.” - Algorithms for Clustering Data, Jain and Dubes.
Techniques for validating the result of cluster analysis are generally categorized as
Intrinsic, Extrinsic and Relative. Intrinsic technique uses certain properties()
present in the result of the conducted analysis for evaluation while the Extrinsic
uses parameters outside the result for evaluation. The Relative Technique
evaluates a clustering structure by comparing it to other clustering schemes. The
Relative type of cluster validation was used in attempt to validating the result of
our clustering.
16
CHAPTER THREE
METHODOLOGY
Cluster analysis is a study of dividing data items into groups such that the
elements of each group are homogenous as much as possible. There are basically
two major approaches used for grouping in cluster analysis, these are Hierarchical
and Partitioning method.
For the sake of this work, Agglomerative nesting and K-Means approach which are
methods of Hierarchical and Partitioning clustering would be our focus
3.0 Agglomerative Nesting (Hierarchical)
Agglomerative nesting often called AGNES uses an algorithm in a bottom-up
manner till a single rooted tree like diagram (dendogram) is formed. The algorithm
is as follows:
1. Initially, put each article in its own cluster.
2. Among all current clusters, pick the two clusters with the smallest distance.
3. Replace these two clusters with a new cluster, formed by merging the two
original ones.
4. Repeat the above two steps until there is only one remaining cluster in the
pool.
3.0.1 Distance Measure
In order to compute clustering, a measure of dissimilarity between sets of
observations is required. This is sometime refers to as proximity matrix (a matrix
of distance measure which could either be similarity measure or dissimilarity
measure as in the case of hierarchical clustering). The proximity matrix is achieved
by use of an appropriate metric such as Euclidean distance and Manhattan
distance measures.
17
Euclidean distance: is an inter point distance, it takes the magnitude of the
expression data into account and therefore, preserves more information about
the data. Euclidean distance is the length of the shortest path between two points
and required standardization when used. Mathematically given as:
Manhattan Distance:
The Manhattan distance also known as the City-Block distance is the sum of
distances along each dimension. This distance measure corresponds to the
distance of travel between two points. By formula:
The cluster analysis computation of this work uses the Manhattan distance.
3.0.2 Ward’s Linkage Criterion
Linkage criterion determines how clusters should be formed; it computes the
distance between clusters.
Ward's method is a criterion applied in hierarchical cluster analysis. Ward's
minimum variance method is a special case of the objective function approach
originally presented by Joe H. Ward, Jr. (1963). Ward suggested a general
agglomerative hierarchical clustering procedure, where the criterion for choosing
the pair of clusters to merge at each step is based on the optimal value of an
objective function
Ward's minimum variance criterion minimizes the total within-cluster variance. At
each step the pair of clusters with minimum cluster distance are merged
18
This method is distinct from other methods because it uses an analysis of variance
approach to evaluate the distances between clusters. In general, this method is
considered very efficient.
3.0.3 Agglomerative Coefficient
Agnes computes a coefficient, called Agglomerative Coefficient (AC), which
measures the clustering structure of the data set.
Agglomerative Coefficient is a dimensionless quantity, varying between 0 and 1.
When AC is close to 1 there is an indication that a very clear structuring has been
found. Otherwise when AC is close to 0 it indicates that the algorithm has not
found a natural structure. In other words, the data consists of only one big cluster.
3.1 K-Means (Partitioning)
This is a method of cluster analysis which aims to partition n observations into k
clusters in which each observation belongs to the cluster with the nearest mean.
K-means is a prototyped based approach that requires that the number of cluster
be specified before the analysis is conducted.
Determining number of cluster to specify is a major challenge of using the k-
means. There are several theories on how to compute and obtain the appropriate
number; these would not be discussed since it is beyond the scope of this work.
There is a Rule of thumb for convenience sake which is an estimate given as:
𝑘 ≈ 𝑛/2
Where: 𝑘 – Number of clusters to specify
𝑛 – Sample size
3.2 R-Console
R-console is a platform independent open source package built for statistical
computing. It based on object oriented programming (works with objects and
classes). The power of R lies in the flexibility to write customized functions and
classes that makes up an R package. This feature of R makes it possible to have
19
thousands of packages by several developers. These packages are known as
LIBRARY. These Libraries can be found on CRAN (Comprehensive R Archive
Network).
The cluster library in R consists of all required function for cluster analysis
computation. The cluster library is usually invoked with the function
library(cluster)
3.2.1 R-Studio
R Studio Is an IDE (Integrated Development Interface) for R. It is a powerful and
productive interface that GUI for R that provides room for convenience coding.
Since R-Console is more of based only on Console, R-Studio makes a perfect
alternative for computation. Below is the interface of R-Studio
Fig 3.0 Interface of R-Studio 0.96.330
20
CHAPTER FOUR
DATA PRESENTATION AND ANALYSIS
This section discussed the detail examination and explanation of the findings
made at the process of this work. Graphical Illustrations were employed to depict
the result of findings.
Table 4.0 Average HIV Prevalence rate in Nigerian States (1999-2003 and 2005-2010)