-
University of IowaIowa Research Online
Theses and Dissertations
2008
Shape and scale in detecting disease clustersSoumya
MazumdarUniversity of Iowa
Copyright 2008 Soumya Mazumdar
This dissertation is available at Iowa Research Online:
http://ir.uiowa.edu/etd/208
Follow this and additional works at: http://ir.uiowa.edu/etd
Part of the Geography Commons
Recommended CitationMazumdar, Soumya. "Shape and scale in
detecting disease clusters." PhD (Doctor of Philosophy) thesis,
University of Iowa, 2008.http://ir.uiowa.edu/etd/208.
-
1
SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS
by
Soumya Mazumdar
An Abstract
Of a thesis submitted in partial fulfillment of the requirements
for the Doctor of
Philosophy degree in Geography in the Graduate College of
The University of Iowa
December 2008
Thesis Supervisor: Professor Gerard Rushton
-
1
ABSTRACT
This dissertation offers a new cluster detection method. This
method looks at the
cluster detection problem from a new perspective. I change the
question of What do real
clusters look like? to the question of What do spurious clusters
look like? and How
do spurious clusters affect the ability to recover real
clusters? Spurious clusters can be
identified from their geographical characteristics. These are
related to the spatial
distribution of people at risk, the shape and scale of the
geographic units used to
aggregate these people, the shape and scale of the spatial
configurations that the disease
mapping or cluster detection method may impose on the data and
the shape and scale of
the area of increased risk. The statistical testing process may
also create spurious clusters.
I propose that the problem of spurious clusters can be resolved
using a computational
geographic approach. I argue that Monte Carlo simulations can be
used to estimate the
patterns of spurious clusters in any situation of interest given
knowledge of the first three
of these four determinants of spurious clusters. Then, given
these determinants, where
real measurements of disease or mortality are known, it is
possible to show those areas of
increased risk that are true clusters as opposed to those that
are spurious clusters. This
distinction is made in a three dimensional signature space, with
shape, size and rate as the
three axes. The extent of similarity (or dissimilarity) of a
cluster to the simulated spurious cluster influences whether it can
be recovered. These experiments show that this method
is successful in detecting clusters. This method can also
predict with reasonable certainty
which clusters can be recovered, and which cannot. I compare
this method with
Rogersons Score statistic method. These comparisons expose the
weaknesses of
Rogersons method. Finally these two methods and the Spatial Scan
Statistic are applied
to searching for possible clusters of prostate cancer incidence
in Iowa. The implications
of the findings are discussed.
-
2
Abstract Approved: ___________________________________ Thesis
Supervisor
___________________________________
Title and Department
___________________________________
Date
-
SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS
by
Soumya Mazumdar
A thesis submitted in partial fulfillment of the requirements
for the Doctor of
Philosophy degree in Geography in the Graduate College of
The University of Iowa
December 2008
Thesis Supervisor: Professor Gerard Rushton
-
Graduate College The University of Iowa
Iowa City, Iowa
CERTIFICATE OF APPROVAL
_______________________
PH.D. THESIS
_______________
This is to certify that the Ph.D. thesis of
Soumya Mazumdar
has been approved by the Examining Committee for the thesis
requirement for the Doctor of Philosophy degree in Geography at the
December 2008 graduation.
Thesis Committee: ___________________________________ Gerard
Rushton, Thesis Supervisor
___________________________________
David Bennett
___________________________________
Naresh Kumar
___________________________________
Marc Linderman
___________________________________
Dale Zimmerman
-
ii
ACKNOWLEDGMENTS
I would like to acknowledge the help I have received during the
course of my stay
in Iowa. I would like to thank Dr Rushton for supervising my
research. I would also like
to thank my committee members for their contributions. The last
four years of my life
have been emotionally challenging for me. I thank the great
masters before us who have
helped me through. I am thankful to the writings of M. Scott
Peck, Viktor Frankl, Swami
Vivekananda, and the yogic practices of Sri Sri Ravishankar @
Art of Living Foundation.
I would also like to thank my family members, especially my mom,
mishtimashi and late
Dr Mazumdar for their support. Thanks are also due to all my
friends and well wishers.
-
iii
TABLE OF CONTENTS
LIST OF TABLES
......................................................................................................v
LIST OF FIGURES
..................................................................................................
vi
CHAPTER
1. DETECTING CLUSTERS OF DISEASE: INVESTIGATING SPURIOUS
CLUSTERS---------------------------------------------------------------------1
1.1 Statement of
Purpose------------------------------------------------------1 1.2
Introduction-----------------------------------------------------------------2
1.3 Organization of the
dissertation------------------------------------------7 1.4 Review
of existing methods of cluster detection----------------------7
1.4.1 Map data without further geographic
processing---------------9 1.4.1.1 Methods that do not smooth the
data------------------10 1.4.1.2 Methods that smooth the
data--------------------------10
1.4.2 Methods that pre-process the data before calculating
and/or testing for significant disease risk----------------12
1.4.2.1 Non combinatiorial approches-------------------------13
1.4.2.2 Combinatorial approaches------------------------------17
1.4.2.3 Hybrid
approaches---------------------------------------18
1.4.3 Significance testing and spurious
clusters---------------------19 1.4.4 Identifying spurious clusters
and distinguishing true clusters from spurious
clusters---------------------------------22
1.4.4.1 The spatial distribution of the locations of people in
the map-----------------------------------------------24
1.4.4.2 The scale and spatial configuration of the geographic
units that are used to aggregate data into discrete small
areas-------------------------------27
1.4.5 Identifying spurious clusters and distinguishing true
clusters from spurious
clusters---------------------------------29
1.4.6 Why use size, shape and
rate----------------------------------- 30
2. THE SHAPE SIZE SENSITIVE (S.S.S) METHOD FOR DETECTING DISEASE
CLUSTERS-------------------------------------------------------55
2.1 Theoretical foundations of the S.S.S
method-------------------------55 2.2 Hypothesis testing
---------------------------------------------------60 2.3 The
simulated
dataset---------------------------------------------------65
2.3.1 Hypothetical study area and
population------------------------65 2.3.2 Hypothetical case
population------------------------------------66 2.3.3 Datasets
under the null hypothesis of no clustering----------66 2.3.4
Extracting the cluster candidates--------------------------------68
2.3.5 Datasets under the alternative hypothesis of
clustering------69
-
iv
2.3.5.1 Rationale Behind the choice of these configurations of
synthetic clusters------------------------------69
2.4 Rogersons Score
Statistic-----------------------------------------------73 2.4.1
Theory--------------------------------------------------------------73
2.5
Diagnostics----------------------------------------------------------------75
2.6 Computational
Scheme--------------------------------------------------76 2.7
Results-
------------------------------------------------------------------
77
2.8 Discussions and future
directions--------------------------------------81
3. INVESTIGATING THE SPATIAL PATTERNS OF PROSTATE CANCER IN
IOWA---------------------------------------------------------------------109
3.1
Background-------------------------------------------------------------109
3.2
Methods-----------------------------------------------------------------111
3.3
Results-------------------------------------------------------------------115
3.4
Discussion---------------------------------------------------------------119
3.5
Conclusion--------------------------------------------------------------120
3.6 Contribution that this dissertation makes to the geography
literature-----------------------------------------------------------------120
REFERENCES----------------------------------------------------------------------------135
-
v
LIST OF TABLES
Table
2.1 Hold one validation for null
hypothesis.-----------------------------------------102
2.2 Hold one validation for alternative
hypothesis.---------------------------------102
2.3 Summary statistics of the simulated 3675 spurious
clusters.------------------103
2.4 Shape, size, risk (signature) and the ability to recover
simulated clusters.--104 2.5 The table illustrates the average
sensitivity (ability to detect a cluster
when it exists) and specificity (ability to classify an area
that is not a cluster as
such).--------------------------------------------------------------------105
2.6 This table compares sensitivity and specificity with which
clusters are recovered for SSS and Rogersons method and the higher
the sensitivity the better the cluster is
recovered.-------------------------------------------------106
2.7 Cluster recovery using only rates and only
shapes.-----------------------------107
2.8 How do true clusters differ in shape and size from spurious
clusters.-------108
-
vi
LIST OF FIGURES
Figure
1.1 This figure displays the statistical significance of
accidents per square kilometer (a p- map over densities) , where
accidents have been randomly scattered across the study area . A 30
meter grid was laid over the entire study area and a 600 meter
filter was used to estimate the accident densities. The black areas
are significant noisy clusters--------35
1.2 This figure displays a spurious cluster detected by Duczmals
Simulated Annealing based SaTScan method. This cluster has a high,
statistically significant likelihood
value.-------------------------------------------36
1.3 In the geographic area, 42 people are distributed over a
uniform grid. Each circle represents an individual. They are color
coded white to indicate that they are healthy.
------------------------------------------------------37
1.4 A noise or spurious cluster generating process operates at
the scale of the entire geographical area. No person is at a
greater risk of disease than any other. All people are at a risk of
0.24. Diseased people are randomly diseased over the map. These
disease people are color coded black to indicate a diseased
state.-------------------------------------------------------------38
1.5 A boundary is drawn around those people who are diseased.
This represents our gerrymandered cluster. Note the highly
irregular and large shape of the
cluster.-------------------------------------------------------39
1.6 In contrast to 1.4, a cluster generating process operates on
this geographic area. The cluster generating process predisposes
the people living in the area bound by the dotted lines to a
greater risk than other areas of the map. These people are at a
risk of 0.56. In one realization of the process cluster of 10
people therefore are diseased in this
area.----------------------40
1.7 The cluster is then enclosed within a boundary. Note the
relatively regular shape of the cluster (compared to a random
distribution of diseased people).
------------------------------------------------------------------41
1.8 People are distributed non uniformly over
space.--------------------------------42
1.9 The entire geographic space is subject to the same risk
(0.24) noise generating process. The resulting 10 diseased people
and the gerrymandered cluster are
shown.--------------------------------------------------43
1.10 The cluster generating process in figure 6 operates on the
inhomogenously distributed population. The risk elevation is the
same as in Figure 1.6 0.56. This causes 8 people to fall ill from
an at-risk population of 14.--------44
-
vii
1.11 The estimated cluster shape and size is very different from
what the shape and size of the cluster is in reality (The dotted
line in Figure 10). It is also very different from what was
obtained for a homogenous distribution of people in Figure
1.6.------------------------------------------------45
1.12 Now a cluster generating process operates on this space.
The white river within the dotted lines is the area of excess risk.
People living within this area are at an excess risk of
disease.--------------------------46
1.13 Assuming an inhomogeneous distribution of people as in
figure 1.8 and a risk elevation of 0.71, we see that a certain
number of people (10) within the area of excess risk are
diseased.----------------------47
1.14 The gerrymandered cluster now encloses the diseased people.
Note the highly irregular and large shape of this
cluster.------------------------------48
1.15 Two cluster generating processes of circular shape and risk
elevation of 0.75 operate on a homogenous distribution of
people.-----------------------49
1.16 The clusters that are estimated from this have the same
triangular shape. This is highly unlikely in
reality.---------------------------------------------------50
1.17 In this example a slightly larger area of increased risk is
considered than in the earlier example. 6 people in each of the two
clusters are subject to a risk of 0.5, which results in 3 of them
becoming cases/ falling
ill.-----------------------------------------------------------------------51
1.18 The clusters that are generated have very different shapes.
In fact the larger the area of increased risk, the greater the
number of possible shapes and sizes of the estimated
cluster.----------------------------------------52
1.19 In this example people are inhomogenously distributed. The
same cluster generating process in Figure 1.15 gives rise to two
circular areas of increased risk where the risk elevation is
0.5.-----------------------------------53
1.20 The two clusters generated have very different shapes.
There is no configuration of cases within the clusters for which
two estimated clusters could have the same
shape.------------------------------------------------54
2.1 Using echelons to extract cluster
candidates.----------------------------------------87
2.2 A set of 50,000 cardiovascular disease mortality cases are
randomly distributed by population weights to each of 942 ZCTAs in
the state of Iowa. A pattern is then extracted using Spatial
Filtering. The pattern is binarized, and the resulting polygon
cluster candidates are extracted using a
GIS.----------------------------------------------88
2.3 An example set of spurious cluster signatures S(ZN ) in
signature space.---89 2.4 An example set of spurious cluster
signatures S(ZN ) in signature space
with a few candidate clusters (grey
squares).-------------------------------------90 2.5 Bounding
rectangle for elliptical
footprint.---------------------------------------91
-
viii
2.6 Flowchart of the S.S.S
method.-----------------------------------------------------92
2.7 Population distribution of ZCTAs in Iowa,
2000.--------------------------------93
2.8 This figure displays the computational process used to
create the simulated dataset. Each bin is labeled as k and has a
specific size. For the simulations in this research
n=942.-------------------------------------------93
2.9 The simulated datasets follow a multinomial
distribution.----------------------94
2.10 Summary of shapes of simulated spurious clusters, frequency
and cumulative
frequency.----------------------------------------------------------------95
2.11 Summary of sizes of simulated spurious clusters, frequency
and cumulative
frequency.----------------------------------------------------------------96
2.12 Summary of rates of simulated spurious clusters, frequency
and cumulative
frequency.----------------------------------------------------------------97
2.13 Characteristics of the four clusters simulated under the
alternative
hypothesis.-----------------------------------------------------------------------------98
2.14 Cluster detection diagnostics (The key to the numbers is in
the text).--------99 2.15 Patterns detected by the Score statistic
and the S.S.S method for one
dataset among 20 datasets simulated for cluster-4. The true
cluster pattern can be seen inset. In this particular dataset S.S.S
is able to identify 62% of the true cluster pattern, while the
Score statistic is able to identify
20%.----------------------------------------------------------------100
2.16 Patterns detected by the Score statistic and the S.S.S
method for one dataset among 20 datasets simulated for cluster-3.
The true cluster pattern can be seen in the inset. In this
particular dataset S.S.S is able to identify 98% of the true
cluster pattern, while the Score statistic is able to identify
91%.-------------------------------------------------------------101
3.1 Spatial patterns of prostate cancer incidence (1999-2004) in
Iowa.----------123 3.2 Cluster of prostate cancer incidence in
Iowa, detected by the S.S.S
method.
----------------------------------------------------------------------------124
3.3 Cluster detected by SaTScan when the geometry of the cluster
is assumed to be
ellipsoidal.----------------------------------------------------------125
3.4 Cluster detected by SaTScan when the geometry of the cluster
is assumed to be
circular.-------------------------------------------------------------126
3.5 Large secondary cluster with low elevation in risk detected
by Kulldorffs SaTScan when the geometry of the cluster is assumed
to be
elliptical.-----------------------------------------------------------------------127
3.6 ZCTAs in Iowa with a significant value of Rogersons Score
statistic.-----128
-
ix
3.7 Expected number of cases in ZCTAs: Entire Iowa versus areas
with a significant value of Rogersons Score
statistic.---------------------------------129
3.8 ZCTAs in the North West Iowa cluster of high prostate cancer
incidence.-----------------------------------------------------------------------------130
3.9 Counties boundaries with ZCTAs in the North West Iowa
cluster of high prostate cancer
incidence.----------------------------------------------------------131
3.10 Change in mortality and incidence rates from 1990-2004 in
five counties Dickinson, Clay, Buena-Vista, Emmet and Clay Counties
in the cluster. The expected counts for the particular year (1990,
1991.2000) are calculated using 2000 census population for the
local area, and incidence/mortality information for the state of
Iowa (Same procedure as indirect
standardization).-----------------------------------132
3.11 Variations in the directly standardized incidence and
mortality rate in Iowa, and incidence of Prostate cancer in
Dickinson County for the years
1990-2004.----------------------------------------------------------------133
3.12 Variations in the directly standardized incidence and
mortality rate in Iowa, and incidence of Prostate cancer in Clay
County for the years
1990-2004.---------------------------------------------------------------------------134
-
1
CHAPTER 1: DETECTING CLUSTERS OF DISEASE: INVESTIGATING
SPURIOUS CLUSTERS
1.1 Statement of Purpose
This dissertation offers a new cluster detection method. This
method looks at the
cluster detection problem from a new perspective. I change the
question of What do real
clusters look like? to the question of What do spurious clusters
look like? and How
do spurious clusters affect the ability to recover real
clusters? Spurious clusters can be
identified from their geographical characteristics. These are
related to the spatial
distribution of people at risk, the shape and scale of the
geographic units used to
aggregate these people, the shape and scale of the spatial
configurations that the disease
mapping or cluster detection method may impose on the data and
the shape and scale of
the area of increased risk. The statistical testing process may
also create spurious clusters.
I propose that the problem of spurious clusters can be resolved
using a computational
geographic [1] approach. I argue that Monte Carlo simulations
can be used to estimate the patterns of spurious clusters in any
situation of interest given knowledge of the first
three of these four determinants of spurious clusters. Then,
given these determinants,
where real measurements of disease or mortality are known, it is
possible to show those
areas of increased risk that are true clusters as opposed to
those that are spurious clusters.
The extent of similarity (or dissimilarity) of a cluster to the
simulated spurious cluster influences whether it can be recovered.
These experiments show that this method is
successful in detecting clusters. This method can also predict
with reasonable certainty
which clusters can be recovered, and which cannot. I compare
this method with
Rogersons Score statistic method [2]. These comparisons expose
the weaknesses of Rogersons method. Finally these two methods and
the Spatial Scan Statistic [3] are
-
2
applied to searching for possible clusters of prostate cancer
incidence in Iowa. The
implications of the findings are discussed.
1.2 Introduction
Disease mapping has a long history. Starting with the example of
John Snows
cholera map to the intelligent agents [4] of the present
century, disease mapping has progressed with developments in
science, especially Geographical Information Systems
(G.I.S) and epidemiology. Some of the first disease maps were
simple dot maps indicating the location of disease cases. These
gave way to maps of statistical summaries
known as thematic maps". These maps convey more information than
simple dot maps
and are therefore, powerful exploratory and decision making
tools. For example, when
mortality maps of lung cancer for the United States were made in
the 1960s, high rates
were found in areas of the Eastern Seaboard [5, 6]. Later, these
high rates were attributed to exposure to asbestos among shipyard
workers in these areas. A disease map can thus
be used to map spatial variations in disease risk. A decision
maker can ask Is a person
living in a given area at a greater risk of disease than a
person living in another area? or
In which areas of the map do people have the greatest risk of
disease? In the disease
mapping literature the problem of finding areas of excess risk
is often called cluster
detection", a cluster being defined as A geographically bounded
group of occurrences of
sufficient size and concentration to be unlikely to have
occurred by chance" [7] or in plain English, a geographic area of
high disease risk. A geographical cluster is therefore
spatially analogous to statistical clustering [8], where the
question of interest is finding things near in statistical space
instead of geographical space.
While investigating the causal factors (or etiology) of areas of
increased risk are important, there are other important
applications of these methods. Public health agencies
are often interested in allocating resources to areas with an
increased burden of disease
[9, 10]. Cluster detection methods are used to identify areas
with increased burden of
-
3
disease. Sometimes, environmental policy is formulated on the
basis of such studies. In
one instance, the Vatican was taken to task for operating radio
transmitters at illegal
frequencies after studies showed an increased risk of cancer
among people living close to
these transmitters [11, 12]. Note that policies are often
formulated on the basis of evidence that an increased risk exists
even though the etiological basis for the increased
risk may not have been established. An interesting extension to
etiological research is that
the presence of spatial clusters of increased risk could also be
used to prove the existence
of disease risk factors that are spatially non random. For
example, it has been claimed
that clusters of autism in California prove the existence of
risk factors that are not related
to genetics or the vaccine hypothesis1 (barring selective
migration) [13]. Many public health agencies maintain on the fly
cluster investigation infrastructure to address
cluster related enquiries [14]. A number of methods exist that
can be used to delineate clusters. A persistent
problem with many of these methods is the detection of areas not
at high risk being
identified as such. Some convenient terms for such false
positives are noise" [15], noisy clusters or spurious clusters
[16-19] . In this research I develop a method to detect and adjust
for the occurrence of spurious clusters in cluster detection
studies. The cluster detection literature identifies at least three
types of spurious clusters.
The first is when the estimate of risk in an area is based on a
small number of people
[15]. These estimates of risk are unreliable and therefore the
area may not have a significant excess risk. A number of solutions
exist to solve this problem [20-26]. The second type of spurious
clusters stem from statistical issues in the cluster detection
method. For example, failing to adjust for multiple hypothesis
testing problems may give rise to spurious clusters [18, 27]. This
problem is an area of active research [28].
1 The vaccine hypothesis is that exposure to Thimerosol a
mercury based additive in
vaccines is a risk factor for autism.
-
4
Kulldorffs SaTScan method resolves this problem by adopting a
likelihood based
hypothesis testing framework [3]. The third type of spurious
cluster is created by a mismatch in the scale and spatial
structure of the process that generates the cluster, with the
scale and spatial structure used
to measure the process. The scale and spatial structure or
spatial form of the cluster
search process (which measures or samples the underlying data)
can generate spurious clusters. Unlike the other sources of
spurious clusters very little research exists on this
form of noise. There are a number of reasons for this. Until
recently, the computational
power available to researchers, for cluster detection problems
was limited. A cluster can
have any geometry or spatial form in reality. However a limited
amount of computational
power confined researchers to searching for clusters within a
small range of spatial forms.
For instance, it is a common strategy to search for circular
clusters. This strategy was
adopted by some of the first cluster search methods [27], and
remains common today [29]. If the real cluster is not circular in
shape, then the power to detect non circular clusters is greatly
reduced. But, a limited search also implies that the likelihood
of
mismatch between the circles and the underlying true cluster is
also limited (given that the spatial form of this true cluster is
unknown). In contrast, if the cluster search incorporates a number
of different spatial forms, then the likelihood of mismatch
increases. Since computational power is not a limiting factor
anymore, some researchers
have developed shape free" disease cluster detection methods.
These methods, that draw
from the work of geographers in the 1960s and 70s [30] measure
spatial attributes (like disease counts or rates) at a large number
of possible shapes , sizes and scales. The measured spatial
attributes or some functions of the attributes are used to decide
if an
area of a given shape and size at a given scale is a cluster or
not. For example, Duczmals
[31] scan assigns a likelihood value to each cluster it finds,
where the likelihood is a function of attributes such as an
observed number of cases in the cluster. The clusters
with the highest likelihood are most likely to be clusters.
These methods thus, promise to
-
5
seek out the true clusters, no matter what their spatial form.
However, this also means,
that at some shape and scale, noise or spurious clusters will be
detected. These spatial
forms will represent a mismatch between the shape and scale of
the process that
generated the process and the shape and scale of the process
being used to detect it. The
closest analogy that can be drawn to this is similar to what is
known in the disease
mapping literature as the Texas Sharpshooter Effect. If a
shotgun is used on a wall,
then the wall is splattered with seemingly random bullet holes.
At the scale of the wall,
the process is random. However, it is always possible to draw
targets a posteriori around
the bullet holes. The act of drawing a target is similar to
searching for a cluster at a scale
different from the scale at which the original process occurred
(the entire wall). Duczmals search procedure, thus often finds
clusters that are spurious. Such spurious
clusters will be found by any method that offers the least
amount of geometric freedom to
the clusters search. In fact, these spurious clusters have even
been found when the search
is limited to circular geometries (for example, see Kulldorff
[32]). Tackling this problem therefore requires a) A thorough
understanding of the problem of what gives rise to these spurious
clusters. b) Suggesting a method to solve or in the very least,
manage this problem. This dissertation is an attempt at this.
It is clear that an understanding of this problem requires an
understanding of scale
and shape of the spurious cluster or noise generating process.
The shape, size and risk
elevation of a cluster, whether spurious or real, is unique to
each and every disease
mapping/cluster detection situation. The characteristics (shape,
size and risk elevation) of a cluster depend on : a) The cluster
generating process, especially the shape and size of the area of
excess risk, b) The spatial distribution of people over space and
c) The scale at which the spatial data are aggregated [19]. These
factors are unique to each disease mapping situation/example, and
these factors are responsible for creating spurious
clusters. Once we have established these facts, two take home
facts are: 1) Every disease mapping situation has a unique noise or
spurious cluster signature b) It is not possible to
-
6
guess this signature a-priori. However this signature may be
computed as explained
below.
Since, each disease mapping situation has a unique noise or
spurious cluster
signature, it follows that in every disease mapping situation
there will be some clusters
which will be hard to detect. These clusters will be in ways
similar to the spurious or
noisy clusters. This issue or the issue of recoverability has
just started being discussed in the disease mapping literature [33,
34]. The methods I describe incorporate the following features.
First, it extracts cluster candidates using an exploratory
approach.
Second, shape, size and rate are used to distinguish true
clusters from spurious clusters.
Third, the method incorporates recoverability of clusters into
the analyses. The researcher
is able to know (computationally) a-priori what spatial form of
clusters are recoverable. The method utilizes computational
geography and two fundamental geographic aspects of
clusters- shape, and size to analyze the recoverability of
clusters and to separate cluster
from non cluster or spurious clusters. This dissertation
diverges from the traditional
disease clustering literature in taking shape and size into
consideration. Traditionally only
the rate at a given location or some function of the rate is
used to separate a true cluster
from a spurious one. Since the method incorporates the shape and
size of the cluster in its
analysis, I call it the Shape, Size Sensitive disease cluster
detection method or the S.S.S
method. The S.S.S method is tested and validated on simulated
data. This method
demonstrates the power of computational geography over
traditional methods [35]. The ideas and methods developed and
tested in this dissertation are either new, or have been
discussed only in scant detail in the literature. Yet, they are
fundamental to geography
and disease mapping. This research thus makes an important
contribution to the disease
mapping literature.
-
7
1.3 Organization of the dissertation
In this chapter (Chapter 1) I discuss how various disease
mapping and cluster detection techniques approach the problem of
spurious clusters. I then argue that these
methods do not address the issue of spurious clusters
adequately. I suggest that a
geographical approach can help us better understand the problem
and explain how
geography gives rise to spurious clusters. Then, having
understood the geographical
bases for spurious clusters I propose a geographically sensitive
disease cluster detection
method. I explain this method the Shape Size Sensitive (S.S.S)
method in Chapter 2. Then, using simulated data, I test the
sensitivity of this method. I also compare the
performance of the S.S.S method with Rogersons Score statistic
method for detecting
disease clusters. The final, short chapter is Chapter 3. Here I
use the S.S.S method and
Rogersons Score Statistic and Kulldorffs Spatial Scan Statistic
to investigate the spatial
patterns of prostate cancer risk in Iowa. The implications of
the findings are discussed.
1.4 Review of existing methods of cluster
detection
All disease mapping and cluster detection approaches share a
common goal. This
is to uncover the underlying pattern of risk. These methods
calculate statistics as rates or
likelihoods which serve as measures of risk. The patterns" on a
map are obtained by
mapping either these statistics, or those areas that cross some
threshold of the calculated
statistic. When the second procedure is followed, that is, the
rate, or, the likelihood of an
area having an excess risk is statistically tested; the method
is often called a cluster
detection method. Most cluster detection methods test a large
number of areas which
could possibly be clusters. These are called candidate clusters
[31, 36] or cluster candidates. If a cluster passes the statistical
test, but demarcates an area where no
cluster exists in reality, then, it is a noisy cluster [31] or
spurious cluster [16-19]. The term true cluster may be used to
indicate geographic areas of excess risk. It is also
-
8
possible that a true cluster is suppressed by the cluster
detection process. In the disease
cluster detection literature this problem is usually not
discussed separately, but forms an
integral part of the spurious cluster detection problem.
Spurious clusters may be created
at various stages in the disease mapping/cluster detection
process. The first step for
applying a cluster detection method is to collect spatial data.
This data may come pre-
aggregated into administrative regions, or it may come in the
individual form [37, 38]. If the data are in the individual form,
they need to be processed and aggregated
such that summary statistics may be gleaned from them and the
summary statistics
mapped. The process of aggregation may create spurious clusters.
One solution is to use
the individual level data to search for clusters [39]. While a
number of methods will work with both aggregated and individual
level data, there are a very few methods, that have
been developed exclusively for individual level data [40, 41] .
With better quality data being increasingly available, such
analyses will become more common [37, 42]. The majority of disease
mapping situations start with aggregated data and summary
statistics are calculated from these datasets. When the summary
statistics are calculated based on a
small base population (also called a small support size), then
these statistical estimates are likely to be unreliable. This is
the small number problem. Some methods carry out
a process called smoothing", where information from neighboring
regions is used to
obtain a better estimate of the mapped statistic for a given
region. This, to some extent
alleviates the problem of spurious clusters created from small
numbers. The statistical
testing procedure could also create spurious clusters. If
multiple hypothesis tests without
adjustment are carried out then, this process may also give rise
to spurious clusters. In a famous example, Openshaw [27] carried
out multiple hypothesis tests when searching for leukemia clusters
in Northern England. Whenever a test was significant, a circle
was
drawn. Some of these circles were spurious clusters, and would
not have existed if
adjustments for multiple testing were carried out. Sometimes,
using the wrong reference distribution may also create spurious
clusters. Conversely, using overly conservative
-
9
multiple testing correction techniques may suppress true
clusters [28]. Waller and Gotway [4] write of situations where for
a Poisson reference distribution, it is not possible to distinguish
a lack of fit to the Poisson distribution (spurious cluster) from a
rejection of the null hypothesis (true cluster). This is an area of
active statistical research, and some new and innovative solutions
have been proposed to these problems [43, 44]. Kulldorffs SatScan
method uses a likelihood based hypothesis testing framework to
solve the problem of multiple testing [3]. Instead of testing
multiple hypotheses, this method tests only one hypothesis. This
hypothesis test is carried out on the cluster
candidate that is most likely to be a cluster. The likelihood is
a statistical function,
that is calculated under the assumption that the observed data
conform to certain known
distributions (ex: Poisson or binomial). There still remains the
third source of spurious clusters. Unlike the first two, there
is little research on this source of spurious clusters. This is
when spurious clusters are
created from mismatch between the process that generates the
disease map patterns, and
the processes used to recover the patterns. This mismatch could
arise when the data are
aggregated to administrative regions, or to other shapes and
scales by the method of
analysis. In this section I discuss the various methods for the
detection of cluster
detection in context of their ability to handle this problem.
Among the various methods
available, some methods offer the opportunity of multiscalar
analysis. In these methods,
the data may be geographically rescaled. While these methods
geographically process the
data before mapping patterns other methods consider the sanctity
of geographic
boundaries unbreachable. The latter attempts to expose the
underlying risk pattern by
mapping summary statistics within existing geographic boundaries
without any further
geographic processing of the data.
-
10
1.4.1 Map data without further geographic
processing
In these methods the geographic boundaries of regions are left
as they are,
however various statistical manipulations are carried out on the
data. Some researchers
prefer to call this group of methods as disease mapping methods
[45]. As I discussed earlier, these methods can again be subdivided
into two groups, methods that smooth the
data and methods that do not smooth the data.
1.4.1.1 Methods that do not smooth the data
The vast majority of diseases maps are maps of raw rates, where
the number of cases per unit population within existing geographic
regions such as counties or states are
mapped [46]. Another approach is a map of probabilities" [47,
48], where instead of mapping a rate, the probability of observing
the rate within a geographic region is
mapped. Mapping raw rates are often problematic when the rates
are based on small base
populations [15]. The maps thus produced are likely to display
noisy (small number problem) patterns.
1.4.1.2 Methods that smooth the data
In these methods various statistical manipulations are used to
smooth the rates
in each region while at the same time keeping the geographic
boundaries intact.
Information from neighboring regions are used to stabilize the
rates in a given region.
Some examples of this approach can be found in the Bayesian
disease mapping literature
[23, 24]. Other examples are method of moving averages and
headbanging [20, 22].These methods are not very successful in
dealing with the problem of spurious clusters. A study by Kafadar
[22] has shown that many of the popular smoothers such as
headbanging and empirical Bayes are unable to detect true patterns
in the data or have
issues with detecting spurious patterns or clusters. Some of the
methods smooth the data
-
11
by averaging rates over kernels or filters. For example Sabel et
al. [49] investigate rates of Amylotropic Lateral Sclerosis (Lou
Gehrings disease) incidence in Finland by smoothing rates using
Gaussian Kernels. Another method is Rogersons Local Score
statistic [2, 4, 50]. In this method the deviations from the
expected rate are smoothed using Gaussian Kernels. Like other
methods, if the rates are based on small numbers,
then smoothing these unreliable rates may create spurious
clusters. I use Rogersons
Score statistic in my research and therefore, this method is
discussed in detail in later
sections. Spurious clusters are often created by these methods.
First, because these
methods map the rates based on small areas before smoothing
them, they are prone to the
small number problem. Second, these methods do not in any way
attempt to deal with the
problem of spurious clusters from spatial mismatch discussed
earlier. Third, the statistical
tests that these methods carry out may not be able to
distinguish spurious clusters from
true clusters. For example, there is no consensus on what the
correct reference
distribution is for Rogersons Score statistic [2, 4, 50]. A
separate group of methods that often smooth the data, are local
measures of
spatial similarity. These methods , which are also known as LISA
(Local Indicators of Spatial Autocorrelation) [51] address the
question, - How similar is the risk at a given small area to that
of its neighbors? The greater the similarity, the higher the
likelihood
that the small area belongs to (or is) a cluster. Some of the
LISA statistics are local Morans-I and local Gearys C [50-54].
Since, the underlying philosophy of this approach is that things
nearer are more similar than things farther away [55], the implicit
definition of scale here is the distance at which this similarity
is manifested. Thus a process that acts
at a large scale may cause similarity among immediately
neighboring local areas, than
processes that work at a smaller scale. Like other methods, if
the statistics are calculated
on small areas, they could be unreliable. The reference
distribution of LISA statistics are
often not known [4] and the scale at which a process operates is
not investigated before
-
12
LISA statistics are calculated. Any of these factors could lead
to the creation of spurious
clusters.
1.4.2 Methods that pre-process the data
before calculating and/or testing for significant
disease risk
These methods allow the modification of geographic boundaries to
extract the
underlying risk surface and/or to find which area has the
greatest excess in disease risk.
One group of methods, often called density estimation methods,
[56] simply ignore existing geographic boundaries. Drawing from the
field" theory of geographic
phenomena [20]; they consider that disease risk patterns are
continuous in nature and that they do not change or stop abruptly
at geographic boundaries. When appropriately used,
these methods provide the opportunity to control the spatial
basis of support, and thus, the
scale of the analysis [57, 58].The other group of methods draw
from concepts of region building which were developed by
geographers [30]. One approach to building regions is to coalesce
groups of areas to build aggregate regions. These methods attempt
to find
that combination of areas which has the greatest likelihood of
being a zone of high
disease risk. A third group of methods combine concepts of
region building methods with
the first group of methods or with methods discussed in the last
section. The ability of all
these methods is limited by the scale of the data. Often the
data come aggregated into
small areas and the analysis must be carried out at scales equal
or greater than the scale of
aggregation. Nevertheless, these methods are better equipped
than other methods to
control the shape and the scale of the data, and this gives them
an edge over other
methods when dealing with the problem of spurious clusters.
-
13
1.4.2.1 Non combinatorial approaches
These methods ignore geographic boundaries and attempt to
extract the
underlying patterns of risk. They often lay a uniform grid over
the map area and measure
the statistic of interest at each grid point. Irrespective of
whether the data are aggregated
or not, a value can be obtained at each grid point. While there
are a number of approaches
to calculating the statistic at each grid point [21]; a simple
and common approach is to filter" the data using circular spatial
filters [3, 9, 21, 27]. Some methods map the statistic calculated
at each grid point [9] while others do not [3]. These circles can
be of fixed or varying sizes. However, since these filters are of a
certain shape, they bias the cluster
search. The bias is in favor of detecting clusters of or similar
to, the shape of the filter
(circles in this case). Statistically, the clusters that are of
the shape of the filter have a higher power of detection than
clusters of other shapes. This approach therefore,
overcomes the limitation outlined in the methods discussed
earlier, but is limited in its
treatment of geographic shape. Ellipses and other geometric
shapes have also been
studied [29, 59]. One of the methods, based on Rushtons Adaptive
DMap [9] maps rates at grid points using adaptive filters and
interpolates these with an IDW (Inverse Distance Weighting)
interpolation algorithm. The adaptive filter [58, 60] ensures that
the rates are based on the same number of people or the same
support size. Thus, unlike the
LISA methods, all statistics are equally reliable. Also, the use
of an adaptive filter
ensures that the scale of the analysis can be precisely
controlled. The Inverse Distance
Weighting Algorithm used for creating the final pattern was also
found by Kafadar [22] to be the least noisy of all
smoothing/interpolation methods. Thus, by allowing
multiscalar analysis, relative freedom of cluster shape
(clusters dont have to conform to geographic boundaries) and using
a robust interpolation technique, Rushtons Adaptive Filtering
method is best suited for dealing with the problem of spurious
clusters from
mismatch between the process and analysis scales. I use this
method in my analyses.
Another important density estimation method is Kulldorff's
SaTScan [3]. While the
-
14
DMap method maps the extracted pattern, and is therefore good
for visualizing and
exploring the underlying pattern, SaTScan can be used to map
only those areas that are
significant clusters. SaTScan has found wide acceptance in the
public health community
because of its ability to account for the multiple hypotheses
testing problem and a robust,
freely available software. Some of the recent developments in
the disease clustering
literature have followed the combinatorial approaches that I
discuss next, and their
method of choice has been based on the Spatial Scan Statistic
method of cluster
detection. Since multiple testing is an issue with these
combinatorial approaches, the
Spatial Scan Statistic is a reasonable choice. Since I use the
Spatial Scan Statistic in
Chapter-3 to investigate clusters of prostate cancer in North
West Iowa, some of the
details of the Spatial Scan Statistic are provided next:
The scan statistic originated as a one dimensional test. Its
objective was to test if a one dimensional point process is purely
random. The one dimensional spatial scan
statistic was extended by Kulldorff into the spatial domain [3]
.The spatial scan statistic moves a circle across the study area.
The circle centers on to a centroid. The centroid
could be the location of a single individual for unaggregated
data, the centroid of a census
tract (for example) for aggregated data or for a set of grid
points. Kulldorff (1997) [3] states The zone defined by a circle
consists of all individuals in those cells whose
centroids lie inside the circle and each zone is uniquely
identified by these individuals.
Thus, although the number of circles is infinite the number of
zones will be finite. For
unaggregated data the zones are perfectly circular, that is, the
individuals in the zone are
exactly those located within a defining circle. With data
aggregated into census districts,
a zone may have irregular boundaries that depend on the size and
the shape of the several
contiguous census districts it includes. The Spatial Scan
Statistic is implemented
through the freely available software SaTScan [32]. The
methodology of the Spatial Scan Statistic is explained as follows.
The method involves two steps, - 1. Confounder
adjustment and 2. Hypothesis testing
-
15
In disease cluster detection studies known risk factors or
confounders are
adjusted for, before the cluster detection algorithm is
implemented. Thus, for example, it is known that age is associated
with prostate cancer. It may be desirable to remove the
effect of age from the analyses, such that the clusters that are
detected reflect the presence
of other, yet unknown, risk factors. The confounder adjustment
procedure that SaTScan utilizes is known as the indirect
standardization method. It is as follows:
If ,
ei= Expected number of cases in local area/ZCTA i after
confounder adjustment. ni = Observed number of cases in local
area/ZCTA i after confounder adjustment. r = specific cofounder
group, for example age group from 45-65 yrs.
= Total number of confounder groups.
nr = Total number of cases in G in age group r
Nir= Total number of people in G in local area i, in age group
r.
The confounder adjustment procedure is:
ei = [ (nr / Nri1 )* N]
The adjusted numbers of cases are then used to test the
hypothesis if a given local
area/ZCTA i has an excess risk/belongs to a cluster. The
hypothesis testing procedure is
explained next. The Spatial Scan Statistic tests the hypothesis
if a given area of the map
(for example a collection of ZCTAs) has a greater (or lesser)
risk, than the rest of the
ZCTAs in the entire geographic region G.
If Zj is the jth cluster :
-
16
For all possible Zjs in Z (The collection of k possible clusters
in G), if the risk inside Zj is
R(inside, j) is the risk inside Zj while R(outside, j) is the
risk outside Zj ,then under the null hypothesis and alternative
hypothesis:
H0: R(inside, j) = R(outside, j)
H1: R(inside, j) > R(outside, j)
The observed number of cases nj inside (or outside) a cluster
candidate is assumed to be Poisson Distributed, and a function of
the expected number of cases in the cluster ej and the risk
R(inside, j) .
Let n= k Nirri1 nj Poisson [ ej *R(inside, j) ] The likelihood
function that is used, from these null and alternative hypotheses
are as follows:
= Likelihood (R(inside, j) > R(outside, j) ) /
Likelihood(R(inside, j) = R(outside, j) )
This likelihood ratio can be solved and written in the
logarithmic form as follows:
Log Likelihood Ratio or LLRj = (nj ln (nj/ ej)) + ((n- nj) ln
[(n- nj)/(n- ej)])
The significance of the log likelihood ratio is tested using a
Monte Carlo
hypothesis test. The SaTScan program carries out a
user-specified number of Monte
Carlo randomizations of the data and tests to 0.001 % (The
percentage can be user
specified too) significance of the presence of a cluster. A p
value is reported. This is
calculated as p value = Rank of LLR / (1 + #simulation)." Note
that the spatial scan
statistic procedure does not adjust for multiple testing in the
traditional sense for example
by carrying out a Bonferroni or other multiple testing
adjustment procedure. Instead, it
avoids the problem of testing multiple hypotheses, by
concentrating on those clusters
candidates that are most likely to be true clusters (and thus
have the highest log likelihood
-
17
value). Also note that the Spatial Scan Statistic procedure
explained above is the spatial
Poisson model, which is the model used in disease mapping. There
are numerous other
modifications to the Spatial Scan Statistic procedure [29].
1.4.2.2 Combinatorial Approaches
Some geographers are interested in creating or building regions
[30, 61-64]. Regions are built up by assigning small areas to
groups such that they fulfill certain
criteria. Regional geographers have called this the assignment
problem. Small areas
are so assigned to regions, that a certain attribute of the
region is optimized [30, 62]. Sometimes, the problem could involve
maximizing the variation in an attribute of the
newly built region as a proportion of the variation within the
entire map [30, 65]. The general question in this approach is What
combination of areas will optimize a given
objective? ". In the disease mapping context disease risk or the
likelihood of risk can be maximized. An example in the disease
mapping context was investigated by Alvanides
[61]. A similar strategy was also suggested (but not
implemented) by Rushton [66]. These ideas were implemented in
computer programs first by Openshaw [64] and later by other
researchers [63, 67, 68]. Independently Duczmal suggested a similar
solution to finding disease clusters of any shape. He operationally
achieved this by maximizing the
Spatial Scan Statistic likelihood function over possible
combinations of areas. While it is
sometimes possible to look at all possible combinations/
collections of areas, for most
realistic geographical areas this is not possible (For example,
see Cliff and Haggett [62]). Neither are there theoretical
solutions to the problem. In operations research, such
problems are called np-complete. This means that for a
collection of n areas, the problem
cannot be solved in polynomial computer time. Heuristics are
used to solve such
problems. Duczmal uses the Simulated Annealing (SA) and Genetic
Algorithm (GA) heuristics in his research [31, 69]. An important
aspect of these methods is that they provide enormous freedom of
analysis of shape and scale. The analysis scale and shape
-
18
vary across a multitude of combinations. Thus instead of asking
the question Is there a
cluster at a given scale of the following shape? these methods
demand - Find clusters
of any shape at any scale. This makes these methods immensely
powerful. But this
strength also brings about a weakness. If spurious clusters are
created from a mismatch
between the process and analysis scale and shapes, and if a
large number of scales and
shapes are evaluated by this analysis method, then it follows
that noisy clusters will
almost always be detected by these methods alongside genuine or
true clusters. At the
end of this section will shall see an example of this. The next
section discusses some of
the modifications that researchers have proposed to these
methods. These modifications
offer better power of detecting clusters.
1.4.2.3 Hybrid Approaches
These approaches combine some of the strategies of the
non-combinatorial
approaches with a combinatorial search. Some examples are the
approaches proposed by
Patil and Tallie [70], Tango [71] and Yinnakoulias [36]. Tango
proposed that the search begin with a circular cluster as a seed",
but then regions adjacent to the circular cluster be coalesced with
it and the resulting hybrid be tested as a possible cluster. With
every
level of adjacency enumerated the problem becomes
computationally complex, and therefore in their example Tango
suggested that three levels of adjacency be tested. Patil and
Tallie`s [70] approach is limited to restricting the search space
to areas with the highest rates, which Patil and Tallie call the
Upper level sets". These methods provide
interesting extensions to the combinatorial shape-free methods
of cluster search.
We are now in a position to summarize the various methods
discussed. All the
methods outlined above have one singular goal: To extract the
underlying pattern of
significant excess risk. Some methods are good at mapping the
entire pattern [9], while others are good at testing for
significant excess risk [3]. In the next section, I discuss how
problems with significance testing can introduce spurious
clusters.
-
19
1.4.3 Significance Testing and Spurious
Clusters
In general all methods at some point, address the following
question: Of all the
candidate clusters in the pattern of risk (whether mapped or
not), what clusters are true clusters? Each candidate cluster has a
specific risk elevation, a size, and a shape.
Traditionally most cluster detection" techniques have used some
function of the risk
elevation or rate of a given area to decide if the area is a
true cluster. The question that is
asked is How likely are we to observe this risk elevation or
rate in this area if the
underlying process is noise? " If the probability is small then
the area is not a cluster.
The distribution of risks/rates under the process of noise is
also known as the reference
distribution. Traditionally, the reference distribution is
normatively chosen. Some
choices are the normal distribution [2, 50], the chi-squared
distribution [2, 50], the Poisson [3] distribution and the Gumbel
distribution [43]. However, using such distributions is
problematic. If the populations are small, the normal distribution
cannot
be used. It is often hard to distinguish a lack of fit to the
Chi-Squared distribution from
a genuine deviation from the Chi-Squared distribution
(indicating clustering) [4] . A more robust method of achieving
this is to use a Monte Carlo simulation approach to
empirically determine the reference distribution.
Methodologically this may be achieved
by simulating a series of maps, in each of which noise is the
underlying process. Multiple
Monte-Carlo simulations of the data are used to mimic the noise
process. If the observed
risk elevation (or some function of the risk value such as the
rate) for the area is significantly different from the ones in the
simulated maps, then the area is considered to
be a cluster. However Monte Carlo simulations do not guarantee
that spurious clusters
will not be detected. Steenberghen et al.,[72] carried out an
experiment that illustrates this problem. This is displayed in Fig
1.1. Fig 1.1 is a map in which simulated locations
of traffic accidents (points) were randomly scattered [72],
filtered using 600 meter filters,
-
20
the density of points estimated, the resulting clusters tested
for significance and the level
of significance was displayed (also known as a p-map). If areas
which show 0.025 % significance are called clusters, the black
shapes in Figure 1.1 are spurious clusters.
Some methods attempt to tackle this problem with a combination
of both Monte
Carlo and normative statistical techniques. Examples are
Duczmals and Kulldorffs
methods. Duczmals method [3, 31, 43, 69, 73] (which derives from
Kuldorffs method) generates a large number of irregular cluster
candidates. For each candidate the rate is
calculated. The rate is then fed into a function known as a
likelihood function to yield a
likelihood value of the cluster candidate being a true cluster.
This value is divided by
the likelihood of the cluster candidate not being a true
cluster. This ratio is known as the
likelihood ratio. The likelihood ratios for all cluster
candidates are calculated. The
cluster candidates with the highest ratios are the most likely
clusters. Multiple Monte
Carlo simulations are carried out, and the rates at all the
candidate clusters calculated.
Again, the rates are fed into the likelihood function, thus
generating a reference
distribution of likelihood ratios for each cluster candidate.
The likelihood ratio value of
the cluster candidate is compared with the reference
distribution to decide if the cluster
candidate is a true cluster. However when Duczmal applied this
approach to some of his
data, problems with this approach were dramatically exposed. In
one of his studies
Duczmal [31] simulated breast cancer cases and randomly
distributed them over 245 counties in New England (Fig 1.2). When
he instructed his Simulated Annealing (SA) SaTScan based irregular
cluster search algorithm to search for clusters, one of the
clusters
that it found was a large and extremely irregular cluster
encompassing 122 counties, and
enclosing a large percentage of the randomly scattered cases.
This cluster is an example
of a noisy cluster. The noise generating process (random
distribution of cases) operated at the scale of 245 counties
(aggregated). The shape of the area at which this process operated
is the shape of the New England region that we see in Fig 1.2. At
this scale and
shape, the process generates noise. However, if this process is
studied at the scale of an
-
21
aggregation of 122 counties and at the shape that follows the
darker (orange if your copy of this document is in color) shaded
counties in Figure 1.2, then, a noisy or spurious cluster is
generated. It is known that the process that generated this cluster
is noise.
This example thus illustrates a situation where spurious
clusters are created from a
mismatch between the scale and shape of the process that
generates the cluster and the
scale and the shape imposed by the method of analysis. Duczmal
[31] noted that this noisy cluster was large in size and extremely
irregular in shape. Duczmal [73] suggests that large and irregular
clusters like the one found in his study (above) are likely to be
spurious. He and some other researchers [36] therefore, incorporate
a penalty for irregularity of shape in this cluster search
algorithm. The extent of this penalty is decided
on a priori knowledge of the shape of the cluster. Therefore, if
researchers believe that
the clusters in an area are likely to be circular; they place a
high penalty on clusters that
are not circular in shape and vice versa. The spurious cluster
detected by Duczmals
method and the proposed solution raises some important
questions. Is this spurious
cluster large and irregular with a high risk/rate elevation a
cluster of his particular
method, or is it possible that if a cluster detection method is
given freedom of shape and
size then these clusters are likely to be detected? We note that
the shape and size of the
spurious clusters in Fig 1.1 are different from the shape and
size of Duczmals spurious
cluster. Thus not all spurious clusters are large and
irregular.
Duczmals problem has reintroduced the otherwise rarely discussed
issue of shape
and size in the disease cluster detection literature [69, 74,
75]. Risk elevation is just one possible characteristic of a
cluster. McCullagh [76] states - In map analysis, features of prime
importance may be size, shape, orientation and spacing". It is
possible for clusters
of different shapes and sizes to have the same risk elevation.
It is also possible for
clusters of same shape and sizes to have different risk
elevations. The first objective of any cluster search should
therefore be to distinguish spurious or noisy clusters from
everything else. The risk or rate value of a possible cluster
alone is not sufficient to make
-
22
this distinction. The shape and size of the cluster must also be
factored in, when
considering if a cluster is a true cluster. Duczmal proposes a
solution that makes certain a
priori assumptions about the shape and size of a cluster. This
solution is interesting.
However, the problem of spurious clusters may be approached from
a different angle.
Instead of asking the question What is the shape of a true
cluster? which is what these
methods do, and which is a question which is hard if not
impossible to answer, the
question that should be asked is What is the shape of a spurious
cluster?. Unlike the
first question, this is easier to answer. This is because the
shape of a spurious cluster,
unlike a true cluster can be mined a-posteriori from the data.
To know how this can be
done, we first need to understand how spurious clusters are
generated in the first place.
Thus, in the chapter that follows I discuss in depth, the
phenomenon of noise and the
creation of spurious clusters.
1.4.4 Identifying spurious clusters and
distinguishing true clusters from spurious
clusters
Spurious clusters enclose noise. Across disciplines noise is
defined as .. a
random and unpredictable signal" [77]. By this definition if the
nature of the signal is known, then noise can be detected and
filtered out. For example in a satellite image, it
may be known that certain frequencies are the signal frequencies
and therefore a spectral
analysis and subsequent filtering may help remove the
undesirable noise. In a satellite
image the signal has a physical existence. For example, infrared
radiation emitted by
vegetation can be measured with certain instruments. In
contrast, in mapping disease the
signal cannot be physically measured. The signal is conceptual
and has to be estimated
from the available data. Some geographers and statisticians
attempt to tackle the problem
by developing statistical models that attempt to separate signal
from noise [21, 23, 78-
-
23
80]. Perhaps a better approach to understanding signal and noise
in a disease map is to understand the physical process that gives
rise to the signal (as in a satellite signal). It is known that in
a disease map, the observed patterns are the result of underlying
processes.
The observed patterns are patterns obtained from mapping
statistical summaries of
disease outcomes. For example, a map of patterns of cholera
mortality in England could
be displaying the number of cholera deaths per unit population
in each county. The
outcome in this case is cholera mortality which is the outcome
of a disease process. Since
cholera is a communicable disease it is possible that the spread
of cholera can be modeled
as a contact network process [81]. There exist many other
spatially explicit disease processes2. For example, patterns of
disease could be the result of processes that reflect
an underlying lack of access to healthcare [10, 56, 82-84].
Whatever the specific process may be, these processes have a common
trait in having a spatial form [85], and this means that they
predispose some areas of the map to have a greater risk than any
other.
It is also possible that the underlying process does not cause
any region of the
map to have a greater risk than any other. Since a disease case
may appear at any point on
the map by random chance, by the earlier definition of noise,
this is a noise generating
process. A cluster defined by enclosing some of these disease
cases is a spurious cluster.
On any given map disease patterns can be the result of one or
more processes. It could be
the result of one process that generates clusters and another
process that generates noise.
The challenge therefore, is to distinguish the areas of a
pattern that are the result of a
cluster generating process from those that are not. Also, given
a disease process that
generates patterns on a map; a number of other factors also
influence the patterns we
2 It is important to distinguish between a spatially explicit
disease process and a
spatial disease process. Some scientists attempt to model
diseases as purely spatial processes. Examples of this can be seen
from the cellular automata based disease modeling literature. No
disease process is purely spatial and therefore such models are
misleading.
-
24
actually observe. Given a cluster generating process, the
following factors influence the
pattern that is then extracted:
1. The spatial distribution of the locations of people in the
map.
2. The shape and size of the geographic units that are used to
aggregate individuals
into discrete small areas.
3. The shape and size of the spatial configuration, the disease
mapping or cluster
detection method may impose on the data (In addition to 2).
Understanding these factors is essential to understanding noise
and spurious
clusters. I discuss this next.
1.4.4.1 The spatial distribution of the locations
of people in the map
A cluster generating process causes an area of the map to have a
greater risk than
other areas of the map. Cluster detection methods seek to
estimate the shape, size and risk
elevation of the area of increased risk using the locations of
people as proxy sample sites.
A representative spatial sample of the area of risk would be a
uniform grid [86]. People are never distributed uniformly over
space; instead, a likely spatial distribution consists
of dense settlements interspaced with sparsely populated areas.
This creates a challenge
in estimating the true shape of the cluster. As I illustrate
from figures 1.3 to 1.11, a
cluster that in reality has a uniform shape, may be estimated as
having a highly irregular
shape, because of the way people are distributed over space
[75].The shape of the actual area of increased risk or true cluster
created by the cluster generating process also
influences the shape of the cluster that is finally estimated.
If the shape of the true cluster
-
25
is highly irregular, it is quite likely that the shape of the
cluster that is estimated is also
highly irregular, but the converse may also be true! This is
illustrated from figures 1.12 to
1.14.Another phenomenon long observed by geographers is that the
same risk process
may give birth to different shaped clusters in different areas
of the map or, in more
general terms, the same cluster generating process may give rise
to different patterns
[87]. While the shape of the original area of the increased risk
or true cluster may be the same in two areas and the spatial
distribution of the people may be the same, it is not
necessary that the pattern of people who are diseased (and who
are not) will be the same in both areas. This means that the shape
of the estimated area of increased risk will not be
the same in both areas. This is further complicated by the fact
that people are almost
never distributed similarly over space in two different regions
(Figures 1.15 to 1.20). First, for the purposes of understanding
this issue, let us assume the highly
improbable situation that people are uniformly distributed over
space. Let the distribution
be over a uniform grid. Figure 1.3 illustrates the situation.
Next, let us consider that out
of the 42 people in the region, 10 are afflicted by some
disease. However, we assume that
the process that causes disease is a noise generating process.
Therefore, we expect
diseased people (or cases) to be randomly distributed over the
region among 42 people as shown in figure 1.4. A convex hull
boundary of these cases is seen in Figure 1.5. In
contrast, if there is a cluster generating process, we would
expect the diseased people to
be clustered together. Figure 1.6 illustrates such a situation.
People enclosed within a
dotted area of increased risk are diseased, the risk being 0. 24
(the risk in other areas being 0). We observe in Figure 1.6 one
realization of the risk process, so 10 people are diseased. Figure
1.7 displays the convex hull boundary of this cluster of
diseased
people. The smooth and regular shape of this cluster is in sharp
contrast to the irregular
cluster shape that we observe in Figure 1.5. Since it is highly
unlikely, that people will be
uniformly distributed over space, Figure 1.8 illustrates the
more realistic possibility of
people being non uniformly distributed over space. If the entire
geographic area in figure
-
26
1.8 is subject to a risk, we expect some people to become
diseased (again, one realization of the process) . Figure 1.9
illustrates this and the boundary that demarcates the cluster. The
shape of the cluster is very different from what was obtained in
Figure 1.5. An
increased area of risk on such a heterogeneously distributed
population gives rise to
clusters of unpredictable shapes (figures 1.10 and 1.11).These
example show how the spatial distribution of the people affect the
shape and size of the risk surface detected.
From these examples it may seem that for a given distribution of
people over
space, a cluster generating process gives rise to patterns on a
map that are regular
compared to the shapes generated by a noise generating process.
Indeed, some scientists
use measures of regularity of a clusters shape to distinguish a
true cluster from a
cluster spurious cluster [73]. Also, people never are
distributed uniformly over geographic space. Next, we see how this
affects the shape and size of the cluster detected.
In the example I have discussed I assumed that the cluster
generating process gives rise to
a very regularly shaped area of increased risk (The area within
the dotted line). In reality this may not be true. The area of
increased risk may have a very irregular shape. Some
examples of geographic features that can be areas of increased
risk are rivers, roads,
underground groundwater streams, plumes of aerial pollution or a
combination of some
of these. We therefore observe that the shape and size of a
cluster cannot be predicted a-
priori and is unique to the risk elevation of the cluster
generating process and the spatial
distribution of the people. Another aspect of a cluster
generating process is that the same
process can give rise to different shaped clusters in different
regions of the map. This can
happen even if people are uniformly distributed. The examples
below illustrate this:
From the discussion and the examples, we can conclude that both
the spatial
distribution of people and the shape and size of the area of
increased risk, have an
important bearing on the shape and size of the cluster that is
finally detected. The area of
increased risk or the true cluster may have a very different
spatial configuration from
the cluster that is detected. Parts of the true cluster may be
suppressed or spurious areas
-
27
of increased risk may arise. Spurious clusters are created from
the method used to
measure the outcome of the process of clustering. By definition,
the method uses a scale
and (or) shape of measurement that is dependent on the spatial
distribution of people. Since this distribution is not
representative of the underlying area of increased risk, there
is a mismatch between the measurement shape/scale and the
process shape scale. While
the above examples are with individual level data, the
conclusions drawn can be
generalized to aggregated data. The act of data aggregation
itself could introduce noise
over and above the problem of heterogeneously distributed
people. This is discussed in
the next section.
1.4.4.2 The scale and spatial configuration
of the geographic units that are used to
aggregate data into discrete small areas
In the geography literature the term scale is used to refer to
three different kinds
of scales, two of which are of relevance here. The first is the
phenomenon scale, or the
scale at which a spatial process operates. The second is the
analysis scale the scale at
which data are aggregated for measurement and analysis [88].
When a phenomenon such as a disease operates at a given scale, its
outcome is often registered as heterogeneity in
disease rates at that scale [89]. Geographers have often
attempted to find the scale at which a process operates [90]. Two
well known methods are the use of spectral analysis [65] and
variogram [91] modeling. The latter approach is often used in the
health geography literature. Studies in China have shown that
Esophageal and Liver Cancers
operate at scales of less than 150 kms while stomach cancers
operate at scales less than
90 km [91]. In Sweden substance related disorders operate at
scales less than 3 kms [92]. Unfortunately, the scale at which a
given process operates is not known in most
geographic studies. A geographer attempts to study a process by
collecting and analyzing
-
28
spatial data. This process involves analysis through the
calculation of statistical
summaries of data aggregated at an appropriate scale. When the
process scale is not
known there is every possibility of a mismatch between the
process scale and the analysis
scale. This mismatch or misalignment arises from two sources.
First, geographic data are
often aggregated into discrete units often for purposes
different from the analyses for
which they are being used. These units of aggregation could
differ in shape and scale
from the process scale and shape. As Haining [93] states in
Conceptual models of spatial variation [93] ...This might be
referred to as process-induced spatial heterogeneity. This source
of heterogeneity may be compounded in the case of regional data by
measuring
attributes through spatial units of different size. This might
be referred to as
measurement-induced heterogeneity because it is a product of how
attributes are
observed and measured. A second source of mismatch is from the
spatial structures that a
disease mapping/ cluster detection method imposes on the data.
For example, spatial
filtering [9, 10] and Spatial Scan Statistic based methods
calculate summary statistics by aggregating data along circular
filters. In the geography literature the problems that
arise from spatial mismatch are grouped under MAUP or the
Modifiable Area Unit
Problem [91, 94]. MAUP phenomena are again grouped under two
broad sub groups as the zone effect and the scale effect. The
creation of spurious heterogeneity or destruction
of true heterogeneity with changing scales is a manifestation of
the scale effect. If the
scale is kept fixed but the shape of the zones of aggregation
are changed, then the zone
effect is likely to be seen. Geographic data aggregated to
administrative units often
display both the zone and scale effects of MAUP. Aggregating
data has a smoothing
effect on disease rates [95], and therefore clusters at scales
smaller than the scale of aggregation could be missed, when
analyses are done using these data. Conversely, if the
scale of aggregation is smaller than the process scale, then
noisy clusters could be
detected. A recent study by Ozonoff et al., [19] demonstrated
that when individual level data are aggregated and a Spatial Scan
Statistic cluster search method used on the data,
-
29
then noise increases with increasing levels of aggregation.
Therefore, analysis and
process scales interact in complex ways to create noisy clusters
and suppress true clusters
We can conclude from our discussions above, that a number of
complex factors
influence the shape, size and the risk elevation of the clusters
that are detected and the
spurious clusters created. These factors are dependent on the
spatial distribution of the
people and the process and analysis scales. It is not possible
to make a priori assumptions
about these factors, and it is certainly not possible to predict
the shape of a noisy cluster a
priori. What approach is then appropriate if the spurious
clusters have to be separated
from the true clusters? The section that follows answers this
question.
1.4.5 Identifying the noisy" or spurious
components of the pattern
A reasonable cluster detection technique should take into
consideration not only
the risk elevation but also the shape and size of the cluster. I
propose a spatially enabled
computational process that uses these attributes of a cluster,
to identify the signature of
spurious clusters from patterns on a disease map. Earlier, I
introduced the idea that a
pattern is the outcome of a process. Analyzing a pattern or the
components of a pattern
such as individual clusters may yield clues about the underlying
process. A map of
disease patterns represents one realization of the underlying
process. It may not be
possible to draw conclusions on the process that generated the
pattern or components of
the pattern by analyzing just one map. However, if multiple maps
were available, representing multiple realizations of the process,
then analyzing the patterns may yield
clues about the underlying process. A classic example of this
approach can be found in
Hagerstrands classic paper [96] in which he simulates multiple
maps assuming an underlying process. He then compares maps of
empirical data with the maps that he has
simulated to draw conclusions about the validity with which he
represents the process in
his model. Another example can be seen from Diggle
[97].Therefore, if maps were
-
30
created using a known process, then analysis of the simulated
patterns on the maps would
yield clues on the signature" of that particular process. Once
this signature" is known,
then the pattern could imply (or not imply) the existence of
this process. More specifically, this scheme can help identify a
signature" for spurious clusters. These
signatures can then be used to distinguish clusters that are
spurious from clusters that are
true", in any given pattern of disease risk. Shape, size and
risk elevation are part of this
signature". For example, the signature of spurious clusters in
Duczmals [73] method was that these clusters were large in size and
had irregular shapes. The next chapter is
devoted to the method I have developed based on these ideas. The
method is first
described, then tested and validated on simulated data.
1.4.6 Why use size, shape and rate
The reason I add the dimensions of size and shape, in addition
to rate, is to
characterize the reference space in which spurious clusters are
located. I know from
theory (as discussed in this chapter) that spurious clusters
arise differently to the extent that the numbers of people at risk
in relation to the overall relative risk of the disease
exist differ across the space. When people are distributed
uniformly in space, the average
number and average size of spurious clusters in that space can
be determined from
theory. As Schinazi [98] shows, deterministic statistics can be
used to determine the chance of finding a given number of clusters
with a rate higher or lower than the expected
rate. However, when people at risk are distributed non-uniformly
in space, the equivalent
number is more difficult to determine directly from theory. The
same theory still applies;
it is just more difficult to implement in the case of
non-uniform distribution of people at risk. For this reason, I use
Monte Carlo simulation to discover the rate, size, shape space
in which typical spurious clusters lie, given the particular
distribution of people at risk
and the particular overall relative risk of the disease in the
study area in question. In his
seminal paper King [85] states The mathematics of stochast