Geographically masking health data to preserve confidentiality

* Correspondence to: Marc P. Armstrong, Department of Geography and Program in Applied Mathematical andComputational Sciences, 316 Jessup Hall, The University of Iowa, Iowa City, IA 52242, U.S.A. E-mail: [email protected]

Contract/grant sponsor: Obermann Center for Advanced Studies at the University of Iowa

CCC 0277—6715/99/050497—29$17.50 Received April 1997Copyright ( 1999 John Wiley & Sons, Ltd. Accepted April 1998

STATISTICS IN MEDICINE

Statist. Med. 18, 497—525 (1999)

GEOGRAPHICALLY MASKING HEALTH DATA TOPRESERVE CONFIDENTIALITY

MARC P. ARMSTRONG1*, GERARD RUSHTON2 AND DALE L. ZIMMERMAN3

1Department of Geography and Program in Applied Mathematical and Computational Sciences, 316 Jessup Hall,¹he ºniversity of Iowa, Iowa City, IA 52242, º.S.A.

2Department of Geography, 316 Jessup Hall, ¹he ºniversity of Iowa, Iowa City, IA 52242, º.S.A.3Department of Statistics and Actuarial Science, Mac¸ean Hall, ¹he ºniversity of Iowa, Iowa City, IA 52242, º.S.A.

SUMMARY

The conventional approach to preserving the confidentiality of health records aggregates all records withina geographical area that has a population large enough to ensure prevention of disclosure. Though thisapproach normally protects the privacy of individuals, the use of such aggregated data limits the types ofresearch one can conduct and makes it impossible to address many important health problems. In this paperwe discuss the design and implementation of geographical masks that not only preserve the security ofindividual health records, but also support the investigation of questions that can be answered only withsome knowledge about the location of health events. We describe several alternative methods of maskingindividual-level data, evaluate their performance, and discuss both the degree to which we can analysemasked data validly as well as the relative security of each approach, should anyone attempt to recover theidentity of an individual from the masked data. We conclude that the geographical masks we describe, whenappropriately used, protect the confidentiality of health records while permitting many important geo-graphically-based analyses, but that further research is needed to determine how the power of tests forclustering or the strength of other associative relationships are adversely affected by the characteristics ofdifferent masks. Copyright ( 1999 John Wiley & Sons, Ltd.

1. INTRODUCTION

The past two decades have seen large increases in the size and number of computer-based systemsthat contain information on the health status of individuals, including vital statistics, diseaseregistries, and other health records.1 These data are used for many purposes and support theroutine surveillance of the health status of communities as well as the investigation of relation-ships between environmental exposures and disease. Much of this information, however, isconfidential; it is furnished by people with the assurance that it not be made available in a formfrom which the identity of any one of them can be determined.2 Access to such informationnormally takes place under one of three modes. The first is when a data user is a member of the

organization that is the custodian of the data. Here the user may access the data in all of its detailbut must ensure that when results of analyses are disclosed, individual-level information is notreleased, nor is any other information from which one can infer individual-level data. The secondmode is when a data user who does not belong to the custodial organization applies for and isgranted privileged access to the information. This may occur, for example, after a user guaranteesthat individual records will not be released and that the information will be used only forpurposes approved by an oversight committee. Such an approach is used by the National DeathIndex program established by the National Center for Health Statistics.3 The third mode ofaccess is when information is released to the general public in a form that protects the security ofall individual records.

Increasingly, it is becoming clear in the field of public health that health information aggreg-ated to political or other administrative areas is inadequate to address many public healthconcerns. From the accident risk of particular environments, to the hazards of living close tohazardous waste sites,4,5 to the exposure risk from lead associated with urban highways, it is clearthat geographical analyses must investigate relationships in areas that are defined far moreflexibly than is possible using spatially aggregated data for current political and administrativeentities. We argue that it is necessary to develop new methods because current practice, in whichindividual-level information is aggregated to pre-existing administrative or political units, oftendestroys information needed for geographical analysis. We are particularly concerned with theidentification of procedures that preserve the confidentiality of individual records, and that alsopreserve, to the maximum degree possible, the geographic properties of the data. The purpose ofthis paper, therefore, is to describe and evaluate alternative approaches to encoding the geogra-phy of health records — which we call geographic masking — that protect the confidentiality ofindividuals but which also ensure the possibility of valid geographical analyses of the data.A central motivation for our work is that better geographic coding of health events will improvethe quality of public health surveillance and enable the investigation of associations betweenadverse environmental conditions and health.

2. CONFIDENTIALITY AND MICRO-LEVEL INFORMATION

There is a strong demand by members of the research community for access to microdata.6 As thedesire for such information increases, and as the number of organizations that routinely maintainindividual-level data increases, problems with the release of information that could be associatedwith an individual are certain to increase as well. Although there are methods that reduce theprobability of disclosure of individuals, while preserving the characteristics of the data essentialfor valid inference,7,8 this literature does not address adequately issues of preserving the spatialproperties of data that influence risk of disclosure.9,10 Duncan and Pearson briefly discuss thegeographical aspects of confidentiality, but end their discussion by stating, without furtherelaboration, that ‘the use of refined geographic identifiers’ are required to ‘achieve the goal ofidentifying and explaining human behavior at both the aggregate and individual levels’ (reference6, p. 221). This geographic aspect of confidentiality has increased in importance during the lastdecade because of advances in computing technology. Though the principles were establishedlong ago for matching address fields on administrative records to an address-range on a map inorder to determine approximate geographical location,11 the process remained embedded inspecial-purpose mainframe computer systems and real applications were few. More recently,microcomputer-based geographic information system (GIS) software, along with the availability

498 M. ARMSTRONG, G. RUSHTON AND D. ZIMMERMAN

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 497—525 (1999)

of low cost street centreline files for the entire nation12—14 have made it possible for individualswith little or no training to determine the location of large numbers of health records for any areain the U.S. These locations essentially reveal the identity of individuals when they are linked toother widely-available information sources.

Data that describe health events, such as vital statistics, diseases and clinic and hospital use, areroutinely made available for geographic areas. In the U.S., the most common areas used are thecounty, census tract and five-digit Zip code. The county is the unit used most frequently (see, forexample, the Area Health Resources File) by national health institutions such as the Centers forDisease Control and Prevention (CDC) with smaller areas used, more typically, by state and localhealth organizations. In the U.K. the enumeration district or unit postcode are often used(reference 15, p. 847). Though this method of geographic encoding of health events continues inmany countries, individual agencies may impose restrictions on the level of geography that can bereported. In the U.S., for example, the National Center for Health Statistics (NCHS) reported in1978 that for all ‘micro data which are released outside of NCHS, geographic identification mustbe deleted for all areas below the State level which contain fewer than 250,000 inhabitants in themost recent official population Census’ (reference 7, p. 53). More recently, NCHS revised thiscriterion noting that ‘Geographic places and characteristics of areas with less than 100,000 peopleare not to be identified’ (reference 8, p. 33; reference 16, p. 1971). This policy, though successful atlimiting disclosure, may preclude the release of information that directly supports conclusionsreached by researchers about any local, geographic health effects of, for example, environmentalcontamination. National disease atlases produced by CDC, similarly, are limited in the geo-graphic detail they reveal.17~19

The identification of disease clusters serves as an example of how the current policy of limitingdisclosure by spatial aggregation of disease data to large areas confounds one of the principalpurposes for which disease data is collected in the first place. When data are spatially aggregatedto large areas, the ability of researchers to detect disease clusters or to investigate suspectedrelationships between environmental exposures and disease events is affected in four ways. First,absolute and relative locations within the geographical extent of each area are unobservable, andas a consequence, one cannot perform tests of clustering, except for those designed to operatespecifically on data aggregated to areas.20 The second effect arises from the geographic scale ofthe aggregation with respect to the geographic scale of the clusters. Waller (reference 21, p. 772)determined that the power of the statistical cluster methods he investigated increased ‘when thelevel of aggregation results in cells that are small enough to contain clusters without includingareas with the background incidence rate’ and that the aggregation level used in an analysis limitsthe size of clusters that one could detect. Waller identified the necessary, but not the sufficient,conditions under which one can use spatial aggregation with maximum statistical power toidentify a disease cluster correctly. In addition to the size of the aggregated area, the third effectarises from the shape and the placement of the area in relation to the distribution of the disease.The issue here is the spatial conformity of the boundaries of the aggregation area with those of thecluster. The fourth effect arises from the fact that exposure assessment data are generally collectedfor different areas than health and demographic data. Accurate analyses might be conducted if thehealth data were spatially encoded to the boundaries of the areas with common levels ofenvironmental exposure. Instead of geographically encoding the health data to the areas definedin terms of environmental exposure, current approaches use methods of areal interpolation thattranslate data to common areas.22,23 These interpolation methods, however, introduce errorsinto the data that are propagated in unknown ways in subsequent analyses.24

GEOGRAPHICALLY MASKING HEALTH DATA 499


The logical conclusion from this previous work is that to achieve the greatest power, not onlymust the level of aggregation result in cells small enough to contain clusters, but their boundariesmust be reasonably consistent with boundaries that encompass the cluster without including largeareas with background incidence rates. When a spatial aggregation divides space into mutuallyexclusive territories, however, there is a substantial possibility that any given disease cluster mightstraddle two or more spatial units.10,25 Such a likely possibility means that the conclusion onereaches depends on the specific boundaries of the areas for which one collects the data. If the datawere made available for one set of regions, the conclusion might be that disease clustering exists;for a different set of regions, but for the same original data, disease clustering may not exist! Thiseffect is well-known and constitutes one component of the modifiable areal unit problem.26~28

Thus, the consideration of both accuracy and efficiency argues for maintaining health data ina spatially disaggregated form so that it can be aggregated, flexibly, to any spatial unit of interestor any set of small areas that share a common level of environmental exposure.

The effect of making health event information available only for large spatial aggregations isthat we cannot detect clusters unless the boundaries of the areas coincide at least roughly with thespatial units for which the data have been encoded and their sizes correspond approximately withthe size of the event cluster in question. These requirements, however, are clearly not met in manydisease cluster investigations, and, consequently, one should not be surprised that a majority ofsuch investigations to date have resulted in ambiguous, if not negative, results.29 This limitationhas been widely recognized ever since the pioneering work of Choynowski,30 although efforts todeal with it have usually faltered over the absence of alternative methods for spatially encodinghealth data. Openshaw (reference 31, p. 89) argues that in the search for disease clusters, ‘theanalyst should not be making scale specific pattern assumptions that are then imposed on theanalysis process from a position of ignorance’. Aggregating point data to areas does just thisunless the analyst knows of aspects of the disease process that differentially change the expectedincidence of the disease between the areas used to aggregate the data. For example, the areas forwhich the aggregations are made may be administrative jurisdictions in which preventive healthservices are developed to different degrees in different jurisdictions. In these circumstances, spatialaggregation will often result in data for which area-based disease cluster tests will have morepower than point-based tests, as described and illustrated by Oden et al.32 These problems ininvestigating disease clusters using spatially aggregated disease data also arise in addressing otherproblems. Investigations of small-area variations in surgery rates, for example, are similarlysensitive to the definitions of the areas for which data are collected. Clearly, we need alternativeapproaches to preserving the confidentiality of health records. In the next section we specificallyconsider geographic masks as a tool for protecting the privacy of individual records and limitingdisclosure. A recent paper considers other types of masks and methods for protecting privacy andlimiting disclosure.33

3. ALTERNATIVE APPROACHES TO GEOGRAPHICALLY MASKINGPUBLIC HEALTH INFORMATION

One can develop several approaches to mask the specific location of individual health events. Theapproaches we describe are meant to serve as a framework for the development of purpose-specific masks rather than as an exhaustive enumeration of all possibilities. The best approachdepends on the purpose of the data user as well as the degree of risk of disclosure that the datacustodian wishes to tolerate.



Duncan and Pearson6 introduced, and Cox34 further developed, a framework for describingand comparing disclosure-limitation and microdata masking procedures. In this framework, thetrue and complete data file is represented by an n by p matrix X. Each of the n rows consists ofdata on p attributes for a single individual. For geographic health event data, variables thatidentify the geographic co-ordinates of the event are among the columns of X. The goal of thedata custodian is to release a modified, or ‘masked’ version of X that preserves useful informationfor legitimate uses of the data but keeps the risk that a ‘data spy’ can infer the value of anillegitimate target, ½, below an acceptable level. Two important illegitimate targets are theidentity of a particular individual and the identity of any individual in the database.

There has been a large number of general masking methods proposed. We classify thesemethods into three main categories: record-transforming masks; attribute transforming masks,and displacing masks. Record transforming masks correspond to operations on the rows ofX (the ‘records’) such as aggregation across similar records, suppression of certain records,sampling of records, and multiplication of records by random noise. Attribute-transformingmasks operate on the columns of X and consist of such strategies as aggregation acrosscommensurate attributes, suppression of certain attributes, and multiplication of attributes byrandom noise. Displacing masks are imposed by adding stochastic or deterministic noise to X orto a portion of X.

To introduce the geographic masking of health event data, we assume that X has been strippedof its obvious formal identifiers such as name and social security number, and that it has thefollowing partitioned form:

X"[º,»,¼,Z]

where: º and » are geographic co-ordinates; ¼ contains explicit health variables (for example,presence or absence of disease, lead concentration level in blood); and Z contains other relevantattributes such as age, gender or socio-economic status.

By focusing herein on geographical masking only, we make the following assumptions aboutthe non-geographical information in X. First, we assume that sufficient masking of the explicithealth variables (¼) to preserve confidentiality so thoroughly distorts the data that theirusefulness is destroyed. For example, if ¼ consists of a single binary variable such as diseaseincidence (0"absent, 1"present) and the data for release is to preserve this level of classifica-tion, then the only way to mask ¼ is to change some of the 0s to 1s and vice versa. Obviously, thisis undesirable. Second, we assume that the information in Z is sufficiently non-individual specific(or possibly has been made so by the initial application of a non-spatial mask) that we need notmask it.

In what follows, we characterize geographic masks as matrix masks, that is, as matrix-valuedfunctions of X (and possibly of some additional information). Doing so provides a conciseunifying framework for all the masks we consider, permits a clearer understanding of theirsimilarities and differences, and indicates the type of computations required to implement themasks.

The most general representation of a geographic mask that we consider is

M"f (X, E) (1)

where: M is an m]q matrix (m)n; q)p) of masked data for release to the data user; E is an n]rmatrix of individual-specific information external to the information in X, and f is a function thatmaps n](p#r) matrices to m]q matrices.



In this representation we permit the mask to depend functionally not only on the datamatrix X, but also on other information, E, available to the custodian and presumably also tothe data user, and which may impact the confidentiality of the data. An important exampleof such external information is population density, particularly if it were available for sub-classesof variables that happen to be in Z (for example, age-specific or gender-specific populationdensity).

The majority of geographic masks we consider are special cases of a mask less general than thatspecified by equation (1), namely

M"[g(X, E), ¼, Z] (2)

where g is a function that maps n](p#r) matrices to n]2 matrices of geographic co-ordinates.This family of masks, which we call isomasks, differs from the more general case by leaving ¼ andZ untouched and replacing the original co-ordinates of each record with a new pair of co-ordinates. The result is a matrix of exactly the same dimensions as X. Furthermore, in manyimportant cases

g (X, E)"A (º, »)B#C (3)

where A, B and C are matrices that implement a particular mask; A effects a record-transformingmask, B effects an attribute-transforming mask, and C effects a displacement mask.

That is, we obtain the new co-ordinates by operating linearly on the rows and columns of(º, » ). In such cases, the geographic mask is a special case of Duncan and Pearson’s generallinear matrix mask. We now consider several specific geographic masks.

3.1. Individual and Concatenated Affine Transformations

We can subject data collected at discrete locations to a family of point transformations that movethese locations deterministically to a new set of locations. For example, we can apply affinetransformations (translation, scale and rotation) to geographic co-ordinates and these are specialcases of the general linear matrix mask defined by equation (3).

3.1.1. Displacement using Translation

The translation approach shown in Figure 1 simply involves the displacement of each point fromits original location by a fixed increment, an approach suggested by Aldrich and Krautheim.35For example, we move each point (u

i, v

i) to (u

i#*u, v

i#*v,) where *u and *v are displacement

constants. We can express this as a linear matrix mask as follows:

M"[º#*u1n, »#*v1

n, ¼, Z]

where 1nis an n]1 vector of ones. If we do not provide *u and *v to the data recipient, then the

transformed data set does not support the detailed identification of individuals by, say, theirhome address location. In addition, the displacement approach preserves several importantproperties of the original data set including overall density, relative density and directionalinformation.



Figure 1. A translation isomask

3.1.2. Change of Scale

Another affine transformation is a scale change in which we transform each point (ui, v

i) to

(cui, cv

i), where c is a scaling constant (Figure 2). We can express this as a linear matrix mask as

follows:

M"[cº, c»,¼,Z].

The scale transformation in this form changes both the distance from the origin of the co-ordinatesystem (thus executing a displacement) as well as the distances between point locations.

3.1.3. Rotation

A third affine transformation obtains by rotating each point by a fixed angle h(0)h)2n) abouta pivot point. We can define a linear matrix mask for rotation as

M"C(º,» )Ccos hsin h

!sin hcos hD , ¼, ZD .

This mask uses the origin of the co-ordinate system as the pivot point around which we rotateevents. An alternative approach, rotation about an arbitrary location (Figure 3), requires, first,that we translate all of the points by a constant amount given by the distance between the chosenpivot point and the origin of the co-ordinate system. For example, we might move the centroid ofthe points that require masking to the origin of the co-ordinate system and translate all otherpoints by that same amount. We rotate these translated points, and then apply an inversetranslation to restore them to their original (general) locations.

3.1.4. Concatenation of Isomasks

While one can use each of the isomasks described in this section solely and independently, we cancombine a set of such masks to introduce additional levels of uncertainty about the true locationof individual health events. In practice, we can concatenate a set of masking operations, such as,translate to the origin (!u, !v), rotate by h, inverse translate (u, v) and scale by c, into a singletransformation matrix.36 We then multiply each point by this concatenated matrix to increase theefficiency of this approach.



Figure 2. A scale isomask

Figure 3. A rotation isomask

3.1.5. Incorporating a Stochastic Mechanism in the Affine Transformations

We can choose input parameters for the affine masks described above arbitrarily or accordingto some random mechanism. For example, we could implement a stochastic translation maskby taking *u and *v as independent uniform random variables on the interval (!d, d), whered'0 is suitably chosen. A sensible way to construct a random scale change is to take c asa random variable for which P (1/i(c) 1)"P (1(c(i)"1/2 and which is uniformlydistributed on each of (1/i, 1) and (1, i) where i'0 is suitably chosen. Finally, we couldintroduce randomness into the rotation mask by taking h to be a uniform random variable on theinterval (0, 2n).

3.2. Random Perturbation

A second general technique for geographic masking is random perturbation, in which we displaceeach point by a randomly determined amount, and in a randomly determined direction, specificto its original location. We can represent this mask as

M"[º#C1, »#C2, ¼, Z]

where the components of C1 and C2 are random variables; thus it, like the stochasticdisplacement mask described in Section 3.1.5, is a linear isomask. It differs, however, because it



Figure 4. Health events (grey squares) are shown in their original locations on the left side. On the right side, each originallocation has been randomly perturbed and appears in a new location (black square). In this example, the possible

magnitude of displacement, indicated by the size of the circle, is held constant

allows the amount and direction of displacement to vary among points. Thus, unlike rigid affinetransformations, random perturbation does not preserve, exactly, the relative locations andorientations of points (Figure 4).

One might take the point-specific displacements of a random perturbation as independentand identically distributed random variables, but this need not be so, and, indeed, forsome purposes is undesirable. For example, it may be beneficial to displace points in sub-regions of lower population density by larger amounts than points in high density areas inorder to equalize risk of disclosure in low density areas. This requires displacements that arenot identically distributed, but rather have lengths whose expectations are inversely relatedto site-specific population density. Alternatively, one might allow displacements to be identi-cally distributed but require that displacements corresponding to nearest neighbours inthe original pattern be highly correlated. This would reproduce closely the nearest-neighbour distances of the original points but would not so closely reproduce other interpointdistances. While such a mask is suitable for a user who wishes to test for clustering using a teststatistic that is a function of a pattern’s nearest neighbour distances only, it may be less suitablefor other users.

We must choose a particular distribution, or family of distributions, for the random displace-ments. Two potentially useful families are the bivariate uniform and bivariate normal distribu-tions. For both families, we can link the level of perturbation (and the preferential direction ofperturbation, if any) directly to a single parameter: the range in the uniform case, and the variancein the normal case. Two desirable features of the uniform distributions are: (i) it imposes an upperbound on the amount of displacement for each point; and (ii) it can easily accommodateirregularly-shaped regions within which the point must lie after perturbation. We might find thisnecessary, for example, if irregularly-shaped geographic features (for example, lakes, steep slopesthat preclude residential development) exist within a study area. In contrast, a normal distribu-tion could, with positive probability, result in one or a few extremely large displacements, and,due to the inherent ellipticity of its equiprobability contours, it cannot easily accommodateirregularly-shaped perturbation regions. Note, however, that the normal distribution allows oneto impose correlation between neighbouring displacements in a simpler way than with a uniformdistribution, owing to the equivalence of zero correlation and independence for the normaldistribution.



3.3. Aggregation

Though we have noted earlier that we can mask individual health records by making themavailable as aggregates for regions, this can be done in two ways. First, we can obtain aggregatedinformation by enumerating the total that exists within a region; this is called areal aggregation.10A second approach (point aggregation) uses a single location to represent a defined subset of theoriginal locations. In the latter case, we could represent regions by their geographic centroids, orwe could compute other locations that have a defined relationship to the original locations. Weconsider this case first. In contrast with affine transformation and random perturbation masks,most useful types of aggregation cannot be represented as linear functions of geographicco-ordinates as in equation (3); hence they are non-linear masks.

3.3.1. Point Aggregation

When we assign individual records to a point location specified by a geographic co-ordinate, suchas when we address-match health records to locations based on their street addresses, we canaggregate them, and thus mask them, to a new set of symbolic, or surrogate points that representthe location of several individual-level events. For example, if there are 1000 birth defects in anarea, we could represent each by a symbol on a map. One could deduce an address from sucha map, however, and as a consequence the identity of an individual would be disclosed. Pointaggregation reduces this number, say to 200, and each symbol then represents some subset of theoriginal distribution. One approach to the selection of symbolic locations that represent aggreg-ates of the individual records is to compute a new set of locations such that we minimize the sumof distances from the original points to the closest symbolic locations. Methods for accomplishingthis are known as location-allocation methods and we can compute locations that are optimumwith respect to a large number of different objectives.37 Of particular interest for spatial maskingof health data are the equal workload problem and the p-median problem. In the equal workloadproblem, we compute locations to minimize the average distance to their closest health eventswhile ensuring that each location represents an approximately equal number of health events.38In the p-median problem, we compute locations to minimize the average distance to their closesthealth events. We represent both of these as cases of the geographic mask given by equation (2),with a g-function determined by the particular location-allocation algorithm used. Recently, Lamand Liu39 have suggested the use of Peano curves, in place of location-allocation models, as spacefilling curves that one can use to define minimum workload regions that are represented bypoints.

Microaggregation40 and blurring41 are, when applied to geographic co-ordinates, types ofpoint aggregation. In this context, microaggregation involves carrying out the following twosteps for each health record: (i) identify a set of records which are geographically similar tothe record; (ii) replace its co-ordinates with a composite co-ordinate computed from the geo-graphically similar records. There is considerable flexibility in how geographic similarity may bedefined and how the composite is computed. Blurring is a special case, which, in this context,consists of: ranking the records’ u-co-ordinates from smallest to largest; forming the averageu-co-ordinate of each successive group of ten (say) ranked records; and replacing each originalu-co-ordinate with its group average. The same procedure is applied independently to thev-co-ordinates also. An alternative to computing average co-ordinates for non-overlappinggroups is to form moving averages from a sliding interval of fixed length or containing a fixednumber of records.



Figure 5. Blurring of 100 points in a unit square: (a) original pattern; (b) blurred pattern

Microaggregation and other types of point aggregation can be represented as isomaskssatisfying equation (2). For example, blurring as described in the previous paragraph is a case ofequation (2) with

g (X, E)"(P~11

JP1º, P~1

2JP

2» )

where P1

and P2

are permutation matrices that arrange the u-co-ordinates and v-co-ordinates,respectively, in increasing order and J is the block-diagonal matrix

diag[1/10 110

1@10

, 1/10 110

1@10

, 2 , 1/10 110

1@10

, ].

Figure 5 illustrates an application of blurring to 100 points in the unit square. Note that blurringyields a map of points that lie, borrowing terminology from Bellhouse,42 on an irregular latticealigned in both directions.

3.3.2. Areal Aggregation

This approach is a special case of equation (1) in which we assign º and » to a set of polygonalregions, each element of which is large enough in population to protect the confidentiality ofindividuals. The resulting matrix reduces in dimensionality to M"[P*, ¼, Z] where we havedefined ¼ and Z previously and P* is a polygonal identifier that replaces the º, » co-ordinatesin the masked data set. One often selects a particular set of polygons because other data areavailable for them. Common situations are when social, demographic or environmental informa-tion is known and required in an analysis. In practice, however, we normally suppress Z, or weprovide a summary statistic, and sum ¼ over each polygon.

After areal aggregation, we have reduced the masked data in dimensionality from individualevent co-ordinates (2 columns) to polygon identifiers (1 column); hence, in this form aggregation isnot an isomask. We can, however, restore the original dimensions of M by mapping P* to thegeographic co-ordinates of the centroids of their respective polygons, or alternatively to one or



Figure 6. A linear buffer is used as an area in which health events are aggregated

more points located within each polygon. For example, we could use the weighted mean centre,we could assign events to cities or settlements within each polygon. Restoration of the originaldimensions in any such fashion yields an isomask.

One can easily implement areal aggregation using this approach if a record contains a field thatdescribes its membership in a particular area of an exhaustive partitioning, such as a country ina state. If such data are not included, one can compute the same result. The Jourdan arc theoremspecifies that a point is contained within a polygon if a perpendicular line from the point to anaxis crosses the border of the polygon an odd number of times (reference 43, p. 207). It isimportant to note that many desktop GIS software programs routinely provide this point-in-polygon functionality.

If the areal aggregation of data is not part of the basic geographic encoding process, but insteadis considered as a means of protecting the confidentiality of the health records, then theinformation could be made available in flexible and context-dependent ways that enable re-searchers to provide the health information for polygonal areas that are defined as relevant forhypotheses that need to be tested. In Figure 6, for example, infant deaths and births have beenaggregated for an area that is within 1 mile of a major highway that crosses Des Moines, Iowa.The infant mortality rate for this region is 11)2 per thousand compared with 9)2 per thousand forthe region as a whole.

3.3.3. Non-conterminous spatial aggregation

Although by convention health data are aggregated for regions, it is possible to aggregate fornon-conterminous ‘regions’. One could release health data for all areas within a given distance of



a specified hazard. An example is all children’s accidents within 20 metres of stop signs. There aremany other such ‘structured environments’ where we need to orient analyses to the specificstructure of interest.44 Often, such aggregation uses additional information not in X (in thisexample the co-ordinates of stop signs) and is thus a case of equation (1) with non-null E. Bygrouping together several such regions, we preserve confidentiality but release information forgeographic areas of real interest.

3.4. Neighbour Information

We create a nearest-neighbour mask when the g-function of equation (2) maps (º, ») tonearest-neighbour distances and/or related information pertaining to the neighbour(s) of eachevent. Such a mask is not an isomask, however, since (º, » ) maps to quantities that are notgeographic co-ordinates. A nearest-neighbour mask is particularly relevant to a data user whosegoal is to detect clustering, since many test statistics for clustering are functions of only theneighbour information. Some statistics (the mean nearest-neighbour distance, Ripley’s K-func-tion, and the nearest-neighbour distance empirical distribution function) use distances to thenearest neighbour only, while others include distances to the k'1 nearest neighbours.45 Stillothers (for example, the Cuzick—Edwards statistic) require knowledge of which event (or at leastwhich type of event — a case or a control) is (are) nearest neighbour(s) of each event. When, inaddition to this latter type of information, distances are provided to the data user, it may bepossible to reconstruct completely the relative locations of events if k is greater than or equal tothree, in which case the reconstructed pattern is merely an affine transformation (plus possiblya reflection) of the original pattern. Subsequently, when we refer to the ‘nearest-neighbourdistance mask’, we mean a mask that consists of no more information than the distances to, andidentities of, the two nearest neighbours of each event.

There are other ways that one can use nearest-neighbour information to mask data. Forexample, suppose that we compute all inter-event distances and then use them to determine the1st, 2nd,2 , (n!1)th nearest neighbours of each of n health events. These ordered nearestneighbours, considered apart from their corresponding distances, constitute ordinal proximityinformation that we can submit to any of several non-metric scaling algorithms. Such analgorithm returns a map of events that mimics, but generally does not replicate exactly, therelative locations of the actual events. One can ascertain the masked co-ordinates of events fromthis map and then attach them to ¼ and Z to yield a non-linear isomask.

3.5. Contextual Information

A quite different approach to limiting disclosure is to remove all explicit geographic identifiersfrom the health record and replace them with contextual information of specific interest to thedata user. Saalfeld et al.10 discuss this approach in detail and contrast the disclosure protectioncharacteristics of ‘spatial partitions’ versus ‘spatial windows’. They conclude that any partitionthat uses small, standard, geographic units, ‘presents too many disclosure risks to be acceptablefor building context’ (p. 692). Instead, they argue that ‘spatial windows’ provide better disclosureprotection where the characteristics of the area surrounding each point is the contextualattachment. This is an interesting argument and merits further research. We do not agree withtheir conclusion that windowing in two dimensions is too computationally complex and costly,however. An example of spatial windows for health information — cases of infant deaths in Des,



Moines, Iowa — organized to limit disclosure, is the output of the program for spatial filtering ofdisease events made available in Rushton and Armstrong.46

4. EVALUATION OF THE APPROACHES

One can implement the affine masking transformations described in the previous section usingeither spreadsheets to operate on X, or purpose-specific programs. These transformations arewidely used in computer graphics36 and in analytical cartography.43 Random perturbations canbe accomplished in much the same way. Random number generators are available in generalpurpose and scientific programming libraries and spreadsheets also can be used to implementrandom displacements. One must implement the remaining masks using purpose-specific orspecialized commercial software. One can use GIS software, for example, to implement aggrega-tion (both point and areal) and nearest-neighbour masks.

In this section we evaluate geographic masks with respect to their ability to preserve informa-tion in the data, links to other geo-referenced information, and confidentiality. A mask can beevaluated differently depending on whether the user is provided with knowledge about itscharacteristics. As more details about the spatial mask are revealed, disclosure risk increases, butthe data become more useful for geographic analyses. An argument can be made, however, that itis in the interest of both the data custodian and the data user that one selects the type of mask andits parameters such that both can be revealed to the user. From the point of view of the custodian,the data released should be such that even if the type of mask and its parameters became known,the disclosure risk is still acceptable. Stated differently, the data custodian should not want torelease data where disclosure is vulnerable to the efforts of a data spy who might acquireinformation about the mask and its parameters. Limiting disclosure means avoiding unnecessaryrisks, and, if it is possible to limit disclosure, it is preferable at the same time to eliminate any riskof disclosure arising from the identification of the mask.

4.1. Preserving Information

An important property of a geographic mask is the quantity and quality of the informationcontent in the original data preserved in the masked data. Though there is information loss withany mask, some masks preserve more information than others. However, data users might notneed some, or perhaps all, of the information lost by a particular mask to answer a particularquestion. Central to this issue is the notion of a user-sufficient mask. A mask, M, is user-sufficientif the vector, ¹, of quantities needed by the user to address a question of interest is a function ofthe masked data: if ¹(X)"¹(M (X)). For example, if a user intends to investigate the possibleexistence of disease clusters using the Cuzick—Edwards test statistic, then any mask that preserveswhether each case’s nearest neighbour is a case or a control is user-sufficient. Of course, it is notalways possible for the data custodian to anticipate the needs of the data user and hence it isrelevant to consider a mask’s information-preserving capabilities along several dimensions.

The geographic masks described in Section 3 vary substantially with respect to the kinds ofinformation they preserve as well as their quality. In Table I, we compare masks along five maininformation-preserving dimensions likely important in the study of the geographic distribution ofhealth events. First, we consider pairwise relations among events: whether we preserve actualdistances, relative distances (up to a scale factor), actual orientations and relative orientations (upto a rotation). Second, we examine event—geography relations, specifically whether the mask



Table I. Information preservation characteristics of geo-masks

Translate Rotate Scale Three Random Aggregate Nearestaffine perturb neighbour

distances

Pair-wise relationsActual distances Y Y N N A A N*Relative distances Y Y Y Y A A N*Actual orientations Y N Y N A N NRelative orientations Y Y Y N A N N

Event-Geography relationsActual distances N N" N N A A NRelative distances N N" N N A A NActual orientations N N N N A N NRelative orientations N Ns N N A N N

ClustersExistence Y Y Y Y A A YActual locations N N N N A A NRelative locations Y Y Y Y A A N

¹rendsExistence Y Y Y Y A A NDirectionality Y N Y N A A N

AnisotropiesExistence Y Y Y Y A N NDirectionality Y N Y N A N N

Y"YesN"NoA"Approximately* Actual and relative distances are preserved to the nearest neighbour (or two nearest neighbours) but not to othereventss Unless there is a single point geographic feature and it is used as the pivot point

preserves distances and orientations (both actual and relative) from events to possibly relevantgeographic features, such as putative hazard sources. Third, we consider whether we preserveinformation that one can use to detect clusters. In particular, we examine whether we retainsufficient information to detect the existence of clusters and whether we preserve cluster locationsor relative locations (up to a translation, scale change, rotation or reflection). This informationgenerally is important because it is the location of clusters, not their existence per se, that ofteninterests public health officials. Fourth, we study whether the mask preserves information abouttrends; specifically we ask whether the mask preserves the existence of a monotonic trend in thespatial density of events, and if so, whether it preserves its directionality. Fifth, and finally, weexamine whether the mask preserves the existence and directionality of local anisotropies(direction-dependent spatial correlation).

We make some assumptions in order to place specific entries in Table I. First, we assume thatthe affine transformation of event co-ordinates is not likewise applied to identifiable geographicfeatures such as rivers and boundaries. To do so divulges the particular transformation used andthus defeats the purpose of the mask. Second, we assume that the level of aggregation associated



with the aggregation mask is large enough that most areas or points represent more than oneevent, and, hence, that the mask preserves virtually no information about local anisotropy.

Table I shows that each of the affine transformations preserves (exactly) the existence ofclusters, trends and anisotropies but does not preserve, even approximately, the locations ofclusters and event—geography relations. We can use nearest-neighbour distances only for detect-ing the existence of clusters. Random perturbation yields information approximately correct in allthe categories considered. Consequently, this mask appears satisfactory from a comprehensiveinformation-preservation standpoint.

For random perturbation, the quality of the approximation is inversely related to the level ofperturbation; we investigate this effect empirically below. Aggregation, like random perturbation,preserves approximate information about clusters and trends but not about local anisotropy. Forthis reason, and also because analyses based on areal counts of events generally are less powerfulthan analyses based on inter-event distances, we argue that aggregation is inferior in terms ofinformation preservation to random perturbation.

We conducted an empirical analysis to illustrate the information loss incurred by the use ofrandom perturbation masks with increasing levels of perturbation. The analysis focuses oninformation about the existence of clusters in a data file, X, that consists of three columns: thegeographic co-ordinates (º, » ) of health events and a column ¼ whose elements indicatewhether the event is a case or a control. We assume that controls are selected at random fromthe population at risk. Cuzick and Edwards47 proposed a series of statistics, M¹kN, to test forclustering in this setting, defined as the number of cases that are k-nearest neighbours of othercases. Unlike many other tests for clustering, Cuzick and Edwards’ approach accounts for theinherent lack of uniformity in the spatial distribution of the at-risk population.

Cuzick and Edwards applied their method to the analysis of 62 cases of leukaemia andlymphoma in children in North Humberside, England, with 141 controls selected from birthregistries (see Figure 7). They found substantial evidence for clustering of cases; for example, theyobtained p-values of 0)055 and 0)006 corresponding to ¹1 and ¹2, respectively. Here, we usedthese same data to study the effects of increasing levels of perturbation of event co-ordinates onthe statistical significance of these two tests. For each level of perturbation, we generated 500simulated patterns of 203 perturbed events. Perturbations were independently and identicallydistributed uniform random variables on the square (!dw, dw)](!dw, dw) where w is theeast-west width of the study region at its widest point and d is the perturbation level parameter.Here we took d"0)005, 0)01, 0)02 and 0)04. We computed P-values corresponding to ¹1 and ¹2for each simulated pattern.

Figures 8 (a) and (b) show box plots of the p-values corresponding to ¹1 and ¹2 for eachlevel of perturbation. The results indicate that the significance of the clustering test tendsto persist at the lowest level of perturbation (d"0)005), but vanishes in the majority ofsimulations at higher levels of perturbation. Diggle48 established theoretically that second-moment analyses (such as ours) of randomly perturbed data are likely to underestimate thetrue extent of spatial heterogeneity or clustering, and our results serve as an empirical verificationof this point. Moreover, our results shed light on the scale at which clustering occurs relativeto the scale, or level, of perturbation. Apparently, for these data, clustering occurs at a scale lessthan 1/2 of 1 per cent of the study region’s width. This type of result suggests a generalrule-of-thumb for how much the custodian can mask the data while adequately preserving its‘clusteredness’; the perturbation level should not exceed the largest scale at which one believesclustering is operating.



Figure 7. Location of cases (open circles) and controls (closed circles)

It should be noted that the North Humberside data reported by Cuzick and Edwards and usedfor the study just described have already been masked by assigning each case and control to thecentroid of the postal code in which the corresponding individual resides (a form of pointaggregation). However, the scale of this ‘pre-masking’ evidently is smaller than the scale at whichclustering operates and is smaller than the perturbation levels employed in our study. Conse-quently, if the study could be repeated using the actual locations of cases and controls, the resultsand conclusions would be similar.

4.2. Preserving Links to Other Geo-referenced Information

A major motivation for maintaining access to individual health records, albeit in masked form, isto test hypotheses about relationships that may exist between individual characteristics and thesocial, physical or health resource characteristics of a related area. These characteristics aregenerally stored in a geographical information system. Most GIS software implements anapproach to data management that is based on the metaphor of thematic map layers. One canhandle and manipulate each layer independently or in concert with other layers to support a widevariety of analytical functions.49 However, polygons on different layers often do not coincide witheach other or with aggregated health records data. In such circumstances, even if health data arenot masked, spatial analyses depend upon the areal interpolation of such zonal data intocommensurable units of interest to the analyst. There have been several methods of arealinterpolation developed for this purpose,50~52 and recent research has investigated the degree oferror introduced by common uncertainties in the geometric accuracy of base data used to allocateinformation from one set of polygons to another.52 Spatial masking adds similar kinds of



Figure 8. Box plots for values of ¹1 and ¹2 for different levels of perturbation: (a) ¹1; (b) ¹2

uncertainty, and some of the results reported may therefore be valid for understanding the effectsof spatial masking of health data.

From intuition, we can see (Figure 9) that the larger a zone of interest is in relation to the size ofthe spatial mask, the lesser is the error introduced by use of the mask. Methods for computing theerror introduced by the mask involve imputing the likelihood that the locations of masked datacould have originated in a zone of interest. If, for example, 90 per cent of the area of a masksurrounding the location of a masked record lies in the zone of interest, then the assumption isthat the record has a 90 per cent chance of originating in the zone. Thus, for the case of a circular



Figure 9. Magnitude of error and the relative sizes of the masks and areas of interest

mask with equal likelihood of placement within a fixed range distance, we can estimate the effect ofthe mask on counts of health events in a specified zone by computing the sum of the mask sourceareas that lie in the zone as a proportion of the total mask areas (Figure 9). This proportion, timesthe number of masks that overlap the zone, is the best estimate of the true count incidences in thezone of the unmasked data. We can obtain a more accurate estimate of error by overlapping aninformation layer that geographically locates the potential locations of the health records. Xie53used address-range information for street segments in TIGER/Line files, to improve the alloca-tion of households to zones. Similarly, in this case, instead of measuring the proportion of the areaof the masking circles that lie inside the zone of interest, one could measure the proportion of theaddresses in the masking circles that lie outside the zone. In both cases, however, this bestestimate is less than the true value if the hypothesis is true that the specific zone characteristicslead to elevated disease incidence rates. In such cases, the proportion of the total masked area inthe zone is correlated with the underestimate bias; the smaller the proportion, the higher is thebias. One can use Monte Carlo simulations that use the parameters of the mask to generatemasked data from the zone of interest to establish empirical confidence limits for the estimates,assuming that one can estimate background incidence rates for areas outside the zone of interest.



Confidence limits for the count of incidences for a zone of interest might have, as limits, thecount for all masked data that could have originated in the zone of interest and the count of allmasked data that might not have so originated. The estimate, described above, is then the mostlikely value within this range. In applications where the zone of interest is large relative to the sizeof the mask, this range is small and incidence rates based on either of the two limits reflect thechange of uncertainty of any conclusion.

4.3. Preserving Confidentiality

We can also evaluate geographic masks by how well they preserve confidentiality, or equivalently,by the risk of disclosure of any individual in the data file. Generally there is a trade-off betweeninformation loss and disclosure risk. For example, we showed in Table I that the affinetransformation masks preserve exact information about several aspects of a pattern of events.However, we can invert these transformations experimentally and evaluate them using criteriathat enable matches between the actual and masked patterns using either known cases or otherknowledge about the geography of an area. If, for example, we simply rotated the data about anorigin whose location was disclosed, it would be a comparatively simple task to try 360 rotationsof one degree to try to bring the masked data into conformance with ancillary information aboutan area. If, however, we also introduced small changes in scale and translation into the maskeddata, then the number of combinations that we have to search to find matches increasessubstantially. Obviously, if we use real-valued masks, the number of combinations goes to R.However, if we apply a fuzzy matching tolerance the combinations reduce substantially.

As noted by Duncan and Lambert,54 one must conceptualize disclosure risk adequately beforeone can measure it and compare masks. Here we consider five disclosure risk criteria: oneproposed by Spruill55~57 and four that we have developed to assess additional dimensions ofdisclosure risk.

We define Spruill’s measure, as adapted for use here, as follows. For each record in the maskedfile M, compute the distance between its geographic co-ordinates and those of each record in theoriginal file X. Spruill’s measure, S, is the proportion of records in M that are closer to their‘parent’ records than to any other record in X. A mask for which a large number of records areclosest to their parent records has a high disclosure risk.

Our second criterion is vulnerability to geographic knowledge. A data user who possessesknowledge of the local geography of an area could, in some instances, exploit this knowledge tore-locate the data to their original positions. For example, if the pattern of some occurrences islimited by a river, bay, or topography, one may readily deduce its orientation by the constraintsimposed by these natural controls. Consider U.S. cities that have a distinctive shape; Boston, MA,Baltimore, MD and San Francisco, CA, for example, have a shape that is, in part, dictated by thepresence of water features. In other cases, topography can play a defining role in shapingdevelopment patterns; consider cities in the ridge and valley province of the eastern U.S. such asWilkes Barre, PA (Figure 10). In other much more common cases human influences might controlthe structure of settlement. Many cities in the U.S. Midwest were settled after the imposition ofthe U.S. Public Land Survey System which left an indelible rectilinearity on street networks inthat region (Figure 11). Consequently, if the city for which a masked pattern were known, onecould invert any simple rotational masking by snapping the data to one of four (NSEW)orientations. When one couples these snapped co-ordinates with ancillary information (forexample, distributional lacunae due to large holdings of public land or industry, or a rare



Figure 10. The street network in the vicinity of Wilkes Barre, PA. The pattern exhibits a distinctive elongation thatparallels the terrain in the region

diagonal roadway), one could deduce the correct orientation of a set of masked point locationsusing pattern-matching approaches.

The third criterion we consider is whether the masked data provide information that one canuse to infer subregions of the study area where the likelihood of a health event in the data file isdifferent than in others. From a disclosure risk standpoint, this is an undesirable property ofa mask because it allows a data spy to focus additional disclosure efforts in subregions of higherlikelihood. Clearly, aggregation has this property. So too does random perturbation if one revealsthe perturbation level, a, since the intersection of the sets of points lying a distance of at leastw from every masked event is a subregion where no actual event could be located.

These first three criteria are measures of inferential disclosure,54 in that they measure whetherthe data user (or spy) can infer new information about an individual record from the releaseddata, even though the new information is inexact. The remaining two disclosure risk criteria weconsider are measures of identity disclosure54 defined as the linkage of a record in the releaseddata to a specific person. In fact, we could term these last two criteria as measures of contingent,complete identity disclosure risk because they measure the risk of disclosure of the complete datafile given certain information on a subset of records in the file. The first of these criteria is theminimum number of health events that, were the data user to link them to their obviousidentifiers, compromises the entire data set. For example, if the data were masked by a simpletranslation, then the linkage of a single record to its obvious identifiers permits a data spy, upondetermining the true address (and hence the geographic co-ordinates) of that event, to re-establishthe original co-ordinates of every event. The second recognizes the fact that it may be rather easy



Figure 11. The street network in the vicinity of Champaign-Urbana, IL. The pattern exhibits a generic rectilinearity,imposed by the U.S. Public Land Survey System, that is accented by distinctive deviations such as limited-access roads

for a data spy to obtain, from media accounts or eyewitness testimony, knowledge of the locationsof a few health events in the data file without knowing which records in the file correspond tothose events. The criterion is the minimum number of such events that compromises the entiredata set. To illustrate, again suppose that the data were masked by a translation. If the data spycould ascertain the locations of only two health events included in the data set, then the spy could,in turn, place each event in the data set in one of these two locations until there is also an event inthe other location. When this happens, the original co-ordinates of the event are re-established,unless, of course, there are duplicate pairs of events separated by the same distance and in thesame direction.

Table II summarizes qualitative evaluations of masks with respect to these five disclosure riskcriteria. We see that Spruill’s measure, vulnerability to geographic knowledge, and the twocontingent disclosure risk criteria divide the masks into two main groups: the group of affine



Table II. Qualitative evaluations of masks with respect to five disclosure risk criteria

Translate Rotate Scale Three Random Aggregate Nearestaffine perturb neighbour

distances

Spruill (S) 1)0 1)0 1)0 1)0 0(S(1 0(S(1 N/AGeographic H H H H L L L

knowledgeEqual likelihood Y Y Y Y N N YIdentifier link 1 2 2 2 N N NLocation link 2 3* 2 3 N N N

H"highL"lowY"yesN"noN/A"not applicable* Knowledge of two locations is sufficient, however, to reduce the number of possible patterns of events to two

Figure 12. Box plots for Spruill’s disclosure risk measure for different levels of perturbation

transformations, which have high disclosure risk; and the remaining masks, all of which haverather low disclosure risk. The ‘equal likelihood’ criterion also divides the masks into two groups;however, these two groups are not the same as the groups formed by the other three criteria.Random perturbation and aggregation have higher disclosure risk than the other masks asmeasured by this criterion.

Figure 12 displays the results of a more quantitative analysis of disclosure risk based on anempirical study of the effect of the random perturbation mask’s level of perturbation on Spruill’smeasure of disclosure. For this study we again used the child leukaemia lymphoma data fromNorth Humberside (see Section 4.1), but for simplicity we restricted attention to cases only. We



generated 500 simulated patterns of the 62 cases from the original pattern by random perturba-tion at four levels, these being the same as in the empirical study reported in Section 4.2. Figure 12clearly indicates the inverse relationship between Spruill’s disclosure risk measure and perturba-tion level. Such a plot, used in tandem with a plot of information loss versus perturbation level,such as Figure 12, may help a data custodian to determine a level of masking that keeps disclosurerisk below an acceptable level, yet provides good-quality information to the data user.

5. DISCUSSION

In this paper we have described and evaluated several types of geographical masks designed tomaintain the confidentiality of health event records, while, at the same time, permitting theconduct of valid analyses. In this evaluation it is clear that the purpose of the mask partiallydetermines its relative suitability. There are, however, a set of open questions that await furtherresearch to provide support for the specification of specific masks in any particular applicationdomain.

5.1. Should one reveal the parameters of a spatial mask or should they be part of the disclosurelimiting process?

All useful spatial analyses of masked data require some knowledge about the characteristics of themask used. Without such knowledge, we cannot conduct valid analyses. Although provision ofinformation about a mask is essential to a legitimate data user, unfortunately it is also helpful toa data spy. Therefore, data custodians must carefully consider the type of information that theyprovide to the user along with the masked data. For example, a custodian may inform a user thatthe locations of events have been translated and rotated, or randomly perturbed. Analysts canfind incomplete information at this level helpful; for example, if informed that event locationshave only been translated, one can safely use the data to infer the existence and directionality oftrends and local anisotropies, and the existence and relative locations, but not the actuallocations, of clusters (see Table II). In some cases the custodian may provide more detailedinformation about a mask. For example, the user may be told that event locations have beenrandomly perturbed and that the perturbations are independently and identically distributeduniform random variables on the square [!1, 1]][!1, 1]. Armed with this knowledge, a soph-isticated data user can, through fuzzy-set theory or otherwise, quantify the accuracy of statisticaltests for attributes (for example, clustering) of the pattern of events.

5.2. How should one qualify statistical inferences based on masked data?

Many geographic masking procedures have the effect of reducing spatial structure (for example,clustering, anisotropy, or trend), if any exists, in the original data. For example, a randomperturbation mask tends to make a clustered pattern of health events less so, as seen in theempirical results reported in Section 4.2. On the other hand, a random perturbation of a randompattern of events does not tend to be more clustered. Thus, there is an asymmetry in thequalifications that should accompany inferences that a user of randomly perturbed data drawsfrom statistical tests for structure. If a test statistic’s magnitude is such that a structural attributeis deemed statistically significant, then no qualification is necessary. If, however, the test indicatesthat a structural attribute is not statistically significant, this inference merits qualification witha statement to the effect that masking could have prevented the detection of the attribute. In



statistical terms, random perturbation tends not to affect the type I error probability but it tendsto increase the type II error probability. Knowledge of the parameters of the perturbationprocess, as discussed in the answer to the previous question, can be very useful in quantifying thepossible increase in the type II error probability. Other geographic masking procedures may havedifferent effects on spatial structure, hence different effects on type I and type II error probabilit-ies. For example, affine transformations tend to neither increase nor decrease the type I andtype II error probabilities of tests for clustering, and point aggregation tends to increase thetype I error probability of such tests.

5.3. What makes an individual record vulnerable to disclosure?

The greater the area of geographic space that we can eliminate as the source of any health events,the greater is the risk of disclosure. Conversely, the greater the number of other records that onecan show to have possibly originated in the same area from which a particular record might havecome, the less is the risk of disclosure. A contextual frame is specified by conditions of localgeography and can be used to define exclusion zones in which a data spy would not search formatches between health records and their actual locations. The frame defines, therefore, thoseareas that can accept a point given a set of ancillary factors (e.g. land use/land cover) that controlthe locations of events. For example, a residence location normally is not in a public park.The remaining area is the feasible geographic space that could, conceivably, be the origin ofa health record.

5.4. Is there a way to equalize risk of disclosure? What criteria should one use to determine theamount of random perturbation?

Disclosure risk is a multi-dimensional problem that depends on a set of interacting factors such asthe density of the underlying population, the density of the masked events and the context inwhich the data are located. An optimal spatial mask is one that ensures that for every location inthe feasible geographic space, at least one record might have originated there, and that for everyhealth record there are approximately the same number of other individuals, s, who might beassociated with it. An optimal displacement is the minimum displacement that protects confiden-tiality and that simultaneously permits ‘valid’ statistical analyses to take place. One way toachieve this goal is to determine, for each individual record contained in the database, the numberof persons at risk who could be associated with that location. Though this number depends onseveral factors, such as the number of variables used and the density of at-risk individuals in the‘neighbourhood’ of that location, results can be computed from an at-risk surface derived froma functional relationship among the density of the at risk population as modified by the peculiargeographical characteristics that surround each location. Algorithms exist for determining theareas that contain an approximately equal number of people at risk from such a density surface.58

5.5. Should the user of a given data set be permitted to select the mask to be used on the data set? Ifso, could the same data set be released under multiple spatial masks? Would this enhance the abilityof a data spy to ‘crack the mask’?

It is possible that there is a many-to-one mapping between masking strategies and possible uses.Data can be released in several ways that are specific to a set of changing analytical requirements,such as when exploratory analyses are conducted. Consequently, multiple masked versions of



a data set could be released to a user. One could design a user interface, for example, that enablesa user to ‘check off’ different characteristics of the data that they require subject to (a) theirpersonal ‘security clearance’ and (b) the ease with which the data could be ‘spied’ upon based onrepeated requests. If a spy were to obtain multiple masked versions of the same data set, however,it would provide them with insight into the masking procedure and may, in some cases, providesufficient insight so as to compromise the confidentiality of the data set; low density areas areespecially vulnerable to such attacks. We conclude, therefore, that allowing a user to select a maskconstitutes an unacceptable risk to most data custodians. One way to preclude such data spyingactivities is to release only a single version of any data that have been randomly masked.

5.6. Is an intelligent agent the best ultimate answer to protection of the confidentiality of healthrecords while permitting their valid spatial analysis?

Software agents are emerging as an important computing paradigm.59~61 A key feature of agentsis that they ’travel’ outside of the host computer to conduct analyses and to search for neededresources stored elsewhere on the network. In addition, agents typically have embodied in theman organized collection of domain-specific knowledge. For example, successful agents have beendeveloped to manage e-mail for users who may specify deletion of certain types of mail from theincoming stream of messages, and that other messages, say from a particular sender or witha specific subject line, are brought to their immediate attention. If an agent were designed tosupport the analysis of public health data, users would not be required to have access toconfidential health records. Rather, they would submit a request to an intelligent analysis agentthat would assess the request along several dimensions such as its semantic correctness and levelof specificity. If the agent accepts the query, it would complete the analysis and return a result tothe data user. An agent could also prepare mapped results that mask the confidentialityof individuals.

6. CONCLUSIONS

As the number of sources of health-related microdata continues to increase, and as pressuremounts on data custodians charged with the preservation of confidentiality to release data thatcan be statistically analysed and mapped, there is a commensurate need to investigate newapproaches that support such activities while, at the same time, protect the privacy of individuals.In this paper we have described a new approach to preserving the confidentiality of individual-level data based on the specification of geographical masks of individual health events. Wedescribed several such masks and evaluated their performance, in both theoretical and empiricalcontexts. We have shown that random perturbation of data, up to some limit, is superior to affineand aggregation masks for many analytical purposes. We also demonstrated that the significanceof a clustering test we evaluated persists at low levels of perturbation, but with higher levels ofperturbation used in a masking operation, as expected, the presence of clusters became lessapparent.

These methods must become the subject of discussion by local, state and federal organizationsas well as private groups that are the custodians of health and disease records. Though interest inthe geographical analysis of disease had increased, there has been little discussion of practicalways to protect the confidentiality of data while permitting valid analyses. In many local areaspublic health officials, and even public groups, are reconstructing local geographic data and are



drawing inferences about possible environmental influences on disease incidences with littlesupporting analyses. Many of the nation’s most talented researchers also continue to work withdisease data whose true geographical locations have been masked using spatial aggregation areas(often counties) that do not permit valid geographical analyses of important epidemiologicalquestions. This must change.

ACKNOWLEDGEMENTS

We acknowledge support provided by an Interdisciplinary Research Grant at the ObermannCenter for Advances Studies at the University of Iowa. Comments provided by the referees andthe editor led to improvements in the paper and are also gratefully acknowledged.

REFERENCES

1. Donaldson, M. S. and Lohr, K. S. (eds). Health Data in the Information Age: ºse, Disclosure and Privacy,Committee on Regional Health Data Networks, National Academy Press, Washington, DC, 1994.

2. Gostin, L. O., Turek-Brezina, J., Powers, M., Kozloff, R., Faden, R. and Steinauer, D. D. ‘Privacy andsecurity of personal information in a health care system’, Journal of the American Medical Association,270, 2487—2493 (1993).

3. NCHS. Catalog of Publications 1990—95, National Center for Health Statistics, Hyattsville, MD, 1996.4. Geschwind, S. A., Stolwijk, J. A. J., Bracken, M., Fitzgerald, E., Stark, A., Olsen, C. and Melius, J. ‘Risk of

congenital malformations associated with proximity to hazardous waste sites’, American Journal ofEpidemiology, 135, 1197—1206 (1992).

5. Stallones, L., Nuckols, J. R. and Berry, J. K. ‘Surveillance around hazardous waste sites: geographicinformation systems and reproductive outcomes’, Environmental Research, 59, 81—92 (1992).

6. Duncan, G. T. and Pearson, R. W. ‘Enhancing access to microdata while protecting confidentiality:Prospects for the future’, Statistical Science, 6, 219—239 (1991).

7. DOC. Report on Statistical Disclosure and Disclosure-Avoidance ¹echniques, Statistical WorkingPaper 2. U.S. Department of Commerce, Washington, DC, 1978.

8. Office of Management and Budget. Report on Statistical Disclosure ¸imitation Methodology, StatisticalPolicy Working Paper 22, U.S. Office of Management and Budget, Washington, DC, 1994.

9. Greenberg, B. and Voshell, L. ‘Relating risk of disclosure for microdata and geographic area size’,Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria,VA, 450—455 (1990).

10. Saalfeld, A., Zayatz, L. and Hoel, E. ‘Contextual variables via geographic sorting: A moving averagesapproach’, Proceedings of the Section on Survey Research Methods, American Statistical Association,Alexandria, VA, 691—696 (1992).

11. U.S. Bureau of the Census. Census ºse Study Report 14, Gecoding with ADMA¹CH, Washington, DC,1971.

12. Broome, F. R. and Meixler, D. B. ‘The TIGER database structure’, Cartography and GeographicInformation Systems, 17, 39—47 (1990).

13. Marx, R. W. ‘The TIGER system: yesterday, today and tomorrow’, Cartography and GeographicInformation Systems, 17, 89—97 (1990).

14. U.S. Bureau of the Census. ¹IGER/¸INETM Census Files, 1990: ¹echnical Documentation, Washington,DC, 1991.

15. Gatrell, A. C. and Bailey, T. C. ‘Interactive spatial analysis in medical geography’, Social Science andMedicine, 42, 843—855 (1996).

16. Croner, C. M., Sperling, J. and Broome, F. R. ‘Geographic information (GIS): New perspectives inunderstanding human health and environmental relationships’, Statistics in Medicine, 15, 1961—1977(1996).

17. Mason, T. J., McKay, F. W., Hoover, R., Blot, W. J. and Fraumeni, J. F. Atlas of Cancer Mortalityfor º.S. Counties 1950—1969, National Institute of Health, DHEW Publication (NIH) 75—780,1975.



18. Mason, T. J., McKay, F. W., Hoover, R., Blot, W. J. and Fraumeni, J. F. Atlas of Cancer Mortalityamong º.S. Nonwhites 1950—1969, National Institute of Health, DHEW Publication (NIH) 76—1204,1976.

19. Pickle, L. W., Mason, T. J., Howard, N., Hoover, R. and Fraumeni, J. F. Atlas of º.S. Cancer Mortalityamong whites: 1950—1980, National Institute of Health, DHHS Publication (NIH) 87—2900, 1987.

20. Devine, O. J., Louis, T. A. and Halloran, M. E. ‘Identifying areas with elevated disease incidence ratesusing Bayes estimators’, Geographical Analysis, 28, 187—199 (1996).

21. Waller, L. A. ‘Statistical power and design of focused clustering studies’, Statistics in Medicine, 15,765—782 (1996).

22. Tobler, W. R. ‘Smooth pycnophalactic interpolation for geographic regions’, Journal of the AmericanStatistical Association, 74, 519—536 (1979).

23. Fisher, P. F. and Langford, M. ‘Modeling sensitivity to accuracy in classified imagery: a study of arealinterpolation by dasymetric mapping’, Professional Geographer, 48, 299—309 (1996).

24. Flowerdew, R. and Green, M. ‘Statistical methods for inference between incompatible zonal systems’, inGoodchild M. F. and Gopal, S. (eds), Accuracy of Spatial Databases, Taylor and Francis, London, U.K.1989, pp. 239—247.

25. Jacquez, G. M. and Waller, L. A. ‘The effects of uncertain locations on disease cluster statistics’,Proceedings of the Second International Symposium on Spatial Accuracy Assessment, Natural Resourcesand Environmental Sciences, United States Department of Agriculture General Technical ReportRM-GTR-277, 1996, pp. 259—266.

26. Amrhein, C. G. and Reynolds, H. ‘Using spatial statistics to assess aggregation effects’, GeographicalSystems, 3, 143—158 (1996).

27. Openshaw, S. and Schmidt, J. ‘Parallel simulated annealing and genetic algorithms for re-engineeringzoning systems’, Geographical Systems, 3, 201—220 (1996).

28. Wong, D. and Amrhein, C. G. ‘Research on the MAUP: Old wine in a new bottle or real breakthrough?’,Geographical Systems, 3, 73—76 (1996).

29. Rothman, K. J. ‘A sobering start to the cluster busters conference’, American Journal of Epidemiology,132, S6—S13 (1990).

30. Choynowski, M. ‘Maps based on probabilities’, Journal of the Royal Statistical Association, 54, 385—388(1959).

31. Openshaw, S. ‘Two exploratory space-time-attribute pattern analysers relevant to GIS’, in Fotherin-gham S. and Rogerson, P. (eds), Spatial Analysis and GIS, Taylor and Francis, Bristol, PA, 1994, pp.83—104.

32. Oden, N., Jacquez, G. and Grimson, R. ‘Realistic power simulations compare point and area-baseddisease cluster tests’, Statistics in Medicine, 15, 783—806 (1996).

33. Cox, L. H. ‘Protecting confidentiality in small population health and environmental statistics’, Statisticsin Medicine, 15, 1895—1905 (1996).

34. Cox, L. H. ‘Matrix masking methods for disclosure limitation in microdata’, Survey Methodology, 20,165—169 (1994).

35. Aldrich, T. E. and Krautheim, K. R. ‘Protecting confidentiality in small area studies’, paper presented atthe CDC Symposium on ‘Statistical Methods: Small Area Statistics in Public Health’, Atlanta, GA,1995.

36. Foley, J. D., VanDam, A., Feiner, S. K. and Hughes, J. F. Computer Graphics: Principles and Practice,2nd edn, Addison-Wesley, Reading, MA, 1990.

37. Daskin, M. S. Network and Discrete ¸ocation: Models, Algorithms, and Applications, Wiley, New York,NY, 1995.

38. Densham, P. J. and Rushton, G. ‘Providing spatial decision support for rural services that requirea minimum workload’, Environment and Planning B, 23, 553—574 (1996).

39. Lam, N. S. -N. and Liu, K. -B. ‘Use of space-filling curves in generating a national rural sampling framefor HIV/AIDS research’, Professional Geographer, 48, 321—332 (1996).

40. Wolf, M. K. ‘Microaggregation and disclosure avoidance for economic establishment data’, Proceedingsof the Section on Business and Economic Statistics, American Statistical Association, 355—360 (1988).

41. Strudler, M., Oh, H. L. and Scheuren, F. ‘Protection of taxpayer confidentiality with respect to the taxmodel’, Proceedings of the Section on Survey Research Methods, American Statistical Association,375—381 (1986).



42. Bellhouse, D. R. ‘Some optimal designs for sampling in two dimensions’, Biometrika, 64, 605—611 (1977).43. Clarke, K. Analytical and Computer Cartography, 2nd edn, Prentice-Hall, Englewood Cliffs, NJ (1995).44. Grimson, R. C. and Oden, N. ‘Disease clusters in structured environments’, Statistics in Medicine, 15,

851—871 (1996).45. Diggle, P. J. Statistical Analysis of Spatial Point Patterns, Academic Press, London, UK, 1983.46. Rushton, G. and Armstrong, M. P. Improving Public Health through Geographical Information Systems,

CD-ROM, Version 3.0. Department of Geography, GIS & Public Health Project, The University ofIowa, Iowa City, IA, 52242, 1998.

47. Cuzick, J. and Edwards, R. ‘Spatial clustering for inhomogeneous populations’, Journal of the RoyalStatistical Society, Series B, 52, 73—104 (1990).

48. Diggle, P. J. ‘Point process modelling in environmental epidemiology’, in Barnett, V. and Turkman,K. F. (eds), Statistics for the Environment, Wiley, New York, 1993, pp. 89—110.

49. Tomlin, C. D. Geographic Information Systems and Cartographic Modeling, Prentice Hall, EnglewoodCliffs, NJ, 1990.

50. Fisher, P. F. and Langford, M. ‘Modeling the errors in areal interpolation between zonal systems byMonte Carlo simulation’, Environment and Planning A, 27, 211—224 (1995).

51. Goodchild, M. F. and Lam, N. S.-N. ‘Areal interpolation: a variant of the traditional spatial problem’,Geoprocessing, 1, 297—331 (1980).

52. Goodchild, M. F., Anselin, L. and Deichmann, U. ‘A framework for the areal interpolation ofsocioeconomic data’, Environment and Planning A, 25, 383—397 (1993).

53. Xie, Y. ‘The overlaid network algorithms for areal interpolation problem’ (sic), Computers, Environmentand ºrban Systems, 19, 287—306 (1995).

54. Duncan, G. and Lambert, D. ‘The risk of disclosure for microdata’, Journal of Business and EconomicStatistics, 7, 207—217 (1989).

55. Spruill, N. L. ‘Measures of confidentiality’, in Statistics of Income and Related Administrative RecordResearch: 1982, U.S. Department of Treasury, Internal Revenue Service, Statistics of Income Division,Washington, DC, 1982, pp. 131—136.

56. Spruill, N. L. ‘The confidentiality and analytic usefulness of masked business microdata’, Proceedings ofthe Section on Survey Research Methods, American Statistical Association, 602—607 (1983).

57. Spruill, N. L. Protecting Confidentiality of Business Microdata by Masking, Public Research Institute,Alexandria, VA, 1984.

58. Rushton, G. ‘Map transformations of point patterns: Central place patterns in areas of variablepopulation density’, Papers of the Regional Science Association, 28, 111—129 (1972).

59. Etzioni, O. and Weld, D. ‘A softbot-based interface to the Internet’, Communications of the Associationfor Computing Machinery, 37, 72—76 (1994).

60. Maes, P. ‘Agents that reduce work and information overload’, Communications of the Association forComputing Machinery, 37, 31—40 (1994).

61. Norman, D. A. ‘How might people interact with agents’, Communications of the Association forComputing Machinery, 37, 68—71 (1994).



Geographically masking health data to preserve confidentiality

Documents