Top Banner
S4ournal of Epidemiology and Community Health 1995;49(Suppl 2):S34-S38 Geodemographic segmentation systems for screening health data Stan Openshaw, Marcus Blake Abstract Aim - To describe how geodemographic segmentation systems might be useful as a quick and easy way of exploring postcoded health databases for potential interesting patterns related to deprivation and other socioeconomic characteristics. Design and setting - This is demonstrated using GB Profiles, a freely available geode- mographic classification system developed at Leeds University. It is used here to screen a database of colorectal cancer re- gistrations as a first step in the analysis of that data. Results and conclusion - Conventional geodemographics is a fairly simple tech- nology and a number of outstanding meth- odological problems are identified. A solution to some problems is illustrated by using neural net based classifiers and then by reference to a more sophisticated geodemographic approach via a data op- timal segmentation technique. (7 Epidemiol Comm Health 1955;49(Suppl 2):S34-S38) School of Geography, Leeds University, Leeds LS2 9JT S Openshaw M Blake Correspondence to: Professor S Openshaw. Geodemographic classifications are widely used in many different application areas as a means of obtaining a useful descriptive summary of the principal types of residential areas that exist in the UK, based on a multivariate classification of census data.' Brown2 writes, "Geodemo- graphics has come into popular use as a short- hand label for both the development and ap- plication of area typologies that have proven to be powerful discriminators of consumer be- haviour and as aids to market analysis". A typical geodemographic classification starts with 1991 census data for all the 145 716 census enu- meration districts in Britain for which there are about 10 000 different census counts available (see ref 3 for a complete listing). These 10 000 variables are reduced by careful selection to about 50 to 100 composite indicator variables measuring a range of socioeconomic, demo- graphic, and housing characteristics. This set of derived variables are then used to classify the census enumeration districts into clusters (or groups) of similar types of areas based on their multivariate data profiles. Current commercial geodemographic classifications contain be- tween 10 and 161 residential area types; for ex- ample, the ACORN91 system has 6 categories, 17 groups, and 54 types; the MOSAIC91 system has 12 groups and 54 types; and the Su- perProfiles94 system 10 lifestyles, 40 target mar- kets, and 161 clusters. Each of these clusters is given a simple descriptive label that offers an idealised and highly stylised portrait (or picture) of what the typical members are like. The re- sulting classification is then usually linked to postcodes via the OPCS/GRO(s) census to unit postcode directory. In essence, geodemographics provides a means by which people can be characterised by the types of area in which they live, using postcodes as a simple indexing mechanism to a multivariate classification of small area census data. This is potentially relevant to health ana- lysis because a geodemographic classification of this type might well be regarded as a more sophisticated approach to incorporating dep- rivation related effects than that provided by the more traditional ranking of index values for small areas such as wards (for example, see Jarman4 and various other indexes of dep- rivation developed for the Department of Environment and Bradford et al5). Geo- demographics might well be more useful be- cause it offers a strongly multivariate view of the characteristics of areas and being census enumeration district based may well also pro- vide a higher level of geographic discrimination. They also offer a proxy for lifestyle and pros- perity, they act as a substitute for census data, and generally provide a quick, albeit crude, means of adding a census based socioeconomic, demographic, and housing context to virtually any postcoded health data. The results are also simple enough to be understood by non- technical experts. It is suggested that geodemographics are rel- evant to health database analysis because they offer a particularly quick and easy way of per- forming a broad brush screening of medical data in terms of different types of residential area. They offer answers to questions such as the following. Do disease rates vary by type of residential neighbourhood? Do poor housing areas with high levels of unemployment have significantly higher than expected mortality or morbidity rates? What sorts of residential area in what parts of the UK are associated with the highest incidence of a particular disease? An example of this type of analysis is that of Reading, Jarvis, and Openshaw.6 It is argued, however, that geodemographics does more than merely offer a convenient source of covariate information that statistical modellers can use to remove so called con- founding socioeconomic effects. Health differ- ences that vary in relation to the type of residential area, after allowances are made for age and sex effects, should perhaps become the focus of attention since (in theory) they may be treated or managed by social rather than purely medical means. They may also have S34
5

Geodemographic segmentation systems screening health

Feb 03, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Geodemographic segmentation systems screening health

S4ournal of Epidemiology and Community Health 1995;49(Suppl 2):S34-S38

Geodemographic segmentation systems forscreening health data

Stan Openshaw, Marcus Blake

AbstractAim - To describe how geodemographicsegmentation systems might be useful as aquick and easy way of exploring postcodedhealth databases for potential interestingpatterns related to deprivation and othersocioeconomic characteristics.Design and setting - This is demonstratedusing GB Profiles, a freely available geode-mographic classification system developedat Leeds University. It is used here toscreen a database of colorectal cancer re-gistrations as a first step in the analysis ofthat data.Results and conclusion - Conventionalgeodemographics is a fairly simple tech-nology and a number ofoutstanding meth-odological problems are identified. Asolution to some problems is illustrated byusing neural net based classifiers andthen by reference to a more sophisticatedgeodemographic approach via a data op-timal segmentation technique.

(7 Epidemiol Comm Health 1955;49(Suppl 2):S34-S38)

School of Geography,Leeds University,Leeds LS2 9JTS OpenshawM Blake

Correspondence to:Professor S Openshaw.

Geodemographic classifications are widely usedin many different application areas as a means

ofobtaining a useful descriptive summary oftheprincipal types of residential areas that exist inthe UK, based on a multivariate classificationof census data.' Brown2 writes, "Geodemo-graphics has come into popular use as a short-hand label for both the development and ap-plication of area typologies that have proven tobe powerful discriminators of consumer be-haviour and as aids to market analysis". A typicalgeodemographic classification starts with 1991census data for all the 145 716 census enu-

meration districts in Britain for which there are

about 10 000 different census counts available(see ref 3 for a complete listing). These 10 000variables are reduced by careful selection toabout 50 to 100 composite indicator variablesmeasuring a range of socioeconomic, demo-graphic, and housing characteristics. This set ofderived variables are then used to classify thecensus enumeration districts into clusters (orgroups) of similar types of areas based on theirmultivariate data profiles. Current commercialgeodemographic classifications contain be-tween 10 and 161 residential area types; for ex-

ample, the ACORN91 system has 6 categories,17 groups, and 54 types; theMOSAIC91 systemhas 12 groups and 54 types; and the Su-perProfiles94 system 10 lifestyles, 40 target mar-kets, and 161 clusters. Each of these clusters isgiven a simple descriptive label that offers an

idealised and highly stylised portrait (or picture)of what the typical members are like. The re-sulting classification is then usually linked topostcodes via the OPCS/GRO(s) census to unitpostcode directory.

In essence, geodemographics provides ameans by which people can be characterisedby the types of area in which they live, usingpostcodes as a simple indexing mechanism toa multivariate classification of small area censusdata. This is potentially relevant to health ana-lysis because a geodemographic classificationof this type might well be regarded as a moresophisticated approach to incorporating dep-rivation related effects than that provided bythe more traditional ranking of index values forsmall areas such as wards (for example, seeJarman4 and various other indexes of dep-rivation developed for the Department ofEnvironment and Bradford et al5). Geo-demographics might well be more useful be-cause it offers a strongly multivariate view ofthe characteristics of areas and being censusenumeration district based may well also pro-vide a higher level ofgeographic discrimination.They also offer a proxy for lifestyle and pros-perity, they act as a substitute for census data,and generally provide a quick, albeit crude,means ofadding a census based socioeconomic,demographic, and housing context to virtuallyany postcoded health data. The results arealso simple enough to be understood by non-technical experts.

It is suggested that geodemographics are rel-evant to health database analysis because theyoffer a particularly quick and easy way of per-forming a broad brush screening of medicaldata in terms of different types of residentialarea. They offer answers to questions such asthe following. Do disease rates vary by type ofresidential neighbourhood? Do poor housingareas with high levels of unemployment havesignificantly higher than expected mortality ormorbidity rates? What sorts of residential areain what parts of the UK are associated withthe highest incidence of a particular disease?An example of this type of analysis is that ofReading, Jarvis, and Openshaw.6

It is argued, however, that geodemographicsdoes more than merely offer a convenientsource of covariate information that statisticalmodellers can use to remove so called con-founding socioeconomic effects. Health differ-ences that vary in relation to the type ofresidential area, after allowances are made forage and sex effects, should perhaps become thefocus of attention since (in theory) they maybe treated or managed by social rather thanpurely medical means. They may also have

S34

Page 2: Geodemographic segmentation systems screening health

Geodemographic segmentation systems for screening health data

political importance as a reflection of spatialinequalities in life expectancy and well being.It would seem, therefore, that a routine geode-mographic style of screening of importanthealth and death databases might be an ex-tremely useful initial exploratory step in moni-toring health information systems in bothepidemiological research and for assessing thehealth needs of populations. It is perhaps sur-prising that much greater use is not as yetroutinely being made of this technology. Thereasons for this relative neglect probably reflectthe cost of acquiring a commercial geode-mographic system, unfamiliarity, and lingeringconcerns about the quality of some of thecommercial products.This paper briefly describes the development

and application of a particular geodemographicsegmentation system which was produced as apart ofan ESRC funded project and is availablefree for academic researchers. This paper il-lustrates the use ofthis system and then outlineshow it can be further developed to generaterobust, safe, data optimal segmentations ofvirtually any postcoded medical data.

Spatial classification of 1991 census dataThe statistical technology needed to create acrude national classification of small area cen-sus data is now widely diffused via popularstatistical packages; for example, SAS or SPSS.However, it is not often recognised that manyofthe available classification methods date fromthe 1960s and are not well suited to handlingthe special nature of spatial (as distinct fromnon-spatial survey) data; in particular, prob-lems of non-normal distributions, non-lin-earity, and spatial dependency are endemic.Census data also introduce a number of ad-ditional difficulties; especially those related tosmall number effects and spatially varying levelsof data precision. Whether these matter de-pends on the nature of the application, on thelevel of skill used to develop a classification,and the context in which it is used. Better stillwould be the use of classifiers that at leastattempt to handle some of the problems.

Accordingly, the classification method usedhere is based on a particular type of un-supervised neural net known as Kohonen's selforganising map.7 This has been modified tohandle the data uncertainty present in censusdata.'89 The attractions of this method includeits simplicity and flexibility. It can handle noisycensus data and size related data precisionissues, and there is a minimum amount of datapreprocessing. However, as with all neuro-computing approaches, it is always useful tocompare the results that are obtained withmore conventionally produced classificationsin order to provide performance benchmarks,against which any improvements can be as-sessed. This naturally leads to having not onebut multiple classifications based on differentmethods, perhaps also using different sets ofvariables, and offering different levels of datageneralisation obtained by varying the numbersof clusters present. The user is expected to

choose what is best in the context of a particulardata analysis application.This may be contrasted with a more tra-

ditional, commercial geodemographic ap-proach, based on a single all purpose systemwhich is applied to virtually everything. Whileofgeneral descriptive utility and ofconsiderablesimplicity, it should be appreciated that par-ticular applications may require either theirown bespoke classifications, tuned to specificrequirements (although this tuning is a veryimprecise process), or a means provided ofchoosing the best from among many alternativeclassifications. A customised or tailored seg-mentation system is one in which a purposespecific classification is developed to meet thespecialist need of a particular application.'0There is no reason why such systems can notbe developed and optimised for use in a healthcontext. The customisation process may in-volve the selection of a specific set of variablesto be used and the careful choice of the bestnumber of clusters relevant to a specific con-text. This can be a lengthy and costly process.Fortunately, it seems that the principle un-known, but critical, variable in the classificationprocess is often the best number of clusters touse. Research suggests that it may well evenbe more important than the choice of classifieras a sensitivity optimising device." There hasbeen a tendency in a marketing context toonly seek highly parsimonious geodemographicsegmentations with few clusters, whereas somehealth applications may well require 5 to 10times as many in order to allow place or regionspecific variations of mortality and morbidityrates to appear.

The GB Profiles geodemographic systemResearch performed in the School of Geo-graphy at Leeds has produced a series of overa 100 different 1991 census data based geode-mographic systems designed solely for aca-demic research usage. The licence that makesthe 1991 census data available to academicsprohibits commercial or non-university basedapplications. This restriction made it feasibleto develop census classification systems withoutbeing hindered by any market factors; for ex-ample, similarly to previous products or con-strained by conventional geodemographicpractices. The resulting GB Profiles system,'2based on the best possible available technology,is designed to use broadly representative censusdata, to offer multiple classifications at varyinglevels of resolution, packaged so that it is easyto use, and is freely available for academicresearch and teaching purposes.The GB Profiles system runs under both PC

and UNIX environments. The Microsoft Win-dows PC-based system provides easy access toa whole series of census classifications withvariable levels of resolution; ranging from 2 to5000 clusters. However, to keep matterssimple, the system currently restricts externalusers to four specific classifications (with 10,49, 64, and 100 clusters) derived by both theneural net classifier and a more conventionalK-means method (see ' for details). These are

S35

Page 3: Geodemographic segmentation systems screening health

836 Openshaw, Blake

-..~~~~~~~~~~~~~~~~~~~~~~.......File\A/;d6w. HClp

5;raqke Sea';C6

List -F''ction:

lu rilaIim.I-

-eJJX-LzJwrc. 777e#1tcJ

-.Sut-ie Search NIode

----- 4-------4

001

o hed ct.o"-

Figure 1 A mosaic of the GB Profiles system.

provided with a full set of cluster labels and aredesigned to provide a broadly based descriptionbased on the available range of 1991 censusdata. It was thought that this would appealmost to the majority of potential social scienceand other research users. An easy to use Mi-crosoft Windows front-end offers the user theability to select a classification, attach clustercodes to a postcoded data file, export the resultsin one of several different formats and performsome preliminary analysis. It is also possible toinvestigate the cluster labels that are used torepresent each area type, and, if appropriate,to change them. Figure 1 presents a mosaic ofsome of the GB Profiler screens.

A file of postcoded cancer registrations wasrun through GB Profiler to illustrate some ofthe benefits and problems of a geodemographicapproach. Postcoded data were available forthe 1622 colorectal cancer registrations thatoccurred in Sheffield between 1979 and 1983.The 100 cluster neural net based classificationwas applied to these data to identify whichresidential area types were associated with thehighest disease incidence, see table 1. Hereattention is restricted to areas with a 40%higher than expected cancer incidence. Theseareas are mapped in figure 2. The geode-mographic labels associated with the clusternumbers suggest that most of these high in-

R

.....

Page 4: Geodemographic segmentation systems screening health

Geodemographic segmentation systems for screening health data

Figure 2 Sheffield District Health Authority showing areas of high colorectal cancer(CRC) incidence identified by the 100 cluster neural net classification.

Table 1 Residential cluster types with the highest diseaserates

Cluster no Observed cases Expected cases* Index valuet

34 2 1 27710 5 2 26194 11 4 22571 7 3 20797 7 3 20562 4 2 20080 13 6 18064 5 2 17515 45 22 17349 44 23 16327 20 11 15687 13 7 15222 20 12 14528 10 6 14479 6 4 14416 66 40 142

* This is an age-sex estimate of the expected numbers of cases.t 100 is the average for Sheffield.

cidence areas are relatively poor. These arecharacterised as either struggling, unemployedfamilies and single parents living in councilhousing. This is a quick and simple way ofidentifying the type of person and the kind ofresidential areas that have high numbers of, inthis case colorectal cancer registrations. Thisdescriptive information would be of use as partof the larger picture of health needs assessmentand disease monitoring. However, this is a verybasic approach that may well be too simple toprovide completely reliable results.

A data optimal segmentation systemThe example presented here is a very brief butnot untypical illustration of how a geode-mographic approach would be used in healthanalysis. There are a number of potential prob-lems: (1) the choice of classification, (2) thebest number of clusters to use, (3) problemsof an ecological fallacy nature, and (4) possiblesmall number problems that render the resultsuncertain. With GB Profiles, the greatest im-mediate source of uncertainty is which clas-sification and how many clusters to use. Selecttoo few clusters and the results might well beover generalised and important associations

lost; select too many and the results might bespurious due to small number effects. Thisdilemma between "too few" and "too many"clusters is problem dependent and thus dataspecific. In a highly descriptive preliminarydata screening exercise this may not matter.However, there is a world of difference betweengeodemographics as applied to "junk mail" inthe commercial sector and its use with healthdatabase analysis of a more critical naturewhere higher standards should apply and theproblems deserve more explicit consideration.One way forward is to develop a more soph-

isticated geodemographic approach. We haveoutlined what is termed a data optimal seg-mentation system that might be developed."This so called "geodemographic targetingmachine" (GTM/1) attempts to identify thebest classification to use and within it the bestset of clusters so as to maximise coverage ofthe data and minimise problems due to smallnumber effects. The GTM/l evaluates a num-ber of different geodemographic classificationsby using a mix of Monte Carlo significancetesting and boot-strapping to delete both un-reliable classifications (that is, those yieldingresults little better than a random classificationwould) and also to delete clusters within ac-ceptable classifications for which the resultsappear to be either highly uncertain (due tosmall number effects) or not particularly in-teresting (in terms of predefined performancebenchmarks). In essence, GTM/1 is an op-timisation procedure that evaluates a set ofdifferent geodemographic classifications to findthat which captures the largest number of casesin clusters that meet user defined constraints.The GTM/1 approach is illustrated by re-

analysing the colorectal cancer data. A selectionof33 different classifications are examined cov-ering a range from 10 to 5000 clusters. Thefollowing segmentation constraints are set:

* A minimum cluster size of 10 cases,* A minimum cancer incidence 40% greater

than expected taking into account age andgender factors, and

* Results significantly different from ran-dom.

Clearly these are arbitrary and can be readilychanged as total run times are less than fiveminutes on a UNIX workstation.The results are reported in table 2 and

mapped in figure 3. It is very interesting thatfive of the classifications were dropped becausethey produced results that were not significantlydifferent from random. The previous 100 clus-ter classification looked good in table 1 but didin fact only capture 278 cases compared withthe 337 in the 25 cluster classification whichproduced the best results in table 2. Moreover,ifthe small and unreliable clusters are removed,then the 100 cluster classification only captures219 cases. The labelling of the clusters in this25 cluster classification would be performedusing automatic labelling software and is thesubject of continuing research. In general, theresults reported here again identify poor hous-ing areas, but with more precision than pre-viously.

M EDs with high CRC incidence

S37

Page 5: Geodemographic segmentation systems screening health

Openshaw, Blake

Figure 3 Sheffield District Health Authority showing areas of high colorectal cancer(CRC) incidence identified by the 25 cluster neural net classification.

Table 2 Data optimal segmentation results

Clusters in Clusters in No of observed Expectedclassification* segmentationt cases no

25 6 337 20020 4 324 20010 2 283 17935 5 269 15350 4 268 15845 5 255 14460 5 245 13240 4 235 130100 7 219 11815 3 205 113

150 8 204 11355 4 198 113130 6 195 102400 8 192 101300 9 189 9290 6 179 92120 7 172 9130 5 159 83500 10 156 78750 9 155 7110 2 153 80

140 6 142 6870 5 139 64

2000 9 137 6249 4 113 61

4000 5 90 443000 7 82 415000 4 50 21

* The classification are ranked in descending order of the num-ber of cases they represent.t Clusters are deleted because they do not occur in the studyregion or they are too small or yield unreliable results.

Clearly there are all manner of interesting"trade-offs" that can occur between the choiceof targetting constraints and the data beinganalysed. Nevertheless, it is quite clear that thisoptimal segmentation technology has con-siderable potential relevancy in a health con-text. A particularly nice feature is its ability todetect and reject random results.

ConclusionGeodemographic classifications provide a use-ful descriptive tool for the analysis of healthdata. The data optimal segmentation systemgoes further and provides a simple and quickmeans of exploring health databases for po-tential interesting associations with residentialarea characteristics. The process is automated

and designed to be intrinsically safe. With thelarge postcoded health databases that exist suchas those kept by the cancer registries, the abilityto screen quickly and easily these data forinteresting patterns is an important need thatthe geodemographic segmentation systems de-scribed here could be used to meet.

The research was partly funded by an ESRC research grant(R000234436). The cancer data was provided by SheffieldDistrict Health Authority as part of a MRC funded PhD projectat Sheffield University. The analysis is based on census anddigital boundary data provided with the support of the ESRCand JISC and are copyright of the Crown and/or Post Office.Information about how to obtain the GB Profile software andassociated 1991 census classifications can be obtained by email-ing; marcus@,geog. Leeds. ac. uk, or stan(cageog. leeds,ac. uk. See alsothe World Wide Web page: http:llwww.geogleeds.ac.uklstafflm. blakelprofilelhlpO-toc. htrn

1 Openshaw S, Wymer C. Classifying and regionalising censusdata. In: Openshaw S, ed. Census users handbook. London:GeoInformation International, 1995;239-70.

2 Brown PJB. Exploring geodemographics, in I. Masser andM Blakemore (eds) Handling Geographical Informnation:niethodology and potential applications. Longmans: London,199 1;221-58.

3 Openshaw S. Census users handbook. London: Geo-Information International, 1995.

4 Jarman B. Identification ofunderprivileged areas, BMJ 1993;286:1705-9.

5 Bradford MG, Robson BT, Tye R. Constructing an urbandeprivation index: a way ofmeeting the needs for flexibility.Environment and Planning A. 1995;27:519-33.

6 Reading R, Jarvis S, Openshaw S. Measurement of socialinequalities in health and the use of health services amongchildren in Northumberland, Arch Dis Child 1993;68:626-31.

7 Kohonen T. Self-organization and associative memory. Berlin:Springer-Verlag, 1984.

8 Openshaw S. Neuroclassification of spatial data. In: Hew-itson DC, Crane RG, eds. Neural nets: applications ingeography. Boston: Kluwer, 1994,53-70.

9 Openshaw S, Blake M, Wymer C. Using neurocomputingmethods to classify Britain's residential areas. In: FisherP ed. Innovations in GIS 2. London: Taylor and Francis,1994;97-112.

10 Openshaw S. Special classifications. In: Leventhal B, MoyC, Griffin J eds. An introductory guide to the 1991 census.Henley: NTC Publications 1993;69-82.

11 Openshaw S. Developing smart and intelligent target mar-keting system. 3tournal of Target Marketing, Measurementand Analysis for Marketing. 1994;2:289-301 and 3:31-8.

12 Blake M, Openshaw S. GB Profiler: A Windows front end forGB Profiles. Leeds: School of Geography, University ofLeeds. Working paper (in press).

Open discussionBEN SHLOMO - Did you show post hoc classificationsbased on the best fit of the data, or who decidedthat that was the way you were going to classifythose different groups?

OPENSHAW - Classifications were based on the censuscharacteristics of the clusters. This was a labellingexercise. It was designed to reduce a set of 80variables that would have different scores on all theenumeration districts in this particular cluster andcome up with a label understandable to others asbeing broadly representative.

DOLK - Similar to an ACORN classification?

OPENSHAW - Yes. Most people identifying numbersof clusters do so by pretending they can detect breakpoints in plots that show how within cluster variationdiminishes as the number of clusters increases. Quiteoften moreover, no real break points can be found.There is usually a smooth graph. What I have doneis to say "why not have multiple classifications?"The 33 classifications I looked at went from 10, 20,30, 40, 50, 60, 70, 80, 100, 200, 300, and so onup to 5000 clusters and then let the data optimalsegmentation system figure out what is best foryour particular application. I think that neatly getsaround the problem of having to identify an optimalbreak point on smooth graphs when none can beseen.

M EDs with high CRC incdence

S38