ModelingSpatialDependenciesforMiningGeospatial DataSanjay
Chawla, Shashi Shekhar, Weili Wu,andUygar Ozesmi1
IntroductionWidespreaduseofspatialdatabases[24]isleadingtoanincreasinginterestinmin-inginterestinganduseful
but implicit spatial patterns[14, 17, 10, 22].
Ecienttoolsforextractinginformationfromgeo-spatial data, thefocusof
thiswork, arecrucial to organizations whichmake decisions
basedonlarge spatial data
sets.Theseorganizationsarespreadacrossmanydomainsincludingecologyandenvi-ronmentmanagement,publicsafety,transportation,publichealth,business,travelandtourism[2,12].Classical
data mining algorithms[1] often make assumptions (e.g.
independent,identicaldistributions)whichviolateToblersrstlawofGeography:
everythingisrelated to everything else but nearby things are more
related than distant things[25].In other words, the values of
attributes of nearby spatial objects tend to systemati-cally aect
each other. In spatial statistics,an area within statistics devoted
to theanalysisofspatialdata,thisiscalledspatialautocorrelation[6].
Knowledgediscov-erytechniqueswhichignorespatialautocorrelationtypicallyperformpoorlyintheSupport
in part by the Army High Performance Computing Research Center
under the auspicesof Department of the Army, Army Research
Laboratory Cooperative agreement number
DAAH04-95-2-0003/contractnumberDAAH04-95-C-0008, andbytheNational
ScienceFoundationundergrant9631539.VignetteCorporation,WalthamMA02451.
Email:[email protected] of Computer Science,
Universityof Minnesota, Minneapolis, MN55455, USA.Email:
[email protected] of Computer Science, Universityof
Minnesota, Minneapolis, MN55455, USA.Email:
[email protected] of Environmental Sciences, Ericyes
University, Kayseri, Turkey.Email:[email protected]
by SIAM. Unauthorized reproduction of this article is
prohibited2presenceofspatial data. Spatial statisticstechniques,
ontheotherhand, dotakespatial
autocorrelationdirectlyintoaccount[3],
buttheresultingmodelsarecom-putationallyexpensiveandaresolvedviacomplexnumerical
solversorsamplingbasedMarkovChainMonteCarlo(MCMC)methods[15].In
this paper we rst review spatial statistical methods which
explictly modelspatial
autocorrelationandweproposePLUMS(PredictingLocationsUsingMapSimilarity),anewapproachforsupervisedspatialdataminingproblems.
PLUMSsearchestheparameterspaceof
modelsusingamap-similaritymeasurewhichismoreappropriateinthecontextof
spatial data. Wewill
showthatcomparedtostate-of-the-artspatialstatisticsapproaches,PLUMSachivescomparableaccuracybutatafractionofthecost(twoordersofmagnitude).
Furthermore,PLUMSpro-videsageneralframeworkforspecializingotherdataminingtechniquesforminingspatialdata.1.1
Uniquefeaturesofspatial dataminingThe dierence betweenclassical
andspatial dataminingparallels the dierencebetweenclassical
andspatial statistics. Oneofthefundamental assumptionsthatguides
statistical analysis is that thedatasamples
areindependentlygenerated:theyarelikethesuccessivetossesofacoin,ortherollingofadie.
Whenitcomestotheanalysisofspatialdata,theassumptionabouttheindependenceofsamplesis
generally false. In fact,spatial data tends to be highly self
correlated. For
exam-ple,peoplewithsimilarcharacteristics,occupationandbackgroundtendtoclustertogether
in the same neighborhoods. The economies of a region tend to be
similiar.Changesinnaturalresources,wildlife,andtemperaturevarygraduallyoverspace.Infact,
thispropertyof likethingstoclusterinspaceissofundamental that,
asnoted earlier, geographers have elevated it to the status of the
rst law of
geography.Thispropertyofselfcorrelationiscalledspatialautocorrelation.
Anotherdistinctpropertyofspatialdata,spatial
heterogeneity,impliesthatthevariationinspatialdataisafunctionofitslocation.
Spatialheterogeneityismeasuredvialocalmea-suresofspatialautocorrelation[18].
WediscussmeasuresofspatialautocorrelationinSection2.1.2
FamousHistorical ExamplesofSpatial DataExplorationSpatial
dataminingis aprocess of automatingthesearchfor
potentiallyusefulpatterns.
Wenowlistthreehistoricalexamplesofspatialpatternswhichhavehadaprofoundeectonsocietyandscienticdiscourse[11].1.
In1855, whenthe Asiatic cholerawas sweepingthroughLondon,
anepi-demiologistmarkedalllocationsonamapwherethediseasehadstruckanddiscovered
that the locations formed a cluster whose centroid turned out to
beawater-pump.
Whenthegovernmentauthoritiesturned-othewaterpump,thecholerabegantosubside.
Laterscientistsconrmedthewater-bornena-tureofthedisease.2.
ThetheoryofGondwanaland,whichsaysthatallthecontinentsonceformedCopyright
by SIAM. Unauthorized reproduction of this article is
prohibited3onelandmass, was postulatedafter R.
Lenzdiscovered(usingmaps) thatall
thecontinentscouldbettedtogetherintoonepiecelikeonegiantjig-sawpuzzle.
Laterfossil studiesprovidedadditional
evidencesupportingthehypothesis.3.
In1909agroupofdentistsdiscoveredthattheresidentsofColoradoSpringshad
unusually healthy teeth, and they attributed this to high levels of
naturalourideinthelocal drinkingwatersupply.
Researcherslaterconrmedthepositive role of ouride
incontrollingtooth-decay. Nowall municipalitiesintheUnitedStates
ensurethat drinkingwater supplies arefortiedwithouride.Ineachof
thesethreeinstances, spatial dataexplorationresultedinasetof
un-expectedhypotheses (or patterns) whichwere later
validatedbyspecialists andexperts. Thegoal of spatial dataminingis
toautomatethediscoveries of
suchpatternswhichcanthenbeexaminedbydomainexpertsforvalidation.
Validationis usuallyaccomplishedbyacombinationof domainexpertise
andconventionalstatisticaltechniques.1.3
AnIllustrativeApplicationDomainWe now introduce an example which
will be used throughout this paper to illustratethe dierent
concepts in spatial data mining. We are given data about two
wetlands,named Darr and Stubble, on the shores of Lake Erie in Ohio
USA in order to predictthe spatial distribution of a marsh-breeding
bird, the red-winged blackbird (Agelaiusphoeniceus). The data was
collected from April to June in two successive years,
1995and1996.Auniformgridwasimposedonthetwowetlandsanddierenttypesofmea-surementswererecordedateachcell
orpixel. Intotal, valuesof sevenattributeswere recorded at each
cell. Of course domain knowledge is crucial in deciding
whichattributes areimportant andwhicharenot. For example,
VegetationDurabilitywaschosenoverVegetationSpeciesbecausespecializedknowledgeaboutthebird-nesting
habits of the red-winged blackbird suggested that the choice of
nest locationis more dependent on plant structure and plant
resistance to wind and wave actionthanontheplantspecies.Ourgoal
istobuildamodel forpredictingthelocationof birdnestsinthewetlands.
Typicallythe model is built using a portionof the data,
calledtheLearning or Tranining data, and then tested on the
remainder of the data, calledthe Testing data. For example,later on
we will build a model using the 1995 dataon the Darr wetland and
then test it on either the 1996 Darr or 1995 Stubble wetlanddata.
Inthelearningdata, all theattributesareusedtobuildthemodel
andinthetraininngdata, onevalueishidden, inourcasethelocationof
thenests,
andusingknowledgegainedfromthe1995Darrdataandthevalueoftheindependentattributes
in the test data, we want to predict the location of the nests in
Darr 1996orinStubble1995.Inthis paper wefocus onthreeindependent
attributes, namelyVegetationDurability, DistancetoOpenWater,
andWaterDepth. Thesignicanceof theseCopyright by SIAM. Unauthorized
reproduction of this article is prohibited4three variables was
established using classical statistical analysis. The spatial
distri-bution of these variables and the actual nest locations for
the Darr wetland in 1995areshowninFigure1.
Thesemapsillustratetwoimportantpropertiesinherentinspatialdata.1.
Thevalueofattributeswhicharereferencedbyspatiallocationtendtovarygraduallyover
space. Whilethis mayseemobvious, classical dataminingtechniques,
either explictly or implicitly, assume that the data is
independentlygenerated. Forexample, themapsinFigure2showthespatial
distributionof attributesif theywereindependentlygenerated. Oneof
theauthorshasapplied classical data mining techniques like logistic
regression[20] and neuralnetworks[19] to build spatial habitat
models. Logistic regression was used be-cause the dependent
variable is binary (nest/no-nest) and the logistic functionsquashes
the real line onto the unit-interval. The values in the
unit-intervalcanthenbeinterpretedasprobabilities.
Thestudyconcludedthatwiththeuseof logisticregression,
thenestscouldbeclassiedatarate24%betterthanrandom[19].2. Thespatial
distributions of attributes sometimes havedistinct local
trendswhichcontradicttheglobaltrends.
ThisisseenmostvividlyinFigure1(b),wherethespatial
distributionofVegetationDurabilityisjaggedinthewest-ern section of
the wetland as compared to the overall impression of
uniformityacrossthewetland. Thispropertyiscalledspatial
heterogeneity.
InSection2.2wedescribetwomeasureswhichquantifythenotionofspatialautocorre-lationandspatialheterogeneity.Thefactthatclassical
dataminingtechniquesignorespatial autocorrelationandspatial
heterogeneityinthemodel buildingprocess is
onereasonwhythesetechniquesdoapoorjob. Asecond,
moresubtlebutequallyimportantreasonisrelatedtothechoiceof
theobjectivefunctiontomeasureclassicationaccuracy.Foratwo-classproblem,thestandardwaytomeasureclassicationaccuracyistocalcuatethepercentageof
correctlyclassiedobjects. Thismeasuremaynotbethe most suitable in a
spatial context. Spatialaccuracyhow far the predictions
arefromtheactualsisasimportantinthisapplicationdomainduetotheeectsofdiscretizationsof
acontinuouswetlandintodiscretepixels, asshowninFigure3.Figure3(a)
shows theactual locations of nests and 3(b) shows thepixels
withactual nests. Notethelossof
informationduringthediscretizationof continuousspaceintopixels.
ManynestlocationsbarelyfallwithinthepixelslabeledAandarequiteclosetootherblankpixels,
whichrepresentno-nest. Nowconsidertwopredictions shown in Figure
3(c) and3(d). Domain scientists prefer prediction3(d)over
3(c),sincepredictednestlocationsarecloseronaveragetosomeactualnestlocations.
The classication accuracy measure cannot distinguish between 3(c)
and3(d),andameasureofspatialaccuracyisneededtocapturethispreference.A
simple and intuitive measure of spatial accuracy is the Average
Distance toCopyright by SIAM. Unauthorized reproduction of this
article is prohibited50 20 40 60 80 100 120 140
16001020304050607080nz = 85Nest sites for 1995 Darr locationMarsh
landNest sites(a)NestLocations0 20 40 60 80 100 120 140
16001020304050607080nz = 5372Vegetation distribution across the
marshland0 10 20 30 40 50 60 70 80 90(b)VegetationDurability0 20 40
60 80 100 120 140 16001020304050607080nz = 5372Water depth
variation across marshland0 10 20 30 40 50 60 70 80
90(c)WaterDepth0 20 40 60 80 100 120 140 16001020304050607080nz =
5372Distance to open water0 10 20 30 40 50
60(d)DistancetoOpenWaterFigure 1. (a) Learning dataset: The
geometryof the wetlandandthelocations of the nests, (b) The spatial
distributionof vegetationdurabilityoverthe marshland, (c) The
spatial distributionof water depth, and(d) The
spatialdistributionofdistancetoopenwater.NearestPrediction(ADNP)fromtheactualnestsites,whichcanbedenedasADNP(A,
P) =1KK
k=1d(Ak, Ak.nearest(P)).HereAkrepresentstheactualnestlocations,
PisthemaplayerofpredictednestlocationsandAk.nearest(P)denotesthenearestpredictedlocationtoAk.
Kisthenumberofactualnestsites.
InSection3wewillintegratetheADNPmeasureinto the PLUMS framework. We
now formalize the spatial data mining problem byincorporating
notions of spatial autocorrelation and spatial accuracy in the
problemdenition.1.4 LocationPrediction:
ProblemFormulationTheLocationPredictionproblemisageneralizationofthenestlocationpredictionproblem.
It capturestheessential propertiesof similar problemsfromother
do-Copyright by SIAM. Unauthorized reproduction of this article is
prohibited60 20 40 60 80 100 120 140 16001020304050607080nz =
5372White Noise No spatial autocorrelation0 0.1 0.2 0.3 0.4 0.5 0.6
0.7 0.8 0.9(a) pixel property with
independentidenticaldistribution0 20 40 60 80 100 120 140
16001020304050607080nz = 195Random distributed nest
sites(b)RandomnestlocationsFigure2. Spatial
distributionsatisfyingrandomdistributionassumptionsofclassical
regressionA= nest locationP = predicted nest in pixelA=actual nest
in pixelP PAA P PA AA(a)AA A(b)(d) (c)PPLegendFigure 3. (a)The
actual locations of nest, (b)Pixels withactual
nests,(c)Locationpredictedbyamodel,
(d)Locationpredictedbyanothermodel.
Predic-tion(d)isspatiallymoreaccuratethan(c).mainsincludingcrimepreventionandenvironmentalmanagement.
Theproblemisformallydenedasfollows:Given:
AspatialframeworkSconsistingofsites {s1, . . . ,
sn}foranunderlyinggeographicspaceG.
AcollectionofexplanatoryfunctionsfXk:S Rk, k=1, . . . K.
Rkistherangeofpossiblevaluesfortheexplanatoryfunctions.
AdependentfunctionfY: S RY Afamily
FoflearningmodelfunctionsmappingR1 . . . RK RY.Find: AfunctionfY
F.Objective: maximizesimilarity(mapsiS( fY(fX1, . . . , fXK)),
map(fY (si)))=(1 ) classicationaccuracy( fY, fY ) +
()spatialaccuracy(( fY, fY )Constraints:Copyright by SIAM.
Unauthorized reproduction of this article is prohibited71.
GeographicSpaceSisamulti-dimensionalEuclideanSpace1.2. Thevaluesof
theexplanatoryfunctions, fX1, ...,
fXKandtheresponsefunctionfYmaynot beindependent withrespect
tothoseof nearbyspatialsites,i.e.,spatialautocorrelationexists.3.
ThedomainRkoftheexplanatoryfunctionsistheone-dimensionaldo-mainofrealnumbers.4.
Thedomainofthedependentvariable,RY= {0, 1}.The above formulation
highlights two important aspects of location
prediction.Itexplicitlyindicatesthat(i)thedatasamplesmayexhibitspatialautocorrelationand,
(ii)anobjectivefunctioni.e.,
amapsimilaritymeasureisacombinationofclassicationaccuracyandspatialaccuracy.
Thesimilaritybetweenthedependentvariable fYand the predicted
variablefYis a combination of the traditional classi-cation
accuracy and a representation dependent spatial classication
accuracy.Theregularizationtermcontrolsthedegreeofimportanceofspatial
accuracyandis typicallydomaindependent. As 0,
themapsimilaritymeasureap-proaches the traditional classication
accuracy measure. Intuitively, captures
thespatialautocorrelationpresentinspatialdata.Thestudyof
thenestinglocations of red-wingedblackbirds [19, 20] is aninstance
of the location prediction problem. The underlying spatial
framework is thecollection of 5m5m pixels in the grid imposed on
marshes. Explanatory variables,e.g. waterdepth,
vegetationdurabilityindex, distancetoopenwater, mappixelstoreal
numbers. Dependentvariable, i.e. nestlocations,
mapspixelstoabinarydomain.
Theexplanatoryanddependentvariablesexhibitspatialautocorrelation,e.g.,
gradual variationoverspace, asshowninFigure1.
Domainscientistspreferspatiallyaccuratepredictionswhichareclosertoactualnests,i.e,
>
0.Finally,itisimportanttonotethatinspatialstatisticsthegeneralapproachformodelingspatial
autocorrelationistoenlarge F, thefamilyof learningmodelfunctions
(see Section 3). The PLUMS approach (See Section 3) allows the
exibilityofincorporatingspatial autocorrelationinthemodel,
intheobjectivefunctionorinboth. Lateronwewill
showthatretainingtheclassical regressionmodel as Fbut modifying the
objective function leads to results which are comparable to
thosefrom spatial statistical methods but which incur only a
fraction of the computationalcosts.1.5
RelatedWorkandOurContributionsRelatedworkexaminestheareaofspatialstatisticsandspatialdatamining.Spatial
Statistics: The goal of spatial statistics is to model the special
prop-ertiesofspatial data.
Theprimarydistinguishingpropertyofspatial
dataisthatneighboringdatasamplestendtosystematicallyaecteachother.
Thustheclas-sical
assumptionthatdatasamplesaregeneratedfromindependentandidenticaldistributionsisnotvalid.
Currentresearchinspatial econometrics, geo-statistics,1The entire
surface of the Earthcannot be modeledas a Euclideanspace but
locallytheapproximationholdstrue.Copyright by SIAM. Unauthorized
reproduction of this article is prohibited8and ecological
modeling[3, 16, 11] has focused on extending classical statistical
tech-niquesinordertocapturetheuniquecharacteristicsinherentinspatial
data.
InSection2webrieyreviewsomebasicspatialstatisticalmeasuresandtechniques.Spatial
DataMining: Spatial datamining[9, 13, 14, 22, 4],
asubeldofdatamining[1], isconcernedwiththediscoveryof
interestinganduseful butim-plicit knowledge in spatial databases.
Challenges in Spatial Data Mining arise fromthefollowingissues.
First, classical datamining[1] dealswithnumbersandcate-gories; In
contrast, spatial data is more complex and includes extended
objects suchaspoints, lines, andpolygons. Second, classical
dataminingworkswithexplicitinputs, whereasspatial predicates(e.g.,
overlap)areoftenimplicit. Third, classi-cal
dataminingtreatseachinputindependentlyof otherinputs,
whereasspatialpatternsoftenexhibitcontinuityandhighautocorrelationamongnearbyfeatures.Forexample,thepopulationdensitiesofnearbylocationsareoftenrelated.
Inthepresenceof spatial data,
thestandardapproachinthedataminingcommunityistomaterializespatial
relationshipsasattributesandrebuildthemodel
withthesenewspatialattributes[14].
Inpreviouswork[4]westudiedspatialstatisticstech-niques which
explictly model spatial autocorrelation. In particular we described
thespatial autoregressionregression(SAR)model
whichextendslinearregressionforspatial data.
WealsocomparedthelinearregressionandtheSARmodel
onthebirdwetlanddataset.Ourcontributions: Inthispaper,
weproposePredictingLocationsUsingMapSimilarity(PLUMS),
anewframeworkfor supervisedspatial dataminingproblems.
Thisframeworkconsistsofacombinationofastatisticalmodel,
amapsimilaritymeasurealongwithasearchalgorithm,
andadiscretizationof thepa-rameterspace.
Weshowthatthecharacteristicpropertyof spatial data, namely,spatial
autocorrelation, canbeincorporatedineitherthestatistical model
ortheobjectivefunction. Wealsopresentresultsof
experimentsonthebird-nestingdatatocompareourapproachwithspatialstatisticaltechniques.Outlineandscopeof
Paper: Therestofthepaperisasfollows. Section2presentsareviewof
spatial statistical techniquesincludingtheSpatial Autore-gressive
Regressive (SAR) model[15], which extends regression modeling for
spatialdata. In Section 3 we propose PLUMS, a new framework for
supervised spatial dataminingandcompareitwithspatial statistical
techniques. Inthispaperwefocusexclusivelyonclassicationtechniques.
Section4presentsresultsof
experimentsonthebirdnestingdatasetsandsection5concludesthewholepaper.2
BasicConcepts: ModelingSpatial Dependencies2.1 Spatial
AutocorrelationandExamplesManymeasures are available for
quantifying spatial autocorrelation. Eachhasstrengthsandweaknesses.
HerewebrieydescribetheMoransImeasure.Inmostcases,theMoransImeasure(henceforthMI)rangesbetween-1and+1andthusissimilartotheclassicalmeasureofcorrelation.
Intuitively,ahigherpositivevalueindicateshighspatial
autocorrelation.
Thisimpliesthatlikevaluestendtoclustertogetherorattracteachother.
AlownegativevalueindicatesthatCopyright by SIAM. Unauthorized
reproduction of this article is prohibited9BCDA B C D(a) Map
b)Contiguity Matrix W0 1 00 1 00 1 0 01 0 1 111AABC DFigure4.
Aspatial
neighborhoodanditscontiguitymatrixhighandlowvaluesareinterspersed.
Thuslikevaluesarede-clusteredandtendtorepel eachother.
Avalueclosetozeroisanindicationthatnospatial
trend(randomdistribution)isdiscernibleusingthegivenmeasure.All
spatial autocorrelationmeasures are cruciallydependent onthe
choiceanddesignof the contiguitymatrixW. The designof the
matrixitself reectstheinuenceof neighborhood. Twocommonchoices
arethefour andtheeightneighborhood.
ThusgivenalatticestructureandapointSinthelattice,
afour-neighborhoodassumesthatSinuencesallcellswhichshareanedgewithS.Inaneight-neighborhood,
itisassumedthatSinuencesall cellswhicheithershareanedgeoravertex.
AneightneighborhoodcontiguitymatrixisshowninFigure
4.Thecontiguitymatrixoftheunevenlattice(left)isshownontherighthand-side.Thecontiguitymatrixplaysapivotalroleinthespatialextensionoftheregressionmodel.2.2
Spatial AutoregressionModels: SARWe now show how spatial
dependencies are modeled in the framework of regressionanalysis.
This framework may serve as a template for modeling spatial
dependenciesinotherdataminingtechniques.
Inspatialregression,thespatialdependenciesoftheerrorterm, or,
thedependentvariable, aredirectlymodeledintheregressionequation[3].
Assumethatthedependentvaluesy
iarerelatedtoeachother, i.e.,yi= f(yj) i =
j.Thentheregressionequationcanbemodiedasy = Wy +X + .Here Wis the
neighborhood relationship contiguity matrix and is a parameter
thatreects the strength of spatial dependencies between the
elements of the dependentvariable. After the correctiontermWyis
introduced, the components of theresidual error
vectorarethenassumedtobegeneratedfromindependent
andidenticalstandardnormaldistributions.WerefertothisequationastheSpatial
AutoregressiveModel (SAR).Noticethatwhen=0,
thisequationcollapsestotheclassical
regressionmodel.Thebenetsofmodelingspatial autocorrelationaremany:
(1)Theresidual errorwill havemuchlowerspatial autocorrelation,
i.e., systematicvariation. WiththeCopyright by SIAM. Unauthorized
reproduction of this article is prohibited10proper choice of W, the
residual error should, at least theoretically, have no system-atic
variation. (2) If the spatial autocorrelation coecient is
statistically
signicant,thenSARwillquantifythepresenceofspatialautocorrelation.
Itwillindicatetheextenttowhichvariationsinthedependentvariable(y)areexplainedbytheaver-ageofneighboringobservationvalues.
(3)Finally,themodelwillhaveabettert,i.e.,ahigherR-squaredstatistic.As
in the case of classical regression, the SAR equation has to be
transformedvia the logistic function for binary dependent
variables. The estimates of and canbe derived using maximum
likelihood theory or Bayesian statistics. We have carriedout
preliminary experiments using the spatial econometrics matlab
package2whichimplements a Bayesian approach using sampling based
Markov Chain Monte Carlo(MCMC)methods[16].
ThegeneralapproachofMCMCmethodsisthatwhenthejoint-probabilitydistributionistoocomplicatedtobecomputedanalytically,thena
suciently large number of samples from the conditional probability
distributionscan be used to estimate the statistics of the full
joint probability distribution.
WhilethisapproachisveryexibleandtheworkhorseofBayesianstatistics,itisacom-putationallyexpensiveprocesswithslowconvergenceproperties.
Furthermore,
atleastfornon-statisticians,itisanon-trivialtasktodecidewhatpriorstochooseandwhatanalyticexpressionstousefortheconditionalprobabilitydistributions.3
PredictingLocationsUsingMapSimilarity(PLUMS)Recall
thatweproposedageneral
problemdenitionfortheLocationPredictionproblem,
withtheobjectiveofmaximizingmapsimilarity, whichcombinesspa-tial
accuracyandclassicationaccuracy. Inthissection,
weproposethePLUMSframeworkforspatialdatamining.IndependentDiscretizedvar.
mapsrasterDependentvar. mapbinary rasterDiscretizedLearned
SpatialModelMeasuresMap Similarity Discretization graphfor
parameter spaceAlgo. to searchparameter spaceFamily of
functions(i.e. spatial models)Learning dataPLUMSFigure5.
Theframeworkforthelocationpredictionprocess2We would like to thank
James Lesage (http://www.spatial-econometrics.com/) for making
thematlabtoolboxavailableontheweb.Copyright by SIAM. Unauthorized
reproduction of this article is prohibited113.1 ProposedApproach:
PredictingLocationsUsingMapSimilarity(PLUMS)PredictingLocationsUsingMapSimilarity(PLUMS)istheproposedsupervisedlearningapproach.
Figure 5shows the context andcomponents of PLUMS. Ittakes a set of
maps for explanatory variables and a map for the dependent
variable.Themapsmustuseacommonspatial framework, i.e.,
commongeographicspaceandcommondiscretization,
andproducealearnedspatial modeltopredictthedependent variable using
explanatory variables. PLUMS has four basic
components:amapsimilaritymeasure, afamilyof parametric functions
representingspatialmodels,adiscretizationofparameterspace,andasearchalgorithm.
PLUMSusesthesearchalgorithmtoexploretheparameterspacetondtheparametervaluetuplewhichmaximizethegivenmapsimilaritymeasure.
Eachparameter
valuetuplespeciesafunctionfromthegivenfamilyasacandidatespatialmodel.A
simple map similarity measure focusing on spatial accuracy for
nest-locationmaps (or point sets in general) is the average
distance from an actual nest site to theclosest predicted
nest-site. Other spatial accuracy and map similarity measures canbe
dened using techniques such as the nearest neighbor index[7], and
the principalcomponentanalysisofapairofrastermaps.3.2
GreedySearchalgorithmofPLUMSAlgorithm1greedy-search-algorithmparameter-value-set
nd-A-local-maxima(parameter-value-set PVS,
discretization-of-parameter-spaceSF,map-similarity-measure-functionMSM,learning-map-setLMS)
{parameter-value-setbest-neighbor,a-neighbor;real
best-improvement=1,an-improvement;while(best-improvement>0)do
{best-neighbor=PVS.get-a-neighbor(SF);best-improvement=MSM(best-neighbor,LMS)-MSM(PVS,LMS);foreacha-neighborinPVS.get-all-neighbors(SF)do
{an-improvement=MSM(a-neighbor,LMS)-MSM(PVS,LMS);if(an-improvement>best-improvement)
{best-neighbor=a-neighbor;best-improvement=an-improvement;}}if(best-improvement>0)thenPVS=best-neighbor;}/*foundalocal
maximainparameterspace*/returnPVS;}A special case of PLUMS using
greedy search is described in Algorithm 1. Thefunction
nd-A-local-maxima, takes a seed value-tuple of parameters, a
discretiza-tion of parameter space, a map-similarity function, and
a learning data set
consistingofmapsofexplanatoryanddependentvariables.
Itevaluatestheparameter-valuetupleintheimmediateneighborhoodofcurrentparameter-valuetupleinthegivendiscretization.
An example of a current parameter-value tuple in a
red-winged-blackbirdapplicationwiththreeexplanatoryvariablesis(a,b,c).
ItsneighborhoodmayCopyright by SIAM. Unauthorized reproduction of
this article is prohibited12includethefollowingparameter
valuetuples: (a+,b,c), (a-,b,c),(a,b+,c),(a,b-,c),(a,b,c+),
and(a,b,c-)givenauniformgridwithcell-sizediscretizationofparameterspace.
Amoresophisticateddiscretizationmayusenon-uniformgrids.PLUMSevaluates
themapsimilaritymeasureoneachparameter valuetupleintheneighborhood.
If someof theneighborshavehighervaluesforthemapsim-ilaritymeasure,
theneighborwiththehighestvalueof mapsimilaritymeasureischosen. This
process is repeated until no neighbor has a higher map similarity
mea-surevalue,i.e.,alocalmaximahasbeenfound.
Clearly,thissearchalgorithmcanbeimprovedusingavarietyof
ideasincludinggradientdescent[5] andsimulatedannealing[23].
Asimplefunctionfamilyisthefamilyofgeneralizedlinearmodels,e.g.,
logisticregression[15], withorwithoutautocorrelationterms.
Otherinterest-ingfamiliesincludenon-linearfunctions. Inthespatial
statisticsliterature,
manyfunctionshavebeenproposedtocapturethespatialautocorrelationproperty.
Forexample, econometricians use the familyof spatial
autoregressionmodels[3, 16],geo-statisticians[12]
useCo-Krigingandecologists usetheAuto-Logisticmodels.Table 1
summarizes several special cases of PLUMS by enumerating various
choicesforthefourcomponents.ThedesignspaceofPLUMSisshowninFigure 6.
EachinstanceofPLUMSisapointinthefourdimensionalconceptualspacespannedbysimilaritymeasure,family
of functions, discretization of parameter space, and external
search algorithm.Forexample, thePLUMSimplementationlabeledAinFigure
6correspondstothespatial accuracymeasure(ADNP),
generalizedlinearmodel
(forthefamilyoffunctions),agreedysearchalgorithmanduniformdiscretization.PLUMSComponentChoicesComponent
ChoicesMapsimilarity avg.
distancetonearestpredictionfromactual(ADNP),...Searchalgorithm
greedy,gradientdescent,simulatedannealing,...Functionfamily
generalizedlinear(GL)(logit,probit),non-linear,GLwithautocorrelationDiscretizationofparameterspace
Uniform,non-uniform,multi-resolution,...Table1.
PLUMSComponentChoices4
ExperimentDesignandEvaluationWecarriedoutexperimentstocomparetheclassicalregressionandspatialautore-gressiveregression(SAR)models[4]andaninstanceofthePLUMSframework.Goals:
The goals of the experiments were (1) to evaluate the eects of
including
thespatialautoregressiveterm,Wy,inthelogisticregressionmodeland(2)comparetheaccuracyandperformanceof
aninstanceof PLUMSwithspatial regressionmodels.The 1995 Darr
wetland data was used as the learning set to build the
classicalandspatial models. Theparametersoftheclassical
logisticandspatial regressionmodel were
derivedusingmaximumlikelihoodestimationandMCMCmethods(Gibbs
Sampling). The two models were evaluated based on their ability to
predictCopyright by SIAM. Unauthorized reproduction of this article
is prohibited13Generalized Linearwith
AutocorrelationGreedy(G)Non-Linearwith
AutocorrelationUniform(NU)Non-SimulatedAnnealing(SA)SearchGeneralized
LinearG SADiscretizationUniform(U)NU(2) (3)G SA(5)UNU(4)Spatial
accuracymeasureMap similarityClassification
accuracymeasuremeasure(0