Gaussian Processes for Active Data Mining of Spatial Aggregates

Gaussian Processes for Active Data Mining of Spatial Aggregates

Naren Ramakrishnan†, Chris Bailey-Kellogg#, Satish Tadepalli†, and Varun N. Pandey†

†Department of Computer Science, Virginia Tech, Blacksburg, VA 24061#Department of Computer Science, Dartmouth College, Hanover, NH 03755

Abstract

Active data mining is becoming prevalent in applica-tions requiring focused sampling of data relevant to ahigh-level mining objective. It is especially pertinentin scientific and engineering applications where we seekto characterize a configuration space or design space interms of spatial aggregates, and where data collectioncan become costly. Examples abound in domains suchas aircraft design, wireless system simulation, fluid dy-namics, and sensor networks. This paper develops anactive mining mechanism, using Gaussian processes, foruncovering spatial aggregates from only a sparse set oftargeted samples. Gaussian processes provide a unify-ing framework for building surrogate models from sparsedata, reasoning about the uncertainty of estimation atunsampled points, and formulating objective criteria forclosing-the-loop between data collection and data min-ing. Our mechanism optimizes sample selection usingentropy-based functionals defined over spatial aggre-gates instead of the traditional approach of samplingto minimize estimated variance. We apply this mech-anism on a global optimization benchmark comprisinga testbank of 2D functions, as well as on data fromwireless system simulations. The results reveal that theproposed sampling strategy makes more judicious useof data points by selecting locations that clarify high-level structures in data, rather than choosing points thatmerely improve quality of function approximation.

Keywords: spatial mining, active mining, sparse data,spatial aggregation, Gaussian processes.

1 Introduction

Many data mining applications in scientific and engi-neering contexts require analysis and mining of spa-tial datasets derived from computer simulations or fielddata, e.g., wireless system simulations, aircraft designconfiguration spaces, fluid dynamics simulations, andsensor network optimization. In contrast to traditionaldata mining contexts that are dominated by massivedatasets, these domains are actually characterized bya paucity of data, owing to the cost and time involved

10 20 30 40

1020

3040

SNR1, dB

SNR

2, d

B

Figure 1: Mining configuration spaces from wirelesssystem simulations. The shaded region denotes thelargest portion of the configuration space where we canclaim, with confidence at least 99%, that the average biterror rate (BER) is acceptable for voice-based systemusage. Each ‘cell’ in the plot is the result of the spatialand temporal aggregation of hundreds of wireless systemsimulations, many of which take hours.

in conducting simulations or setting up experimentalapparatus for data collection. Nevertheless, the com-putational scientist has control of where data can becollected; it is hence prudent in such domains to focusdata collection in only those regions that are deemedimportant to support a high-level data mining objec-tive.

As a concrete example, consider the characteri-zation of WCDMA (wideband code-division multipleaccess) wireless system configurations for a given in-door environment. A configuration comprises many ad-justable parameters, and the goal of wireless systemcharacterization is to assess the relationship betweenthese parameters and performance metrics such as BER(bit error rate), a measure of the number of bits trans-mitted in error using the system. When a wireless engi-neer designs a system for a given indoor environment, heor she sets an acceptable performance criterion for BER(e.g., 10−3 for a system designed to carry voice traffic,stricter thresholds for data traffic) and seeks a region

in the configuration space that can satisfy this crite-rion (see Fig. 1). To collect the data necessary for min-ing configuration spaces, the engineer either performs acostly Monte Carlo simulation (where a model of radiopropagation in the wireless channel is embedded inside asystem-wide model encapsulating wireless protocols andcommunications standards), or installs channel sound-ing equipment and system instrumentation in the envi-ronment, and actually enacts usage scenarios. In eitherapproach, it is not feasible to first organize a volumi-nous body of data and subsequently perform data min-ing on the collected dataset. It is thus imperative thatwe interleave data collection and data mining and focussampling at only those locations that maximize well-defined notions of relevance and utility. Importantly, wewill not need to sample the entire configuration space,only enough so as to identify a region with acceptableconfidence.

Active data selection has been investigated in avariety of contexts [4, 25]. A sampling strategy typicallyembodies a human assessment of where might be a goodlocation to collect data [1, 13] or is derived from theoptimization of specific design criteria [5, 17, 22]. Manyof these strategies, however, are either based on utilityof data for function approximation purposes [24], or aremeant to be used with specific data mining algorithmsand tasks (e.g., classification [10]). In this paper, wepresent a formal framework that casts spatial datamining as uncovering successive multi-level aggregatesof data, and uses properties of higher-level structures tohelp close the loop between mining and data collection.This approach helps us design sampling strategies thatbridge higher-level quality metrics of structures (e.g.,entropy) with lower-level considerations of data samples(e.g., locations and fidelity). While we focus on spatialcontexts, we point out that spatial can denote anydimension that affords a metric; our approach thusapplies equally well to a wide range of data sets withmore abstract notions of space (such as the wirelesssimulation example above).

Our active mining mechanism is based on thespatial aggregation language (SAL; [3]), a generic datamining framework for spatial datasets, and Gaussianprocesses (GPs; [27]), a powerful unifying theory forapproximating and reasoning about datasets. Gaussianprocesses provide the ‘glue’ that enables us to performactive mining on spatial aggregates. In particular,they aid in (i) creation of surrogate models from datausing a sparse set of samples (for cheap generationof dense approximate datasets), (ii) reasoning aboutthe uncertainty of estimation at unsampled points, and(iii) formulation of objective criteria for active datacollection.

classesEquivalence

objectsSpatial

N-graph

Ambiguities

Sample

Aggregate

Interpolate

LocalizeRedescribe

LocalizeRedescribe

Lower-Level Objects

Higher-Level Objects

Abstract Description

Classify

Input Field

Figure 2: SAL uncovers multi-layer spatial aggregatesby employing a small set of operators (a spatial mining“vocabulary”) utilizing suitable domain knowledge. Avariety of spatial data mining tasks, such as vector fieldbundling, contour aggregation, correspondence abstrac-tion, clustering, and uncovering regions of uniformitycan be expressed as multi-level spatial aggregate com-putations.

1.1 Contributions: This paper builds on our priorwork in [1, 23] by presenting a novel integration ofGaussian processes with SAL:

• While classical active mining work in spatial mod-eling focuses on quality of function approximation,the mechanism presented here focuses on clarifyinghigh-level structures. The entropy-based samplingapproach introduced in this paper is applicable tomining a broad range of spatial structures.

• Unlike traditional data mining contexts that dealwith voluminous amounts of data, the mechanismis targeted at scenarios where data collection costsfar outshadow data mining costs. For instance,in the wireless simulation study, each data samplerequires a few hours to acquire on a cluster ofworkstations whereas the data mining (and sampleselection optimization) algorithms as implementedhere can be executed in a matter of minutes on aworkstation.

−1−0.5

00.5

1

−1

−0.5

0

0.5

1−2

−1.5

−1

−0.5

0

0.5

1

Figure 3: de Boor’s ‘pocket’ function in 2D, depictingcontours around basins of local minima.

• Since Gaussian processes are re-statements ofkernel-based learning methods [7] this work helpsbridge the qualitative nature of SAL algorithmswith rigorous quantitative methodologies necessaryto evaluate and assess active mining strategies.

This work assumes a moderate background of algo-rithms for spatial aggregation and spatial statisticalmodeling. Nevertheless, Sec. 2 and Sec. 3 overview ear-lier work on SAL and GPs for the SIAM DM audienceand instantiate such work for a motivating case study— identifying and characterizing pockets in a space(such as the wireless system design configuration space).Sec. 4 then moves to active mining and introduces twoimportant classes of sampling strategies integrating SALand GPs. Sec. 5 evaluates the mechanism using bothsynthetic and real-world datasets. Sec. 6 provides a dis-cussion and reviews related work.

2 Spatial Aggregation Language

The Spatial Aggregation Language (SAL) [3, 28, 30] isa generic framework to study the design and implemen-tation of spatial data mining algorithms. SAL is cen-tered on a field ontology, in which the spatial data in-put is a field mapping from one continuum to another(e.g. 2-D temperature field: R2 → R1; 3-D fluid flowfield: R3 → R3). SAL programs employ vision-like rou-tines, in an imagistic reasoning style [29], to uncover andmanipulate multi-layer geometric and topological struc-tures in fields. Due to continuity, fields exhibit regionsof uniformity, and these regions can be abstracted ashigher-level structures which in turn exhibit their owncontinuities. Task-specific domain knowledge specifieshow to uncover such uniformity, defining metrics forcloseness of objects and similarity of features. For exam-ple, streamlines are connected curves of nearby pointswith vectors flowing in similar enough directions, whilepressure cells are connected regions of similar (and ex-treme) enough pressure.

SAL supports structure discovery through a smallset of generic operators, parameterized with domain-

specific knowledge, on uniform data types. These oper-ators and data types mediate increasingly abstract de-scriptions of the input data (see Fig. 2) to form higher-level abstractions and mine patterns. The primitives

in SAL are contiguous regions of space called spatial

objects; the compounds are (possibly structured) collec-tions of spatial objects; the abstraction mechanisms con-nect collections at one level of abstraction with singleobjects at a higher level. This vocabulary has provedeffective for expressing the mechanisms required to un-cover multi-level structures in spatial datasets in ap-plications ranging from decentralized control design [2]and object manipulation [30] to analysis of weatherdata [12], diffusion-reaction morphogenesis [21], andmatrix perturbation analysis [22].

The identification of structures in a field is a formof data reduction: a relatively information-rich fieldrepresentation is abstracted into a more concise struc-tural representation (e.g. pressure data points into iso-bar curves or pressure cells; isobar curve segments intotroughs). Navigating the mapping from field to abstractdescription through multiple layers rather than in onegiant step allows the construction of modular data min-ing programs with manageable pieces that can use sim-ilar processing techniques at different levels of abstrac-tion. The multi-level mapping also allows higher-levellayers to use global properties of lower-level objects aslocal properties of the higher-level objects. For example,the average temperature in a region is a global propertywhen considered with respect to the temperature datapoints, but a local property when considered with re-spect to a more abstract region description. As this pa-per demonstrates, analysis of higher-level structures insuch a hierarchy can guide interpretation of lower-leveldata.

Let us begin with a spatial mining task motivatedby the wireless study — determining the number andlocations of pockets, or basins of local minima, in avector field. Fig. 3 illustrates four pockets in a fielddefined by Carl de Boor’s function in 2D (from [22]):

α(X) = cos

(

n∑

i=1

2i

(

1 +xi

| xi |

)

)

− 2(2.1)

δ(X) = ‖X− 0.5I‖(2.2)

p(X) = α(X)(1 − δ2(X)(3 − 2δ(X))) + 1(2.3)

where X is the n-dimensional point (x1, x2, · · · , xn) atwhich the pocket function p is evaluated, I is the identityn-vector, and ‖ · ‖ is the L2 norm. The property of thisfunction is that it assumes a pocket in each corner ofthe cube, just outside the sphere enclosed in the cube.Since the ratio of cube volume (2n) to that of the sphere(πn/2/(n/2)!) grows unboundedly, global optimization

algorithms cannot exploit any special properties andmust consider every one of the 2n corners! Hence, thede Boor function is a well-known benchmark for globaloptimization (esp. in high dimensions), but we focushere on a somewhat different objective of characterizingthe high-level structure of the field. The algorithmicencoding of the calculus definition of local minimasuggests that the four pockets in Fig. 3 can be identifiedvia convergent flows in the gradient underlying thevector field. Let us assume we are given a dense set ofsamples covering the region of interest. Fig. 4 illustratesan example of key spatial aggregation operations:

(a) Establish the input field, here by calculating thegradient field (normalized, since we’re interestedonly in direction in order to detect convergence).

(b) Localize computation with a neighborhood graph,so that only spatially proximate objects are com-pared. Here, an 8-adjacency neighborhood graphis employed, which results in somewhat ‘blocky’streamlines but fast computation.

(c)-(f) Use a series of local computations to find equiv-

alence classes of neighboring objects with simi-lar features. Here, we systematically eliminate allneighborhood graph edges but those whose direc-tions best match the vector direction at both end-points. ‘Forward neighbor’ computation comparesgraph edge direction with the average of the vec-tor directions, and keeps only those that are simi-lar enough (implemented as a cosine angle similar-ity threshold). ‘Best forward neighbor’ at junctionpoints then selects from among these neighbors,by a third metric combining similarity in directionwith closeness in point location. Backward calcula-tions are analogous, but deal with the predecessoralong a streamline rather than the successor.

(g) Move up a level in the spatial object hierarchy byredescribing equivalence classes into more abstractobjects. Here, connected vectors are abstractedinto curve objects, which have both a reducedrepresentation and additional semantic properties(e.g. curvature is well-defined).

(h) Apply the same mechanism — aggregate, classify,and redescribe — at the new level, using the exact

same operators but with different metrics. Here,curves are grouped into coherent pockets with con-vergent flow. Neighborhood (not shown) is de-rived from neighborhood of constituent vectors,and equivalence tests direction of flow for conver-gence.

Notice that SAL is not a specific data mining al-gorithm, but rather a language to construct complexmining operations (such as in Fig. 4) from a small coreset of operations. As such, the quality of results from aSAL implementation depends on suitable choices of ab-straction levels and appropriate settings of any relevantparameters. For instance, in the above example, threeparameters control the relationship from input field tooutput structures: adjacency neighborhood size (used instep (b)), angle for vector similarity (used in step (c)),and distance penalty metric (used in step (d) to com-bine distance with direction). For Fig. 4, we set theseparameters to 1.5 (generates an 8-adjacency neighbor-hood), 0.75, and 0.1 respectively. This paper is notconcerned with evaluating particular SAL implementa-tions but instead focuses on using them from within anactive mining context.

Localized computations are integral to SAL, andhence an effective SAL application relies on a denseset of samples covering the domain. When data isscarce, we can first build an approximation to theunderlying field with the given samples, and use theapproximation to generate a dense field of data (e.g.,on a uniform grid). Such an approximation is calleda surrogate model — cheap-to-compute substitutes forcomplex functions. One way to build surrogate modelsrelies on Gaussian processes.

3 Gaussian Processes

The use of Gaussian processes in machine learning anddata mining is a relatively new development, althoughtheir origins can be traced to spatial statistics and thepractice of modeling known as kriging [14]. In contrastto global approximation techniques such as least-squaresfitting, GPs are local approximation techniques, akinto nearest-neighbor procedures. In contrast to functionapproximation techniques that place a prior on the formof the function, GP modeling techniques place a prioron the covariance structures underlying the data.

The basic idea in GPs is to model a given datasetas a realization of a stochastic process. Formally, aGP is a set of random variables any finite subset ofwhich have a (multivariate) normal distribution. Forour purposes, we can think of these variables as spatiallydistributed (scalar) response variables ti, one for each2D location xi = [xi1, xi2] where we have collected adata sample. In our vector field analysis application,ti denotes the modeled response, i.e., the value of deBoor’s function at xi. Given a dataset D = {xi, ti}, i =1 . . . n, and a new data point xn+1, a GP can be usedto model the posterior P (tn+1|D, xn+1) (which wouldalso be a Gaussian). This is essentially what manyBayesian modeling techniques do (e.g., least squares

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 4: Example steps in SAL implementation of vector field analysis of de Boor’s function. (a) Input vectorfield. (b) 8-adjacency neighborhood graph. (c) Forward neighbors. (d) Best forward neighbors. (e) Neighborhoodgraph transposed from best forward neighbors. (f) Best backward neighbors. (g) Resulting adjacencies redescribedas curves. (h) Higher-level aggregation and classification of curves whose flows converge.

approximation with normally distributed noise) but it isthe specifics of how the posterior is modeled that makeGPs distinct as a class of modeling techniques.

To make a prediction of tn+1 at a point xn+1, GPsplace greater reliance on ti’s from nearby points. Thisreliance is specified in the form of a covariance prior forthe process and will be central to how we embed SALin a broader GP framework:

Cov(ti, tj) = α exp

(

−1

2

2∑

k=1

ak(xik − xjk)2

)

(3.4)

Intuitively, this function captures the notion that re-sponse variables at nearby points must have high corre-lation. The reader will note that this idea of influencedecaying with distance has an immediate parallel to howSAL programs localize computations. In Eq. 3.4, α isan overall scaling term whereas a1, a2 define the lengthscales for the two dimensions. However, this prior (oreven its posterior) does not directly allow us to deter-mine tj from ti, since the structure only captures thecovariance; predictions of a response variable for newsample locations are thus conditionally dependent onthe measured response variables and their sample lo-cations. Hence, we must first estimate the covarianceparameters (a1, a2, and α) from D and then use theseparameters along with D to predict tn+1 at xn+1.

3.1 Using a GP: Before covering the learning pro-cedure for the covariance parameters (a1, a2, and α), it

is helpful to develop expressions for the posterior of theresponse variable in terms of these parameters. Sincethe jpdf of the response variables P (t1, t2, · · · , tn+1) ismodeled Gaussian (we will assume a mean of zero), wecan write:

P (t1, t2, · · · , tn+1 |x1,x2, · · · ,xn+1, Covn+1) =

1

λ1exp

(

−1

2[t1, t2, · · · , tn+1] Cov−1

n+1 [t1, t2, · · · , tn+1]T

)

where we ignore λ1 as it is simply a normalizing factor.Here, Covn+1 is the covariance matrix formed from the(n + 1) data values (x1,x2, · · · ,xn+1). A distributionfor the unknown variable tn+1 can then be obtained as:

P (tn+1|t1, t2, · · · , tn,x1,x2, · · · ,xn+1, Covn+1)

=P (t1, t2, · · · , tn+1 |x1,x2, · · · ,xn+1, Covn+1)

P (t1, t2, · · · , tn |x1,x2, · · · ,xn+1, Covn+1)

=P (t1, t2, · · · , tn+1 |x1,x2, · · · ,xn+1, Covn+1)

P (t1, t2, · · · , tn |x1,x2, · · · ,xn, Covn)

where the last step follows by conditional independenceof {t1, t2, · · · , tn} w.r.t. xn+1 and the part of Covn+1

not contained in Covn. The denominator in the aboveexpression is another Gaussian random variable, givenby:

P (t1, t2, · · · , tn |x1,x2, · · · ,xn, Covn) =

1

λ2exp

(

−1

2[t1, t2, · · · , tn] Cov−1

n [t1, t2, · · · , tn]T)

Putting it all together, we get:

P (tn+1|t1, t2, · · · , tn,x1,x2, · · · ,xn+1, Covn+1) =

λ2

λ1exp ( −

1

2[t1, t2, · · · , tn+1] Cov−1

n+1 [t1, t2, · · · , tn+1]T

−1

2[t1, t2, · · · , tn] Cov−1

n [t1, t2, · · · , tn]T )

Computing the mean and variance of this Gaussiandistribution, we get an estimate of tn+1 as:

tn+1 = kT Cov−1n [t1, t2, · · · , tn](3.5)

and our uncertainty in this estimate as:

σ2tn+1

= k − kT Cov−1n k(3.6)

where kT represents the n-vector of covariances withthe new data point:

kT = [Cov(x1,xn+1) Cov(x2,xn+1) · · · Cov(xn,xn+1)]

and k is the (n + 1, n + 1) entry of Covn+1. Eqs. 3.5and 3.6, together, give us both an approximation at anygiven point and an uncertainty in this approximation;they will serve as the basic building blocks for closing-the-loop between data modeling and higher level miningfunctionality.

The above expressions can be alternatively derivedby positing a linear probabilistic model and optimizingfor the MSE (mean squared error) between observed andpredicted response values (e.g., see [24]). In this sense,the Gaussian process model considered here is alsoknown as the BLUE (best linear unbiased estimator),but GPs are not restricted to linear combinations ofbasis functions.

To apply GP modeling to a given dataset, wemust first ensure that the chosen covariance struc-ture matches the data characteristics. We have cho-sen a stationary structure above under the assumptionthat the covariance is translation invariant. Variousother functions have been studied in the literature (e.g.,see [18, 19, 24]), all of which satisfy the required prop-erty of positive definiteness. The simplest covariancefunction yields a diagonal matrix, but this means thatno data sample can have an influence on other locations,and the GP approach offers no particular advantages.In general, by placing a prior directly on the functionspace, GPs are appropriate for modeling ‘smooth’ func-tions. The terms a1, a2 capture how quickly the influ-ence of a data sample decays in each direction and, thus,the length scales for smoothness.

An important point to note is that even though theGP realization is one of a random process, we can nev-ertheless build a GP model for deterministic functions

(like the de Boor’s function) by choosing a covariancestructure that ensures the diagonal correlations to be1 (i.e., perfect reproducibility when queried for a sam-ple whose value is known). Also, the assumption ofzero mean for the Gaussian process can be relaxed, byincluding a constant term (gives another parameter tobe estimated) in the covariance formulation. This ap-proach is used for our experimental studies.

3.2 Learning a GP: Learning the GP parametersθ = (a1, a2, α) can be undertaken in the ML and MAPframeworks, or in the true Bayesian setting where weobtain a distribution over values. The log-likelihood forthe parameters is given by:

L = log P (t1, t2, · · · , tn|x1,x2, · · · ,xn, θ)

= c + log P (θ) −n

2log(2π) −

1

2log | Covn |

−1

2[t1, t2, · · · , tn] Cov−1

n [t1, t2, · · · , tn]T

To optimize for the parameters, we can compute partialderivatives of the log-likelihood for use with a conjugategradient or other optimization algorithm:

∂L

∂θ=

∂ log P (θ)

∂θ

−1

2tr

(

Cov−1n

∂ Cov−1n

∂θ

)

+1

2[t1, t2, · · · , tn] Cov−1

n

∂ Cov−1n

∂θ

Cov−1n [t1, t2, · · · , tn]T

where tr(·) denotes the trace function. In our runningexample, we need only estimate three parameters for θ,well within the purview of modern numerical optimiza-tion software. For larger numbers of parameters, we canresort to the use of MCMC methods [19].

4 Active Data Mining Strategies

The above section showed two important uses of GPsfor spatial mining: designing a surrogate function forgenerating a dense field (via Eq. 3.5), and assessing un-certainties in our estimates of the function at unsampledpoints (using Eq. 3.6). We are now ready to formulateobjective criteria for active data selection, a pre-cursorto active mining.

4.1 Variance Reducing Designs: A simple strat-egy for sampling is to target locations to reduce ouruncertainty in modeling, i.e., select the location thatminimizes the posterior generalized variance of the func-tion. This approach can be seen as optimizing sample

selection for the functional:

ΦV =1

2log

[

∂t

∂θ

]

H−1

[

∂t

∂θ

]T

(4.7)

where[

∂t∂θ

]

is the (row) vector of sensitivities w.r.t. eachGP parameter computed at a sample location, and H isthe Hessian (second order partial derivatives) of t, againw.r.t. the parameters. A straightforward derivation willshow that optimizing ΦV suggests a location whose‘error bars’ σ2 are highest.

To implement this strategy, we can adopt either ablock design (optimize for K locations simultaneously),or apply it sequentially to determine one extra samplinglocation at a time. The former is appropriate whenwe can farm out function evaluations across nodes ina compute cluster, whereas the latter will track thedesign functional better. We adopt the sequentialapproach here; Fig. 5 shows this strategy for the pocketfunction of Fig. 3 and concomitant results from pocketmining of the surrogate model data. At each step,we determine the best sample location (from amongunsampled locations on a regular grid of 21× 21), buildthe GP model from the data collected thus far, andapply our SAL-based vector aggregation mechanism tothe gradient field derived from the function values.

The initial design has one point in the center of eachquadrant, and one at the center. Not surprisingly, wefind a significant number (16) of basins in the gradientfield. The next four points added are actually at thecorners; this is because estimated variances are typicallyhigh toward the boundaries of an interpolation region.As MacKay points out [17], such a metric has a tendencyto ‘repeatedly gather data at the edges of the inputspace.’ Continuing the sampling, we see that the 13-point design actually has the samples organized in adiagonal design (a layout that has been referred toas ‘whimsical’ [9]). The emphasis on overall qualityof function approximation more than data mining isevident from the fact that it takes over 30 points beforethe SAL-based pocket finder can infer that there arefour pockets. In further experiments not reported here,we have found that pushing the initial points outward(or inward) does not have any appreciable effect onfuture samplings, and the variance-based metric favorsthe outer envelope of the design space.

4.2 Entropy-Based Functionals: It is a classicalresult in experiment design (e.g., see [8]) that, for Gaus-sian priors, the variance-reducing design is actuallyequivalent to the design minimizing the (expected) pos-terior entropy of the distribution tD|D, where D denotes

the unsampled locations in D. For a proof, see [16].This criterion is also equivalent to the D-optimality de-

sign criterion in spatial statistics, under the assumptionthat the noise factor on all measurements is the same.MacKay generalizes this idea [17], and pre-specifies acollection of points requiring high-quality approxima-tion; the goal then is to minimize entropy of data distri-bution w.r.t. these points. This strategy does not applyhere since our understanding of which locations are rel-evant improves as active mining proceeds.

To develop a better active mining strategy, noticethat our goal is the identification of regions definedby convergent flows. If we view the SAL program asan information processor that maps a data field intoa class field (defined over the same underlying space),then the utility of sampling in a region is directly relatedto our inferential capabilities about the correspondingregion in the class field. Intuitively, we should bemore interested in samples that tell us something aboutthe boundary between regions than those that capturethe insides of a region, even though the latter might

have high variance in its current estimate. Repeatedlysampling function values inside an already classified andabstracted region is not as useful as sampling to clarifyan emerging boundary classification. This means thatwe must bridge high-level information about pocketsfrom SAL into a preference of where to collect data.

An idea that suggests itself is to adopt variance-based design, but instead of minimizing the entropy ofthe data distribution, minimize the entropy of the classdistribution as revealed by the SAL pocket finder. Bypositing a class distribution at each point, based on theclass labels occupied by neighboring points, we achieveour goal of ranking locations along region boundarieshigher. While this basic strategy appears reasonable,it will repeatedly gather information at the regionboundaries, just as variance-based design repeatedlyfocuses on the edges. So a point with high entropy isa good location to sample only as long as the variancesurrounding it is sufficiently high. As our confidence inthe data value increases, our preference for this locationshould decrease even if the class entropy remains large(as it will, if it lies on a boundary). This suggests usingclass entropy to define a distribution PE(x) over points,and using that distribution to scale the variance-baseddesign criterion:

ΦE =1

2

∑

x

PE(x) log

[

∂t

∂θ

]

H−1

[

∂t

∂θ

]T

(4.8)

The expression inside the summation contains the sameterm as in Eq. 4.7 but is now evaluated across the designspace and scaled by the amount of interest in locationx:

PE(xi) ∝∑

x∈N (xi)

P (C(x)) log P (C(x))(4.9)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Currently design space: 5 points

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8


−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1.5

−1

−0.5

0

0.5

1

1.5

GP model with 5 points

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1.5

−1

−0.5

0

0.5

1

1.5


−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1.5

−1

−0.5

0

0.5

1

1.5


−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1.5

−1

−0.5

0

0.5

1

1.5


16 3 3 4

Figure 5: How variance-based sampling selects locations. (top row) initial design of 5 points, followed by snapshotstaken at later stages (9, 13, and 31 points). Old sample locations are shown with red circles and new locations areshown with blue diamonds. (middle row) GP model fits to the given samples. (bottom row) Number of pocketsidentified by SAL pocket miner.

where N (xi) is a neighborhood around xi, C(x) denotesthe (flow) class of point x as inferred by the SAL miner,and P (C(x)) denotes the probability of encounteringthis class in the neighborhood. The proportionality con-stant in Eq. 4.9 must be set to ensure that

∑

PE = 1.Formal characterization of this criterion (i.e., conver-gence properties) is difficult since PE(x) changes dur-ing every iteration of data mining, and we do not havea model of how PE(x) varies across samplings. Oper-ationally, to apply this criterion, we can identify thelocation that gives the highest information gain, giventhat we are intending to make a measurement at thatlocation. Fig. 6 shows a design that optimizes ΦE andsuccessfully reveals all four pockets with only 11 points.

4.3 Computational Considerations: Other thanany data collection costs, the primary costs to im-plementing the active mining mechanisms involve thenested optimizations and the necessary matrix compu-tations. There are two optimizations per round of datacollection: a multi-dimensional optimization over θ tofit the surrogate model, and a 2D optimization over x

to identify the next sample point. Both can be doneeither locally or globally, depending on our fidelity re-quirements and availability of resources. Here, to reducethe computational complexity in building the surrogatemodel, we adopt the public domain Netlab scaled con-jugate gradient algorithm [18] which runs in O(|D||θ|)time. While this algorithm avoids having to work withthe Hessian explicitly, the active sample selection steprequires the computation of the Hessian inverse, which

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Current design space: 11 points

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−1.5

−1

−0.5

0

0.5

1

1.5


Figure 6: Active data mining with an entropy-basedfunctional. This sampling strategy picks six additionalpoints (top), in various quadrants; the SAL miner findsfour pockets (bottom) when a GP model is constructedusing the given points. Contrast this with the per-formance of variance-based sampling for a comparablenumber of samples.

takes O(|D||θ|2 + |θ|3) time. To reduce the cost of opti-mization, we use a discrete lattice search or hill climb-ing, restricting our attention to locations over a uniformgrid. If the number of locations on the grid is |G|, theneach round of active mining requires O(|G||θ|2) time,plus the cost of computing the inverse Hessian. Thisexpression applies to both variance-based mining andentropy-based mining, since the computation of PE(x)is just linear in |G| for a fixed neighborhood calculationof entropy. Recall that this cost is typically negligiblecompared to the actual cost of running a simulation (toacquire a data sample).

4.4 Stopping Criteria: How do we know when tostop sampling? If a cost metric is defined over data col-lection, and if it can be determined that we can sampleat most K points within the given resources, then weshould ideally perform a K-dimensional optimization,rather than adopting a sequential sampling strategy. Inthe absence of such a cost-metric, a sampling strategycould terminate when the estimated dataset log like-lihood is within bounds. In this paper, we primarilyevaluate sampling strategies using classes of problemsfor which the ‘right’ answer is known, and pose ques-tions such as: ‘starting from an initial grid, how manysamples does it take to mine the right number of higher-level structures?’ The answer to this question gives usan indication of how aggressive the sampling strategy is,its stability (i.e., once mined, does it continue to minethe patterns?), and comparisons with the other strategy.

5 Experimental Results

We now present empirical results demonstrating the ef-fectiveness of our active mining strategy on both syn-thetic and real datasets. We employed the Netlab suiteof algorithms for GP modeling. Netlab supports a co-variance formulation similar to Eq. 3.4, along with abias term that overcomes our earlier assumption of zeromean. In addition, the model includes a noise term thatcan capture uncertainties in individual measurements;while this is not required for the deterministic functionsconsidered here, it ensures that the numerical compu-tation doesn’t become unstable. All GP parameters aregiven a relatively broad Gaussian prior. A surrogatemodel was fit on a regularly spaced grid (more below),with a limit of 100 iterations for conjugate gradientsearch. The SAL parameters were set to (1.5, 0.75, 0.1),as before. The standard variance-based sampling hasno adjustable parameters; a fixed 8-adjacency neigh-borhood was utilized for defining P (x) in entropy-basedsampling. Optimization for ΦV and ΦE was conductedover the same grid as the domain of the surrogate func-tion.

20 40 60 80 100 120 1401

2

3

4

5

6

7

8

9

rounds of data collection

num

ber

of p

ocke

ts m

ined

Variance−based Sampling

20 40 60 80 100 120 140−120

−100

−80

−60

−40

−20

0


log

likel

ihoo

d

Variance−based Sampling

20 40 60 80 100 120 1402

3

4

5

6

7

8


num

ber

of p

ocke

ts m

ined

Entropy−based Sampling

20 40 60 80 100 120 140−120

−100

−80

−60

−40

−20

0

log

likel

ihoo

d


Entropy−based Sampling

Figure 8: Pocket mining performance on the 7-pocketfunction from Fig. 7. (top) variance-based and (bottom)entropy-based sampling. (left) number of pockets foundand (right) negative log-likelihood.

5.1 Synthetic Datasets: For the synthetic bench-mark, we adopted the suite of test functions from [11],an ACM TOMS algorithm to readily generate classes offunctions with known local and global minima. The al-gorithm systematically distorts a convex quadratic func-tion with cubic or quintic polynomials to yield continu-ously differentiable (D-type) and twice continuously dif-ferentiable (D2-type) functions over the closed interval[−1, 1]2. Since our active mining proceeds by discretesearch over a pre-defined grid, we evaluated the gener-ated functions over a regular 21 × 21 grid in [−1, 1]2

(|G| = 441) and used these function values as the ‘or-acle’ that is queried by the active mining mechanism.We verified whether in each instance, the SAL miner isable to resolve all pockets when given a complete 21×21dataset. This is necessary because the radii of the basinsof attraction interact with the spacing of the samplinggrid, and hence influence the number of samples avail-able for aggregation by the SAL miner. We found thatthe pocket miner is able to resolve only those generatedfunctions that have up to 7 local minima; functions withmore (e.g., 8–12) local minima use only a handful ofpoints (typically 3–9) to represent some of their pock-ets, too few to be aggregated into a flow class underthe SAL miner’s parameter settings. Hence, we prunedthe automatically generated functions by requiring thatthat each local minima have at least 12 samples perpocket, when sampled over the 21×21 grid. This yieldsa collection of 43 functions (21 D-type and 22 D2-type),with numbers of pockets ranging from 4 to 7. Fig. 7depicts some of these functions.

−10

1

−1

0

1

0

2

4

−10

1

−1

0

1

0

2

4

−10

1

−1

0

1

0

2

4

6

−10

1

−1

0

1

0

2

4

Figure 7: Example test functions with 4, 5, 6, and 7 pockets (respectively). Note that the viewpoint chosen makesvisible only some of the pockets in these functions.

Both algorithms were initially seeded with a 52

design, comprising 25 points (about 5% of the designspace of 441 points). Sampling was conducted for anadditional 100 sample values (a total of 125 points, orabout 25% of the design space). We reasoned that this isa good interval over which to monitor the performanceof the sampling strategies, as even a regularly spacedgrid covering 25% of the design space would mine thepockets correctly! Fig. 8 reveals the results for the 7-pocket function of Fig. 7. Both sampling strategiessystematically reduce the (negative) log likelihood (asestimated from the GP model parameters) but variance-based sampling shows more oscillatory behavior w.r.t.the number of pockets mined. On close inspection,we found that this strategy goes through stages whereadjacent pockets are periodically re-grouped aroundsample values (which are mostly at the boundaries),causing rapid fluctuations in the SAL miner’s output.We say that this strategy is more prone to ‘beingsurprised.’ The number of pockets stabilizes around7 only toward the end of the data collection interval.In contrast, the entropy-based sampling first mines theseven pockets with 68 points, and proceeds to stabilizebeyond this point. Similar results have been observedwith other test functions.

Next, we analyzed the performance of both algo-rithms across all 43 test functions. We tested for whatfraction of the datasets the mining was correct by, andstayed correct following, a given number of rounds ofsampling. Our hypothesis was that the D2-type func-tions, being smoother, are more easily modeled usingGPs and should lend themselves to more aggressive sam-pling strategies. In addition, the entropy-based sam-pling strategy should be more effective w.r.t. number ofrounds than the variance-based sampling. Fig. 9 showsthat this is indeed the case.

5.2 Mining Wireless System Configuration

Spaces: Our second application involves characteriza-tion of configuration spaces of wireless system designs(see again Fig. 1). The goal is to understand the joint

20 40 60 80 100 1200

20

40

60

80

100

# samples

% c

orre

ct

varianceentropy

40 60 80 100 1200

20

40

60

80

100

# samples

% c

orre

ct

varianceentropy

Figure 9: Overall pocket mining performance (fractionof cases correctly identified) with increasing number ofsamples, for (left) D-type and (right) D2-type functions.

influence of selected configuration parameters on sys-tem performance. This can be achieved by identify-ing spatial aggregates in the configuration space, ag-gregating low level simulation data (typically multiplesamples per configuration point) into regions of con-strained shape. In particular, the setup in Fig. 1 isfrom a study designed to evaluate the performance ofSTTD (space-time transmit diversity) wireless systems,where the base station uses two transmitter antennasseparated by a small distance, in an attempt to im-prove received signal strength. In this application, theaim is to assess how the power imbalance between thetwo branches impacts the performance (measured bybit error rate, BER) of the simulated system, across arange of signal-to-noise ratios (SNRs). When the signalcomponents are significant compared to the noise com-ponents, and when the SNR ratios of the two branchesare comparable, then it is well known that the systemwould yield high quality of BER performance. Whatis not so clear is how the performance will degrade asthe SNRs move apart. Posed in the spatial aggregationframework, this objective translates into identifying andcharacterizing (in terms of width, or power imbalance)the pocket in the central portion of the configurationspace. Identifying and characterizing other pockets isnot as important, since some of them will actually con-tain suboptimal configurations.

We adopt an experimental methodology similarto that in the previous case studies, and created an‘oracle’ from the simulation data described in [26].

SNR1, dB

SNR2, dB

log(BE

R)

−1

−3

−4

−5

10

20

30

40

10

20

30

40

−2

Figure 10: Estimates of BER performance in a space ofwireless system configurations.

Fig. 10 demonstrates that the dataset is quite noisy,especially when the SNR values are low. The designof the oracle, surrogate model building, and sampleselection all employ a 55×55 grid over the configurationspace (SNR levels ranging from 3dB to 58dB for eachantenna). Both variance-based sampling and entropy-based sampling were initialized using a 112 design(about 4% of the configuration space). Sampling wasconducted for an additional 650 points, yielding a totalof 771 points (25% of the design space, as with theearlier studies). For each round of active mining, wedetermined the majority class occupied by points havingequal SNR and determined the maximum width of thisclass. This measure was periodically tracked acrossthe rounds of data collection. Fig. 11 shows howthe sampling strategies fare compared to the correctestimate of 12dB, as reported in [26] by applying aspatial data mining algorithm over the entire dataset.Entropy-based sampling once again selects data thatsystematically clarify the nature of the pockets, andcause a progressive widening of the trough in the middle.However, it doesn’t mine the ideal width of 12dB (withinthe given samples). We reason that this is because theGP model has difficulty approximating the steep edgeof the basin. Variance-based sampling fares worse anddemonstrates a slower growth of width across samples.This application highlights the utility of our frameworkfor mining both qualitative and quantitative propertiesof spatial aggregates.

6 Discussion

This paper has presented a novel integration of ap-proaches from three areas, namely spatial structure dis-covery, probabilistic modeling using GPs, and activedata mining. The spatial aggregation language pro-vides a methodology for identifying multi-level struc-tures in field data, Gaussian processes provide a prob-

200 400 600 8002

4

6

8

10

12

# samples

regi

on w

idth

varianceentropy ideal

Figure 11: Performance of active mining strategies onwireless simulation data, characterizing width of themain pocket in Fig. 10 with increasing numbers ofsamples.

abilistic basis for reasoning about uncertainty in fielddata, and active data mining closes the loop to opti-mize new samples for uncertainty in field data as wellas information content relevant to high-level structures.Entropy-based sampling is suitable whenever we can de-fine information-theoretic functionals over spatial aggre-gates. In this paper, we have primarily focused on char-acterizing region boundaries, but this same strategy isapplicable to any case where we expect locations withspatial proximity but informational impurity, e.g., iden-tifying breaks and fissures in volumetric data, pickingoutliers from geographical maps, and detecting viola-tions of coherence in spatio-temporal datasets.

There are several extensions to the work presentedhere. First, our assumption of sampling over a de-fined grid can be relaxed and the scope of active min-ing can be expanded to include subsampling. Second,the modeling of vector fields using GPs warrants fur-ther investigation, in particular to address the issue ofhow to model data fields given only (or also) deriva-tive information or when the underlying function is notsmooth or differentiable. Other investigators have donerelated work in this area [6]. Third, we assume herethat the model (of flow classes) posited by SAL is cor-rect, and use this information to drive the sampling.To overcome this assumption, we must create a proba-bilistic model of SAL’s computations (including uncer-tainty and non-determinism in aggregation procedures)and integrate this model with the GP model for thedata fields. Instantiating SAL to popular spatial miningalgorithms investigated in the data mining community(e.g. [15, 20]) and applying them in an active miningcontext is a final direction we are pursuing. These andsimilar ideas will help establish the many ways in whichmathematical models of data approximation can be in-tegrated with data mining algorithms.

Acknowledgements

This work is supported in part by US NSF grants IIS-0237654, EIA-9984317, and IBN-0219332. The wirelesssimulation dataset is courtesy of Alex Verstak.

References

[1] C. Bailey-Kellogg and N. Ramakrishnan. Ambiguity-Directed Sampling for Qualitative Analysis of SparseData from Spatially Distributed Physical Systems. InProc. IJCAI, pages 43–50, 2001.

[2] C. Bailey-Kellogg and F. Zhao. Influence-Based ModelDecomposition for Reasoning about Spatially Dis-tributed Physical Systems. Artificial Intelligence, Vol.130(2):pages 125–166, 2001.

[3] C. Bailey-Kellogg, F. Zhao, and K. Yip. SpatialAggregation: Language and Applications. In Proc.AAAI, pages 517–522, 1996.

[4] K. Brinker. Incorporating Diversity in Active Learningwith Support Vector Machines. In Proceedings ofthe Twentieth International Conference on MachineLearning (ICML’03), pages 59–66, 2003.

[5] D.A. Cohn, Z. Ghahramani, and M.I. Jordan. ActiveLearning with Statistical Models. Journal of ArtificialIntelligence Research, Vol. 4:pages 129–145, 1996.

[6] D. Cornford, I.T. Nabney, and C.K.I. Williams.Adding Constrained Discontinuities to Gaussian Pro-cess Models of Wind Fields. In Proceedings of NIPS,pages 861–867, 1998.

[7] N. Cristianini and J. Shawe-Taylor. An Introductionto Support Vector Machines and Other Kernel-BasedLearning Methods. Cambridge University Press, 2000.

[8] C. Currin, T. Mitchell, M. Morris, and D. Ylvisaker.Bayesian Prediction of Deterministic Functions, withApplications to the Design and Analysis of ComputerExperiments. J. Amer. Stat. Assoc., Vol. 86:pages 953–963, 1991.

[9] R.G. Easterling. Comment on ‘Design and Analysis ofComputer Experiments’. Statistical Science, 4(4):425–427, 1989.

[10] J. Garcke and M. Griebel. Data Mining with SparseGrids using Simplicial Basis Functions. In Proceedingsof the Seventh ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, pages87–96, 2001.

[11] M. Gaviano, D.E. Kvasov, D. Lera, and Y.D. Sergeyev.Algorithm 829: Software for Generation of Classes ofTest Functions with Known Local and Global Minimafor Global Optimization. ACM Transactions on Math-ematical Software, Vol. 29(4):pages 469–480, Dec 2003.

[12] X. Huang and F. Zhao. Relation-Based Aggregation:Finding Objects in Large Spatial Datasets. In Proceed-ings of the 3rd International Symposium on IntelligentData Analysis, 1999.

[13] J.-N. Hwang, J.J. Choi, S. Oh, and R.J. Marks II.Query-based Learning Applied to Partially TrainedMultilayer Perceptrons. IEEE Transactions on NeuralNetworks, Vol. 2(1):pages 131–136, 1991.

[14] A.G. Journel and C.J. Huijbregts. Mining Geostatis-tics. Academic Press, New York, 1992.

[15] G. Karypis, E.-H. Han, and V. Kumar. Chameleon:Hierarchical Clustering using Dynamic Modeling.IEEE Computer, Vol. 32(8):pages 68–75, 1999.

[16] J. Koehler and A. Owen. Computer Experiments. InS. Ghosh and C. Rao, editors, Handbook of Statistics:Design and Analysis of Experiments, pages 261–308.North Holland, 1996.

[17] D.J. MacKay. Information-Based Objective Functionsfor Active Data Selection. Neural Computation, Vol.4(4):pages 590–604, 1992.

[18] I.T. Nabney. Netlab: Algorithms for Pattern Recogni-tion. Springer-Verlag, 2002.

[19] R.M. Neal. Monte Carlo Implementations of GaussianProcess Models for Bayesian Regression and Classifica-tion. Technical Report 9702, Department of Statistics,University of Toronto, Jan 1997.

[20] R.T. Ng and J. Han. CLARANS: A Method forClustering Objects for Spatial Data Mining. IEEETransactions on Knowledge and Data Engineering, Vol.14(5):pages 1003–1016, 2002.

[21] I. Ordonez and F. Zhao. STA: Spatio-Temporal Ag-gregation with Applications to Analysis of Diffusion-Reaction Phenomena. In Proc. AAAI, pages 517–523,2000.

[22] N. Ramakrishnan and C. Bailey-Kellogg. Sam-pling Strategies for Mining in Data-Scarce Domains.IEEE/AIP CiSE, Vol. 4(4):pages 31–43, 2002.

[23] N. Ramakrishnan and C. Bailey-Kellogg. GaussianProcess Models of Spatial Aggregation Algorithms. InProc. IJCAI, pages 1045–1051, 2003.

[24] J. Sacks, W.J. Welch, T.J. Mitchell, and H.P. Wynn.Design and Analysis of Computer Experiments. Sta-tistical Science, Vol. 4(4):pages 409–435, 1989.

[25] S. Tong and D. Koller. Support Vector Machine Ac-tive Learning with Applications to Text Classification.Journal of Machine Learning Research, Vol. 2:pages45–66, 2001.

[26] A. Verstak et al. Using Hierarchical Data Mining toCharacterize Performance of Wireless System Configu-rations. Technical Report cs.CE/0208040, CoRR, Aug2002.

[27] C.K.I. Williams. Prediction with Gaussian Processes:From Linear Regression to Linear Prediction and Be-yond. In M.I. Jordan, editor, Learning in GraphicalModels, pages 599–621. MIT Press, Cambridge, MA,1998.

[28] K.M. Yip and F. Zhao. Spatial Aggregation: Theoryand Applications. JAIR, Vol. 5:pages 1–26, 1996.

[29] K.M. Yip, F. Zhao, and E. Sacks. Imagistic Reasoning.ACM Computing Surveys, Vol. 27(3):pages 363–365,1995.

[30] F. Zhao, C. Bailey-Kellogg, and M.P.J. Fromherz.Physics-Based Encapsulation in Embedded Softwarefor Distributed Sensing and Control Applications. Pro-ceedings of the IEEE, 91:40–63, 2003.

Gaussian Processes for Active Data Mining of Spatial Aggregates

Documents

wireless data

sparse data

data points

data trac

data mining contexts

data mining applications

data necessary

paucity of data