Top Banner
Computational Statistics & Data Analysis 50 (2006) 760 – 774 www.elsevier.com/locate/csda Nonparametric density estimation and clustering in astronomical sky surveys Woncheol Jang Institute of Statistics and Decision Sciences, Old Chemistry Building 223B, Duke University, Durham, NC 27708, USA Received 5 April 2004; received in revised form 28 September 2004; accepted 1 October 2004 Available online 28 October 2004 Abstract We present a nonparametric method for galaxy clustering in astronomical sky surveys. We show that the cosmological definition of clusters of galaxies is equivalent to density contour clusters (Hartigan, 1975) S c ={f>c} where f is a probability density function. The plug-in estimator ˆ S c = ˆ f>c is used to estimate S c where ˆ f is the multivariate kernel density estimator. To choose the optimal smoothing parameter, we use cross-validation and the plug-in method and show that cross-validation method outperforms the plug-in method in our case. A new cluster catalog, database of the locations of clusters, based on the plug-in estimator is compared to existing cluster catalogs, the Abell and Edinburgh/Durham Cluster Catalog I (EDCCI). Our result is more consistent with the EDCCI than with the Abell, which is the most widely used catalog. We use the smoothed bootstrap to asses the validity of clustering results. © 2004 Elsevier B.V. All rights reserved. Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed bootstrap 1. Introduction It is widely assumed that the universe is homogeneous and isotropic which means that it looks the same in all directions and matter is distributed evenly throughout the space. This is a key assumption in modern cosmology. There is evidence that supports this assumption Tel.: +919 6843437; fax: +412 6848594. E-mail address: [email protected]. 0167-9473/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2004.10.001
15

Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

Aug 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

Computational Statistics & Data Analysis 50 (2006) 760–774www.elsevier.com/locate/csda

Nonparametric density estimation and clustering inastronomical sky surveys

Woncheol Jang∗Institute of Statistics and Decision Sciences, Old Chemistry Building 223B, Duke University, Durham,

NC 27708, USA

Received 5 April 2004; received in revised form 28 September 2004; accepted 1 October 2004Available online 28 October 2004

Abstract

We present a nonparametric method for galaxy clustering in astronomical sky surveys. We show thatthe cosmological definition of clusters of galaxies is equivalent to density contour clusters (Hartigan,

1975)Sc = {f > c} wheref is a probability density function. The plug-in estimatorSc ={f > c

}is used to estimateSc wheref is the multivariate kernel density estimator. To choose the optimalsmoothing parameter, we use cross-validation and the plug-in method and show that cross-validationmethod outperforms the plug-in method in our case. A new cluster catalog, database of the locationsof clusters, based on the plug-in estimator is compared to existing cluster catalogs, the Abell andEdinburgh/Durham Cluster Catalog I (EDCCI). Our result is more consistent with the EDCCI thanwith the Abell, which is the most widely used catalog. We use the smoothed bootstrap to asses thevalidity of clustering results.© 2004 Elsevier B.V. All rights reserved.

Keywords:Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed bootstrap

1. Introduction

It is widely assumed thatthe universe is homogeneous and isotropicwhich means that itlooks the same in all directions and matter is distributed evenly throughout the space. Thisis a key assumption in modern cosmology. There is evidence that supports this assumption

∗ Tel.: +919 6843437; fax: +412 6848594.E-mail address:[email protected].

0167-9473/$ - see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2004.10.001

Page 2: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774 761

on large scales, but on smaller scales, one can find significant deviations from homogeneityand isotropy such as walls, filaments, and clusters of galaxies (Martínez and Saar, 2002).

Knowing how the universe has evolved would lead to the answers for discrepancy of thestructure at different scales. According to the classic scenario, the universe has been expand-ing since the Big Bang. Due to small fluctuations which were present at early epochs, theuniverse has been clumped and clustered since then. Clusters of galaxies play an importantrole in finding where the local structure fades away into a homogeneous and isotopic dis-tribution. However, the availability of complete, accurate cluster catalogs for such studiesis very limited.

The Abell catalog has been one of the most widely used catalogs for cosmological re-search. It was published by George Abell in 1958 and covers the whole northern hemisphere.A catalog of the southern hemisphere of the sky, was completed by Abell and his colleaguesin 1989. It contains 4073 clusters over the entire sky. Much cosmological research have beendone based on the Abell catalog. However, the Abell catalog was constructed by a visualscan of photographic material, thus it is subjective and not consistent with other catalogs.Cosmologists have been suspicious that inconsistency of the Abell catalog would under-mine scientific results based on it and they started to build new, objective and accurate largearea cluster catalogs which would eventually replace the Abell catalog. Thus, large area skysurveys using present new technologies, 8-m optical telescopes, new X-ray and microwavesatellites, are currently planned or underway.

The power of modern technology is opening a new era of massive astronomical data thatis beyond the capabilities of traditional methods for galaxy clustering. The advent of newmassive sky survey brings statisticians to cosmologists and the explosion of the data incosmology is a blessing and a curse to statisticians. It is a blessing because the amount ofdata makes nonparametric statistical methods very effective. The same features also limitsthe use of nonparametric statistical methods without efficient data management. Therefore,efficient automated clustering algorithms will be critical to make the use of nonparametricstatistical methods in cosmology.

There has been great progress in the development of new automated clustering algorithmsin cosmology. The first generation of automated clustering algorithms are simple variantson the peak-finding algorithm (Lumsden et al., 1992). In recent years, several new, more so-phisticated, algorithms have become available including the adaptive matched-filter (AMF)algorithm (Kepner et al., 1999). The idea behind the AMF algorithm is to identify clustersby finding peaks in a kind of likelihood map. To generate the map, a filter, based on a modelof the distribution of galaxy, is matched with the data and contour is produced based on themodel. To find clusters, one chooses a threshold given by cosmological theory, then all thedata below this threshold are removed and what remains are regarded as clusters. The AMFalgorithm is very complicated and not rigorously justified. Moreover, the size of the filter(smoothing parameter) is arbitrarily chosen.

The goal of this paper is to produce a catalog based on nonparametric methods and com-pare the catalog with the Abell and EDCCI, the first catalog based on the AMF algorithm.

This paper is organized as follows. Section 1 is a brief introduction. In Section 2,we explain how to model the spatial locations of galaxy clusters. Then the data, theEdinburgh–Durham Southern Galaxy Catalog (EDSGC) which are used in our analysis, aredescribed in Section 3. Section 4 outlines nonparametric methods. Section 5 summarizes

Page 3: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

762 W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774

the results. Finally, we discuss scientific contributions of our results and address possibleextensions in Section 6.

2. Stochastic model of the galaxy distribution

Let Y1, . . . , Yn be the positions of the galaxies in a regionC. We assume thatYi is arealization of a Poisson process with the intensity measure�t (C) = ∫

C�(y)dy, the mean

number of galaxies insideC at timet. Here�(y) is the intensity function.Let �t (y) be the mass density function of objects such as galaxies at timet, i.e.,∫

A

�t (y)dy ≡ total mass in a regionA.

Cosmologists assume the mean number of galaxies inside a regionC is directly propor-tional to the total mass inside the region. Hence the intensity measure�t (C) is

�t (C) ∝∫C

�t (y)dy.

Define the overdensity

�t (y) ≡ �t (y) − �

�,

where� ≡ ∫U

�(y)dy/∫U

dy.The overdensity is a kind of normalized mass density function and used as a scale-free

reference threshold. For example, it is believed that galaxies form at overdensity≈ 1.Cosmology theory also assumes that the overdensity�0(x) is a realization of a Gaussian

process and has evolved, at timet, to�t =H (�o, t). Here,H(·, ·) is a complicated nonlinearfunction.

The early universe was very hot and dense, but began to cool down due to the expan-sion after the Big Bang. As a result, small fluctuations began to exist due to temperaturedifferences. The fluctuations, large localized overdensities, eventually collapsed to formvirializedobjects or self-gravitating objects such as clusters due to gravitational instability.After overdensities are virialized, the universe is in the form of stability or balance. In otherwords, the universe is in some form of dynamic equilibrium. To reach dynamic equilibrium,the mass of virialized object must satisfy the following geometric condition:

C = {y |�t (y)> �c

}, (1)

where�c is the present day (t = 0) overdensity that has collapsed and virialized at timet and can be expressed as a function of unknown cosmological parameters. SeeReichartet al. (1999)for the details.

Since the mass density function�t is modeled as a random process, the Poisson modelresults in a double-stochastic process, i.e. the Cox process.

Page 4: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774 763

From condition (1), it is clear that one must estimate the intensity function to understandthe spatial locations of clusters of galaxies. In fact, similar problems can be found in spatialstatistics literature (Diggle, 1985; Cressie, 1991) where the kernel density estimator wasused to estimate the intensity function. Since the mass density function�t can be consideredas a probability density function, the cosmological definition of clusters is indeed the sameas Hartigan’s density contour clusters: (Hartigan, 1975) the connected components of levelsets. In other words, galaxy clustering is equivalent to level set estimation.

A naive estimator for the level set is the plug-in estimatorSc ≡ {f > c

}wheref is a

nonparametric density estimator. The consistency of the plug-in estimator with the kerneldensity estimator was proved byCuevas and Fraiman (1997)in terms of a set metric suchas the symmetric differenced� and the Hausdorff metricdH,

d� ≡ �(T�S), dH(T , S) ≡ inf{�>0 : T ⊂ S�, T � ⊂ S

},

where� is symmetric difference,� is Lebesgue measure andS� is the union of all openballs with a radius� around points ofS.

Therefore, we can use the plug-in estimator to estimate a level set given density estimatesand use the connected components of level set estimates as clusters of galaxies.

3. Astronomical sky survey data

The source of galaxy data used in our analysis is the EDSGC, which consists of surveyplates scanned with the Edinburgh plate measuring machine COSMOS.

The equatorial system is used to list the locations of objects in the catalog. Each objectslist the right ascension (RA) and declination (DEC), the longitude and latitude with respectto the Earth. The right ascension takes values from 0 to 360◦ in hours, minutes and seconds(24 h= 360◦) while declines takes values from 0 to 90◦ for objects in northern hemisphereand−0 to −90◦ for objects in the southern hemisphere. EDSGC is located from 22 to 3 h(through 0 h) and−23 to−42◦. A 10× 10◦ subset of EDSGC is used in our data analysisand it is located from 0 h to 0 h 40 min and−28 to−38◦.

The catalog contains 1.5 × 106 galaxies and because of its size, a 10◦ subset of theEDSGC is used in our data analysis. The origin of subset is 0 h and−28◦. It contains 41171galaxies and each galaxy has 7 attributes.

Fig. 1shows the 10◦ subset of the EDSGC. According to the Abell catalog, there are 43clusters in that area while the EDCCI, the first catalog based the AMF algorithm, shows 42clusters (Lumsden et al., 1992).

Fig. 2 shows the Abell and EDCCI catalogs. The points in Abell and EDCCI catalogsrepresent the locations of clusters. Both of them find clusters at the upper left corner andlower right corner which are obvious from data. However, there is inconsistency betweentwo catalogs. A cluster in the upper right side in the Abell is not found in the EDCCI andthe cluster pattern doesn’t look similar. By overlapping two catalogs, roughly a half of themare located within a reasonable Euclidean distance according to cosmologists.

Page 5: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

764 W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774

1 2 3 4 5 6 7 8 9 10X (Degree)

1

2

3

4

5

6

7

8

9

10

Y (

Deg

ree)

Fig. 1. 10×10◦ subset of EDSGC.

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

0.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.20

Fig. 2. Abell and EDCCI catalog.

4. Methodology

SupposeY1, . . . , Yn are independent observations from a distribution with densityfwhereYi= (Yi1, . . . , Yid), d-dimensional vector. In our case,Yi is the location ofith galaxy

Page 6: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774 765

with d=2. We define clusters as connected components of the plug-in estimator. Hence onemust estimate the densityf first. We use the kernel density estimator for the estimation off.

4.1. Multivariate kernel density estimation

The generald-dimensional kernel estimator is given by

f (y) = 1

n|H|1/2n∑

i=1

K(H−1/2 (y − Yi)

),

whereH, bandwidth matrix, is a symmetric positive definited × d andK is a bounded,compactly supportedd-variate kernel satisfying∫

K(y)dy = 1,∫

yK(y)dy = 0, and∫

yytK(y)dy = �2(K)I .

Here�2(K) = ∫y2i K(y) is independent ofi (Wand and Jones, 1995).

We assume that the contour of the kernel is ellipsoidal and elliptical axes of the kernelare aligned with the coordinate axis. In other words, we assume the bandwidth matrix is adiagonal matrix.

H ∈ H2 ={diag

(h2

1, . . . , h2d

): h1, . . . , hd >0

}.

Then, the density estimator is given by

f (y) = 1

nh1 · · ·hdn∑

i=1

K

(y1 − Yi1

h1, . . . ,

yd − Yid

hd

).

However, it is ad-dimensional optimization problem and could be very computationallyexpensive.

We often assume a simpler form of the bandwidth matrix. Suppose that the contour ofthe kernel is spherically symmetric. Then the bandwidth is a class of

H1 ={h2I : h>0

}.

Another simple class of the bandwidth matrix is a class of

C ={h2 · diag

(�2

1, . . . , �2d

)},

where�2i is the variance of theith coordinate variable. As pointed out inWand and Jones

(1993), this scaling approach is not always better than to useH ∈ H1. In our case, the vari-ance of each coordinate are the almost same because the universe isalmosthomogeneous.Therefore, there was little difference to use eitherH ∈ C1 or H ∈ H1 in our case.

4.2. Binned kernel estimation

To make the nonparametric density estimation effective with huge datasets, one mustprovide efficient algorithms to calculate the density estimates. The original formula requires

Page 7: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

766 W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774

O(n2

)evaluation to calculate density estimates at every data point which easily can be a

daunting task even with a moderate size of data.The binned kernel estimation is an appealing way of approximation of kernel estimation

with the fast Fourier transform (FFT).Suppose thatK is a symmetric kernel and define bins by equally spaced mesh of points

over the support. If the support is infinite, one can replace it with “effective support”, whoseoutside is negligible. For example, we can use[−4,4] as the effective support of the standardnormal distribution.

Let nj a number of data points in binBj which is centered attj andj = 1, . . . , m. Then,the binned kernel estimator is given by

f (y) = 1

m

m∑j=1

nj

|H|1/2 K(H−1/2 (

y − tj))

.

The binned kernel estimator only requires O(m) evaluation to calculate a density estimate.Furthermore, with FFT, it only requires O(m log m) to evaluate density estimates at everygrid point.

The rule of thumb for the number of grid points is between 100 and 500 and the approx-imation is better asm increase (Wand and Jones, 1995).

For higher dimension, one may consider the Weighted Averaging of Rounded Points(WARPing) of Härdle and Scott (1992)to reduce computational complexity. The cost ofthe WARP and FFT are quite similar for the bivariate case, since both of them are based onequally spaced grid points.

4.3. Bandwidth selection

Cosmologists have been paid more attention to the shape of the filter (kernel) than the sizeof the filter (bandwidth). Indeed it is well known in statistics that the choice of bandwidthis crucial in density estimation, not the choice of the kernel.

Define the risk function

R(f, f

) = E[ISE(H)].Here ISE(H) is the integrated square error (ISE) which is given by

ISE(H) =∫ (

f (y; H) − f (y))2

dy

=R(f

) − 2∫

f (y; H)f (y)dy + C,

whereC = ∫(f (y))2 dy does not depend onH andR

(f

) = ∫ (f (y; H)

)2dy.

In terms of minimizing the risk function, the choice of the optimal bandwidth is far moreimportant than the choice of kernel.

To choose the optimal density estimator (optimal bandwidth) is to find the estimatorwhich minimizes the risk function. Therefore, finding the optimal bandwidth is equiva-lent to finding the bandwidth matrix which minimizes the first two terms. The idea of

Page 8: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774 767

cross-validation method is to find bandwidth matrix which minimizes an unbiased estima-tor of the first two terms using leave-one-out.

Define

CV(H) =∫

f (y; H)2dy − 2

n

n∑i=1

f−i (yi; H) ,

where

f−i (y; H) = 1

(n − 1)h1 · · ·hd∑j �=i

K(H−1 (

y − Yj))

.

AssumingK is the multivariate standard normal density, CV(H) is given by

CV (h1, . . . , hd) = 1(2√

2)d

nh1 · · ·hd+ 1(

2√

2)d

n2h1 · · ·hd�,

where

� =∑i �=j

[exp

{− 1

4

d∑k=1

(yik − yjk

hk

)2}

− 2(d+2)/2 exp

{− 1

4

d∑k=1

(yik − yjk

hk

)2}]

.

SeeSain et al. (1994)for the details.For H ∈ C,

CV(h) = 1(2√

2h)d

n�1 . . . �d

+ 1(2√

2h)d

n2�1 · · · �d�∗,

where

�∗ =∑i �=j

[exp

{− 1

4h2

d∑k=1

(yik − yjk

�k

)2}

− 2(d+2)/2

× exp

{− 1

4h2

d∑k=1

(yik − yjk

�k

)2}]

.

Whereas cross-validation method uses a direct approach by finding minimum of ISE, theplug-in method uses asymptotic expansion of MISE, the expectation of ISE.

AssumingH ∈ C,

MISE(h) = E

∫ (f (y; H) − f (y)

)dy

≈ 1

4h4�4

1 · · · �4da

20

∫{∇f (y)}2dy + a1

nhd�1 · · · �d,

where∇f (y) = ∑di=1

(�2/�y2

i

)f (y). Herea2

0 = ∫y2K(y)dy anda1 = ∫

K2(y)dy.

Page 9: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

768 W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774

Then, one can find the optimal bandwidth by numerical methods such as the Newton–Raphson method. However the solution still depends on an unknown functional∇f (x)

which we need to estimate.Wand and Jones (1995)proposedd-stage estimator to addressthe issue.

4.4. Assessment of variability

To address the reliability of other catalogs based on our result, we want to assess thevariability of our kernel estimates first. The bootstrap is a compelling method to assessvariability of estimate in case the distributions of estimators are intractable. The key ideaof the bootstrap is to use (Fn) to estimate a functional(F ) whereF and the cumula-tive density function andFn is the empirical cumulative distribution function. SinceFn isdiscrete, in some situations a smooth estimate ofF might be better.Silverman and Young(1987)showed that when the smoothed bootstrap works better.

The usual cases to use the smoothed bootstrap is where the effects of discreteness causeserious problems such as estimating density or sample median. For example,Silverman(1985)used a smoothed bootstrap test for multimodality.

The main idea of the smoothed bootstrap is to resampley∗ from the kernel estimatefinstead of the raw data. The resampling steps are as follows:

Step1. Choose integersI1, . . . , In with equal probability from 1, . . . , n.Step2. Generate random variablezi fromK (yi) for i = 1, . . . , n.Step3. Lety∗

i = yIi + h · zi for i = 1, . . . , n.Step4. Repeat step 1–3N times.

Step5. Construct bootstrap estimatef ∗j based on each resampley∗j =

(y

∗j1 , . . . , y

∗jn

)for j = 1, . . . , N.

Step6. Find clusters defined by{y : f ∗

j (y)> c}

in the jth resample forj = 1, . . . , N

wherec is a threshold.

If h = 0, the smoothed bootstrap estimate is the same as the naive bootstrap estimate.Once we have the smoothed bootstrap estimates, we are able to find a cluster map from

each estimate. The frequency of appearance of a cluster within a reasonable range from thesame position is a measure of consistency of the cluster in that position. In other words, themore we observe a cluster within a range from the same position at each bootstrap estimate,the more confidence we have that the cluster is not due to noise.

5. Results

The optimal bandwidth was selected by cross-validation and the plug-in method. Thecontour plots of density estimates and two-dimensional plug-in bandwidth selection wereimplemented by the R library “Kernsmooth” developed by Matt Wand.

To calculate CV(h), R with Fortran subroutine was used and the minimum of CV(h) wasfound by “nlm” function in R. The program for cross-validation method is available uponrequest.

Page 10: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774 769

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

Cross-validation

Fig. 3. Contour plot by plug-in and cross-validation.

2040

6080 100

X20

4060

80100

Y

010203040506070

Z

2040

6080 100

X20

4060

80100

Y

050

100150200250

Z

Fig. 4. Density estimates by plug-in and cross-validation.

The optimal bandwidth matrices by the plug-method and cross-validation arediag(0.005046, 0.005065) and diag(0.000734, 0.000701), respectively.Figs. 3and4 showcontour plots and density estimates by the plug-in and cross-validation.

As pointed out byLoader (1999), the “plug-in” method often oversmooths. Findingclusters is equivalent to findingsharp and highpeaks which requires the smaller bandwidth.To find clusters from the plug-in estimator

{f > c

}, one must determine the thresholdc.

While determining the exact threshold from cosmological consideration is an ongoingresearch problem, contemporary cosmological theory provides a short range of possiblethreshold values. Within the range of the threshold, we first compare the plug-in methodwith cross-validation.Fig. 4 gives a snapshot of density contours by both methods givena threshold within the range. It is clear that we can find at most a dozens of clusters bythe plug-in method while cross-validation captured the feature of the data well within therange.

Neither the Abell nor EDCCI catalogs were built based on the threshold from cosmo-logical theory. Hence, we use an ad hoc method to choose a threshold to compare our

Page 11: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

770 W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

0.0

0.05

0.10

0.15

0.20

Plug-in0.0 0.05 0.10 0.15 0.20

Fig. 5. Density contours by plug-in and cross-validation.

catalog with them. Specifically, we choose a threshold within the range such that we canfind 42–43 clusters, the number of clusters in the Abell and EDCCI by cross-validationmethod.

To convert density contours into two clusters, we first find a subset of equally spacedgrid points belonging to

{f > c

}and find clusters by agglomerating the grid points. In

other words, we use⋃kn

i=1B (ti, r) to approximately{f > c

}, whereB (ti, r) is a closed

ball centered atti with radiusr. Hereti ’s are equally spaced grid points such thatf (ti) > c

andr is a half of the grid size. If two balls are adjacent to each other, we consider themto belong to the same cluster. In case a density contour has a region with more than onepeak, we consider it as one cluster as long as they are connected components of the densitycontour. SeeJang (2004)for the details.

In practice, it is not easy to check frequencies of appearance of clusters. Furthermore,cosmologists are not interested in the locations of clusters, but the number and size ofthe clusters, that is the mass distribution of the clusters. Hence we focus on the numberof clusters. InFig. 5, by the plug-in method, we only find a dozen of clusters, while 43clusters are found by cross-validation method. The number of common clusters withina reasonable Euclidean distance between the EDCCI and ours is 27 while the numberbetween the Abell and ours is 20. Also the pattern of clusters in the EDCCI is more similarto ours.

Fig. 6shows 20 smoothed bootstrap estimates. Each bootstrap estimate was constructed

based on resampley∗j =(y

∗j1 , . . . , y

∗jn

)for j = 1, . . . ,20. To resampley∗j , we used the

following steps:

(1) ChooseIi from {1, . . . , n} with probability 1/n for i = 1, . . . , n.(2) zi is generated from bivariate normal with meanyT

i = (yi1, yi2) and variance matrixdiag

(�2

1, �22

). Here�2

j is the variance ofjth coordinate variable.

Page 12: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774 771

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.200.0

0.05

0.10

0.15

0.20

Fig. 6. Smoothed bootstrap estimates.

(3) Lety∗j = yIi + h · zi whereh is the optimal bandwidth by cross-validation.

(4) Find clusters from density contours{y : f ∗

j (y)> c}

wherec is the threshold which

was used to find clusters in the Abell and EDCCI catalog andf ∗j is the kernel density

estimator based ony∗j .

It is not easy to match the clusters from smoothed bootstrap estimates to smoothedbootstrap estimates since the number of clusters may not be the same. Instead of comparingthe locations of clusters, we count the number clusters at each quadrant and use it as aheuristic measure of consistency.

To compare them more precisely, we divide each catalog into 4 quadrants.Table 1reportsthe number of clusters in each quadrant for the Abell, EDCCI and our new catalog. Sincewe use a union of balls to approximate clusters, we assigned the clusters to each quadrantdepending on the centroid of the union of the ball. While the total number of clusters ineach catalog is almost the same, the number of clusters in each quadrant is quite different.Indeed, almost a half of clusters in the EDCCI and our catalog are located in upper leftquadrant but one can find more clusters in lower left quadrant than any other quadrants inthe Abell catalog.

Page 13: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

772 W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774

Table 1Number of clusters in 4 quadrants at each catalog

Catalog Upper left Upper right Lower right Lower left Total

Abell 12 5 11 15 43EDCCI 18 2 12 10 42Cross-validation 21 3 11 8 43

Table 2Number of cluster in 4 quadrants at smoothed bootstrap catalogs

Upper left Upper right Lower right Lower left Total

1 16 3 8 7 342 17 4 8 12 413 20 4 7 11 424 19 2 7 11 395 17 3 7 9 366 19 3 7 10 397 18 2 8 11 398 15 3 8 8 349 22 4 7 13 4610 12 3 8 8 3111 17 2 7 11 3712 18 2 7 12 3913 19 2 8 8 3714 10 4 9 11 3415 13 3 8 7 3116 19 4 8 11 4217 18 4 6 13 4118 17 5 9 7 3819 22 2 8 7 3920 19 3 8 10 40Average 17.35 3.1 7.65 9.85 37.95Median 18 3 8 10.5 39

Table 2shows the smoothed bootstrap results which are consistent with the EDCCI andour catalog; upper left quadrant is the most crowded area. One discrepancy between ourresult and the smoothed bootstrap is the total number of clusters. The average and median oftotal number of clusters in the smoothed bootstrap are 37.95 and 39. We found more clustersin the upper left and lower right quadrant, but less in the lower left quadrant in our catalog.We suspect that some of those tiny clusters in the upper left and lower right quadrant inour catalog are random noises. For the lower left quadrant, the smoothed bootstrap resultshows the bimodality of the number of clusters and certainly our result belongs to the highermode.

Page 14: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774 773

Another interesting result is that there is a cluster near the origin in the Abell and EDCCI,but is not found in our new catalog. InFig. 6, among 20 smoothed bootstrap estimates, only7 of them have a cluster near origin. Another cluster in the upper right corner which isnot found in the EDCCI, is found in every smoothed bootstrap estimates. Based on thesmoothed bootstrap we suspect the cluster near origin is a random noise but the cluster onupper right side is real.

6. Discussion

The choice of bandwidth selectors are still widely on debate. For our case, contemporarycosmological theory agrees to the result provided by cross-validation method. While theconvergence rate of the cross-validation selector to a true bandwidth is much slower that byplug-in selector’s rate (Duong and Hazelton, 2004a), we are more interested in the behaviorof plug-in estimator

{f > c

}. Also the faster convergence rate of bandwidth selectors does

not imply that the plug-in estimator is asymptotically inefficient estimate (Loader, 1999).Furthermore, the difference between two selectors convergence rates for bivariate is small(≈ 2.6) even forn = 100 000.

Recently,Duong and Hazelton (2004b)suggested to the use of full bandwidth matrix andintroduced a new version of a smooth cross-validation method for full bandwidth matrixwhich might be beneficial for our case.

We also tested the reliability of other catalogs. The Abell catalog shows inconsistencywhereas the EDCCI works reasonably well. We used the smoothed bootstrap to assess thevariability of the estimates.

Acknowledgements

The author would like to thank Larry Wasserman and Bob Nichol for their helpful com-ments and suggestions.

References

Cressie, N., 1991. Statistics for Spatial Data. Wiley, New York.Cuevas, A., Fraiman, R., 1997. A plugin approach to support estimation. Ann. Statist. 25, 2300–2312.Diggle, P., 1985. A kernel method for smoothing point process data. Appl. Statist. 34, 138–147.Duong, T., Hazelton, M.L., 2004a. Convergence rates for unconstrained bandwidth matrix selectors in multivariate

kernel density estimation. J. Multivariate Anal., to appear.Duong, T., Hazelton, M.L., 2004b. Cross-validation bandwidth matrices for multivariate kernel density estimation,

to appear.Härdle, W.K., Scott, D.W., 1992. Smoothing by weighted averaging of rounded points. Comput. Statist. 7,

97–128.Hartigan, J., 1975. Clustering Algorithm. Wiley, New York.Jang, W., 2004. A fast clustering algorithm with application to cosmology. Technical Report 803, Department of

Statistics, Carnegie Mellon University.Kepner, J., Fan, X., Bahcall, N., Gunn, J., Lupton, R., 1999. An automated cluster finder: the adaptive matched

filer. Astrophys. J. 517, 78–91.

Page 15: Nonparametric density estimation and clustering in ...wjang/paper/Jang06.pdf · Keywords: Galaxy clustering; Density contour cluster; Plug-in estimator; Cross-validation; Smoothed

774 W. Jang / Computational Statistics & Data Analysis 50 (2006) 760–774

Loader, C., 1999. Bandwidth selection: classical or plug-in. Ann. Statist. 27, 415–438.Lumsden, S.L., Nichol, R.C., Collins, C.A., Guzzo, L., 1992. The Edinburgh/Durham Southern Galaxy Catalogue-

IV. The cluster catalogue. Monthly Notices Roy. Astronom. Soc. 258, 1–22.Martínez, V., Saar, E., 2002. Statistics of the Galaxy Distribution. Chapman & Hall, London.Reichart, D., Nichol, R., Castander, F., Burker, D., Romer, A.K., Holden, B., Collins, C., Ulmer, M., 1999. A

deficit of high-redshift, high-luminosity X-ray clusters: evidence for high value of�m. Astrophys. J. 518,521–532.

Sain, S.R., Baggerly, K.A., Scott, D.A., 1994. Cross-validation of multivariate densities. J. Amer. Statist. Assoc.89, 807–817.

Silverman, B.W., 1985. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.Silverman, B.W., Young, 1987. The bootstrap: to smooth or not to smooth. Biometrika 74, 469–479.Wand, M.P., Jones, M.C., 1993. Comparison of smoothing parameterizations in bivariate kernel density estimation.

J. Amer. Statist. Assoc. 88, 520–528.Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. Chapman & Hall, London.