Global, exact cosmic microwave background data analysis using Gibbs sampling

arX

iv:a

stro

-ph/

0310

080

v2

5 O

ct 2

003

Global, Exact Cosmic Microwave Background Data Analysis Using Gibbs Sampling

Benjamin D. Wandelt,1, 2, ∗ David L. Larson,1 and Arun Lakshminarayanan1

1Department of Physics, UIUC, 1110 W Green Street, Urbana, IL 618012Department of Astronomy, UIUC, 1002 W Green Street, Urbana, IL 61801

(Dated: October 5, 2003)

We describe an efficient and exact method that enables global Bayesian analysis of cosmic mi-crowave background (CMB) data. The method reveals the joint posterior density (or likelihood forflat priors) of the power spectrum Cℓ and the CMB signal. Foregrounds and instrumental param-eters can be simultaneously inferred from the data. The method allows the specification of a widerange of foreground priors. We explicitly show how to propagate the non-Gaussian dependencystructure of the Cℓ posterior through to the posterior density of the parameters. If desired, theanalysis can be coupled to theoretical (cosmological) priors and can yield the posterior density ofcosmological parameter estimates directly from the time-ordered data. The method does not hingeon special assumptions about the survey geometry or noise properties, etc. It is based on a MonteCarlo approach and hence parallelizes trivially. No trace or determinant evaluations are necessary.The feasibility of this approach rests on the ability to solve the systems of linear equations whicharise. These are of the same size and computational complexity as the map-making equations. Wedescribe a pre-conditioned conjugate gradient technique that solves this problem and demonstratein a numerical example that the computational time required for each Monte Carlo sample scales

as n3/2p with the number of pixels np. We test our method using the COBE-DMR data and explore

the non-Gaussian joint posterior density of the COBE-DMR Cℓ in several projections.

I. INTRODUCTION

The observation and analysis of cosmic microwavebackground (CMB) anisotropies have attracted a greatdeal of attention in recent years due to their unique rel-evance for cosmological theory (see [1] for a recent re-view). A slew of observational results have been pub-lished during the last two years[2]. These were obtainedfrom maps of the microwave sky at ever increasing sensi-tivity and resolution. Since the recent release of the firstyear WMAP data, an all-sky microwave survey has beenavailable down to angular scales of 12 minutes of arc [3].By the end of the decade the Planck satellite is expectedto generate 1 Terabyte of high resolution, high sensitivityall-sky data.

The basic assumption is that the CMB anisotropy sig-nal and the instrumental noise are Gaussian and thatthe signal statistics are isotropic on the sky. Con-tact between theory and observation is then best madeby extracting the angular power spectrum Cℓ from thedata[4, 5, 6]. Methods for efficiently estimating the powerspectrum have been investigated since the computationalunfeasiblity of using the brute-force approach was re-alized [7, 8, 9]. This effort has yielded two classes ofmethods: exact methods, applicable only to two nar-rowly defined classes of observational strategies [10, 11],and approximate but more broadly applicable methods[12, 13, 14][29].

∗Benjamin D. Wandelt is an NCSA/UIUC Faculty Fellow; Elec-tronic address: [email protected]

We will describe here a solution to the problem of in-ference from microwave background data which combinesthe advantages of exact methods with the practicality ofthe approximate methods. The computational cost of ourmethod scales like the best approximate method for thesame experiment, albeit with a larger pre-factor. Powerspectrum estimates and any desired characterization ofthe (multivariate) statistical uncertainty in the estimatescan be computed free from any approximations in the es-timator which could lead to sub-optimality or biases.

The solution we propose is to sample the power spec-trum (as well as other desired quantities, such as theunderlying CMB signal, foregrounds or the noise proper-ties of the instrument) directly from the joint likelihood(or posterior) density given the data. We can efficientlysample from this multi-million dimensional density us-ing the Gibbs sampler. This approach obviates the needto evaluate the likelihood or its derivatives in order toanalyze CMB data.

Our approach shares certain algorithmic features withthe approach independently discovered in [17] which de-scribes a maximum likelihood estimator of the powerspectrum using Bayesian Monte Carlo methods. How-ever, our goal from the outset was to design a methodthat allows a full exploration of the multivariate proba-bility density of the power spectrum and the parameterestimates, given the data.

Our method seamlessly integrates with parameter es-timation without recourse to semi-analytic Gaussian, off-set log-normal [18], χ2 [19] or hybrid [20] approximationschemes. If desired, theoretical priors can be applied inthe analysis by restricting the space of power spectrato those which arise from a physical model of the CMB

mailto:[email protected]

2

anisotropy.By design, the sample of power spectra and recon-

structed sky maps will reflect the statistical uncertaintygiven the data through the full non-Gaussian statisticaldependence structure of the Cℓ estimates. This infor-mation can be propagated losslessly to the cosmologicalparameter estimates.

One aspect of our method which is of general interestin astrophysics beyond CMB analysis is that it general-izes the results on globally optimal interpolation, filteringand reconstruction of noisy and censored data sets in [25]to self-consistently include inference of the signal covari-ance structure. This defines a generalized Wiener filterthat does not need a priori specification of the signal co-variance. A byproduct of our method is a prescriptionfor “unbiasing” the Wiener filter which clearly reveals thetight relation between Wiener filtering and power spec-trum estimation.

Our methods differ from traditional methods of CMBanalysis in a fundamental aspect. Traditional methodsconsider the analysis task as a set of steps, each of whicharrives at intermediate outputs which are then fed asinputs to the next step in the pipeline. Our approach isa truly global analysis, in the sense that the statistics ofall the science products are computed jointly, respectingand exploiting the full statistical dependence structurebetween the various components.

In summary, our method is a Monte Carlo techniquewhich samples power spectra and other science productsfrom their exact, multivariate a posteriori probabilitydensity, and which does so without explicitly evaluat-ing it. The result is a detailed characterization of thestatistics of the CMB signal on the sky, reconstructedforegrounds, the CMB power spectrum, and the cosmo-logical parameters inferred from it with a cost which isproportional to the cost of a least squares map-makingalgorithm for the same set of observations.

In section II we introduce notation and a general sta-tistical model of CMB observations. Our method is de-scribed in detail in section III. In section IV we com-ment on the perspective our Bayesian approach offerson cosmic variance. We discuss the numerical and com-putational techniques used to implement our method insection V and apply it to the COBE-DMR data in sec-tion VI. We reflect on where we stand and conclude withcomments on further work to be done in section VII.

II. MODEL AND NOTATION

We begin by defining our model of CMB observationsand introduce our notation. We imagine that the actualCMB sky s is observed with some optical system and ac-cording to some observing strategy encoded in a pointingmatrix A, which maps the signal on the sky into a col-lection of no time-ordered observations of the sky. Thisresults in the “raw” data d, represented by a vector withno elements (an no–vector). Our model of this process is

encoded in the model equation

d = A(s + f) + ntod, (1)

where ntod is a realization of Gaussian instrumental noiseadded to the data and f =

∑

i fi is the sum of a collectionof foregrounds (assumed spatially varying and constantin time). We represent maps on the sky with np resolu-tion elements (pixels) as np–vectors. Note that while wedo not explicitly consider multi-channel data, the modelis easily generalized to that case by adding a frequencyindex to d, A, ntod and f .

The “map” vector m is the least squares estimate ofthe signal s + f from d. Because we assume Gaussiannoise with zero mean this is also the maximum likelihoodestimate (or maximum a posteriori estimate assuming aflat prior). It can be found as the solution of the normalequation

AT N−1todAm = AT N−1

todd. (2)

Here the matrix Ntod is the covariance matrix of the noisein the time ordered data space Ntod = 〈ntodntod T 〉. Thenm = s + f + n where n describes the residual noise onthe map estimate with covariance matrix N = 〈nnT 〉 =(AT N−1

todA)−1.The cosmological model specifies the signal covariance

matrix S. For isotropic theories S is diagonal in thespherical harmonic basis, with the special form Sℓmℓ′m′ =Cℓδℓℓ′δmm′ .

In keeping with the majority of the literature in thefield, we restrict our discussion to theories which predicta Gaussian CMB signal s. It will be convenient to ab-breviate Gaussian multivariate densities as

G(m, C) ≡1

√

|2πC|exp

(

1

2mT C−1m

)

. (3)

III. METHOD

A. Overview

For a cleaner exposition of the method, we will ignorethe foregrounds f for now and return to their inclusionlater. We are trying to explore the a posteriori density

P (Cℓ|m) ∝ G(m, S(Cℓ) + N)P (Cℓ) (4)

where P (Cℓ) is the density encoding prior informationon the Cℓ. Up to normalization this can be obtained bymarginalizing the joint density

P (Cℓ, s, m) = P (m|s)P (s|Cℓ)P (Cℓ) (5)

over the signal s. Setting P (Cℓ) = const makes this anal-ysis equivalent to an exact frequentist likelihood analysis.We will discuss other choices of prior later.

Traditionally, the approach to exploring the a poste-

riori density has been to define an estimator, such as

3

the least squares quadratic (LSQ) estimator [5] or themaximum likelihood (ML) estimator [6]. Then somemeasure of uncertainty in the values of this estimatorwas defined, for instance by approximating the shape ofP (Cℓ|m) around the maximum by a multivariate Gaus-sian and evaluating elements of the curvature matrix atthe extremum.

Evaluating the LSQ or ML estimators is a very costlyoperation, taking O(n3

p) operations[30]. In general, eval-uating the curvature matrix is even more costly becauseit has O(np) elements each of which requires O(n3

p) opera-

tions, making the overall operation count of order O(n4p).

In addition a Gaussian approximation fails at low ℓ wherethe small number of degrees of freedom makes the poste-rior significantly non-Gaussian, and also at high ℓ in theregime of small signal-to-noise (S/N . 1).

Instead, we propose to sample parameter values Cℓ

from the posterior directly. There is no known way todirectly sample from Eq.4, but if a way can be found tosample s and Cℓ from the joint distribution Eq. (5) thenthe Cℓ taken by themselves are exact samples from themarginalized distribution.

At first, sampling from the joint distribution seemseven less feasible. But powerful theorems can be proved[26] that show that if it is possible to sample from theconditional distributions P (s|Cℓ, m) and P (Cℓ|s, m) ∝P (Cℓ|s) then one can sample from the joint distributionin an iterative fashion. Begin with some starting guessC0

ℓ . Then iterate the following equations

si+1 ← P (s|Ciℓ, m) (6)

Ci+1ℓ ← P (Cℓ|s

i+1) (7)

then after some “burn-in” the (Ciℓ, s

i) converge to beingsamples from the joint distribution Eq. (5). This tech-nique of sampling from the joint distribution is called theGibbs sampler.

B. Sampling Techniques

To implement these ideas one needs the forms of theconditional densities and recipes for sampling from thesedistributions. These follow now.

The conditional density of the signal given the mostrecent Cℓ sample is just a multivariate Gaussian

P (s|Ciℓ, m) ∝ G

(

Si(Si + N)−1m, ((Si)−1 + N−1)−1)

,(8)

where Si ≡ S(Ciℓ). This will be recognized as the pos-

terior density of the Wiener Filter given the most recentpower spectrum estimate.

The density for the power spectrum coefficients Cℓ fac-torizes due to the special form of S.

P (Cℓ|si) ∝ P (Cℓ)

∏

l

1√

C2ℓ+1ℓ

exp

(

−1

2Cℓ

+l∑

m=−l

|siℓm|

2

)

(9)

The siℓm are the spherical harmonic coefficients of si.

This density is known as the inverse Gamma distributionof order 2ℓ − 1. This result has interesting implicationsfor cosmic variance in this Bayesian framework, which wewill discuss below.

To sample from Eq. (8) we need to generate a Gaussianvariate with the given mean and covariance. A numer-ically convenient choice (see section V) of the equationfor the mean x is

(1 + Si 1

2 N−1Si 1

2 )Si − 1

2 x = Si 1

2 N−1m. (10)

In fact it is easier to solve for z = Si − 1

2 x and to thensolve for x trivially. Note from its definition above thatN−1x is easier to compute than Nx. If Ntod is circulant(stationary noise) or block-circulant (a popular choicefor non-stationary noise), N−1

todx can be computed usingFFTs. If N is diagonal to very good accuracy then com-puting N−1 easy. We will drop the i superscript in whatfollows.

We chose to write the equation in terms of the mapmade from the data. It easy to see from Eq. (2) and fromN = (AT N−1

todA)−1 that writing Eq. (10) in terms of the

TOD saves some computations: N−1m = AT N−1todd. This

replacement can be made throughout in the equationsthat follow in the remainder of this paper.

Then we need to add a fluctuation term to this meanto get a random variate. This is non-trivial, because weneed to simulate a map with covariance (S−1 + N−1)−1

without being able to compute square roots of this ma-trix. We can, however, compute the square root of Sbecause it is diagonal in spherical harmonic space and

we can compute N− 1

2 ≡ AT N1

2

tod by using FFTs on thetime-ordered data. This leads to the following solution:generate two p-vectors ξ and χ of independent Gaussianrandom variates, with zero mean and unit variance (theseare called normal variates). Then solve the linear set ofequations

(

1 + S1

2 N−1S1

2

)

S− 1

2 y = ξ + S1

2 N− 1

2 χ (11)

for y. It is easy to verify that this does give the rightcovariance by computing 〈yyT 〉. The final result is then

si+1 = x + y, (12)

where we have re-introduced the superscript.Note that each s is a perfect pure signal sky (up to the

assumed band-limit) with covariance S. While x is theWiener filter, whose power spectrum would be a biasedestimator of the Cℓ, s is “unbiased”. The addition of thefluctuating term y has replaced filtered noise fluctuationswith synthetic signal fluctuations.

Drawing the Ci+1ℓ from the inverse Gamma distri-

bution, Eq. (9), is very simple. For each ℓ, compute

σℓ =∑+ℓ

m=−ℓ |siℓm|

2 and generate a (2ℓ− 1)-vector ρℓ ofGaussian random variates with zero mean and unit vari-ance. Then

Ci+1ℓ =

σℓ

|ρℓ|2, (13)

4

where the denominator is the square norm of ρℓ.

C. Foregrounds

Traditionally, regions on the sky are excised if theresidual error after foreground subtraction is large. How-ever, modeling the signal on the remainder of the sky af-ter foreground cuts complicates the structure of the signalcovariance matrix S. Instead, we choose to model fore-grounds as an additional component in the model equa-tion, as shown in Eq. (1). Then the joint density in Eq.(5) becomes

P (Cℓ, s, {fj}, d) = P (d|s, {fj})P (s|Cℓ)P (Cℓ)∏

k

P (fk)

(14)where each P (fk) contains prior information about thekth foreground.

Following the Gibbs sampler approach we draw fromthe foreground components given the data. We groupdifferent logically separate foregrounds by adding in ad-ditional steps in the sampling chain

for every j : f i+1j ← P (fj|C

iℓ, s

i, {fk<j}i+1, {fk>j}

i)

si+1 ← P (s|Ciℓ, {fj}

i+1) (15)

Ci+1ℓ ← P (Cℓ|s

i+1)

Where appropriate, different foreground componentsmay be grouped together into one fj . The algorithmto sample from the conditional foreground densities isanalogous to the signal sampling algorithm described inthe previous subsection. We will return to algorithmicissues after discussing the foreground prior

∏

j P (fj).How do we specify the foreground prior? For in-

stance, we might want to be completely insensitive tocertain foreground terms {fi}. This would mean settingP (f) = G(f, FFT ), where FFT ≡ σ2

f

∑

i fifTi and each

vector fi represents a foreground contribution we wouldlike to project out. The matrix F is just constructedby columns from the fi. The variance σ2

f is numerically“infinite”, i.e. large compared to any other noise source.This specifies maximal ignorance about the amplitude ofthis foreground component. As an example, fi could bethe monopole and the three dipole components. Or, ifthe foreground contribution in a pixel j was completelyunknown, the fi = 1j where 1j is the vector represent-ing the map which is zero everywhere except in the pixelj. Essentially any spatial template to which we wantthe power spectrum estimation to be insensitive can beadded in here, and they can be grouped in computation-ally convenient ways in Eq. (15).

It is important to note that even though we may havespecified “infinite” variance in our prior, the foregroundsamples produced will be constrained by the data andhence will be informative about the level of the fore-ground contribution supported by the data. For example,

the sample of the three dipole components generated dur-ing the iteration of Eq. (15) in the example above wouldbe informative about the direction and amplitude of theCMB dipole, and could be used to calibrate the experi-ment.

Different choices for the foreground prior P (f) are pos-sible. It could include information on foreground tem-plates as well as a specification of our uncertainty in thesetemplates. For example if the template is f and our un-certainty could be described by a Gaussian centered on fwith covariance FFT then P (f) = G(f − f , FFT ). Oneway to specify f and FFT would be to simulate a setof possible theoretical foreground models fi with weightswi, such that

∑

wi = 1, and to then set f =∑

wifi andFFT ≡

∑

i wi(fi − f)(fi − f)T .Note that the assumption of a Gaussian prior P (f)

only assumes that our ignorance of the foreground con-tribution can be expressed through a Gaussian covari-ance structure—the foregrounds are not assumed to haveGaussian statistics. Non-Gaussianity can be explicitlyassumed by choosing a non-Gaussian template f . Forthe case of multi-frequency data, P (f) could encode whatis known about the dependence of certain physical fore-ground components on the frequency.

Returning to the mechanics of sampling Eq. (15) wewrite Fj = FjF

Tj , and solve at the jth step

(Fj +FjN−1Fj)xj = fj +FjN

−1(m−s−∑

k 6=j

fj), (16)

and

(

Fj + FjN−1Fj

)

yj = Fjξ + FjN− 1

2 χ. (17)

Then fj = Fj(xj + yj). Since Fj may not be full rankin np dimensions, the equations here may be understoodas shorthand for the projected equations in the subspaceon which Fj has full rank.

Note that when foregrounds are considered, the m onthe right hand side of Eq. (10) has to be replaced with(m−

∑

j fj).In special cases it may be desirable to perform the

marginalization over f analytically. Appendix A givestechniques for doing so.

D. Noise model

It is straightforward to extend our methods to includeestimation of the noise covariance from the data them-selves. In the case that Ntod is non-stationary and block-diagonal with circulant blocks (the standard assumptionin CMB analysis), we can easily add a sampling stepsymbolically written as

N i+1j ← P (N |{Nk 6=j}

i, s, Cℓ, {fj}) = P (N |s, {fj}).

(18)The second equality expresses two facts: (1) for a blockdiagonal noise matrix the conditional density of one block

5

does not depend on the other blocks and (2) N is con-ditionally independent of the Cℓ given s; that is given sthe Cℓ do not add more information about the N .

In practice, the noise model would assume smoothnessof the noise power spectra. If we write Njk for the kthband power spectral coefficient of the jth block of thenoise covariance of the TOD simply involves computingthe FFT of the jth segment of d − A(s +

∑

f), addingthe power in bands of width d and then sampling Njk

from the inverse Gamma distributions of order d− 2.More general noise models can be implemented. We

will explore the effect of more sophisticated modeling ofnon-stationary noise in future work.

E. Parameter estimation

Currently power spectrum estimation algorithms relyon approximate representations of the posterior densityP (Cℓ|d) [31], for example in terms of multivariate Gaus-sian, shifted log-normal or hybrid representations. Theseapproximations have to be fitted to sets of Monte Carlosimulations [20]. Since they take simple analytical formsthey can only be expected to be accurate near the peakof the posterior density. In order to faithfully propagateall the information in the Cℓ estimates through to theparameter estimation step, efficient ways must be foundto accurately represent and communicate P (Cℓ|d).

The Bayesian estimation technique described in thispaper provides a natural answer to this problem. Themethod generates a set of samples from P (Cℓ|d) whichcan simply be published electronically. Meaningful sum-maries of the properties of P (Cℓ|d) can all be calculatedarbitrarily exactly, given a sufficient number of samples.

The disadvantage of using this sample set for parame-ter estimation is that it does not lend itself easily to com-puting a numerical probability density for a theoreticalCℓ power spectrum computed from a set of cosmologicalparameters θ.

However, a fortunate circumstance solves the problemof finding an arbitrarily exact numerical representationof P (Cℓ|d). At each iteration of the Gibbs sampler the Cℓ

are drawn from P (Cℓ|s) which is in fact P (Cℓ|σℓ) whereσℓ =

∑

m |sℓm|2. We can therefore write

P (Cℓ|d) =

∫

dsP (Cℓ, s|d) =

∫

dsP (Cℓ|s)P (s|d)

=

∫

DσℓP (Cℓ|σℓ)P (σℓ|d) ≈1

nG

∑

i

P (Cℓ|σiℓ).

(19)

The sum (where the index runs over nG Gibbs samples)becomes an arbitrarily exact approximation to the in-tegral as the number of samples increases. It is calledthe Blackwell-Rao estimator for the density and can beshown to be superior to binned representations. This sumyields a numerical representation of the posterior densityof the power spectrum given the signal samples. All the

information about P (Cℓ|d) is contained in the σiℓ, which

generate a data set of size O(ℓmaxnG).It is noteworthy that in the limit of perfect data, using

Eq. (19) returns the exact posterior density after onlyone iteration of the Gibbs sampling algorithm.

In addition to being a faithful representation ofP (Cℓ|d) it is also a computationally efficient representa-tion. Evaluating the Gaussian or the shifted log-normalapproximations to P (Cℓ|d) takes O(ℓ3

max) operations,while our approach requires only O(ℓmaxnG) operations.Note also that any moments of P (Cℓ|d) can be calculatedthrough

〈CℓCℓ′ . . . Cℓ′′ 〉|P (Cℓ|d) ≈1

nG

∑

i

〈CℓCℓ′ . . . Cℓ′′〉|P (Cℓ|σiℓ).

(20)This is a far more efficient representation than would beafforded by a Monte Carlo sample of a pseudo-Cℓ esti-mator since each of the terms on the right hand side canbe computed analytically.

Another feature of this framework is that is possi-ble to include cosmological parameter estimation in thejoint analysis of the data. If we assume a class of the-oretical models, we can solve the estimation problemof power spectrum and cosmological parameters concur-rently. The assumption of such a class of models whichamounts to choosing a prior for the power spectra whichexcludes spectra that could not possibly be the resultfrom a solution of the Boltzmann equation for any com-bination of the parameters about which we wish to makeinferences.

With such an assumed class of models the relation-ship between Cℓ and the cosmological parameters θ is anon-stochastic one, Cℓ = Cℓ(θ), and P (Cℓ|θ) is a deltafunction. We can integrate out this delta function in theposterior and then obtain the conditional density for sam-pling the cosmological parameters given the data. Thisprocedure results in removal of the Cℓ sampling step andthe addition of the following step to the list in Eq. (15):

θi+1 ← P (Cℓ(θ)|si+1). (21)

Here P (Cℓ(θ)|si+1) denotes the inverse Gamma distribu-

tion, Eq. (9), and Cℓ(θ) is defined through cosmologicaltheory. Instead of sampling from the ℓmax power spec-trum coefficients given the σℓ we sample from θ assumingthat we just measured the σℓ on a perfect signal sky (thelast draw). In practice, that can be achieved by runninga Markov Chain using the Metropolis Hastings algorithmuntil one independent θ sample is produced.

If we believe strongly in the theoretical framework,using this prior information is desirable: it reduces thenumber of parameters in the problem and therefore im-proves the signal and hence also the foreground recon-struction from the data. The set of Cℓ for the draws of θrepresents stochastically what is known about the theo-retical power spectrum. This method defines an optimalnon-linear filter which returns the best power spectrum

6

FIG. 1: Computing time averaged over 30 iterations of theGibbs sampler required for solving Eq. (10) and Eq. (11) as afunction of the number of pixels in the map. These timings arefor a single AthlonXP 1800+ CPU. Solid line: actual timings.Dashed lines show nx

p for x ∈ {3, 5/2, 2, 3/2} from the top tothe bottom on the right side of the figure.

and a characterization of the error while including phys-ical constraints on the analysis (for example the smooth-ness of the Cℓ which is related to the natural frequencyof oscillations modes in the primordial plasma).

However, just as we are interested in making mapsfrom the data without inputting information about theforegrounds and the statistical properties (e.g. isotropy)of the CMB, we are also interested in the model indepen-dent power spectrum constraints.

IV. COSMIC VARIANCE

In Eq. (9) we have written down the conditional pos-terior P (Cℓ|s). This encodes what we know about theCℓ if we have perfect knowledge of the signal on the sky.The full posterior distribution P (Cℓ|d) would reduce tothis if we had perfect (noiseless, all-sky) data.

As shown in Eq. (9) the Cℓ only depend on the data

through σℓ =∑+ℓ

m=−ℓ |siℓm|

2. These σℓ have a physical in-terpretation. They measure the actual fluctuation poweron our sky. Therefore, if we had perfect data it would bepossible to measure the σl with zero variance.

The residual uncertainty in Cℓ even for perfect datais a well known fact, known as cosmic variance. It isthe consequence of having only one sky at our disposal,which means that there are a limited number of degreesof freedom for each Cℓ. Hence we cannot measure the Cℓ

arbitrarily precisely.In this Bayesian treatment the functional form of the

conditional posterior density may be unexpected. In thefrequentist approach where the true underlying theory(i.e. the Cℓ) is thought of as fixed and the data as random,

5 10 15 200

0.025

0.05

0.075

0.1

0.125

0.15

0.175

FIG. 2: The COBE-DMR power spectrum. The verticalbands display the marginalized densities at each ℓ. Horizon-tal bars mark off bins of constant probability. These bins areassigned their color in Cℓ space and then projected into the di-agram. The bin with the highest probability density is shownin black. The dark and light shaded regions are the 1-σ and2-σ highest posterior density regions, respectively. The Cℓ aremeasured in units mK2 in this and all subsequent figures.

the variances on our sky σℓ =∑

m |sℓm|2 are thought of

as χ2 variates with 2ℓ + 1 degrees of freedom.From a Bayesian perspective the data is fixed and our

knowledge of the underlying theory is uncertain—so ourknowledge about the Cℓ is encoded in the inverse Gammadistributions (2ℓ− 1), Eq. (9).

The mean and variance of the inverse Gamma distri-bution of order d are

〈Cℓ〉 =σℓ

d− 2d > 2, (22)

and

〈∆C2ℓ 〉 =

2 σ2ℓ

(d− 4) (d− 2)2 d > 4. (23)

For the case of a flat prior P (Cℓ) = const we obtaind = 2ℓ− 1. In this case the variance only becomes finitefor ℓ > 2. This is a result of having chosen a flat priorfor a variance measurement. There are in fact no argu-ments for doing so —when measuring a variance (whichis a scale parameter) a flat prior does not correspond tomaximal ignorance.

The Jeffrey ignorance prior [32] for this case is P (Cℓ) =1/Cℓ. This would lead to d = 2ℓ + 1 and finite vari-ance for ℓ > 1. In this case 〈Cℓ〉 = σℓ

2ℓ−1 and 〈∆C2ℓ 〉 =

2 σ2

ℓ

(2ℓ−3) (2ℓ−1)2.

In order to obtain the frequentist expectation 〈Cℓ〉 =σℓ

2ℓ+1 the prior P (Cℓ) = 1/C2ℓ would have to be used. In

this case we still obtain a variance for the estimator whichis larger by a factor 2ℓ+1

2ℓ−1 than the frequentist chi-square

variance [21]. So the mean of P (Cℓ|d) is an unbiased

7

FIG. 3: Marginalized posterior densities for each individual Cℓ from the COBE-DMR data. At each ℓ the fluctuations in theCℓ at all other ℓ were integrated out. The axis ranges are the same for all panels.

estimator of Cℓ for perfect data and hence has the sameexpectation as the maximum likelihood estimator [17].

These considerations are potentially relevant to thediscussion about the statistical significance of the lowℓ Cℓ estimates in the WMAP data in the Bayesian ap-proach (e.g., [16] and references therein). We will explorethis issue in more detail in a future publication.

V. COMPUTATIONAL CONSIDERATIONS

The computationally most demanding part of imple-menting this method is solving Eq. (10) and Eq. (11) ateach iteration of the Gibbs sampler. Each of these is alinear system of equations of the form Mv = w, whereM = (1+S

1

2 N−1S1

2 ). It should be noted that these sys-tems are of the same size as the map-making equation,Eq. (2). Maps also have to be made for approximateestimators. Therefore we expect the computational com-plexity of the Gibbs sampler to be no worse than thecomputational complexity of an approximate method.

For large np (np > 105) on the largest supercomput-ers available at the time of writing), direct solution ofeither of these equations becomes infeasible, because nei-ther of them are sparse. This means the operation countscales as n3

p and because the memory requirements for

storing the coefficient matrices scales as n2p. Therefore

large systems of this type are usually solved using iter-ative techniques, such as the conjugate gradient (CG)technique [27]. The memory savings can be very large:the components of M do not have to be stored as long asmatrix vector products Mv can be computed somehow.In terms of CPU time, iterative techniques outperformdirect techniques if either Mv can be computed in lessthan n2

p operations or the number of iterations requiredto converge to a solution of sufficient accuracy is muchless than np.

We chose to write Eq. (10) and Eq. (11) in a formwhich satisfies all of theses requirements. The memoryrequired is of order np since we never need to store thecomponents of the coefficient matrix.

The action of any power of S on a vector can be com-puted in much less than n2

p operations using sphericalharmonic transforms (or FFTs in the flat sky approxi-mation). The action of N−1 = AT N−1

todA on a vector isgenerally easier to compute than the action of N on avector. As long as noise correlations can be modeled ina simple way in the time-domain (e.g. as piecewise sta-tionary) the time required for applying N−1 to a vectoris similar to that required for a forward simulation of thedata.

The number of CG iterations until convergence can bereduced far below the theoretical maximum np if M isnearly proportional to the unit matrix. This goal can be

8

FIG. 4: Correlation matrix of Cℓ estimates from the COBE-DMR data. The diagonal components have been set to zero toenhance the contrast of the off-diagonal components. The surface is shaded according to height. We see that correlations betweenthe power spectrum estimates vary between 8% correlation at (ℓ, ℓ′) = (6, 10) and 15% anti-correlation at (ℓ, ℓ′) = (8, 12). SeeFigure 5.

approached by finding an approximate inverse for M , apreconditioner.

If N−1 were diagonal in the spherical harmonic basis,M would be, too. Therefore, as long as this is approx-imately true on scales where S ≫ N , a good precondi-tioner for this system would be the inverse of the diagonalpart of M in the spherical harmonic basis. These are easyto compute if we approximate the diagonal componentsof N−1 by counting the number of TOD samples in eachpixel and weighting by the current noise temperature ofthe detector. Due to the way Eq. (10) and Eq. (11) havebeen written, the structure of N−1 in the noise domi-nated regime does not matter, since if S ≪ N , M ≈ 1.

This preconditioner can be computed in O(n3

2

p ) oper-ations. Figure 1 shows the results of a timing study forsimulated data sets of varying size with WMAP-like scan-ning strategy and uncorrelated noise. The preconditionerperforms well. The number of iterations does not increasewith problem size over three orders of magnitude in np

and the computing time is is dominated by the sphericalharmonic transforms.

VI. APPLICATION TO THE COBE DATA

In order to test our method we applied to the well-studied COBE-DMR data. The exact maximum likeli-hood estimator [6, 22, 23] and the least square quadraticestimator [5] have been computed for this data set. How-ever, even for this small data set, the marginalized prob-

ability densities of each individual Cℓ, or the joint pos-terior density of pairs of Cℓ have not been computed be-cause doing so would require numerical integration over∼ 20 dimensions. We will show these densities here forthe first time.

The COBE-DMR data [24] is published in the quad-cube data structure, at a resolution which has 6144 pix-els on the sphere. We use a noise-weighted average ofthe 53GHz and 90GHz maps. Because much of our codewas already written for a HealPix data structure, we putthe COBE data into a HealPix pixelization at resolutionnside = 64 with 49152 pixels. HealPix pixels whose cen-ters lie within the same quadcube pixel get the same data(temperature) value.

Because the noise is completely correlated between setsof HealPix pixels in the same quadcube pixel, the noisecovariance matrix N is block diagonal, where each ele-ment of the block is σ2, the published (noise) variance ofthat quadcube pixel. This means that N is not strictlyinvertible, so we have to use a pseudo-inverse for N−1.Our pseudo-inverse is also block diagonal, with constant-valued blocks, and correctly inverts the action of N on avector that is constant valued on the same blocks as N .

We project out the mean and dipoles from the un-cut region of the COBE-DMR map, and model the datawithin the custom galactic cut as Gaussian random whitenoise with large variance. This corresponds to claim-ing complete ignorance of the foregrounds at low galacticlatitudes (within the custom cut) and assuming that noresidual foregrounds are present at high latitudes (out-side the cut region). This is the simplest possible way of

9

FIG. 5: 2-D marginalized posterior densities. Each plot shows the full joint posterior of the data, integrated over all dimensionsexcept for the two shown. From bottom left anti-clockwise: P (C2, C3), P (C2, C4), P (C8, C12), and P (C6, C10). The latter twowere chosen because these Cℓ pairs were maximally anti-correlated and correlated, respectively.

treating the monopole, dipole, and galactic foregrounds.Our noise matrix has values published by the COBE

team, but with the σ2 noise variance increased to1000 mK2 in the galactic cut region, a numerically largevalue that exceeds any other variance in the problem.

For the first iteration of the Gibbs sampler we choose

C0 = C1 = 10−30 mK2 Cℓ =10−4

ℓ(ℓ + 1)mK2. (24)

We chose these values because they very roughly approx-imate the true Cℓ values to reduce burn-in time. Thefirst two are numerically small, because we consider themonopole and dipole to be non-cosmological. During theCℓ estimation step of the Gibbs sampler, the C0 and C1

values are not changed. This corresponds to enforcingthe prior that the cosmological signal does not containsuch components.

The Gibbs sampler is run through 10, 000 iterations(sets of Cℓ values). This takes approximately 24 hourson an Athlon XP1800+ workstation. We ignore thefirst 1000 iterations to ensure that the Gibbs sampler

has converged to the true distribution. This is veryconservative—in fact by computing correlations of ourCℓ draws along the chain we infer that about every 20th

sample is uncorrelated.We plot the power spectrum in figure 2. For each ℓ

value, we display vertically a binned representation ofthe marginalized posterior densities P (Cℓ|m). The binsall hold an equal number of points. The bins that arethinnest (points are densest in Cℓ space) are colored moredarkly. The top 68% are dark gray; from 68% to 95% arelighter gray, and the rest are white. The highest densitybin is shown in black.

To explore the marginalized posterior Cℓ distributionin more detail we plot their histograms in Figure 3. It isnoteworthy that not a single one of these is even nearlyGaussian. Within the context of the discussion of the lackof large scale power in the CMB, it is worth pointing outthat all inferences about C2 from COBE-DMR can bebased on the P (C2|d) shown here.

The correlation structure of the estimates contains in-formation about how well we were able to account for the

10

FIG. 6: Reconstructed signal maps in Galactic coordinates.A: The signal component of the COBE-DMR data marginal-ized over the power spectrum: 〈x〉|P (x|m). This is a general-ized Wiener filter which does not require knowing the signalcovariance a priori. B: The solution y of Eq. (11) at one Gibbsiteration. C: The sample pure signal sky s = x + y at thesame iteration (band-limited at ℓmax = 50). D: The WMAPinternal linear combination map smoothed to an FWHM of5 degrees. The corresponding features in parts A and D areclearly visible. Note that in this map low galactic latitudesare not masked, which leads to some artifacts that are notvisible in the masked COBE-DMR data.

effects of the galactic cut. It is clear from figure 4 thatthe residual correlations are at most of order 10% evenat very small ℓ.

However, since the posterior densities are non-Gaussian, the two-point correlations do not contain allthe information. We therefore show the marginalizedposteriors for four pairs of Cℓs in figure 5. Again, allfour of these densities are strongly non-Gaussian.

Lastly, we show the reconstructed signals. Figure 6Ashows the expectation of the signal component (the solu-tion of Eq. (10) at each iteration of the Gibbs sampler) ofthe COBE-DMR data with respect to the posterior den-sity marginalized over the power spectrum: 〈x〉|P (x|m).This is a generalized Wiener filter (GWF) which doesnot require knowing the signal covariance a priori. Thesmoothing of the map autmatically adapts locally de-pending on how much detail the data support. The morestrongly smoothed central horizontal band was obscuredby the galaxy. Still the GWF reconstructs large scalemodes in the galactic cut.

The power spectrum of figure 6A would be biased low,since the Wiener filter removes everything that could benoise. At each iteration of the Gibbs sampler the solutionto Eq. (11) (shown in figure 6B) adds in a fluctuatingterm that replaces filtered noise with synthetic signal. Itis noticable that this fill-in signal is larger in the regionsof the map where the Galaxy obscures the CMB. Theresulting draw s from Eq. (6) is shown in figure 6C. Everys draw is one possible pure signal sky that could havegiven rise to the data. Since we know that the COBEdata has no statistical power above an ℓmax of about 20,we imposed a bandlimit of ℓmax = 50.

For comparison with the inferences we draw from theCOBE-DMR data, we show in figure 6D the internallinear combination map from the WMAP satellite [3]smoothed down to five degrees FWHM, an intermedi-ate scale between the slightly larger average smoothingof panel A and the somewhat smaller smoothing of panelB and C due to the bandlimit of ℓmax = 50. Nearly everyhot and cold spot that is identified by the GWF can befound in the high signal-to-noise WMAP data. Figure6C fills in signal very plausibly up to the imposed ban-dlimit. Even more striking is the similarity of our figure6A to the combination of Q and V band WMAP datashown in figure 8 of [3], which is intended to mimic theCOBE-DMR 53GHz map.

VII. CONCLUSIONS AND FUTURE WORK

We have described a framework for global and loss-less analysis of cosmic microwave background data. Thisframework is based on a Bayesian analysis of CMB data.It has several advantages compared to traditional meth-ods. It is computationally feasible. It is optimal andexact under the assumption of Gaussian fields and theability to encode our prior knowledge about foregroundcomponents in terms of multivariate Gaussian densities.

11

It uses controlled approximations (e.g. the number ofsamples of the Gibbs sampler controls the accuracy ofthe result but this can be increased by spending morecomputing time). It allows joint analysis of the CMB sig-nal, foregrounds and noise properties of the instrument,while modeling and exploiting the statistical dependencebetween these different inferences.

Traditional methods of inference from CMB data di-vide the data analysis into several steps: map-makingfrom TOD, component separation, power spectrum es-timation from the CMB signal and cosmological param-eter estimation. Our method allows treating all theseinferences jointly and self-consistently, if desired. Thetraditional results can be understood as special cases ofour method for certain uninformative prior choices. Forexample, pure map-making could be viewed as applyingthis framework with P (s|Cl)P (f)P (Cl) = const.

In spite of this generality, the framework for analyz-ing CMB data described here is very modular: the struc-ture of the Gibbs sampling scheme separates the differentsteps of the inference process focusing on each componentin turn. The framework described here therefore holdsthe promise of making more data analysis steps part ofa self-consistent framework rather than sequential stagesin a data pipeline.

Our method turns out to give an unbiased Wiener fil-ter and generalizes the global filtering and reconstructionmethods in [25] to include power spectrum estimation,obviating the need for a priori knowledge of the signalcovariance.

We require the use of iterative techniques to solve themost computationally demanding step in this method.We find that our simple-minded preconditioned gradi-ent iteration works well over 3 orders of magnitude inproblem size. It remains to be studied whether otherpreconditioners may be even more effective (e.g. [28]).

We applied our formalism to the well-studied COBE-DMR data set. We demonstrate that our methods enablenew analyses for even such a small data set. We quoteposterior densities for individual Cℓ as well as posteriordensities for pairs of Cℓ as examples of results that wouldbe prohibitively expensive to obtain with traditional al-gorithms. Our results are consistent with the most so-phisticated brute force O(n3

p) analyses available in theliterature.

The approach can be extended straightforwardly to po-larized maps, data that spans different frequency bandsand joint estimation of different data sets. There is noth-ing that prevents the application of these ideas to randomfields on manifolds other than the sphere, such as one-,two- or three-dimensional Euclidean space. We are inves-tigating the formalism for joint inference from CMB dataabout the power spectrum and map of the pure CMB skywith the power spectrum and map of the projected gravi-tational potential. We will report on these developmentsin a future publication.

Acknowledgments

We thank I. O’Dwyer, J. Jewell and the members ofthe Planck CTP working group for comments and stim-ulating conversations. This work was supported in partby the University of Illinois at Urbana-Champaign andthe NCSA/UIUC Faculty Fellowship program.

APPENDIX A: ANALYTIC MARGINALIZATION

OVER FOREGROUNDS

From a statistical point of view we consider the a pos-

teriori distribution to be a function of the CMB signal,Cℓ and the foregrounds. Then we marginalize over theforegrounds f . This can be done either implicitly throughGibbs sampling from the joint posterior density includingf (as described in the main text) and then marginalizingover f in the generated samples or explicitly through an-alytic marginalization of the posterior over f . Then theignorance about the foreground is included in additionalterms in the noise covariance matrix. The first route ismore general, but there may be occasions where the sec-ond is preferable; for example if the main goal is to makethe CMB analysis insensitive to a small number of knownforeground templates fi.

The effect of analytic marginalization is that the newnoise covariance matrix N ′ including the foreground termbecomes

N ′ ≡ N + σ2fFFT ≡ N + σ2

f

∑

i

fifTi . (A1)

In order to implement the Gibbs sampler including this

new term we need to be able to apply N ′−1to vectors.

If only a small number (up to ∼ 1, 000) of foregroundtemplates need to be projected out this can best be doneby grouping all the vectors using the following limit ofthe Sherman-Morrison-Woodbury formula [25]

N ′−1 ≡ limσ2

f→∞

(N + σ2fFFT )−1 = (A2)

N−1 −[

N−1F(

FT N−1F )−1FT N−1)]

.

This operation will project out the directions in N−1

corresponding to the foreground contributions. The ac-tion of this new inverse noise covariance matrix on a vec-tor can be computed using methods similar to those de-scribed in Eqs. (2.7.16ff) in [27].

Alternatively one can solve iteratively the set of equa-tions

(

N + σ2fFFT

)

x = v (A3)

every time x = N ′−1v is required on the LHS of Eq. (10)

and Eq. (11). When N ′− 1

2 is required for the RHS of Eq.(11) we solve

(

N + σ2fFFT

)

y = N1

2 ξ + σfFχ. (A4)

12

The term N1

2 ξ is obtained by simulating a noise-onlymap solving Eq. (2) with d = ntod. In both of theseequations one can choose σf numerically large.

However, the method in the main text is more flexible,since it allows grouping different foregrounds together inways that are computationally convenient.

[1] W. Hu and S. Dodelson, Ann. Rev. Astron. Astrophys.40, 171 (2002)

[2] A. D. Milleret al. 1999 Astrophys.J. 524 (1999) L1-L4; N.W. Halverson et al. 2001, astro-ph/0104489; S. Hananyet al. Ap. J. 545 (2000) L5; K. Grainge et al. Mon.Not. R. Astron. Soc. 000, 15 (2002); Kuo, C. L. et al.

2002, Ap. J., astro-ph/0212289; J. E. Ruhl et al. (2002),astro-ph/0212229; S., Padin et. al., Ap. J. 549, L1, (2001)

[3] C. L. Bennett et al., astro-ph/0302208, Ap. J., in press.[4] J. R. Bond and G. Efstathiou, MNRAS 226, 655 (1987)[5] M. Tegmark, Phys. Rev. D55, 5895 (1997)[6] J. R. Bond, A. H. Jaffe, and L. Knox, Physical Review

D 57, 2117 (1998)[7] K. M. Gorski, E. Hivon, B. D. Wandelt, Evolution of

large scale structure : from recombination to Garching/edited by A. J. Banday, R. K. Sheth, L. N. da Costa.Garching, Germany : European Southern Observatory,37 (1999)

[8] J. R. Bond, R. Crittenden, A. Jaffe, L. Knox, Com-put.Sci.Eng.1, 21 (1999)

[9] J. Borrill, Phys.Rev.D 59, 027302 (1999)[10] S. P. Oh, D. N. Spergel, G. Hinshaw, Ap. J. 510, 551

(1999)[11] B. D. Wandelt, F. Hansen, astro-ph/0106515, Phys. Rev.

D 67, 023001 (2003)[12] B. D. Wandelt, E. Hivon, and K. M. Gorski, Phys. Rev.

D 64, 083003 (2001)[13] E. Hivon et al., Ap. J. 567, 2 (2002)[14] Szapudi et al.,, Ap. J. 548, L115 (2001)[15] G. P. Efstathiou, astro-ph/0307515, submitted to MN-

RAS.[16] G. P. Efstathiou, astro-ph/0306431, submitted to MN-

RAS.[17] J. Jewell, S. Levin, and C. H. Anderson, astro-ph-

0209560, submitted to Astrophysical Journal.[18] J. R. Bond, A. H. Jaffe, and L. Knox, Astrophysical Jour-

nal 533, 19 (2000)

[19] J. Bartlett et al, Astrophysical Letters and Communica-tions, 37, 321 (2000)

[20] L. Verde et al 2003, astro-ph/0302218, Ap. J., in press.[21] L. Knox, Phys. Rev. D 52, 4307 (1995)[22] K. M. Gorski, et al., Ap. J. 464, L11 (1996)[23] K.M. Gorski, Cosmic Microwave Background Anisotropy

in the COBE DMR 4-yr Sky Maps, Proceedings of theXXXIst Recontres de Moriond, ’Microwave BackgroundAnisotropies’, astro-ph/9701191 (1998)

[24] C. L. Bennett et al., Ap. J. 464, L1 (1996)[25] G. B. Rybicki and W. H. Press, Astrophysical Journal

398, 169 (1992)[26] Tanner, Tools for Statistical Inference: Methods for the

Exploration of Posterior Distributions and Likelihood

Functions, 3rd Edition. Springer Verlag, Heidelberg, Ger-many. (1996)

[27] W. H. Press, et al., Numerical Recipes. Cambridge Uni-versity Press, Cambridge, England. (1992)

[28] U. L. Pen, astro-ph/0304513, MNRAS in press.[29] An exception to this classification is a hybrid method

which has been proposed very recently and which com-bines a maximum likelihood approach on large scales withan approximate approach on small scales [15].

[30] See however [28] for fast numerical techniques that wereapplied successfully to weak lensing data.

[31] We write P (Cℓ|d) as shorthand for the multivariate pos-terior density, a function of {Cℓ : ℓ = 1, . . . , ℓmax}.

[32] Jeffrey’s ignorance prior is constructed by requiring thatthe probability measure P (Cℓ)dCℓ be invariant undertransformations which leave our prior knowledge aboutthe parameter invariant. In the case of power spectrumestimation (which is essentially a variance measurement)we are estimating a positive semi-definite scale param-eter Cℓ. Our a priori ignorance about the scale impliesthat the measure must be invariant under scale trans-formations. This is uniquely satisfied if P (Cℓ) ∝ 1/Cℓ.

http://arXiv.org/abs/astro-ph/0104489










Global, exact cosmic microwave background data analysis using Gibbs sampling

Documents