Massive Science with VO & Grids

Massive Science with VO & Grids

Bob Nichol

ICG, Portsmouth

Thanks to all my colleagues in AG2, VOtech,& PiCASpecial thanks to Chris Miller, Alex Gray, Gauri Kulkarni, Garry

Smith, Brent Bryan, Chris Genovese, Jeff Schneider

ADASS 2005 2

Outline

1. VO + Grid provides a powerful emerging infrastructure for massive scientific calculations

2. Discussion of VO infrastructure and VOtechbroker

3. Examples:• N-point correlation functions• Nonparametric analyses and massive model fitting

ADASS 2005 3

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.





+

ADASS 2005 4

Euro-VO• The Euro-VO Data Centre Alliance (DCA):

– A collaborative and operational network of European data centres who publish data and metadata to the Euro-VO and who provide a research infrastructure of GRID-enabled processing and storage facilities.

• The Euro-VO Facility Centre (VOFC): – An organization that provides the Euro-VO for centralized resource

registry, standards definition and promotion as well as community support for VO technology take-up and scientific program support using VO technologies and resources.

• The Euro-VO Technology Centre (VOTC) – A distributed organisation that coordinates a set of research and

development projects on VO technology, systems and tools.

ADASS 2005 5

EuroVO: VOTech Project

• Aims:– Complete all technical preparatory work necessary for the

construction of the European Virtual Observatory,– Responsible for development of infrastructure and tools:

• Intelligent resource discovery (ontology and the semantic web), data interoperability, data mining, and visualisation capabilities.

– Provide the ability to offload mass scale computational process onto the Enabling Grids for E-sciencE (EGEE) backbone.

ADASS 2005 6

Existing infrastructure

• VOTech is tasked with building upon existing infrastructure:

–IVOA for standards–AstroGrid for middleware

• Web Services based,

• Presumably IVOA will continue to look towards other standards bodies:

–World Wide Web Consortium (W3C)–Global Grid Forum (GGF)

ADASS 2005 7

IVOA StandardsVOTable Format Definition Version 1.1:

– An XML language,• Flexible storage and exchange format for tabular data: Emphasis

on astronomical tables,

– Allows meta data and data to be stored separately with links to remote data.

• Resource Metadata for the VO Version 1.01:– For describing what data and computational facilities are

available, and once identified how to use them.• Unified Content Descriptor (UCD) (Proposed):

– A formal (and restricted) vocabulary for astronomical data.• IVOA Identifiers Version 1.10 (Proposed):

– Syntax for globally unique resource names.

ADASS 2005 8

AstroGrid Components• MySpace: Distributed file store for workflows,results,• Common Execution Architecture (CEA):

– Codes need wrapping before use,– Take command line apps and present as a Web Service.

• Algorithm Registry:– Meta data from wrapped codes are published in a yellow pages, for

searching.• Portal:

– Web interface for interacting with preceding services,– Workflow: Coordinate data flow/control of components within a larger

system of work,– Submit jobs and observe status, and access files in MySpace.

• Dashboard/Workbench:– Interact with MySpace, Registry, CEA from any language that provides

XML-RPC library. Web Start application.

ADASS 2005 9

AG Portal

AG rollout and access via portal

ADASS 2005 10

VOtechbroker• Execute potentially thousands of sequential processes

simultaneously, repeat multiple times:– Parameter sweep.

• Utilise existing infrastructure at remote sites: – e.g. computational resources: Condor, Globus, – Transparent to the user.

• Locate suitable compute nodes (i.e. processor type, available libraries, CPU load, memory,

• Stage code and observe status of running processes, • Combine results for further analysis, e.g as input to a

post-mortem visualisation component in the AG workflow.

ADASS 2005 13

Architecture Job Submission Description Language (JSDL) from the Global Grid

Forum

London eScience Center

AstroGrid

GGF

ADASS 2005 14

Web form

ADASS 2005 19

Broker Summary• A broker to submit parameter sweeps to the Grid, and other

distributed resources, in a transparent way,• Aim to allow arbitrary algorithms to be added easily, just a new

web form• Aim to interoperate with a wide range of job submission

systems using a plug-in system,• Distributed architecture based on Web Services, allows for

multiple types of client,• AstroGrid workflow important:

– CEA command line to thin algorithm wrappers,– Wrapper and Broker interaction with MySpace.

• X.509 and myproxy for authentication/authorisation.• Ready for full-scale testing: n-point functions

N-point Correlation FunctionsThe 2-point function ((r)) has a long history in cosmology (Peebles 1980). It is the excess joint

probability (dP12) of a pair of points over that expected from a Poisson process.

dP12 = n2 dV1 dV2 [1 + (r)]

dP123=n3dV1dV2dV3[1+23(r)+13(r)+12(r)+123(r)]

dV1 dV2r

Same 2pt, different 3pt

Measure of the topology of the large-scale structure in universe

Credit: Alex Szalay

Multi-resolutional KD-treesMulti-resolutional KD-trees• Scale to n-dimensions

• Use Cached Representation (store at each node summary sufficient statistics). Compute

counts from these statistics• Prune the tree which is stored in memory

• (Moore et al. 2001 astro-ph/0012333)

Top Level

1st Level

2nd Level

5th Level

Just a set of range searchesJust a set of range searches

Prune cells outside range

Also Prune cells inside!Greater saving in time

ADASS 2005 25

N1 dmax

dmin

Usually binned into annuli

rmin< r < rmax

Thus, for each r transverse both trees and prune pairs of nodes

No count

dmin < rmax or dmax < rmin

N1 x N2

rmin > dmin and rmax< dmax

N2

Therefore, only need to calculate pairs cutting the boundaries.

Scales to n-point functions also do all r values at once

Dual Tree AlgorithmDual Tree Algorithm

ADASS 2005 26

3-point Correlation Function of SDSSLuminous Red Galaxies

Details of the dataset:

0.15 < z < 0.55

-23.2 < Mg < -21.2

(~50000 LRGs from Eisenstein et al. 2005)

Each bin can be 100’s of individual calculations (errors)

ADASS 2005 27

Employing npt on Teragrid - I

• Scale of computing npt:– For the value of 2-point correlation function within any give

bin, we need 3 types of pair counts (DD, DR and RR) while for the value of 3-point correlation function, we need 4 types of triplet counts (DDD, DDR, DRR, RRR).

– Memory requirement depends upon the size of the dataset and random catalog. For ~50,000 LRGs and random dataset of ~800,000 ,NPT code makes a tree of ~50MB.

– Each bin requires error estimate, which can mean 30 jack-knifes: Therefore, each bin can be hundreds of individual jobs which can be sent to a separate node

ADASS 2005 28

/ (radian)

Time taken on TeraGrid to compute DDD triplets for LRG data.

Employing npt on Teragrid - IIT

ime

(se

c)

/ (radian)/ (radian)

Tim

e (s

ec)

/ (radian)T

ime

(sec

)

Time taken on TeraGrid to compute RRR triplets.

0.5 < s < 1.5 Mpc/h

1.5 < s < 2.5 Mpc/h

9.5 < s < 10.5 Mpc/h

19.5 < s < 20.0 Mpc/h

0.5 < s < 1.5 Mpc/h

1.5 < s < 2.5 Mpc/h

9.5 < s < 10.5 Mpc/h

19.5 < s < 20.0 Mpc/h

ADASS 2005 29

Employing npt on Teragrid - III

• Limitations:– Long queue time (stretching sometimes to

6 hours).– After 24 hrs, jobs are terminated. So

bigger datasets need to be processed on different cluster.

ADASS 2005 30

Non-parametric techniques

• The complexicity and wealth of the data demands non-parametric techniques, ie., can one describe phenomena using the least amount of assumptions?

ADASS 2005 31

CMB Power Spectrum

Before WMAPWMAP data

Are the 2nd and 3rd

peaks detected?

ADASS 2005 32

In parametric models of the CMB power spectrum the answer is likely “yes” as all CMB models have multiple peaks. But that has not really answered our question!

Can we answer the question non-parametrically e.g.,Yi = f(Xi) + ci

Where Yi is the observed data, f(Xi) is an orthogonal function (icos(iXi)), ci is the covariance matrix. The challenge is to “shrink” f(Xi), we use

• Beran (2000) to strink f(Xi) to N terms equal to the number of data points - optimal for all smooth functions and provides valid confidence intervals

• Monotonic shrinkage of i - specifically nested subset

selection (NSS)

See Genovese et al. (2004) astro-ph/0410104

ADASS 2005 33

Results(optimal smoothing through bias-variance trade-off)

ConcordanceOur f(Xi)

Note, WMAP only fit is not same as concordance model

ADASS 2005 34

Testing models• The main advantage of this method is that we can construct a

“confidence ball” (in N dimensions) around f(Xi) and thus perform non-parametric interferences e.g. is the second peak detected?

Not at 95% confidence!

ADASS 2005 35

Using CMBfast we can make parametric models (11 parameters) and test if they are within the “confidence ball”. Varying b we get a range of 0.0169 to 0.0287

Gray are models in the 95% confidence ball

ASA “Outstanding Application of the year” (2005)

ADASS 2005 36

Testing in high D

• Now we can now jointly search 7 cosmological parameters in the parametric model and determine which models fit in the confidence ball (at 95%).

• Traditionally this is done by marginalising over the other parameters to gain confidence intervals on each parameter separately. This is a problem in high-D where the likelihood function could be degenerate, ill-defined and under-identified

• This is computational intense as millions of models need to searched, each takes ~3 minute to run

ADASS 2005 37

Find boundariesWe using kriging

– “method of interpolation which predicts unknown values from data observed at known locations”

– Also known as Gaussian process regression; a form of Bayesian inference

– Different metrics for evaluation (Variance, Entropy, least probable)

Variance: pick points far from other searches

Straddle: points far from other searches and near predicted

boundary

ADASS 2005 38

50 samples 200 samples

ADASS 2005 39

Results

darkmatter

b

ary

on

s

Add two heuristics:• Path - explore between peaks• Depth - flood

peaks

ADASS 2005 40

1.2 million models

6.8yrs of CPU Time

Purple: 68%Red: 95%

ADASS 2005 41

Future• Marriage with VOtechbroker and run 10

million models on TeraGrid (300Gb of models)

• Java code exists to query dataspace - provide a webservice

• Add other data (CMB, LSS)• Convergence test: shape of surface• Visualization of 7D space

ADASS 2005 42

Future applications

• Selection function for XMM Cluster Survey

• Add fake clusters and then analyse

• Over a million combinations, or 4 yrs of CPU time

Fake cluster added to XMM field

ADASS 2005 43

Summary

• VO infrastucture with emerging Grids provides a powerful framework within which to do massive calculations

• VOtechbroker will abstract Grid from user and interface with VO mySpace

• Registry of advanced algorithms (npt, kriging, nonparametric statistics etc.)

Massive Science with VO & Grids

Documents

astronomical data

meta data

tabular data

data mining

data interoperability

vo version

remote data

vo technologies