A Web service for Distributed A Web service for Distributed Covariance Computation on Covariance Computation on Astronomy Catalogs Astronomy Catalogs Presented by Presented by Haimonti Dutta Haimonti Dutta CMSC 691D CMSC 691D
A Web service for Distributed Covariance A Web service for Distributed Covariance Computation on Astronomy CatalogsComputation on Astronomy Catalogs
Presented by Presented by Haimonti DuttaHaimonti DuttaCMSC 691DCMSC 691D
ROADMAP
• Background Information
• Interesting Astronomy Data Mining Problems
• What has / not been done (Literature review)
• My project objectives
• The problem of Alignment in astronomy catalogs
• The Fundamental Plane
• A case study for recreating the Fundamental Plane from astronomy catalogs
• Experimental Results
• Efforts towards building Web services
Background Information
Next generation Astronomy catalogs will contain data for most of the sky
Existing astronomy sky surveys – SDSS, 2Mass, FIRST, etc
Terabytes and Peta bytes of Data
Data Avalanche in Astronomy
Getting useful information is like looking for a needle in a haystack
National Virtual Observatory (NVO) has been set up to facilitate scientific discovery
Obvious need for Distributed Data Mining
What kind of Data Mining activities are astronomers interested in ?
Detection of transient objects such as supernovae (Online transient object detection in real time)
Obtain statistics of variable and moving objects (model variability, refine existing models, fit models to irregularly sampled data )
Parameterize shapes of objects using rotationally invariant quantities
Efficient cluster and outlier detection
Supervised Data Mining problems (match objects detected in multiple bands, derive photometric red shifts)
What has/not been doneWhat has/not been done
Lot of efforts in centralized data mining Lot of efforts in centralized data mining (NVO, FMass, Class X, FIRST etc )(NVO, FMass, Class X, FIRST etc )
Some grid mining (Notable GRIST Some grid mining (Notable GRIST project)project)
Very few distributed data mining efforts in Very few distributed data mining efforts in their preliminary stagestheir preliminary stages
((http://www.cs.queensu.ca/home/mcconell/DDMAstro.htmlhttp://www.cs.queensu.ca/home/mcconell/DDMAstro.html))
Objectives of this project
Aligning of Catalogs (The Fundamental Plane Problem)
Implementation of algorithms for Distributed Data Mining on Astronomy Catalogs
Development of webservices for the catalogs / investigation into what needs to be done to integrate this into the NVO
Alignment of Astronomy CatalogsAlignment of Astronomy Catalogs
Cross matching is a non trivial problem in itself. We assume cross matching happens off line and there exists an indexing scheme by which catalogs know the exact cross matched tuples
Some interesting numbersSome interesting numbers Size of current SDSS catalogs 3.0 TB , contains about 180 million objects (As per Data Release 4)
2Mass has already observed 99% of the sky and reports 470,992,970 Point sources and 1,647,599 Extended sources
Portion of the sky observed by SDSS
Problems Problems Cross Matching is an inherently difficult Cross Matching is an inherently difficult
problem for the astronomy catalogsproblem for the astronomy catalogs We We assume assume data sets are cross matched data sets are cross matched
and this computation is done offlineand this computation is done offline This is a strong assumption and often This is a strong assumption and often
may not be acceptable to astronomersmay not be acceptable to astronomers
A real life cross matching ExerciseA real life cross matching Exercise
Problems encountered Problems encountered Which catalogs to use ? Which catalogs to use ? We tried several - SDSS, 2Mass, HyperLeda, CfA RedShift CatalogWe tried several - SDSS, 2Mass, HyperLeda, CfA RedShift Catalog Catalogs have different indexing schemes – more recent ones use Catalogs have different indexing schemes – more recent ones use
HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even Names of objectsNames of objects
Some attributes are really not available ! (SDSS has -9999 for most Some attributes are really not available ! (SDSS has -9999 for most of its red shift values)of its red shift values)
Different catalogs observe different portions of the sky (SDSS Different catalogs observe different portions of the sky (SDSS covers only about 16% of the sky in the latest release while 2Mass covers only about 16% of the sky in the latest release while 2Mass covers the entire sky) – covers the entire sky) – Select subsets to cross match wisely ! Select subsets to cross match wisely !
The successful cross matching …..The successful cross matching ….. Chose a region of the sky between 0 and 15 (dec) and 150 and 200 Chose a region of the sky between 0 and 15 (dec) and 150 and 200
degrees (ra) – observed by both SDSS and 2Massdegrees (ra) – observed by both SDSS and 2Mass Use a web interface provided by SDSS to do the cross matchingUse a web interface provided by SDSS to do the cross matching Selected the K-band for obtaining red shift and surface brightness Selected the K-band for obtaining red shift and surface brightness
(astronomical significance)(astronomical significance)
Case StudyCase Study Centralized database 1249 cross matched objectsCentralized database 1249 cross matched objects Attributes are size, surface brightness, velocity dispersionAttributes are size, surface brightness, velocity dispersion Does not really make a case for a distributed data mining scenario ! Does not really make a case for a distributed data mining scenario !
Solution Solution
- try a larger subset of the data from both catalogs - try a larger subset of the data from both catalogs
The Fundamental PlaneThe Fundamental Plane
Interesting problem in astronomy - Identify Interesting problem in astronomy - Identify correlations in high dimensional spaces correlations in high dimensional spaces
For the class of elliptical and spiral galaxiesFor the class of elliptical and spiral galaxies Observed featuresObserved features – radius, mean surface – radius, mean surface
brightness and central velocity dispersionbrightness and central velocity dispersion A two dimensional plane in the observed A two dimensional plane in the observed
space of 3D parameters exist called space of 3D parameters exist called THE FUNDAMENTAL PLANETHE FUNDAMENTAL PLANE
An illustration of the Fundamental Plane
Experimental Results Experimental Results
First PC captured 69.4193% of variance
Second PC captured 12.1333% of the variance
The astronomy literature suggests 1st and 2nd PC together should capture about 88% of variance
Reasonably close recreation of the Fundamental Plane from two cross matched data sets in the centralized setting
Algorithm for Distributed Covariance ComputationAlgorithm for Distributed Covariance Computation
A central co-ordination site S sends A and B a random A central co-ordination site S sends A and B a random number generation seednumber generation seed
A and B generate and n X l Random matrix R where l << nA and B generate and n X l Random matrix R where l << n A and B send S – R A and B send S – R TT A and R A and R TT B B S computes ( R A )S computes ( R A )TT (RB) / n (RB) / n
Experimental Results – Distributed SettingExperimental Results – Distributed Setting
Case StudyCase Study 1249 attributes at site A and B 1249 attributes at site A and B 2 attributes at site A and 1 2 attributes at site A and 1
attribute at site Battribute at site B
More resultsMore results
Development of a Web ServiceDevelopment of a Web ServiceArchitecture of the Proposed SystemArchitecture of the Proposed System
CLIENT
SITE A
SITE B
WEB SERVICEFor Distributed
Covariance Computation
Soap Message
Soap Message
Current Implementation Current Implementation
Using Apache Axis (SOAP engine – a Using Apache Axis (SOAP engine – a framework for making SOAP processors framework for making SOAP processors such as clients, servers )such as clients, servers )
Tomcat version 4.1Tomcat version 4.1 SOAP version 1.2SOAP version 1.2 Short Demo Short Demo Further System Developmental Issues Further System Developmental Issues
(use of SOAP with attachments)(use of SOAP with attachments)
QUESTIONS ?QUESTIONS ?