This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi,
Tomo Goto (Astro)Larry Wasserman, Chris Genovese, Wong Jang,
Pierpaolo Brutti (Statistics)Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS)
Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS)
Pittsburgh Computational AstroStatistics (PiCA) Group
(See http://www.picagroup.org)
First MotivationCosmology is moving from a “discovery”
science into a “statistical” scienceDrive for ``high precision’’ measurements:
Cosmological parameters to a few percent;Accurate description of the complex structure in
the universe;Control of observational and sampling biases
New statistical tools – e.g. non-parametric analyses – are often computationally intensive.
Also, often want to re-sample or Monte Carlo data.
Second MotivationLast decade was dedicated to building more
telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS,
DPOSS, MAP). Also, larger simulations.
We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5
nights! Petabytes by end of 00’s
Highly correlated datasets and high dimensionality
Existing statistics and algorithms do not scale into these regimes
New Paradigm where we must build new tools before we can analyze &
visualize data
SDSSSDSS
SDSSSDSS
SDSS Data
FACTOR OF 12,000,000
Area 10000 sq deg 3
Objects 2.5 billion 200
Spectra 1.5 million 200
Depth R=23 10
Attributes 144 presently 10
SDSS Science Most Distant Object! 100,000 spectra!
Start with tree data structures: Multi-resolutional kd-trees
Scale to n-dimensions (although for very high dimensions use new tree structures)
Use Cached Representation (store at each node summary sufficient statistics). Compute counts
from these statisticsPrune the tree which is stored in memory!See Moore et al. 2001 (astro-ph/0012333)
Many applications; suite of algorithms!
Goal to build new, fast & efficient statistical algorithms
Range SearchesFast range searches and catalog matching
Prune cells outside range
Also Prune cells inside!Greater saving in time
N-point correlation functions
The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair
of points over that expected from a poisson process. Also long history (as point processes) in Statistics:
Similarly, the three-point is defined as (so on!)
Same 2pt, very different 3ptNaively, this is an n^N process, but all it is, is a
set of range searches.
Dual Tree Approach
Usually binned into annuli rmin< r < rmax . Thus, for each r transverse both trees and prune pairs of nodes with either dmin < rmin ; dmax > rmax.
Also, if dmin > rmin & dmax<rmax all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries.
Extra speed-ups are possible doing multiple r’s together and controlled approximations
Time depends on density of points
and binsize & scale
N*N
NlogNN*N*N
Fast Mixture ModelsDescribe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than
bandwidth!)
The parameters of the model are then N gaussians each with a mean and covariance
Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for
100,000 points on a PC!)
Employ heuristic splitting algorithm as well
Details in Connolly et al. 2000 (astro-ph/0008187)
EM-Based Gaussian Mixture Clustering: 1
EM-Based Gaussian Mixture Clustering: 2
EM-Based Gaussian Mixture Clustering: 4
EM-Based Gaussian Mixture Clustering: 20
Applications
Used in SDSS quasar selection (used to map the multi-color stellar locus)
Gordon Richards @ PSU
Anomaly detector (look for low probability points in N-dimensions)
Optimal smoothing of large-scale structure
SDSS QSO target selection in 4D color-space
Cluster 9999 spectroscopically confirmed stars
Cluster 8833 spectroscopically
confirmed QSOs (33 gaussians)
99% for stars, 96% for QSOs
Bayes Net Anomaly Detector
Instead of using a single joint probability function (fitted to data) factorize into a smaller
set of conditional probabilities Directional and acyclical
If we know graph and conditional probabilities, we have valid probability function
to whole model
Use 1.5 million SDSS sources to learn model (25 variables each)
Then evaluate the likelihood of each data being drawn from the model
Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck
Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilitiesTherefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return againIssue of productivity!
Will Only Get Worse
LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007)
VISTA will collect 300 Terabytes of data (2005)
Archival Science is upon us! HST database has 20GBytes per day
downloaded (10 times more than goes in!)
Will Only Get Worse II
Surveys spanning electromagnetic spectrumCombining these surveys is hard: different sensitivities, resolutions and physicsMixture of imaging, catalogs and spectraDifference between continuum and point processesThousands of attributes per source
What is VO?
The “Virtual Observatory” must: Federate multi-wavelength data sources