The Sloan Digital Sky Survey From Big Data to Big Database ...

The Sloan Digital Sky Survey

From Big Data to Big Database to Big Compute

Heidi Newberg

Rensselaer Polytechnic Institute

Summary

• History of the data deluge from a personal perspective.

• The transformation of astronomy with the Sloan Digital Sky Survey.

• The discovery of density substructure in the Milky Way stellar spheroid.

• Using MilkyWay@home to fit more complex models to the data.

The new 1024x1024 CCD camera required a new computer to store the data from just one night of observing (2 megabytes every five minutes). We also needed to write to exabyte tape drives rather than magnetic tapes, so the data would be easier to carry home on the airplane.

The beginning of the data deluge (1990’s)

• New CCD cameras produced enough data that we could no longer look at each astronomical object individually. Automated algorithms were needed.

• Mag tapes hold 100 Mbytes each, ~2 hrs of observing time per tape. (Requires large backpack to transport home.) Exabyte tapes made data transport easier.

• I still own all of these tapes, but it is likely that they are not readable. All astronomical data from that era is lost forever.

The Sloan Digital Sky Survey (SDSS) is a joint project of The University of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation Group, The Johns Hopkins University, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, Princeton University, the U.S. Naval Observatory, and the University of Washington. (11 institutions)

Funding for the project has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society.

The Data

• Images of 14,000 square degrees of sky in 5 passbands (raw data 20 TB)

• A catalog of a billion objects detected in those images (20 TB SQL database), ~400 parameters per object

• Other data products (DAS – 34 TB)

• 1.5 million spectra of galaxies, stars, and quasars (3.3 TB)

• Spectral parameters (450 Gbytes)

Data reduction??

I discussed the data processing,

Alex Szalay and his group at Johns Hopkins took on the enormous task of putting all of this data into a database, preserving as much provenance as possible, and making the data as accessible as possible. There are serious issues with speed in a database of this size, so his group needed to think hard about how the data would be accessed, and thus how it should be organized.

The 20 Queries Q11: Find all elliptical galaxies with spectra that have an

anomalous emission line.

Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2’, and create a map of masks over the same grid.

Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization.

Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes.

Q15: Provide a list of moving objects consistent with an asteroid.

Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5.

Q17: Find binary stars where at least one of them has the colors of a white dwarf.

Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-I are less than 0.05m.

Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies.

Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy.

Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023

Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and -10<super galactic latitude (sgb) <10, and declination less than zero.

Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75.

Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds.

Q5: Find all galaxies with a deVaucouleours profile (r¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profile

Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes.

Q7: Provide a list of star-like objects that are 1% rare.

Q8: Find all objects with unclassified spectra.

Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7.

Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.)

From talk by Jim Gray (2001)

Scientists were asked for example scientific queries, so the database could be optimized.

Sky survey “Navigate” tool lets you browse through the images

Over a billion hits to the SDSS site, leveling off at 150 million per year. Over 2,000,000 SQL queries per month on the database.

16

Computational Science

• Traditional Empirical Science – Scientist gathers data by direct

observation

– Scientist analyzes data

• Computational Science – Data captured by instruments

Or data generated by simulator

– Processed by software

– Placed in a database

– Scientist analyzes database

From talk by Jim Gray 10/10/2001

http://es.rice.edu/ES/humsoc/Galileo/Images/Astro/Instruments/hevelius_telescope.gif

17

What’s needed? (not drawn to scale)

Science Data

& Questions

Scientists

Database

To store data

Execute

Queries

Plumbers

Data Mining

Algorithms

Miners

Question &

Answer

Visualization

Tools

Slide from talk by Jim Gray 4/10/2002

Astronomy Information Age • Astronomical data is processed without anyone looking at the individual

images/spectra Astronomers used to classify galaxies by eye. Sometimes a graduate

student would classify thousands of galaxies from a computer screen. At three per minute, this might take hours, days, or even weeks of time. The SDSS found 108 galaxies. At three per minute, classification would take 63 years of 24 hours per day, seven days per week. The “Galaxy Zoo” is a project that allows private citizens to look at data by eye, and contribute classifications to scientists.

• More data is obtained than anyone can analyze himself (drinking from a fire hose)

Projects like the SDSS SkyServer, the Virtual Observatory, Google Sky, and WikiSky are all projects aimed at letting people better access the data from SDSS.

• New surveys, including Pan-STARRS, LSST, Guo Shou Jing (LAMOST), DES, RAVE, SEGUE, HERMES, and WFMOS are planned or in progress, patterned on the success of the Sloan Digital Sky Survey.

222 )/(,5.3, qzyxrr

The SDSS survey was funded as an extragalactic project, but Galactic stars could not be completely avoided.

The use of statistical knowledge of the absolute magnitudes of stellar populations to determine the density distributions of stars.

Statistical Photometric Parallax

Newberg et al. 2002

Vivas overdensity, or Virgo Stellar Stream

Sagittarius Dwarf Tidal Stream

Stellar Spheroid?

Monoceros stream, Stream in the Galactic Plane, Galactic Anticenter Stellar Stream, Canis Major Stream, Argo Navis Stream

Squashed halo

Spherical halo

Exponential disk

Prolate halo

Newberg et al. 2002

Kat

hry

n J

oh

nst

on

David Law

A map of stars in the outer regions of the Milky Way Galaxy, derived from the SDSS images of the northern sky, shown in a Mercator-like projection. The color indicates the distance of the stars, while the intensity indicates the density of stars on the sky. Structures visible in this map include streams of stars torn from the Sagittarius dwarf galaxy, a smaller 'orphan' stream crossing the Sagittarius streams, the 'Monoceros Ring' that encircles the Milky Way disk, trails of stars being stripped from the globular cluster Palomar 5, and excesses of stars found towards the constellations Virgo and Hercules. Circles enclose new Milky Way companions discovered by the SDSS; two of these are faint globular star clusters, while the others are faint dwarf galaxies. Credit: V. Belokurov and the Sloan Digital Sky Survey.

Why is this important?

• Small dwarf galaxies are merging with the Milky Way at the present time.

• The Milky Way itself was created by a long history of merging smaller galaxies to make larger ones

• The tidal streams are an archeological record of the merger history that created our galaxy

• The tidal streams encode the gravitational potential through which the dwarf galaxy traveled, and can therefore tell us about the distribution of dark matter in the Milky Way.

Newberg et al. 2002

Vivas overdensity, or Virgo Stellar Stream

Sagittarius Dwarf Tidal Stream

Stellar Spheroid?

Monoceros stream, Stream in the Galactic Plane, Galactic Anticenter Stellar Stream, Canis Major Stream, Argo Navis Stream

Fitting model parameters

Previous astronomers fit 3 parameters to the entire stellar halo.

We want to fit 20 parameters to each of eighteen 2.5-degree

wide stripe = 360 parameters. The number of iterations to compute the likelihood increases

with the number of stars, and the required accuracy of the calculation

At four hours per evaluation and 50 likelihood calculations per iteration in a conjugate gradient descent method and 50 iterations, 10,000 hours are required to optimize one stripe. This would take more than 400 days on a single processor.

Began: November 9, 2007 Computing power: 0.5 PetaFLOPS (high over 2 PetaFLOPS) Number of volunteers (total people): 146,863 Number of computers volunteered (total): 291,944 Number of active volunteers: 25,670 Number of active computer being volunteered: 35,686 Number of volunteers as of 10/4/2012

206 countries (of which 193 are UN members)

Volunteer Computing with

150,000 volunteers:

• Let us use their CPUs for scientific calculations

• Continously upgrade their hardware

• Populate extensive forum discussions on science, technical support, and well, anything

• Monitor the health of our system (especially our volunteer moderator)

• Wrote the first GPU version of

our software

• Donate money and hardware

Volunteer Computing with

150,000 volunteers also:

• Compete with each other for BOINC “credits”

• Become angry if another person or team is getting an unfair number of credits

• Return garbage results (which require zero computations) so they can earn credit faster

• Insult each other on public

forum boards

• Link anti-Semitic websites to ours

Astronomy students write algorithms

MilkyWay@home server sends out jobs to volunteers and collects results

Algorithms are adapted to run on asynchronous, heterogenious, parallel computing environment. The code compiled and tested on 16 platforms including CPUs and GPUs, and attached to the server. Mechanisms are created to start and end “runs.” The MySQL database is maintained.

Astronomy students write algorithms

MilkyWay@home server sends out jobs to volunteers and collects results

Algorithms are adapted to run on asynchronous, heterogenious, parallel computing environment. The code compiled and tested on 16 platforms including CPUs and GPUs, and attached to the server. Mechanisms are created to start and end “runs.” The MySQL database is maintained.

This was originally accomplished with a $750,000 grant shared between astronomy and computer science faculty. But there is no model for maintaining this, since it is no longer an interesting computer science problem, and very expensive for an individual astronomy grant. We need lighter tools

Data from one stripe Stream 1 (6 parameters) Stream 2 (6 parameters)

Stream 3 (6 parameters) Smooth (3 parameters)

We can fit 20 parameters to each 2.5-degreee wide stripe of data.

We recently analyzed 18 stripes of data from DR7 (300-400 parameters).

Law & Majewski (2010) Newby et al., submitted

We can compare the position of the stream in the sky (left), with n-body simulations of Sgr dwarf galaxy disruption (right). The stream positions in the left panel are calculated by 2.5-degree wide stripe.

1.9 million F turnoff stars

160,000 stars with Sgr density 1.7 million non-Sgr stars

Polar plots of SDSS F turnoff stars in the north Galactic Cap (top). Using our density model, we place each star in either the Sgr (lower left) or non-Sgr panel (lower right), with the probability given by the model. The stars in the Sgr panel are not guaranteed to be from the stream, but they collectively have the spatial properties of the Sgr stream.

Determining the total mass, lumpiness, and flattening of the Galaxy’s dark matter halo

We now want to fit parameters of the Milky Way galaxy and the dwarf galaxies that fell in, by using n-body simulations of the merging and comparing them to the density parameters we measured in the data.

(1) We would like to fit N-body simulations (100,000 particles in the dwarf) instead of orbits (1 particle)

(2) We would like to fit multiple streams at the same time. (3) We would like to fit distances, velocities, positions, and densities

of the streams, and simultaneously fit measurements of the Milky Way’s rotation curve.

(4) We need to consider internal properties of the dwarfs Since modeling one dwarf requires ~30 minutes on a CPU, this

requires substantial computational power. But then, we have MilkyWay@home.

Sample 100,000 particle (sub-sampled above) semi-analytic N-body simulations of the tidal disruption of the Orphan Stream. Fit only the Plummer sphere parameters for the dwarf galaxy Right now we have a version of the Barnes and Hut (1986) code that works across CPU platforms for our MilkyWay@home with checkpointing, and hope it will be running on GPUs sometime within the coming year.

The Sloan Digital Sky Survey From Big Data to Big Database ...

Documents