Top Banner
Astrosta’s’cs: The Role of Sta’s’cs in Astronomical Research Eric Feigelson Center for Astrosta2s2cs Penn State University BigSkyEarth DLR Germany April 2016 1
32

05 astrostat feigelson

Jan 25, 2017

Download

Science

Marco Quartulli
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 05 astrostat feigelson

Astrosta's'cs:  The  Role  of  Sta's'cs  in  Astronomical  Research  

Eric  Feigelson  Center  for  Astrosta2s2cs  Penn  State  University  

BigSkyEarth        DLR  Germany    April  2016  

1  

Page 2: 05 astrostat feigelson

The  underlying  situa0on  

Astronomers  are  well-­‐trained  in  the  mathema2cs  underlying  physics,  but  not  in  applied  fields  associated  with  sta2s2cal  methodology.        Consequently,  many  astronomers  use  a  narrow  suite  of  familiar  sta2s2cal  methods  that  are  oNen  non-­‐op2mal,  and  some2mes  incorrectly  applied,  for  a  wide  range  of  data  and  science  analysis  challenges.      This  talk  highlights  some  common  problems  in  recent  astronomical  studies,  and  encourages  use  of  improved  methodology.  

2  

Page 3: 05 astrostat feigelson

Outline  of  this  talk  

Ø  Astrosta2s2cs  =  Astronomy  +  Sta2s2cs:  not  so  simple  

Ø  History  of  astronomy  &  sta2s2cs:    good  à  bad    

Ø  Astrosta2s2cs  today:  improving  

Ø  R:  The  premier  sta2s2cal  compu2ng  environment  

Ø  Common  sta2s2cal  problems  in  astronomical  research  

 

3  

Page 4: 05 astrostat feigelson

What is astronomy? Astronomy is the observational study of matter beyond Earth:

planets in the Solar System, stars in the Milky Way Galaxy, galaxies in the Universe, and diffuse matter between these concentrations.

Astrophysics is the study of the intrinsic nature of astronomical

bodies and the processes by which they interact and evolve. This is an indirect, inferential intellectual effort based on the assumption that physics – gravity, electromagnetism, quantum mechanics, etc – apply universally to distant cosmic phenomena.

4  

Page 5: 05 astrostat feigelson

What is statistics? (No consensus !!)

–  “… briefly, and in its most concrete form, the object of statistical methods is the reduction of data”

(R. A. Fisher, 1922) –  “Statistics is the mathematical body of science that pertains to the

collection, analysis, interpretation or explanation, and presentation of data.”

(Wikipedia, 2014.0)

–  “Statistics is the study of the collection, analysis, interpretation, presentation and organization of data.”

(Wikipedia, 2014.7)

–  “A statistical inference carries us from observations to conclusions about the populations sampled”

(D. R. Cox, 1958)

5  

Page 6: 05 astrostat feigelson

Does statistics relate to scientific models? The pessimists … “Essentially, all models are wrong, but some are useful.”

(Box & Draper 1987) “There is no need for these hypotheses to be true, or even to be at

all like the truth; rather … they should yield calculations which agree with observations” (Osiander’s Preface to Copernicus’ De Revolutionibus, quoted by C. R. Rao)

"The object [of statistical inference] is to provide ideas and

methods for the critical analysis and, as far as feasible, the interpretation of empirical data ... The extremely challenging issues of scientific inference may be regarded as those of synthesising very different kinds of conclusions if possible into a coherent whole or theory ... The use, if any, in the process of simple quantitative notions of probability and their numerical assessment is unclear."

(D. R. Cox, 2006) 6  

Page 7: 05 astrostat feigelson

The positivists … “The goal of science is to unlock nature’s secrets. … Our

understanding comes through the development of theoretical models which are capable of explaining the existing observations as well as making testable predictions. …

“Fortunately, a variety of sophisticated mathematical and

computational approaches have been developed to help us through this interface, these go under the general heading of statistical inference.”

(P. C. Gregory, Bayesian Logical Data Analysis for the

Physical Sciences, 2005)

7  

Page 8: 05 astrostat feigelson

Recommended steps in the statistical analysis of scientific data

The application of statistics can reliably quantify information embedded in scientific data and help adjudicate the relevance of theoretical models. But this is not a straightforward, mechanical enterprise. It requires:

Ø exploration of the data Ø careful statement of the scientific problem Ø model formulation in mathematical form Ø choice of statistical method(s) Ø calculation of statistical quantities Ø  judicious scientific evaluation of the results

Astronomers often do not adequately pursue each step

8  

Page 9: 05 astrostat feigelson

•  Modern statistics is vast in its scope and methodology. It is difficult to find what may be useful (jargon problem!), and there are usually several ways to proceed. Very confusing.

•  Some statistical procedures are based on mathematical proofs

which determine the applicability of established results. It is perilous to violate mathematical truths! Some issues are debated among statisticians, or have no known solution.

•  Scientific inferences should not depend on arbitrary choices in methodology & variable scale. Prefer nonparametric & scale-invariant methods. Try multiple methods.

•  It can be difficult to interpret the meaning of a statistical result with respect to the scientific goal. Statistics is only a tool towards understanding nature from incomplete information.

We should be knowledgeable in our use of statistics

and judicious in its interpretation   9  

Page 10: 05 astrostat feigelson

Astronomy & Statistics: A glorious past For most of western history,

the astronomers were the statisticians! Ancient Greeks to 18th century

Best estimate of the length of a year from discrepant data? •  Middle of range: Hipparcos (4th century B.C.) •  Observe only once! (medieval) •  Mean: Brahe (16th c), Galileo (17th c), Simpson (18th c) •  Median (20th c)

19th century Discrepant observations of planets/moons/comets used to estimate orbital parameters using Newtonian celestial mechanics

•  Legendre, Laplace & Gauss develop least-squares regression and normal error theory (c.1800-1820)

•  Prominent astronomers contribute to least-squares theory (c.1850-1900)

10  

Page 11: 05 astrostat feigelson

The lost century of astrostatistics….

In the late-19th and 20th centuries, statistics moved towards human sciences (demography, economics, psychology, medicine, politics) and industrial applications (agriculture, mining, manufacturing). During this time, astronomy recognized the power of modern physics: electromagnetism, thermodynamics, quantum mechanics, relativity. Astronomy & physics were wedded into astrophysics. Thus, astronomers and statisticians substantially broke contact; e.g. the curriculum of astronomers heavily involved physics but little statistics. Statisticians today know little modern astronomy.

11  

Page 12: 05 astrostat feigelson

The state of astrostatistics today (not good!)

Many astronomical studies are confined to a narrow suite of familiar statistical methods:

–  Fourier transform for temporal analysis (Fourier 1807) –  Least squares regression for model fits

(Legendre 1805, Pearson 1901) –  Kolmogorov-Smirnov goodness-of-fit test (Kolmogorov, 1933) –  Principal components analysis for tables (Hotelling 1936)

Even traditional methods are often misused: final lecture on Friday

12  

Page 13: 05 astrostat feigelson

Under-utilized methodology: •  modeling (MLE, EM Algorithm, BIC, bootstrap) •  multivariate classification (LDA, SVM, CART, RFs) •  time series (autoregressive models, state space models) •  spatial point processes (Ripley’s K, kriging) •  nondetections (survival analysis) •  image analysis (computer vision methods, False Detection Rate) •  statistical computing (R) Advertisement …

Modern Statistical Methods for Astronomy with R Applications E. D. Feigelson & G. J. Babu, Cambridge Univ Press, 2012

!!

"#$$%&!'()'!*+,-.!/01&2!34&!!5%67!/67&4$489!:!;4684<4=9!544>!

!!

13  

Page 14: 05 astrostat feigelson

Cosmology Statistics

Galaxy clustering Spatial point processes, clustering Galaxy morphology Regression, mixture models Galaxy luminosity fn Gamma distribution Power law relationships Pareto distribution Weak lensing morphology Geostatistics, density estimation Strong lensing morphology Shape statistics Strong lensing timing Time series with lag Faint source detection False Discovery Rate Multiepoch survey lightcurves Multivariate classification CMB spatial analysis Markov fields, ICA, etc ΛCDM parameters Bayesian inference & model selection Comparing data & simulation under development

An astrostatistics lexicon …

14  

Page 15: 05 astrostat feigelson

Recent resurgence in astrostatistics •  Improved access to statistical software. R/CRAN public-domain statistical software environment with thousands of functions. Increasing capability in Python. •  Papers in astronomical literature doubled to ~500/yr in past decade (“Methods: statistical” papers in NASA-Smithsonian Astrophysics Data System) •  Short training courses (Penn State, India, Brazil, Spain, Greece, China, Italy, France, ESO, ESA, conferences) •  Cross-disciplinary research collaborations (Harvard/ICHASC, Carnegie-Mellon, Penn State, NASA-Ames/Stanford, CEA-Saclay/Stanford, Cornell, UC-Berkeley, Michigan, Imperial College London, LSST Statistics & Informatics Science Collaboration, …)

•  Cross-disciplinary conferences (Statistical Challenges in Modern Astronomy 1991--, Astronomical Data Analysis, PhysStat, SAMSI programs 2012/16, Astroinformatics 2012--, CosmoStat 2014/16, IAU/WSC/JSM, …) •  Scholarly society working groups and a new integrated Web portal asaip.psu.edu serving: Int’l Astrostatistical Assn (~ Int’l Statistical Institute), Int’l Astro Union Working Group, Amer Astro Soc Working Group, Amer Stat Assn Interest Group, IEEE Task Force, LSST Science Collaboration) •  Increased review of statistical methodology by journals (Nature, Science, ApJ)

15  

Page 16: 05 astrostat feigelson

Textbooks

Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica Support, Gregory, 2005

Practical Statistics for Astronomers, Wall & Jenkins, 2nd ed 2012 Modern Statistical Methods for Astronomy with R Applications, Feigelson & Babu, 2012 Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data, Ivecic, Connolly, VanderPlas & Gray, 2014  

16  

Page 17: 05 astrostat feigelson

A new imperative: Large-scale surveys & megadatasets

Huge imaging, spectroscopic & multivariate datasets are emerging from specialized survey projects & telescopes:

–  109-object photometric catalogs from USNO, 2MASS, SDSS, … –  106-8- spectroscopic catalogs from SDSS, LAMOST, … –  106-7-source radio/infrared/X-ray catalogs from WISE, eROSITA, … –  Spectral-image datacubes from VLA, ALMA, IFUs, … –  109-object x 102 epochs (3D) surveys (PTF, CRTS, SNF, VVV, Pan-

STARRS, Stripe 82, DES, …, LSST)

The Virtual Observatory is an international effort to federate

many distributed on-line astronomical databases.

Powerful statistical tools are needed to derive scientific insights from TBy-PBy-EBy databases

17  

Page 18: 05 astrostat feigelson

To treat massive data streams and databases …

Rapid rise of astroinformatics

Statistics guides the scientist on what to compute Informatics helps the scientist perform the computation

Methodology: Computationally intensive astronomy, data mining, multivariate regression & classification, machine learning, Monte Carlo methods, NlogN algorithms, etc. Software & hardware: Parallel processing on multi-processors machines, cloud computing, CUDA & GPU computing, database management & promulgation, etc. Workshops & training schools emerging. IAU Symposium #325 Astroinformatics in Sorrento IT, October 2016. Growing perception that more community training is needed.

18  

Page 19: 05 astrostat feigelson

Join a Working Group and the Astrostatistics and Astroinformatics Portal

http://asaip.psu.edu

Recent papers, meetings, jobs, blogs, courses, forums, … 19  

Page 20: 05 astrostat feigelson

A vision of astrostatistics in 2025 …

•  Astronomy graduate curriculum has 1 year of statistical and computational methodology

•  Some astronomers have M.S. in statistics and computer science •  Astrostatistics and astroinformatics is a well-funded, cross-

disciplinary research field involving a few percent of astronomers (cf. astrochemists) pushing the frontiers of methodology.

•  Astronomers regularly use many methods coded in R.

•  Statistical Challenges in Modern Astronomy meetings are held annually with ~400 participants

20  

Page 21: 05 astrostat feigelson

Prelude to R ….

A brief history of statistical computing

1960s – c2000: Statistical analysis developed by academic statisticians, but implementation relegated to commercial companies (SAS, BMDP, Statistica, Stata, Minitab, etc). 1980s: John Chambers (ATT, USA)) develops S system, C-like command line interface. 1990s: Ross Ihaka & Robert Gentleman (Univ Auckland NZ) mimic S in an open source system, R. R Core Development Team expands, GNU GPL release. Early-2000s: Comprehensive R Analysis Network (CRAN) for user-provided specialized packages grows exponentially. Important packages incorporated into base-R.

21  

Page 22: 05 astrostat feigelson

Growth of CRAN contributed packages

4 April 2016: 8206 packages (~6/day) ~150,000 functions

See The Popularity of Data Analysis Software, R. A. Muenchen, http://r4stats.com

2  year  doubling  'me  

22  

Page 23: 05 astrostat feigelson

Rexer Analytics Data Miner Survey 2013

Posts on software forums 2013

Job trends from Indeed.com

R

SPSS

See R vs. Python debates on ASAIP Software Forum

R’s growing importance in data science

23  

Page 24: 05 astrostat feigelson

The R statistical computing environment

•  R  integrates  data  manipula2on,  graphics  and  extensive  sta2s2cal  analysis.  Uniform  documenta2on  and  coding  standards.    But  quality  control  is  limited  for  community-­‐provided  CRAN  packages.    

 •  Fully  programmable  C-­‐like  language,  similar  to  IDL  &  Matlab.  Specializes  in  

vector/matrix  inputs.        •  Easy  download  from  hbp://www.r-­‐project.org  for  Windows,  Mac  or  linux.  

On-­‐the-­‐fly  installa2on  of  CRAN  packages.      Quick  communica2on  with  C,  Fortran,  Python.    Emulator  of  Matlab.    

•  >8000  user-­‐provided  add-­‐on  CRAN  packages,  ~150,000  sta2s2cal  func2ons  

 

24  

Page 25: 05 astrostat feigelson

•  Many  resources:    R  help  files  (3500p  for  base  R),  CRAN  Task  Views    and  vignebe  files,  on-­‐line  tutorials,  >150  books,  >400  blogs,  Use  R!  conferences,  galleries,  companies,  The  R  Journal  &  J.  Stat.  So3ware,  etc.    

 Principal  steps  for  using  R  in  astronomical  research:  

–  Knowing  what  you  want    [educa0on,  consul0ng,  thought]  –  Finding  what  you  want      [Google,  Rseek,  Rdocumenta0on]  –  Wri'ng  R  scripts        [R  Help  files,  StackOverflow,  books]  –  Understanding  what  you  find    [educa0on,  consul0ng,  thought]  

 

25  

Page 26: 05 astrostat feigelson

Some functionalities of base R

arithme2c  &  linear  algebra  bootstrap  resampling  empirical  distribu2on  tests  exploratory  data  analysis    generalized  linear  modeling  graphics  robust  sta2s2cs  linear  programming  local  and  ridge  regression  max  likelihood  es2ma2on    

mul2variate  analysis  mul2variate  clustering  neural  networks  smoothing  spa2al  point  processes  sta2s2cal  distribu2ons    sta2s2cal  tests  survival  analysis  2me  series  analysis  

26  

Page 27: 05 astrostat feigelson

Selected methods in Comprehensive R Archive Network (CRAN) Bayesian computation & MCMC, classification & regression trees, genetic algorithms, geostatistical modeling, hidden Markov models, irregular time series, kernel-based machine learning, least-angle & lasso regression, likelihood ratios, map projections, mixture models & model-based clustering, nonlinear least squares, multidimensional analysis, multimodality test, multivariate time series, multivariate outlier detection, neural networks, non-linear time series analysis, nonparametric multiple comparisons, omnibus tests for normality, orientation data, parallel coordinates plots, partial least squares, periodic autoregression analysis, principal curve fits, projection pursuit, quantile regression, random fields, Random Forest classification, ridge regression, robust regression, Self-Organizing Maps, shape analysis, space-time ecological analysis, spatial analyisis & kriging, spline regressions, tessellations, three-dimensional visualization, wavelet toolbox

27  

Page 28: 05 astrostat feigelson

CRAN Task Views (http://cran.r-project.org/web/views)

CRAN  Task  Views  provide  brief  overviews  of  CRAN  packages  by  topic  &  func2onality.    Maintained  be  expert  volunteers.      

Par2al  list:    •  Bayesian        ~110  packages  •  Chem/Phys        ~75  packages  (incl.  20  for  astronomy)  •  Cluster/Mixture  ~100  packages  •  Graphics        ~40  packages  •  HighPerfComp  ~75  packages  •  Machine  Learning  ~70  packages  •  Medical  imaging  ~20  packages  •  Robust      ~50  packages  •  Spa2al      ~135  packages  •  Survival      ~200  packages  •  TimeSeries    ~170  packages       28  

Page 29: 05 astrostat feigelson

Since c.2005, R has been the world’s premier public-domain

statistical computing package

Data scientists recommend both Python and R (https://asaip.psu.edu/forums/software-forum/195790576)

29  

Page 30: 05 astrostat feigelson

Some  common  sta0s0cal  problems  in  astronomical  papers  

o  Overuse  of  Kolmogorov-­‐Smirnov  test:    incorrect  significance  levels,  less  sensi2ve  than  Anderson-­‐Darling  test  

o  Overuse  of  histograms  for  inference    o  Overuse  of  heuris2c  parametric  regression  (e.g.  linear,  powerlaw).    

Use  new  local  regression  methods  (splines,  LOESS,  Gaussian  Processes  regression)  

 o  Overuse  of  `minimum  chi-­‐squared’  regression,  assuming  scaber  is  

due  to  measurement  errors    

30  

Page 31: 05 astrostat feigelson

o  Overuse  of  regression  when  response  variable  not  specified  by  science    o  Underuse  of  Poisson  &  logis2c  regression    o  Insufficient  examina2on  of  regression  results:  R2,  residual  analysis  (test  for  

normality,  autocorrela2on,  outliers  via  Cook’s  distance)    o   Overuse  of  Bayesian  inference  with  uninforma2ve  priors    o   Overuse  of  `friends-­‐of-­‐friends’  algorithm  or  subjec2ve  evalua2on  for  

unsupervised  clustering    o   Underuse  of  machine  learning  methods  for  supervised  classifica2on  

(CART/Random  Forests,  Support  Vector  Machines,  neural  networks,  …)    

31  

Page 32: 05 astrostat feigelson

Conclusion  

While  a  vanguard  of  astronomers  use  and  develop  advanced  methodologies  for  specific  applica2ons,  many  studies  u2lize  a  narrow  suite  of  familiar  methods.    Astronomers  need  to  become  more  informed  and  more  involved  in  sta2s2cal  methodology,  for  both  data  analysis  and  for  science  analysis.    Areas  of  common  weakness  of  sta2s2cal  analyses  in  astronomical  studies  can  be  iden2fied.    Improvement  is  oNen  not  difficult.    Highly  capable  free  soNware,  such  as  R/CRAN,  can  be  effec2ve  in  bringing  new  methodology  to  bear  on  astronomical  problems.    

32