How to succeed in reproducible research without really trying Geoffrey M. Oxberry Lawrence Livermore National Laboratory Computational Engineering Division Energy Conversion and Storage This work was performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Views expressed in this work are solely those of its author and do not reflect the views of Lawrence Livermore National Laboratory. February 28, 2013 (LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 1 / 27
27
Embed
How to succeed in reproducible research without …...They recount those experiences as cautionary tales1,7,11,12 (LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
40 60 80 100 120
40
60
80
mm
How to succeed in reproducible research
without really tryingGeoffrey M. Oxberry
Lawrence Livermore National LaboratoryComputational Engineering Division
Energy Conversion and Storage
This work was performed under the auspices of the US Department of Energy by LawrenceLivermore National Laboratory under Contract DE-AC52-07NA27344. Views expressed inthis work are solely those of its author and do not reflect the views of Lawrence Livermore
National Laboratory.
February 28, 2013
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 1 / 27
40 60 80 100 120
40
60
80
mm
Computational science has culture problems
Lack of verification: Where’s the bug?
I My method?I Its implementation?I Its dependencies?
Lack of transparency
I No public code & data = tough for others to debugI Do results only show “good” case studies?
Efficiency
I Costly to implement everything from scratchI Bad record-keeping makes research, revision, collaboration
harder
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 2 / 27
40 60 80 100 120
40
60
80
mm
We can fix these problems. . . we have the
technology
Tools available to automate:
I Record-keepingI TestingI Building paper
Services available for storing research outputs
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 3 / 27
40 60 80 100 120
40
60
80
mm
Reproducible research practices: a solution
Reproducibility of work yields:
I Verification: easier to find and fix bugsI Transparency: leads increased citation count, broader impactI Efficiency: via de-duplication of effort
In this talk, I ignore restrictions due to:
I Classified or sensitive materialI Nondisclosure agreementsI Software licensing issues
Partial reproducibility may still be possible despite restrictions
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 4 / 27
40 60 80 100 120
40
60
80
mm
How to succeed in reproducible research without
really trying
State of reproducible research to-date
I Defining “reproducible research”I Motivating reproducible research
Tools and services to do reproducible research at reasonable cost
I How and where to host code & dataI Automate verificationI Where to host everything else, and getting credit for it
Challenges still exist
I Objections, and overcoming themI Needed policies, tools, cultural changes
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 5 / 27
40 60 80 100 120
40
60
80
mm
Reproducibility has long history
Mathematical proof is one form of reproducibility
I First: Greek mathematicians (ca. 400 BC)I Modern rigorous proof: 1800s
Notable experimental examples
I Galileo (1620s) built several copies of his telescopeI Pasteur added “Materials and Methods” sections to his journal
articles1
Modern scientific movements
I Structural and protein biology (1980s)2
I Political science (1990s)3
I Genomics and genetics (2000s)2,4,5
I Statistics (2010s)6
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 6 / 27
40 60 80 100 120
40
60
80
mm
Reproducible research = “post paper and all
supporting materials”
Reproducible research has many definitions7
In this presentation, “reproducible research” means submittingat minimum:
I the paperI all code & data to reproduce results under open source licenses8
I README files describing code & data
Minimum standard chosen to minimize cost
Can be helpful to do more
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 7 / 27
40 60 80 100 120
40
60
80
mm
Most computational science not reproducible
currently
People do not post the code & data with their work2,9,10
Even reproducible research gurus have published papers withoutcode & data
They recount those experiences as cautionary tales1,7,11,12
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 8 / 27
40 60 80 100 120
40
60
80
mm
Lack of reproducibility causes problems
Typical anecdotes1,7,9,11–13:
I Which version of code goes with paper?I Where’s the bug: method or implementation?I Easy to forget research set aside for monthsI Results can depend on “magic parameter settings”I New person in lab can’t repeat former grad student’s work
Reproducible research helps avoid these issues
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 9 / 27
40 60 80 100 120
40
60
80
mm
Doing reproducible research has benefits
Reproducible research tends to be cited more7,14
In addition, reproducible research has the following anecdotalbenefits1,7,11,12:
I Enhanced knowledge transferI Easier to resume projects after hiatusI Easier to train new researchersI Decreases time to revisionI Attracts collaboratorsI Decreases debugging time
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 10 / 27
40 60 80 100 120
40
60
80
mm
Reaping benefits of reproducible research reduces
to habits and practices
Basic principles of reproducibility in computational science likethose in experimental sciences
Like experimentalists, computational scientists need to:
I Keep good records: notebooks and version control!I Include code, data, proofs, derivationsI Use tests as control experiments
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 11 / 27
40 60 80 100 120
40
60
80
mm
Tools enable adopting these practices at reasonable
cost
Automating habits with tools and services reduces theircognitive burden
I Version control systemsI Repository hosting web sitesI Unit testing frameworksI Build systemsI Figure, data, preprint archives
Examples, payoffs, and estimated costs (to learn basics) given
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 12 / 27
40 60 80 100 120
40
60
80
mm
Version control systems track all code changes in
repositories
Examples: Git, Mercurial, SVN, CVS, etc.
Payoffs
I Eases collaborationI Can track changes in any file type, and who made themI Can revert file to any point in its tracked history
Costs
I 2-3 days to learn; learn by using on everything you canI SVN, CVS require their own server (Git & Mercurial don’t)I Takes a long time to master (much like LaTeX, MATLAB)
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 13 / 27
verification, transparency, and efficiencyReproducible research is about sharing data & code with papers
VerificationI Publicly posted code & data makes checking work easierI Use tools like version control, unit testing, file hosting
TransparencyI Concerns and questions addressed by looking at code & dataI GitHub and BitBucket list record of all changes to code, data,
paperI FigShare used to share data publicly, track its citations
EfficiencyI Others do not necessarily have to redo existing workI Get more citations per paperI Easier to remember what you did after a long breakI Easier to build upon, collaborate, transfer knowledge
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 23 / 27
40 60 80 100 120
40
60
80
mm
Available tools let you do reproducible research
without really trying
Learning tools requires small investments to automaterecord-keeping
Good record-keeping protects investments in scholarship
Get more citations, save time later
Posting code much better than not
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 24 / 27
40 60 80 100 120
40
60
80
mm
Acknowledgments
Victoria Stodden (posted a literature review)
ICERM (hosted reproducibility workshop)
Jaydeep Bardhan, Ahmed E. Ismail, Matthew Reuter, andfriends (helpful discussions)
Matt McNenly, Dan Flowers, Russell Whitesides, and LLNLcolleagues (helpful discussions)
Lawrence Livermore National Laboratory (funding via postdocaccount)
Gurpreet Singh (program manager)
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 25 / 27
References[1]Buckheit, J.B. and Donoho, D.L. Wavelab and reproducible research. (1995).[2]Morin, A. et al. Shining light into black boxes. Science. 336, (2012), 159–160.[3]King, G. Replication, Replication. PS: Political Science and Politics. (1995).[4]Schofield, P.N. et al. Post-publication sharing of data and tools. Nature. 461, (2009), 171–173.[5]Birney, E. et al. Prepublication data sharing. Nature. 461, (2009), 168–70.[6]Peng, R.D. Reproducible research and Biostatistics. Biostatistics (Oxford, England). 10, (2009), 405–408.[7]Vandewalle, P. et al. Reproducible research in signal processing – What, why, and how. IEEE Signal Processing Magazine.26, (2009), 37–47.[8]Stodden, V. The Legal Framework for Reproducible Scientific Research: Licensing and Copyright. Computing in Science &Engineering. 11, (2009), 35–40.[9]Merali, Z. Error: Why scientific programming does not compute. Nature. (2010), 6–8.[10]Barnes, N. Publish your computer code: it is good enough. Nature. 467, (2010), 753.[11]LeVeque, R.J. Python tools for reproducible research on hyperbolic problems. Computing in Science & Engineering. (2009),19–27.[12]LeVeque, R.J. Wave propagation software, computational science, and reproducible research. Proceedings of theInternational Congress of Mathematicians (Madrid, Spain, 2006), 1–27.[13]Price, K. Anything You Can Do, I Can Do Better (No You Can’t)... Computer Vision, Graphics, and Image Processing.(1986), 387–391.[14]Piwowar, H. a et al. Sharing detailed research data is associated with increased citation rate. PloS one. 2, (2007), 308.[15]Wilson, G. et al. Best Practices for Scientific Computing. 1–6.[16]Drummond, C. Reproducible Research: a Dissenting Opinion. (2012), 1–10.[17]Ioannidis, J.P. a et al. Repeatability of published microarray gene expression analyses. Nature genetics. 41, (2009), 149–55.[18]Savage, C.J. and Vickers, A.J. Empirical study of data sharing by authors publishing in PLoS journals. PloS one. 4, (2009),7078.[19]Quirk, J. Computational Science “Same Old Silence, Same Old Mistakes” “Something More Is Needed...” Adaptive MeshRefinement-Theory and Applications. (2005), 4–28.[20]McCullough, B.D. Got Replicability? The Journal of Money, Credit and Banking Archive. Econ Journal Watch. 4, (2007),326–337.[21]McCullough, B.D. Do economics journal archives promote replicable research?. Economics Journal Archives. (2008).[22]Manolescu, I. et al. Repeatability & Workability Evaluation of SIGMOD 2009. SIGMOD 2009 (2009), 2–4.[23]Freire, J. et al. Computational reproducibility: state-of-the-art, challenges, and database research opportunities. SIGMOD2012 (2012), 593–596.
(LLNL-PRES-621574-DRAFT) Reproducibility without trying February 28, 2013 27 / 27