Prepared for Program on Information Science – Brown Bag Talks MIT March 2015 Modeling Reproducibility from an Informatics Perspective Dr. Micah Altman <[email protected]> Director of Research, MIT Libraries Head/Scientist, Program on Information Sciences <informatics.mit.edu>
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Prepared for
Program on Information Science – Brown Bag Talks
MIT
March 2015
Modeling Reproducibility from an Informatics Perspective
Director of Research, MIT LibrariesHead/Scientist, Program on Information Sciences
<informatics.mit.edu>
DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Modeling Reproducibility from an Informatics Perspective
Collaborators & Co-Conspirators
• Kobbi Nissim, Michael Bar-Sinai, Salil Vadhan & the Privacy Tools for Research Data Project<http://privacytools.seas.harvard.edu/>
• Jeff Gill• Michael P. McDonald
Research Support
Sloan FoundationNational Science Foundation (Award
#1237235)Modeling Reproducibility from an Informatics
Perspective
Micah Altman
Award #
Related Work• Allen, Liz, et al. "Credit where credit is due." Nature 508.7496
(2014): 312-313.• Altman, M., & Crosas, M. (2013). The evolution of data citation:
From principles to implementation. IASSIST Quarterly, 37.• Garnett, A., Altman, M., Andreev, L., Barbarosa, S., Castro, E.,
Crosas, M., ... & Yang, X. (2013, May). Linking OJS and Dataverse. In PKP Scholarly Publishing Conference 2013.
• Altman, M., Fox, J., Jackman, S., & Zeileis, A. (2011). An Special Volume on" Political Methodology". Journal of Statistical Software, 42(i01).
• Altman, M. (2008). A fingerprint method for scientific data verification. In Advances in Computer and Information Sciences and Engineering (pp. 311-316). Springer Netherlands.
• Altman, M., & King, G. (2007). A proposed standard for the scholarly citation of quantitative data. D-lib Magazine, 13(3/4).
• Altman, Micah, Jeff Gill, and Michael P. McDonald. (2004). Numerical issues in statistical computing for the social scientist. John Wiley & Sons.
• Altman, M., & McDonald, M. P. (2003). Replication with attention to numerical accuracy. Political Analysis, 11(3), 302-307.
• Altman, Micah. "A review of JMP 4.03 with special attention to its numerical accuracy." The American Statistician 56.1 (2002): 72-75.
• Altman, M., & McDonald, M. P. (2001). Choosing reliable statistical software. Political Science & Politics, 34(03), 681-687.
• Altman, M., Andreev, L., Diggory, M., King, G., Kolster, E., Sone, A., ... & Krot, M. (2001, January). Overview of the virtual data center project and software. In Proceedings of the JCDL 2001 (pp. 203-204). ACM.
Modeling Reproducibility from an Informatics Perspective
Roadmap for this Talk
Reproducibility Concerns…
Modeling Reproducible Research from an Information Perspective
How can informatics improve reproducibility?
Modeling Reproducibility from an Informatics Perspective
Modeling Reproducibility from an Informatics Perspective
Information Perspective
Increased Retractions, Allegations of Fraud
Maximizing the Impact of Research through Research Data Management
7
What Goes in the File Drawer?
Maximizing the Impact of Research through Research Data Management
Daniel Schectman’s Lab Notebook
Providing Initial
Evidence of Quasi Crystals
• Null results are less likely to be published published results as a whole are biased toward positive findings
• Outliers are routinely discarded unexpected patterns of evidence across studies remain hidden
8
Replicability of Published Results
Maximizing the Impact of Research through Research Data Management
Many journals have no replication policy
Even in journals with clear policy, success rate is low
9
Modeling Reproducibility from an Informatics Perspective
Many Initiatives to Improve Scientific Reliability
•Retraction monitoring
•Data citation
•Clinical trial preregistration
•Registered replication
•Open data
•Badges
Modeling Reproducibility from an Informatics Perspective
Reproducibility Concerns
Modeling Reproducibility from an Informatics Perspective
Framing Reproducibility from an Informatics Perspective
Reproducibility claims are not formulated as direct claims about the world…
1. What claims about information are implied by reproducibility claims/issues?*
2. What properties of information and information flow are related to those claims?
3. How would possible changes to information processing and flow yield?(And how much would they it cost?)
*
Modeling Reproducibility from an Informatics Perspective
Some Types of Reproducibility Issues/Use CasesCommon Labels For Reproducibility Problems
Example Interventions
Misconduct, Bit Rot, Author Responsibility Discipline/community data archives. NIH genomic data sharing policyRetractionWatch; Collaborative Data Collection Projects
Misconduct, Negligence, Confusion , Typo, Proofreader error*, Dynamic Data Problem, Versioning problem
Modeling Reproducibility from an Informatics Perspective
Improving Reproducibility
Modeling Reproducibility from an Informatics Perspective
Some Types of Reproducibility Issues/Use CasesCommon Labels Reproducibility Related Issue Example Interventions
Misconduct, Bit Rot, Author Responsibility
Data was fabricated, corrupted, or radically misinterpreted prior to analysis
Discipline/community data archives. NIH genomic data sharing policyRetractionWatch; Collaborative Data Collection Projects
Misconduct, Negligence, Confusion , Typo, Proofreader error*, Dynamic Data Problem, Versioning problem
Data {referenced by identifier | provided as an instance| described by method} has nontrivial set of semantic differences from that used as input to the publication
Dat, DataHub, DataVerse (versioning)
Misconduct, Negligence, Harmless Error,
Published analysis algorithm does not correspond to implemented analysis
S/Weave; Compendia; Vistrails
Reproducibility [NSF; Donoho 1995]
Replicability [King 1995, many journals]
Variance of estimates given data instance & analysis implementation
Author bias to creating significant results resulting in difference between stated method/analysis and actual (complete) method/analysis
Holdout Data Escrow
Sensitivity, Robustness Variance of support for claims across specification change Sensitivity Analysis
Reliability Variance of support for claims across repeated measures, samples
Metaanalysis;Cochrane ReviewData Integration
Generalizability Variance of support for claims across different frames Cochrane Review
Laws, Truth Variance of support for claims to other populations Grand Challenge ?
… … …
Operational Reproducibility ClaimsReproducibility Related Issue Related informatics claims
Label Validation, Fact Checking
Reproducibility Issue Variance of estimates given data identifier & analysis algorithm
Reproducibility Claim Variance of estimates given data identifier & analysis algorithm is known & correctly represented.
Use Case Post-publication reviewer wants to establish that published claims correspond to analysis method performed…
Potential supporting informational claims
1. Instance of data retrieved via identifier is semantically equivalent to instance of data used to support published claim
2. analysis algorithm is robust to choice of reasonable alternative implementation
3. implementation of algorithm is robust to reasonable choice of execution details and context
4. published direct claims about data are semantically equivalent to subset of claims produced by authors previous application of analysis
5. …
Potential information systems properties supporting claims
1a. Detailed provenance history for data from collection through analysis and deposition1b. Automatic replication of direct data claims from deposited source1c. Cryptographic evidence (e.g. cryptographic signed {analysis output including, cryptographic hash of data} & {cryptographic hash of data retrieved via identifier}…2a. Standard implementation, subject to community review2b. Report of results of application of implementation on standard testbed2c. Availability of implementation for inspection….3. …
Conjectures: How Could Informatics Improve Reproducibility
Formal Properties
(Some formal properties on information flow and management tend to support reproducibility related inferences…)
• Transparency
• Auditability
• Provenance
• Fixity
• Identification
• Durability
• Integrity
• Repeatability
• Self-documentation
• Non-repudiation
Properties applied to different stages, entities, and to components of the information system itself
Systems Property*
(How does the system interact with users, and what incentives and culture does it engender?)
• Barriers to entry
• Ease of use
• Support for intellectual communities
• Speed and performance
• Security
• Access control
• Personalization
• Credit and attribution
• Incent well-founded trust among actors
• Disincent “glamour & deceit”
(How does the system integrate into research ecosystem?)
Systems Oriented
• Sustainability
• Cost
• Incent well-founded trust in system and outputs
Modeling Reproducibility from an Informatics Perspective
Discussion – How can we better support reproducibility
with information infrastructure?* •How can we better identify the inferential claims implied by specific set of (non)reproducibility claims/issues?
•Which information flows and systems that most closely associated with these inferential claims?
•Which properties of information systems support generating these inferential claims?
Additional References
• de Waard, A. (2010). The story of science: a syntagmatic/paradigmatic analysis of scientific text. In Proceedings of the AMICUS Workshop (pp. 36-41).
• Gentleman, R., & Lang, D. T. (2007). Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics, 16(1).
• Freire, Juliana. "Making computations and publications reproducible with vistrails." Computing in Science & Engineering 14.4 (2012): 18-25.
• Kevles, Daniel J. The Baltimore case: A trial of politics, science, and character. WW Norton & Company, 2000.
• King, G. (1995). Replication, replication. PS: Political Science & Politics, 28(03), 444-452.
• McCullough, B. D. (2009). Open access economics journals and the market for reproducible economic research. Economic Analysis and Policy, 39(1), 117-126.
Modeling Reproducibility from an Informatics Perspective