OPEN DATA: ENHANCING PRESERVATION, REPRODUCIBILITY, AND INNOVATION Clarke Iakovakis Scholarly Communications Librarian Neumann Library CC BY-SA 3.0-2.5-2.0-1.0 image courtesy Daniel Tenerife - Own work. Title: "Social Red" https://commons.wikimedia.org/wiki/File:Social_Red.jpg#mediaviewer/Fil e:Social_Red.jpg This work is licensed under a Creative Commons Attribution- NonCommercial -ShareAlike 4.0 International License.
70
Embed
Open data: Enhancing preservation, reproducibility, and innovation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
“data is no longer regarded as static or stale, whose usefulness is finished once the purpose for
which it was collected was achieved.”
- Kenneth Cukier and Viktor Mayer-Schönberger
"in some fields, the data are coming to be viewed as an essential end product of research, comparable in value to journal articles or
conference papers”
- Christine Borgman
OUTLINE
• Data-centric scholarship
• Benefits & challenges of open data• Defining open data
• Reproducibility
• Public use & data management plans
• Data reuse
• Concerns and open questions
• Where to deposit data?
SECTION 1
FROM DOCUMENT TO DATA-CENTRIC VIEW OF SCHOLARSHIP
WHAT WE MEAN BY “DATA”
A wide definition:
any information that can be stored in digital form, including text, numbers, images, video or movies,
audio, software, algorithms, equations, animations, models, simulations, etc. Such data
may be generated by...observation, computation, or experiment
- National Science Board
National Science Board. Long-Lived Data Collections: Enabling Research and Education in the 21st Century. Arlington, VA (2005): 13. https://www.nsf.gov/pubs/2005/nsb0540/nsb0540_3.pdf
WHAT IS RESEARCH DATA?
collected, observed, accessed, or created, for the purposes of analysis to produce and validate original
research results.
What is a routine collection at one point can become research data in the future
Thus research data are very much about whenthey are used, as well as what they constitute, and
the purpose for which they are to be used
University of Edinburg. “Research Data Explained.” http://mantra.edina.ac.uk/researchdataexplained/
Hard Science: Scientific data generated by instrumented research projects
Social science: data generated from government statistics, online surveys, behavioral models
Humanities: bodies of text, digital images and video, models of historic sites
WHAT IS RESEARCH DATA?
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Applying information technology to research problems
Collaborations across disciplines & increasing size of collaborations
Increasing the complexity and quantity of research data
DATA INTENSIVE RESEARCH
DATA INTENSIVE RESEARCH
• Scientific instruments generate data at greater speeds, densities, and detail
• Digitization of older print & analog data
• Born digital data
• Data storage capacity increases & storage costs decrease, enabling preservation of data
• Improvements in searching, analysis & visualization tools
World’s technological installed capacity to store information (table SA1) (16).
M Hilbert, and P López Science 2011;332:60-65
SLOAN DIGITAL SKY SURVEY
The most distant quasar ever discovered (at least as of October 2003). The redshift 6.4 quasar is seen at a time when the universe was just 800 million years old. The light-travel time from this object to us is about 13 billion years.
http://www.sdss.org
Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It
has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives
profitable activity; somust data be broken down, analyzed for it
Pryor, Graham. “Why Manage Research Data?” Managing Research Data. London: Facet Publishing, 2012.
VALUE OF DATA
Value of a dataset can be• Immediate
• Gained over time
• Transient
• Little (i.e. it’s easier to recreate than curate)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
VALUE OF DATA
“Fundamentally, there is a shift from a document-centric view of scholarship to a data-
centric view of scholarship”
- Sayeed Choudury
Choudury, Sayeed. "Data curation: An ecological perspective." College & Research Libraries News 71, no. 4 (2010): 194-196.
WHY OPEN?Data that underpin a journal article should be made concurrently available in an accessible
database.
We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be
interoperable.
Adapted from Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data“ Licensed under CC BY
Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science-public-enterprise/report/
Vines, Timothy H, Arianne Y K. Albert, Rose L Andrew, Florence Débarre, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, et al. "The Availability of Research Data Declines Rapidly with Article Age." Current Biology 24, no. 1 (1/6/ 2014): 94-97. https://linkinghub.elsevier.com/retrieve/pii/S0960-9822(13)01400-0
Researchers requested data sets from a relatively homogenous set of 516 articles published 1991-2011 in field
of zoology
Tracking down the authors & getting a response was the first challenge.
For every yearly increase in article age, the odds of the data set being reported as extant decreased by 17%
When the authors did give the status of their data, the proportion of data sets that still existed dropped from 100%
Vines, Timothy H, Arianne Y K. Albert, Rose L Andrew, Florence Débarre, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, et al. "The Availability of Research Data Declines Rapidly with Article Age." Current Biology 24, no. 1 (1/6/ 2014): 94-97. https://linkinghub.elsevier.com/retrieve/pii/S0960-9822(13)01400-0
Many of these missing data sets could be retrieved only with considerable effort by the
authors, and others are completely lost to science
Adapted from Mitcham, Jenny & Lindsey Myers. “Managing your research data”. Licensed under CC BY-NC-SA
Adapted from Mitcham, Jenny & Lindsey Myers. “Managing your research data”. Licensed under CC BY-NC-SA
DATA LOSS
DATA LOSS• Human error
• Natural disaster
• Facilities infrastructure failure
• Storage failure
• Server hardware/software failure
• Application software failure
• Format obsolescence• Legal encumbrance • Malicious attack • Loss of staffing
competencies• Loss of institutional
commitment • Loss of financial stability
Peters, Christie. Research Data Management: Basics and Best Practices. http://uknowledge.uky.edu/cgi/viewcontent.cgi?article=1000&context=rdsc_workshops. Licensed under CC BY
• Have you seen a shift to a data-centric research culture in your discipline?
• Is data availability a concern among you or your colleagues?
• Other ideas & questions
• Up next: Open Data Benefits and Challenges
SECTION 2
BENEFITS AND CHALLENGES OF OPEN DATA
WHAT IS OPEN DATA?
HIGH ASPIRATIONS, LOW UPTAKE
• Berlin Declaration for Access to Knowledge in the Sciences and Humanities (2003: 572 institutions)
• Recommendations for Access to Data from Publicly Funded Research (2006, all OECD member states)
CULTURE CHANGE?
A survey of 17,000 UK doctoral students
Showed that they are privately open to sharing resources
But in practice, followed behaviors of supervisors
And fear losing future publication opportunities
Researchers of Tomorrow – The Research Behaviour of Generation Y Doctoral Students. London, United Kingdom: JISC. Retrieved from: http://www.jisc.ac.uk/publications/reports/2012/researchers-of-tomorrow.
Tenopir, C, Dalton, E D, Allard, S, Frame, M, Pjesivac, I, Birch, B, Pollock, D and Dorsett, K (2015). Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide. PLoS ONE 10(8): e0134826.DOI: https://doi.org/10.1371/journal.pone.0134826
JOURNAL MANDATES• Mandatory requirement to archive data publically
unless there is a valid reason not to• Response to low voluntary uptake
• To allow reproduction of reported results
• Ecology, evolution, biology
• These policies do work to increase data archiving
• However, the quality varies…
Roche DG, Kruuk LEB, Lanfear R, Binning SA (2015) Public Data Archiving in Ecology and Evolution: How Well Are We Doing? PLoSBiol13(11): e1002295. https://doi.org/10.1371/journal.pbio.1002295
JOURNAL MANDATES
Researchers surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and
have a strong PDA policy.
Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely
prevented reuse.
Roche DG, Kruuk LEB, Lanfear R, Binning SA (2015) Public Data Archiving in Ecology and Evolution: How Well Are We Doing? PLoSBiol13(11): e1002295. https://doi.org/10.1371/journal.pbio.1002295
REPLICATION/REPRODUCIBILITY
"True reproducibility requires deep engagement with the epistemological questions of a given
research specialty, and the very different ways in which investigators obtain and value evidence“
“As rationale for sharing research, reproducibility…risks reducing the research process to a set of mechanistic procedures”
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
REPRODUCIBILITY AS RATIONALE
• Where data deposit is required as condition of publication (e.g. Protein Data Bank), researchers will comply
• Data sharing more likely if• Materials/documentation are automated
• Data is not sensitive/no licensing restrictions apply
• Publication is completed
• Data is not part of a long-term study integral to researcher’s career
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
DISCUSSION
Is there a reproducibility crisis?
If so, to what extent can data sharing remedy the crisis?
Other questions/comments
Up next: data management plans & sharing for public use
PUBLIC USE
PUBLIC USE
Tax monies should be leveraged to serve the public good
Data should not be hoarded by researchers
Public understanding of research
Evidence-based advocacy
Education & teaching
Citizen science
Policymakers
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
OPEN GOVERNMENT DATA
White House Office of Science & Technology Policy memo: “Expanding Public Access to the Results of Federally Funded Research” (Feb 2013)
digitally formatted scientific data resulting from unclassified research supported wholly
or in part by Federal funding should be stored and publicly accessible to search, retrieve,
WHO seeks a paradigm shift in the approach to information sharing in emergencies, from one limited by embargoes set for publication timelines, to open sharing using modern fit-for-purpose pre-publication platforms.
Opting in to data and results sharing should be the default practice and the onus should be placed on data generators and stewards at the local, national and international level to explain any decision to opt out from sharing data and results during public health emergencies
World Health Organization “Developing global norms for sharing data and results during public health emergencies”
Searchable database and single focal point of up-to-date information concerning funders'
policies and their requirements on open access, publication and data archiving.
http://v2.sherpa.ac.uk/juliet/
DISCUSSION
Do you have experience with data management plans?
Up next: Data reuse
DATA REUSE
ASKING NEW QUESTIONS OF EXTANT DATA
• Encourages meta-analyses & data combination
• Exploring new questions and identifying new relationships
HUBBLE SPACE TELESCOPE DATA REUSE
• General Observing (GO) paper: At least one author was investigator on the GO proposal that obtained the data.
• AR paper: No overlap between the paper authors and investigators on the GO proposal that obtained the data.
• GO+AR: Combination of GO data sets with AR data sets.
Adapted from Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data“ Licensed under CC BY.
Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science-public-enterprise/report/
Papers based upon reuse of archived observations now
• Assessing veracity requires domain expertise & misinterpretation is a serious risk
• Depends on extensive documentation & description
• The farther the user is from the point of data origin
• The more documentation required• The more effort required by reuser• Greater the risk of misinterpretation
• Benefits prospective users more than producers
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
ADVANCING RESEARCH AND INNOVATION
• Data-intensive fields (astronomy, social sciences, economics)
• Comparisons across time and space (ecology, biology, sociology)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
ADVANCING RESEARCH AND INNOVATION
• Maximizing the use of data
• Increasing the impact of findings
• Progressing the state of research
• Laying broader foundation for knowledge
• Diversifying perspectives
Fischer, B.A., & Zigmond, M.J. (2010). The essential nature of sharing in science. Science and Engineering Ethics, 16(4), 783–799.
DATA SHARING ASSOCIATED WITH CITATION IMPACT
Examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data.
The 48% of trials with publicly available microarray data received 85% of the aggregate citations
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE2(3): e308. https://doi.org/10.1371/journal.pone.0000308
DATA SHARING ASSOCIATED WITH CITATION IMPACT
Does not imply causation
• But there may be mechanisms in which data sharing did stimulate greater citations
• Exposure
• Reanalysis
• Enthusiasm and synergy around a specific research question
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE2(3): e308. https://doi.org/10.1371/journal.pone.0000308
SECTION 3
CONCERNS AND OPEN QUESTIONS
RESEARCHER CONCERNS
• Data is competitive advantage
• Data is intellectual capital
• Time & effort required to prepare data for archiving
• Lack of recognition & other extrinsic incentives
• Concerns about data misinterpretation
Roche, D. G., Kruuk, L. E. B., Lanfear, R., & Binning, S. A. (2015). Public data archiving in ecology and evolution: How well are we doing?PLoS Biology, 13(11) doi:http://dx.doi.org/10.1371/journal.pbio.1002295
OPEN QUESTIONS
• What data to share?
• What is sharing?
• What is interpretable and reusable?
• How to reward/give credit?
• How to document without extensive labor?
• How to handle misuse/misinterpretation?
• Restricting access/de-identification
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
OPEN QUESTIONS
• Lack of demonstrated demand for research data outside genomics, climate science, astronomy, social science, demographics
• How open is it?
• Who owns the copyright? Is data public domain?
• How to validate data?
• Preserving data
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634