Public sharing of research datasets: a pilot study of associations Heather Piwowar and Wendy Chapman Department of Biomedical Informatics University of Pittsburgh
Nov 01, 2014
Public sharing of research datasets:
a pilot study of associations
Heather Piwowar and Wendy Chapman
Department of Biomedical InformaticsUniversity of Pittsburgh
data data
http://www.flickr.com/photos/vroomvroommm/3457772539
stale
http://www.flickr.com/photos/75166820@N00/5318468/
sounds great
http://www.flickr.com/photos/ryanr/142455033/
but not easy
http://www.flickr.com/photos/faerie-dust/2315927946/
persuade
http://www.flickr.com/photos/sunrise/35819369/http://www.flickr.com/photos/fboyd/2156630044/
does it work?
http://www.flickr.com/photos/mesh/14102209/
aim
Prior work has focused on surveys and studies of intention.
Our aim: measure associations between observed data sharing behaviour and environmental variables
aim
Funder Journal Investigator Institution Study
Is research data shared after publication?
aim
Funder Journal Investigator Institution Study
Is research data shared after publication?
microarray data
http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG
microarray data
data sample
Ochsner et al. (2008). Much room for improvement in deposition rates of expression microarray datasets. Nature Methods, 5(12), 991.
Manually reviewed 20 journals for 2007:
400 studies
200 shared their microarray data
variables
Is research data shared after publication?
Funder mandates
Journalimpact factor
Investigator “experience”
Journalmandates
variables
Funder mandates
variables
Funder mandates
NIH 2003 Data Sharing Requirement
Requires a data sharing plan
for studies funded after October 2003
that receive more than $500 000 in direct funding per year
variables
Funder mandates
Assumed data sharing requirement was applicable if:
the NIH grant numbers associated with PubMed entry had
$750 000 in total funding any year since 2004
plus
a NIH grant number with a leading “1” or “2” since 2004
variables
Journal mandates
variables
Journal mandates
Piwowar and Chapman.
A review of journal policies for sharing research data.
International Conference on Electronic Publishing (ELPUB) 2008
Journal Policy Strength: Strong, Weak, or None
variables
Author experience
variables
Author experience
Publication history and impact
variables
“experience and impact” proxy:• years since first publication• h-index estimate• a-index estimate
Scriptable, to allow scaling up to thousands of authors?
Author experience
variables
Author experience
Author publication history
variables
Author experience
Citation counts
variables
Author experience
Author-ity web service:Torvik & Smalheiser. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.
Author name disambiguation
variables
PubMed + PubMed Central + Author-ity to computepubmedi citation estimates
Author experience
➡ not comprehensive account of publication accomplishments
➡ for aggregate analysis: free, open, scriptable, flexible, reproducible.
variables
For each first and last author, we used the first principal component of:
• years since first publication• pubmedi h-index estimate• pubmedi a-index estimate
Author experience
variables
Is research data shared after publication?
Funder mandates
Journalimpact factor
Investigator “experience”
Journalmandates
stats
Univariate odds ratios
Multivariate logistic regression
results
http://www.flickr.com/photos/paperpariah/3002687604/
resultsIs research data shared
after publication?
Funder mandates
Journalimpact factor
Investigator “experience”
Journalmandates
Statistically significantNot statistically significant
results
33%
Funder mandates
results
Strength of journal data sharing policy is very correlated with impact factor
Journalmandates
Journalimpact factor
results
Investigator “experience”
results
Investigator “experience”
results
Investigator “experience”
results
Investigator “experience”
results
Investigator “experience”
results
Investigator “experience”
limitations
• Association does not imply causation
• Only one datatype
• Small sample, limited variables
• Dataset contains disproportionate number of high-impact studies
http://www.flickr.com/photos/vlastula/300102949/
prelim conclusions
• NIH data sharing plan applies to a minority of NIH microarray studies
• NIH data sharing plan does not seem to increase frequency of data sharing
• More experienced investigators are more likely to share data
next steps
PhD dissertation!
• More samples
• More variables
http://www.flickr.com/photos/krcla/2069243613/
future
Spin-off projects:
• Quantify usefulness of pubmedi h-index
• Study the patterns and prevalence of data reuse
http://www.flickr.com/photos/cogdog/123072/
thanks
Dept of Biomedical Informatics at U of Pittsburgh
NLM for training grant funding
Open science online community and those who release their articles, datasets and photos openly
Dr Wendy Chapman for her support and feedback
variables
Journal mandates
variables
Journal mandates
None: No applicable mention of data sharing
Weak: Request or unenforceable requirement
Strong: Require data deposit accession number as a condition of publication
Policy strength categorization:
open science
I post my data, code, and statistical scripts athttp://www.dbmi.pitt.edu/piwowar
Share yours too!
http://www.flickr.com/photos/myklroventine/892446624/