Measuring progress toward a cultural norm of shared (and reused!) biomedical research data Heather Piwowar Department of Biomedical Informatics University of Pittsburgh
Nov 01, 2014
Measuring progress toward a cultural norm of
shared (and reused!)biomedical research data
Heather Piwowar
Department of Biomedical InformaticsUniversity of Pittsburgh
Sharing research data
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif; http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
Shared data benefits science
VerifyUnderstandExtendExploreCombineSynergizeTrainReduce
But... costly for authorsFindOrganizeDocumentDeidentifyFormatDecideAskSubmit
Answer questionsWorry about mistakes being foundWorry about data being misinterpretedWorry about being scoopedForgo money and IP and prestige???
As a result, policy makers have spent lots of time and money ....
http://www.flickr.com/photos/tonivc/2283676770/
http://www.flickr.com/photos/johnnyvulkan/381941233/
... on initiatives, requests, requirements, and tools
NIH data sharing plan requirement
Journal requirements
Public databases
Data sharing grids like BIRN and caBIG
Data formatting standards
Editorials, letters to the editor, discussion....
http://www.flickr.com/photos/mesh/14102209/
lots of data sharing!
http://www.genome.jp/en/db_growth.html
but how much isn’t shared?
what isn’t shared?
who isn’t sharing it?why not?
what can we do about it?
how much does it matter?
you can not manage what you do not measure
http://www.flickr.com/photos/archeon/2941655917/
1. Is there benefit for those who share?
2. Do journal policies increase rates of sharing?
3. What other factors are correlated with sharing and withholding data?
research questions
microarray data
http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/File:DNA_double_helix_vertikal.PNG
microarray data
http://www.flickr.com/photos/sunrise/35819369/
1. Is there benefit for those who share?
currency of value?
Citations.
$50!
Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
Prior work focused on the citation advantage of an open access publishing model.
Our question: are articles that share their raw research data cited more than articles that don’t?
dataset85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003)
citationsISI Web of Science Citation index, citations from 2004-2005
data sharing locationsPublisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine
statisticsMultivariate linear regression
Note:log scale
In multivariate regression, we found studies that had made their data publicly available received 69% more citations than similar studies that did not share their data (95% confidence interval: 18% to 143%)
Piwowar, Day and Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308
• collect a larger dataset for citation analysis (stay tuned)
• investigate other datatypes
• examine citation context
future work
http://www.flickr.com/photos/ryanr/142455033/
2. Do journal data sharing policies increase sharing?
“An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …”
http://www.nature.com/authors/editorial_policies/availability.html
http://www.nature.com/nature/journal/v453/n7197/index.html
Prior work examined data sharing policies in biomedicine, but these reviews are now dated, consider a variety of resources, and don’t correlate policy to behaviour.
McCain. Science Communication, Vol. 16, No. 4. (1 June 1995), pp. 403-431
NAS. Sharing Publication-Related Data and Materials. (2003), p. 33
Our aim: look at data sharing policies within Instruction to Author statements of 70 journals, as they apply to gene expression microarray data.
Very diverse policies in terms of:• statements of policy motivation• datatype-specific policies• requested vs. required• data location• data format• data completeness• timeliness of sharing• consequences for not sharing• exceptions
content of data sharing policies
No applicable policy (43%)
Weak policy (24%)
should, recommend, requestmust, but without database accession number
Strong policy (33%)
must, required, condition of publicationrequires database accession number
strength of data sharing policies
Journal has a data sharing policy?
Impact
Factor
Open
Access?
Society
Publisher?
•! Biochemistry
&Molecular Biology
•! Oncology
strength of data sharing policiesmultivariate associations
High-impact journals
tend to have
a strong data-sharing
policy
strength of data sharing policiesassociated with impact factor
For each of the 70 journals,
we measured the percent of articles that were cited from within GEO and ArrayExpress.
We considered this a proxy for percent of articles with shared data.
data sharing policiesassociated with amount of sharing
% of articles with shared data
Impact
Factor
Open
Access?
Society
Publisher?
•! Genetics &
Heredity
•! Multidisciplinary Sciences
Having a data-sharing policy?
data sharing policiesassociated with amount of sharing
• our corpus of “gene expression microarray” articles may have included some that reused data and did not themselves produce primary data
• these results should be considered preliminary, pending a more precise filter (stay tuned)
http://www.flickr.com/photos/vlastula/300102949/
• use a more precise filter to isolate data producing articles and thereby understand the absolute levels of data sharing
• investigate other datatypes
• look at associations with reviewer instructions and opinions
future work on journal policies
• are they effective? (stay tuned)
• what do people propose in data sharing plans? Do they do what they propose? Why not?
• quantify the perceived worth of data sharing plans and accomplishments in funding and promotion decisions
future work on funder policies
http://www.flickr.com/photos/cogdog/123072/
3. What other factors are correlated with sharing and withholding data?
Prior work has focused on surveys and studies of intention.
Our aim: measure associations between observed data sharing behaviour and environmental variables
Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.
Blumenthal et al. Acad Med. 2006
Ochsner et al. manually reviewed 20 journals for 2007:
400 studies
200 shared their microarray data
Ochsner et al. (2008). Much room for improvement in deposition rates of expression microarray datasets. Nature Methods, 5(12), 991.
pilot dataset
Is research data shared after publication?
Funder mandates
Journalimpact factor
Investigator “experience”
Journalmandates
pilot variables
funder mandates
NIH 2003 Data Sharing Requirement
Requires a data sharing plan
for studies funded after October 2003
that receive more than $500 000 in direct funding per year
Assumed data sharing requirement was applicable if:
the NIH grant numbers associated with PubMed entry had
$750 000 in total funding any year since 2004
plus
a NIH grant number with a leading “1” or “2” since 2004
funder mandates
Publication history and impact proxy
First and last authors:
• years since first paper• h-index (the largest number N such that
an author has N papers cited at least N times)
• a-index
author experience
Author publication history:
Citation counts:
Author-ity web serviceTorvik & Smalheiser. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.
Author name disambiguation:
Derived h-index (pubmedi citation indices):
author experience
Is research data shared after publication?
Funder mandates
Journalimpact factor
Investigator “experience”
Journalmandates
pilot variables
Univariate odds ratios
Multivariate logistic regression
stats
Is research data shared after publication?
Funder mandates
Journalimpact factor
Investigator “experience”
Journalmandates
Statistically significantNot statistically significant
results of pilot
33%
results of pilot
results of pilot
results of pilot
results of pilot
results of pilot
results of pilot
results of pilot
More samples, more variables
http://www.flickr.com/photos/krcla/2069243613/
PhD dissertation
Developed and evaluated automated methods to:
•Identify studies that generate datasets that could potentially be shared
•Determine which of these have in fact been shared
More samples:
To identify studies that generate datasets,
use a query on the full text of published articles:
("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
To determine which articles have shared data,
use a query on the full text of published articles:
pubmed_gds[filter] and query ArrayExpress
More variables:
Use PubMed and a variety of other internet resources...
funded by NIH?
size of grant
sharing plan req’d?
funded by non-NIH?
impact factor
strength of policy
open access?
number of microarray studies published
years since first paper
h-index
a-index
previously shared?
previously reused?
gender
sector
size
impact rank
country
humans?
mice?
plants?
cancer?
clinical trial?
number of authors
year
Funder Journal Investigator Institution Study
Univariate odds ratios
Multivariate logistic regression
Exploratory factor analysis
stats
http://www.flickr.com/photos/skrb/2427171774/
results?
1. Is there benefit for those who share?
2. Do journal policies increase rates of sharing?
3. What other factors are correlated with sharing and withholding data?
research questions
what’s next?
• citation analysis of larger cohort
• journal policies with refined filter
• beyond microarray data
• deeper into journal and funder policies
• and, finally....
future work previously mentioned...
Reuse.
http://www.flickr.com/photos/boitabulle/3668162701/
who reuses data?when?
why aren’t they?
which datasets are most likely to be reused?
what can we do about it?
how many datasets could be reused but aren’t?
why?
who doesn’t?
One possible reuse research agenda
1. Inventory reuse acknowlegement patterns
2. Build full-text and metadata filters to identify instances of data reuse
3. Analyze patterns in data reuse choices
4. Survey data producers and data consumers to augment with intentions and perspectives
Resources
• GEO list of reuse articles (currently 618)
• Previous work in citation context classification
• Amazon Mechanical Turk for annotation
• Experimental Philosophy for insight into cultural norms
• ...Teufel et al. (2006) Automatic classification
of citation function. EMNLP.
• readers
• reusers
• authors
• editors
• reviewers
• funders
• database designers, maintainers, curators
• patients, subjects, or populations
Stakeholders
For their perspectives,
and also to design studies that have actionable results
for these groups
I post my data, code, and statistical scripts athttp://www.dbmi.pitt.edu/piwowar
Share yours too!
http://www.flickr.com/photos/myklroventine/892446624/
Data sharing plan
thank you
Dept of Biomedical Informatics at U of Pittsburgh
NLM for training grant funding
Open science online community and those who release their articles, datasets and photos openly
Dr Wendy Chapman for her support and feedback
“Does anyone want your data?
That’s hard to predict […] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.
Your data, too, may simply be awaiting an effective matchmaker.”
Got data? Nature Neuroscience 10, 931 (2007)
variables
Journal mandates
Blumenthal et al. Acad Med. 2006
industry involvement
perceived competitiveness of field
male
sharing discouraged in training
human participants
academic productivity
0 1 2 3
Correlates with self‐reported data withholding
Campbell et al. JAMA 2002.
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
confidentiality
commercial value of results0% 20% 40% 60% 80%
Self‐reported reasons for data withholding
self-reported denying a request in last 3 years
trainees self-reported denying a request
been denied access to data, materials, code
authors “not able to retrieve raw data”
not willing to release data
0% 10% 20% 30% 40%
Prevalence of data withholding via surveys
Campbell et al. JAMA. 2002.Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.Reidpath et al. Bioethics 2001.