2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing
Post on 10-Jul-2015
432 Views
Preview:
Transcript
Ways and Needs to Promote Rapid Data Sharing
Laurie Goodman, PhD
Editor-in-Chief GigaScience
ORCID ID: 0000-0001-9724-5976
Scientific Communication Via Publication
• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and
computational methods, which support the
scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
• Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives
• Lack of transparency, lack of credit for anything other than “regular” dead tree publication
A Tale of Two Bacteria1. On May 2, 2011 German Doctors Reported the first case of an
E.coli infection, that was accompanied by hemolytic-uremic syndrome
2. On May 21, 2011 the first death occurred from this bacteria (denoted E.coli O104:H4)
3. On June 3, 2014, BGI completed a draft sequence of E.coliO104:H4 from a sample provided by doctors at the University Medical Centre Hamburg-Eppendorf
4. At this point- the leaders at BGI held a discussion about whether to release the sequence data immediately: what were the potential repercussions of doing so
The question arose:If the data were released now- would it affect their ability to publish later?
A Tale of Two Bacteria• In one world- the researchers — who were concerned about their
ability to publish as this is the way to obtain recognition and obtain grants (which are essential for them to work) — waited.
The first publication appeared on July 29th
• In another world, the researchers — who decided public health was more important than obtaining a publication — released the data immediately.
The first publication appeared on July 29th — but was not from that group who released the data (though information on
that data was included.
Whether the concern about the ability to publishif data are released early is real or imagined
Researchers act on that concern
Whether the concern about the ability to publishif data are released early is real or imagined
Researchers act on that concern
To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001
These data were put on an FTP server under a CCO waiver and also
given a DOI to make access ‘permanent’
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
Downstream consequences:
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”
1. Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.
All that asideCan we all agree that releasing the E.coli data
ahead of publication was ‘good’ At least from a public health perspective
Here are the numbers for the E.coli 2011 OutbreakIn total, ~4000 people were infected and 53 died
Infectious DiseaseMeasles: 122,000 per yearHepatitis C-related liver disease: 350,000-500,000 per yearMalaria: 627,000 per yearHIV/AIDS: 1.4-1.7 million per yearNon-communicable, with genetic predispositionProstate cancer: 307,000 per yearBreast cancer: 522,000 per yearSuicide: 800,000 per yearDiabetes: 1.5 million per yearCancer: 8.2 million per yearCardiovascular Disease: 17.5 million per yearNon-genetic/Non-infectiousPesticide Poisoning: 250,000 per yearMalnutrition: 2.8 million children (under 5) per year
*World Health Organization Fact Sheets http://www.who.int/en/
From a Public Health perspective…Deaths Worldwide*
Sharing Data is Essential for Many Reasons
0
100
200
300
400
500
600
700rice wheat
Rice v Wheat: consequences of publically available genome data
Sharing aids fields…
Every 10 datasets collected contributes to at least 4 papers in the following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
Sharing aids authors…
Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
Sharing Detailed Research Data Is Associated with Increased Citation Rate.
Lack of Sharing Impacts Reproducibility
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Sharing can reduce retractions>15X increase in last decade
Strong correlation of “retraction index” with higher impact factor
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
At current % increase by 2045 as many papers published as retracted!
?
Data Sharing Hurdles
If only it were easy…
There are numerous reasons why researchers do not share data:
The majority of which are good reasons
Wiley Researcher Data Insights SurveyOur objective was to establish a baseline view of data sharing practices, attitudes, and motivations globally, with participation from researchers in every scholarly field.
In March 2014, more than 90,000 researchers around the world were invited to participate in Wiley’s Researcher Data Insights Survey. Participants were researchers who had published at least one journal article in the past year with any publisher.
We received an overwhelming 2,886 responses from around the world.
Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
Wiley Researcher Data Insights Survey
Key Findings
• Most researchers are sharing their data.
• Those not sharing have a variety of reasons.
• Data that’s being shared typically is <10 GB.
• The most common type of data that is being shared is flat, tabular data (.csv, .txt, .xl)
• Data is usually saved on hard drives.
Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
Wiley Researcher Data Insights SurveyWhy Researchers Do Not Share• Intellectual property or confidentiality issues (59%)• Concerned research might be “scooped” (39%)• Concerns about misinterpretation or misuse (32%)• Concerns about attribution/citation credit (31%)• Ethical concerns (24%)• Insufficient time/resources (19%)• Funder/institution does not require sharing (13%)• Lack of funding (13%)• Not sure where to share (5%)• Not sure how to share (3%)
See also:http://exchanges.wiley.com/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont/http://scholarlykitchen.sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research-data-question/
Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
How Can Publishers Promote Data Sharing
Carrots and Sticks
– Create Journal Data Release Policies
– Check Data Release Policy is followed
– Find Ways to Aid Researchers in Releasing Data
– Consider ways to support/protect researchers who do share ahead of publications
– Promote Data Citation
And- why us? Researchers are never so captive as when they publishing
But we need to help — not just harass.
How Can Publishers Promote Data Sharing
Carrots and Sticks
– Create Journal Data Release Policies
– Check Data Release Policy is followed
– Find Ways to Aid Researchers in Releasing Data
– Consider ways to support/protect researchers who do share ahead of publications
– Promote Data Citation
And- why us? Researchers are never so captive as when they publishing
But we need to help — not just harass.
?
Incentives/credit
Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)
Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)
Genomics Data Sharing Policies…
1. Automatic release of sequence assemblies within 24 hours.2. Immediate publication of finished annotated sequences.3. Aim to make the entire sequence freely available in the public domain for
both research and development in order to maximise benefits to society.
Bermuda Accords 1996/1997/1998:
1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production.
2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria.
Fort Lauderdale Agreement, 2003:
The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.
Toronto International data release workshop, 2009:
Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility (From the Fort Lauderdale Meeting 2003)
http://www.genome.gov/pages/research/wellcomereport0303.pdf
DataCite and DOIs
“increase acceptance of research data as legitimate, citable contributions to the scholarly record”.
Aims to:
“data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.
Citing Data Isn’t NewThe Physical Sciences have been doing this for a while
How We Envision Research Publication(Communicating Science)
Data Sets inGigaDB
Analyses inGigaGalaxy
Paper inGigaScience
Open-access journal Data Publishing Platform
Data Analysis Platform
Other Journals are now doing similar
This is most commonly done in the form of a Data Paper rather than a release of data that is citable in itself.
• A Data Paper is affectively a Description of the Data
• Other journals that do Data Publishing as a formal paper type• F1000 Research (launched in 2012)
• Has Data papers as one of several types of papers• Scientific Data (launched in 2014)
• Solely publishes Data Descriptors• There are more…
Making the Data Itself Citable
We provide a linked database The data are then directly linked to the paper- but can also be cited separately through a Data DOI We can do this because we have a collaboration between BMC (who handles the standard paper publication) and BGI (which has enormous data storage capacity.)
However: There are many community available databases- so in principle- any journal can do this by taking advantage of such available resources.
These include the usual suspects: EBI, NCBI, DDBJ etc.Databases that take all data types and provide Data DOIs: Dryad, FigShare, etc.There are also numerous smaller community databases specific to different fields or data types.
For data citation to work, needs:
• Acceptance by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
For data citation to work, needs:
• Acceptance by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
In Principle…
Back to E.coli O104:H4
• As noted: articles on these early released and citable data were published
• Also- the early releasers were not the first to publish
• Nor was the data cited
This open-source analysis work was published on August 25th
The journal did not approve of inclusion of the data citation.
Nor was any indication of where the genome information could be found
This report was the first to be publisher- and it included and used information from the crowd-source release as well as the other early release.
No where in the paper is there any indication of where to obtain this data
Nor is there an indication of where to obtain the sequence data they generated
This group made their 0104:H4 sequence available at the time of completion- prior to publication in the NCBI database.
Though no link to the Accession Number is easily found in the paper.
This report DID include a reference for the data (even though they did not use it in their analysis)
For data citation to work, needs:
• Acceptance by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
In Practice…
• Data submitted to NCBI databases:
• Submission to public databases complemented by its citable form in GigaDB (doi:10.5524/100012).
- Raw data SRA:SRA046843
- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000
- SNPs dbSNP:1056306
- CNVs- InDels dbVAR:nstd63
- SV}
In the references…
Is the DOI…
In Practice…
In Practice…
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
The polar bear DATA was released –prepublication- in 2011They were used and cited in the following studies- before the main paper on the sequencing was published
Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424.
Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345.
Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117.
Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133.
Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109
Cell Press Journals
However, this didn’t include the citation…
One step forward — two steps back
Removing data citations from the references
One journal informed the authors that non-reviewed material could not be cited in the references of the paper
Another journal stripped the data citation from the references- and went an extra step and changed the citation in the Data Availability section to the URL where the DOI directed it to at that timeWe happened to know about this one- and were able to create a forward to the DOI’d page when the URL broke after we moved our database platform
Note: Much of this was due to a standard operating procedure in the production department
Lesson: If you decide to include Data Citations- tell your entire team
For data citation to work, needs:
• Acceptance by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
For data citation to work, needs:
• Acceptance by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
This is a work in progress…
Data Citation Really is a Major IncentiveOn Weds this week- we released the genome sequence from 3000 Rice strains (13.4 TB of data)• These data were also deposited in NIH SRA repository• So why did we do it too?1. It is linked directly to the Data Paper that provides
details of data production, quality, and basic analysis2. Authors were hesitant to release these data (a HUGE
community resource) prior to the analysis paper publication (which, for 3000 strains… would take years…). The opportunity to have these data citable (and trackable) encouraged the authors and led to their releasing these data and doing so in collaboration with GigaScience’s Biocurator
The 3,000 Rice Genomes Project. (2014) GigaScience 3:7 http://dx.doi.org/10.1186/2047-217X-3-7;The 3000 Rice Genomes Project (2014) GigaScience Database. http://dx.doi.org/10.5524/200001
IRRI GALAXY
Rice 3K project: 3,000 rice genomes, 13.4TB public data
No: your data is not too large to share
Beyond Data Citation
Reviewing Data
Data Release policies include the need to help authors
Data availability without metadata is practically useless
Beyond Data Citation
Reviewing DataIt’s too hard- we can’t ask our reviewers to do that!Use Data Reviewers
Example in Neuroscience
1. Neuroscience Data are not typically shared
2. For most papers: Data AND Tools are nottypically made available to the reviewers
3. Journal Editors think Reviewers will notwant to review data
GigaScience 2014, 3:3 doi:10.1186/2047-217X-3-3
Example in Neuroscience• Neuroscience Data are not typically shared• Author Dr. Stephen Eglen said: “One way of encouraging neuroscientists to
share their data is to provide some form of academic credit.”• We hosted with a DOI: 366 recordings from 12 electrophysiology datasets• GigaDB is included in Thompson Reuters Data Citation Index
• Data AND Tools are not typically made available to the reviewers• We made manuscript, data and tools all available to the reviewers.• We make sure to include reviewers who are able to properly assess the data
itself and rerun the tools • To reduce burdens- we sometimes select a reviewer who ONLY looks at the
data.
• Journal Editors think Reviewers will not want to review data• What Reviewer Dr. Thomas Wachtler said: “The paper by Eglen and
colleagues is a shining example of openness in that it enables replicating the results almost as easily as by pressing a button.”
• What Reviewer Dr. Christophe Pouzat said: “In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewers job more fun!”
Beyond Data Citation
Data Release policies include the need to help authors
CollaborationsWith data repositoriesWith other journals
Consider Cross Journal SupportCompetition is good…
….but sometimes we should collaborate for the community good
• PLoS recent data deposition policies have led to community concerns about feasibility.
• We support (and applaud) this …we have an even stricter data deposition policy
• But- PLoS ONE received a submission that was acomparative study of earthworm morphology and anatomy using a 3D non-invasive imaging technique called micro-computed tomography (or microCT) …And there is no good place to put this
• These data are extremely complex, videos, multiple files-with several folders of ~10 GB
Consider Cross Journal Support
• GigaScience and PLOS ONE collaborated. They published the main article; we published a Data Note describing the data itself and hosted all the data on GigaDB under separate citation.
• With our Aspera Connection- reviewers could download even the 10 TB folders in ~1/2 hour
• Reviewer Dr. Sarah Faulwetter noted the usefulness of having these data available, saying: Instead of having to go through the lengthy process of obtaining the physical specimen from a museum, I can now download a fairly accurate representation from the web.
Lenihan et al (2014). GigaScience, 3:6 http://dx.doi.org/10.1186/2047-217X-3-6; Lenihan, et al (2014): GigaScience Database. http://dx.doi.org/10.5524/100092; Fernández et al (2014) PLOS ONE 9 (5) e96617 http://dx.doi.org/10.1371/journal.pone.0096617
Beyond Data Citation
Data availability without metadata is practically useless
Engage/Employ/Interact with Curators
1. Lack of interoperability/sufficient metadata
2. Long tail of curation (“Democratization” of “big-data”)
Challenges for the future…
?
Think about what you do… and what you can do…• Promote- rather than inhibit- prepublication data sharing
• Promote Data Citation in the reference section
– incentivizes data release
– Makes it easier for readers to find
• Promote Data Sharing upon publication
– Consider your data release policies
• Form collaborations with repositories to aid authors in depositing their work
– Identify community organizations with metadata standards
• Make data available for reviewers (author website, community repositories, dryad and similar (your publisher?)
– at least do a sanity check
– Use “data reviewers”
No- this isn’t easy, but do what you can nowAnd work toward the rest
Evolve
It’s Time to Move Beyond Dead Trees
18121665 1869
Thanks to:Scott Edmunds, Executive Editor
Nicole Nogoy, Commissioning Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Rob Davidson, Data Scientist
Xiao (Jesse) Si Zhe, Database Developer
Amye Kenall, Journal Development Manager
editorial@gigasciencejournal.comdatabase@gigasciencejournal.com
@GigaScience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog
Contact us:
Follow us:
www.gigasciencejournal.comwww.gigadb.org
top related