2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

Ways and Needs to Promote Rapid Data Sharing

Laurie Goodman, PhD

Editor-in-Chief GigaScience

ORCID ID: 0000-0001-9724-5976

Scientific Communication Via Publication

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and

computational methods, which support the

scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication

A Tale of Two Bacteria1. On May 2, 2011 German Doctors Reported the first case of an

E.coli infection, that was accompanied by hemolytic-uremic syndrome

2. On May 21, 2011 the first death occurred from this bacteria (denoted E.coli O104:H4)

3. On June 3, 2014, BGI completed a draft sequence of E.coliO104:H4 from a sample provided by doctors at the University Medical Centre Hamburg-Eppendorf

4. At this point- the leaders at BGI held a discussion about whether to release the sequence data immediately: what were the potential repercussions of doing so

The question arose:If the data were released now- would it affect their ability to publish later?

A Tale of Two Bacteria• In one world- the researchers — who were concerned about their

ability to publish as this is the way to obtain recognition and obtain grants (which are essential for them to work) — waited.

The first publication appeared on July 29th

• In another world, the researchers — who decided public health was more important than obtaining a publication — released the data immediately.

The first publication appeared on July 29th — but was not from that group who released the data (though information on

that data was included.

Whether the concern about the ability to publishif data are released early is real or imagined

Researchers act on that concern

Whether the concern about the ability to publishif data are released early is real or imagined

Researchers act on that concern

To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

These data were put on an FTP server under a CCO waiver and also

given a DOI to make access ‘permanent’

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Downstream consequences:

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”

1. Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons

4. Example for faster & more open science

1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully

illustrated by events following an outbreak of a severe gastro-

intestinal infection in Hamburg in Germany in May 2011. This

spread through several European countries and the US,

affecting about 4000 people and resulting in over 50 deaths. All

tested positive for an unusual and little-known Shiga-toxin–

producing E. coli bacterium. The strain was initially analysed by

scientists at BGI-Shenzhen in China, working together with

those in Hamburg, and three days later a draft genome was

released under an open data licence. This generated interest

from bioinformaticians on four continents. 24 hours after the

release of the genome it had been assembled. Within a week

two dozen reports had been filed on an open-source site

dedicated to the analysis of the strain. These analyses

provided crucial information about the strain’s virulence and

resistance genes – how it spreads and which antibiotics are

effective against it. They produced results in time to help

contain the outbreak. By July 2011, scientists published papers

based on this work. By opening up their early sequencing

results to international collaboration, researchers in Hamburg

produced results that were quickly tested by a wide range of

experts, used to produce new knowledge and ultimately to

control a public health emergency.

All that asideCan we all agree that releasing the E.coli data

ahead of publication was ‘good’ At least from a public health perspective

Here are the numbers for the E.coli 2011 OutbreakIn total, ~4000 people were infected and 53 died

Infectious DiseaseMeasles: 122,000 per yearHepatitis C-related liver disease: 350,000-500,000 per yearMalaria: 627,000 per yearHIV/AIDS: 1.4-1.7 million per yearNon-communicable, with genetic predispositionProstate cancer: 307,000 per yearBreast cancer: 522,000 per yearSuicide: 800,000 per yearDiabetes: 1.5 million per yearCancer: 8.2 million per yearCardiovascular Disease: 17.5 million per yearNon-genetic/Non-infectiousPesticide Poisoning: 250,000 per yearMalnutrition: 2.8 million children (under 5) per year

*World Health Organization Fact Sheets http://www.who.int/en/

From a Public Health perspective…Deaths Worldwide*

Sharing Data is Essential for Many Reasons

700rice wheat

Rice v Wheat: consequences of publically available genome data

Sharing aids fields…

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.

Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Sharing aids authors…

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

Sharing Detailed Research Data Is Associated with Increased Citation Rate.

Lack of Sharing Impacts Reproducibility

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14

2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Sharing can reduce retractions>15X increase in last decade

Strong correlation of “retraction index” with higher impact factor

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

At current % increase by 2045 as many papers published as retracted!

Data Sharing Hurdles

If only it were easy…

There are numerous reasons why researchers do not share data:

The majority of which are good reasons

Wiley Researcher Data Insights SurveyOur objective was to establish a baseline view of data sharing practices, attitudes, and motivations globally, with participation from researchers in every scholarly field.

In March 2014, more than 90,000 researchers around the world were invited to participate in Wiley’s Researcher Data Insights Survey. Participants were researchers who had published at least one journal article in the past year with any publisher.

We received an overwhelming 2,886 responses from around the world.

Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley

Wiley Researcher Data Insights Survey

Key Findings

• Most researchers are sharing their data.

• Those not sharing have a variety of reasons.

• Data that’s being shared typically is <10 GB.

• The most common type of data that is being shared is flat, tabular data (.csv, .txt, .xl)

• Data is usually saved on hard drives.

Wiley Researcher Data Insights SurveyWhy Researchers Do Not Share• Intellectual property or confidentiality issues (59%)• Concerned research might be “scooped” (39%)• Concerns about misinterpretation or misuse (32%)• Concerns about attribution/citation credit (31%)• Ethical concerns (24%)• Insufficient time/resources (19%)• Funder/institution does not require sharing (13%)• Lack of funding (13%)• Not sure where to share (5%)• Not sure how to share (3%)

See also:http://exchanges.wiley.com/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont/http://scholarlykitchen.sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research-data-question/

How Can Publishers Promote Data Sharing

Carrots and Sticks

– Create Journal Data Release Policies

– Check Data Release Policy is followed

– Find Ways to Aid Researchers in Releasing Data

– Consider ways to support/protect researchers who do share ahead of publications

– Promote Data Citation

And- why us? Researchers are never so captive as when they publishing

But we need to help — not just harass.

How Can Publishers Promote Data Sharing

Carrots and Sticks

– Create Journal Data Release Policies

– Check Data Release Policy is followed

– Find Ways to Aid Researchers in Releasing Data

– Consider ways to support/protect researchers who do share ahead of publications

– Promote Data Citation

And- why us? Researchers are never so captive as when they publishing

But we need to help — not just harass.

Incentives/credit

Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)

Genomics Data Sharing Policies…

1. Automatic release of sequence assemblies within 24 hours.2. Immediate publication of finished annotated sequences.3. Aim to make the entire sequence freely available in the public domain for

both research and development in order to maximise benefits to society.

Bermuda Accords 1996/1997/1998:

1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production.

2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria.

Fort Lauderdale Agreement, 2003:

The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.

Toronto International data release workshop, 2009:

Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility (From the Fort Lauderdale Meeting 2003)

http://www.genome.gov/pages/research/wellcomereport0303.pdf

DataCite and DOIs

“increase acceptance of research data as legitimate, citable contributions to the scholarly record”.

Aims to:

“data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.

Citing Data Isn’t NewThe Physical Sciences have been doing this for a while

How We Envision Research Publication(Communicating Science)

Data Sets inGigaDB

Analyses inGigaGalaxy

Paper inGigaScience

Open-access journal Data Publishing Platform

Data Analysis Platform

Other Journals are now doing similar

This is most commonly done in the form of a Data Paper rather than a release of data that is citable in itself.

• A Data Paper is affectively a Description of the Data

• Other journals that do Data Publishing as a formal paper type• F1000 Research (launched in 2012)

• Has Data papers as one of several types of papers• Scientific Data (launched in 2014)

• Solely publishes Data Descriptors• There are more…

Making the Data Itself Citable

We provide a linked database The data are then directly linked to the paper- but can also be cited separately through a Data DOI We can do this because we have a collaboration between BMC (who handles the standard paper publication) and BGI (which has enormous data storage capacity.)

However: There are many community available databases- so in principle- any journal can do this by taking advantage of such available resources.

These include the usual suspects: EBI, NCBI, DDBJ etc.Databases that take all data types and provide Data DOIs: Dryad, FigShare, etc.There are also numerous smaller community databases specific to different fields or data types.

For data citation to work, needs:

• Acceptance by journals.

• Data+Citation: inclusion in the references.

• Tracking by citation indexes.

• Usage of the metrics by the community…

In Principle…

Back to E.coli O104:H4

• As noted: articles on these early released and citable data were published

• Also- the early releasers were not the first to publish

• Nor was the data cited

This open-source analysis work was published on August 25th

The journal did not approve of inclusion of the data citation.

Nor was any indication of where the genome information could be found

This report was the first to be publisher- and it included and used information from the crowd-source release as well as the other early release.

No where in the paper is there any indication of where to obtain this data

Nor is there an indication of where to obtain the sequence data they generated

This group made their 0104:H4 sequence available at the time of completion- prior to publication in the NCBI database.

Though no link to the Accession Number is easily found in the paper.

This report DID include a reference for the data (even though they did not use it in their analysis)

In Practice…

• Data submitted to NCBI databases:

• Submission to public databases complemented by its citable form in GigaDB (doi:10.5524/100012).

- Raw data SRA:SRA046843

- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000

- SNPs dbSNP:1056306

- CNVs- InDels dbVAR:nstd63

In the references…

Is the DOI…

In Practice…

http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/

The polar bear DATA was released –prepublication- in 2011They were used and cited in the following studies- before the main paper on the sequencing was published

Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424.

Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345.

Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117.

Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133.

Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109

Cell Press Journals

However, this didn’t include the citation…

One step forward — two steps back

Removing data citations from the references

One journal informed the authors that non-reviewed material could not be cited in the references of the paper

Another journal stripped the data citation from the references- and went an extra step and changed the citation in the Data Availability section to the URL where the DOI directed it to at that timeWe happened to know about this one- and were able to create a forward to the DOI’d page when the URL broke after we moved our database platform

Note: Much of this was due to a standard operating procedure in the production department

Lesson: If you decide to include Data Citations- tell your entire team

This is a work in progress…

Data Citation Really is a Major IncentiveOn Weds this week- we released the genome sequence from 3000 Rice strains (13.4 TB of data)• These data were also deposited in NIH SRA repository• So why did we do it too?1. It is linked directly to the Data Paper that provides

details of data production, quality, and basic analysis2. Authors were hesitant to release these data (a HUGE

community resource) prior to the analysis paper publication (which, for 3000 strains… would take years…). The opportunity to have these data citable (and trackable) encouraged the authors and led to their releasing these data and doing so in collaboration with GigaScience’s Biocurator

The 3,000 Rice Genomes Project. (2014) GigaScience 3:7 http://dx.doi.org/10.1186/2047-217X-3-7;The 3000 Rice Genomes Project (2014) GigaScience Database. http://dx.doi.org/10.5524/200001

IRRI GALAXY

Rice 3K project: 3,000 rice genomes, 13.4TB public data

No: your data is not too large to share

Beyond Data Citation

Reviewing Data

Data Release policies include the need to help authors

Data availability without metadata is practically useless

Reviewing DataIt’s too hard- we can’t ask our reviewers to do that!Use Data Reviewers

Example in Neuroscience

1. Neuroscience Data are not typically shared

2. For most papers: Data AND Tools are nottypically made available to the reviewers

3. Journal Editors think Reviewers will notwant to review data

GigaScience 2014, 3:3 doi:10.1186/2047-217X-3-3

Example in Neuroscience• Neuroscience Data are not typically shared• Author Dr. Stephen Eglen said: “One way of encouraging neuroscientists to

share their data is to provide some form of academic credit.”• We hosted with a DOI: 366 recordings from 12 electrophysiology datasets• GigaDB is included in Thompson Reuters Data Citation Index

• Data AND Tools are not typically made available to the reviewers• We made manuscript, data and tools all available to the reviewers.• We make sure to include reviewers who are able to properly assess the data

itself and rerun the tools • To reduce burdens- we sometimes select a reviewer who ONLY looks at the

• Journal Editors think Reviewers will not want to review data• What Reviewer Dr. Thomas Wachtler said: “The paper by Eglen and

colleagues is a shining example of openness in that it enables replicating the results almost as easily as by pressing a button.”

• What Reviewer Dr. Christophe Pouzat said: “In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewers job more fun!”

Data Release policies include the need to help authors

CollaborationsWith data repositoriesWith other journals

Consider Cross Journal SupportCompetition is good…

….but sometimes we should collaborate for the community good

• PLoS recent data deposition policies have led to community concerns about feasibility.

• We support (and applaud) this …we have an even stricter data deposition policy

• But- PLoS ONE received a submission that was acomparative study of earthworm morphology and anatomy using a 3D non-invasive imaging technique called micro-computed tomography (or microCT) …And there is no good place to put this

• These data are extremely complex, videos, multiple files-with several folders of ~10 GB

Consider Cross Journal Support

• GigaScience and PLOS ONE collaborated. They published the main article; we published a Data Note describing the data itself and hosted all the data on GigaDB under separate citation.

• With our Aspera Connection- reviewers could download even the 10 TB folders in ~1/2 hour

• Reviewer Dr. Sarah Faulwetter noted the usefulness of having these data available, saying: Instead of having to go through the lengthy process of obtaining the physical specimen from a museum, I can now download a fairly accurate representation from the web.

Lenihan et al (2014). GigaScience, 3:6 http://dx.doi.org/10.1186/2047-217X-3-6; Lenihan, et al (2014): GigaScience Database. http://dx.doi.org/10.5524/100092; Fernández et al (2014) PLOS ONE 9 (5) e96617 http://dx.doi.org/10.1371/journal.pone.0096617

Data availability without metadata is practically useless

Engage/Employ/Interact with Curators

1. Lack of interoperability/sufficient metadata

2. Long tail of curation (“Democratization” of “big-data”)

Challenges for the future…

Think about what you do… and what you can do…• Promote- rather than inhibit- prepublication data sharing

• Promote Data Citation in the reference section

– incentivizes data release

– Makes it easier for readers to find

• Promote Data Sharing upon publication

– Consider your data release policies

• Form collaborations with repositories to aid authors in depositing their work

– Identify community organizations with metadata standards

• Make data available for reviewers (author website, community repositories, dryad and similar (your publisher?)

– at least do a sanity check

– Use “data reviewers”

No- this isn’t easy, but do what you can nowAnd work toward the rest

Evolve

It’s Time to Move Beyond Dead Trees

18121665 1869

Thanks to:Scott Edmunds, Executive Editor

Nicole Nogoy, Commissioning Editor

Peter Li, Lead Data Manager

Chris Hunter, Lead BioCurator

Rob Davidson, Data Scientist

Xiao (Jesse) Si Zhe, Database Developer

Amye Kenall, Journal Development Manager

editorial@gigasciencejournal.comdatabase@gigasciencejournal.com

@GigaScience

facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog

Contact us:

www.gigasciencejournal.comwww.gigadb.org

2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

y wang

genomic data

publishif data

sequence data

y chen

y rohde

y chang

y zong

Technology

1 Ed Pentz, CrossRef The OpenURL and OpenURL Framework:...

CrossRef คืออะไร

CrossRef Contributor Id

Explaining Crossref and Crossref services: how they can...

Crossref Wilbanks

0 CrossRef 2006 Annual Member Meeting Page 0 CrossRef Annual...

CrossRef How-to: A Technical Introduction to the Basics of.....

CrossRef Plagiarism Checking

Auto Crossref 0311

Linking & CrossRef

Reference Linking via CrossRef April 13, 2000 Ed Pentz...

Crossref Chronograph

CrossRef Multiple resolution - Serials · • CrossRef...

2013 CrossRef Annual Meeting Flash Update CrossRef Metadata....

CROSSREF Manual

Replacement CrossRef