A Journal’s Perspective on Data Standards and Biocuration Alexandra Basford, PhD www.gigasciencejournal.com
Jan 27, 2015
A Journal’s Perspective on Data Standards and Biocuration
Alexandra Basford, PhD
w w w. g i g a s c i e n c e j o u r n a l . c o m
Overview
Introduction
/ The Curation Challenges of a Journal/Database
Data Publishing
Our DOI Adventures
Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Sharing
Overview
Introduction
/ The Curation Challenges of a Journal/Database
Data Publishing
Our DOI Adventures
Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Sharing
How do we deal with “big data”?
vs. ?
What is
?
w w w. g i g a s c i e n c e j o u r n a l . c o m
… “big and sharable”
is a new open-access open-data journal for the publication of all types of biological studies that use or create large-scale data sets
- “Omics”- Imaging- Neuroscience
- Ecology- Medicine- Systems biology
The scope spans the biomedical and life sciences, including:
Published byin partnership with
Stephen O'Brien, USA Hanchuan Peng, USA Russell Poldrack, USAMing Qi, China/USA Susanna-Assunta Sansone, UK Michael Schatz, USA David Schwartz, USAFritz Sommer, USA Lincoln Stein, CanadaSumio Sugano, Japan Thomas Wachtler, Germany Jun Wang, ChinaAlistair Young, New ZealandZang Yufeng, China Marie Zins, France
Stephan Beck, UKAlvis Brazma, UKAnn-Shyn Chiang, Taiwan Richard Durbin, UK Paul Flicek, UK Robert Hanner, Canada Yoshihide Hayashizaki, Japan Henning Hermjakob, UK Wolfgang Huber, GermanyGary King, USA Tin-Lap Lee, Hong KongDonald Moerman, CanadaKaren Nelson, USA Francis Ouellette, Canada Lennart Hammarström, SwedenPaul Horton, Japan
Editorial Board – International
Stephen O'Brien, GenomicsHanchuan Peng, Imaging/Neuro Russell Poldrack, NeuroscienceMing Qi, GeneticsSusanna-Assunta Sansone, Standards Michael Schatz, Cloud ComputingDavid Schwartz, Optical MappingFritz Sommer, NeuroscienceLincoln Stein, Cloud ComputingSumio Sugano, GenomicsThomas Wachtler, Neuroscience Jun Wang, GenomicsAlistair Young, Medical ImagingZang Yufeng, NeuroscienceMarie Zins, Medicine
Stephan Beck, EpigenomicsAlvis Brazma, TranscriptomicsAnn-Shyn Chiang, NeuroscienceRichard Durbin, Genetics/GenomicsPaul Flicek, GenomicsRobert Hanner, DNA Barcoding/Ecology Yoshihide Hayashizaki, GenomicsHenning Hermjakob, ProteomicsWolfgang Huber, Functional GenomicsGary King, MedicineTin-Lap Lee, GenomicsDonald Moerman, Functional GenomicsKaren Nelson, MetagenomicsFrancis Ouellette, GenomicsLennart Hammarström, Immuno/GeneticsPaul Horton, Genetics/Tools
Editorial Board – Multidisciplinary
Nowaccepting
submissions
What is ?
www.GigaDB.org
vs. !✕&
An Unusual Format
• GigaScience combines standard manuscript publication with an ever expanding database
• Evolving data repository– Integrating tools for public access, viewing, and analysis of
the stored data – Improvements driven by community input
• All datasets are assigned data digital object identifiers (DOIs) to make them easy to access, track, and cite
&
?
Data Sharing Hurdles• Technical
– too large volumes– too heterogeneous – no home for many data types
• Economic– too expensive– no long-term funding
• Cultural– inertia– no incentives to share – unaware of how– too time consuming
?
Curation, cutation, curation
The long tail of new “big-data” producers?
Changing Trends
Growing/widening user base.
Cultural shift towards data sharing.
Use of Data = Importance + Usability
easier to assesssubjective?
Challenges for a Journal/Database
Reproducibility/Reuse
DOI®
Utility/Usability
Standards/Searchability/Sharing
Data publishing/DOI
• Guarantee of permanency• Clear method for data tracking and data citation,
allowing: – Increased the searchability (and hopefully use) of data – Credit for data production, making it clear who produced
the data and when– Credit to original authors for their data’s use – The ability to track and receive feedback on data usage– A data citation metric potentially rivaling and
complementary to the impact factor– The potential make the data available and receive credit
for it earlier, then later publishing papers on the dataset
Why DOI®s?.org
Sequencers137 Illumina/HiSeq 200027 LifeTech/SOLiD 416 AB/3730xl + 110 MegaBACEs2 Illumina iScan
Largest Sequencing Capacity in the World
Data Production 5.6 Tb / day
> 1500X of human genome / day
Multiple Supercomputing Centers157 TB Flops
20 TB Memory
12.6 PB Storage
BGI – “Sequence it.”
Early BGI DOI®s
Datasets
PlantsChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum
MicrobeE. Coli O104:H4 TY-2482
Human Asian individual (YH) - DNA Methylome - Genome Assembly - TranscriptomeAncient DNA (coming soon) - Saqqaq Eskimo - Aboriginal Australian
VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingNaked mole rat Penguin - Emperor penguin - Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope
InvertebratesAnt - Florida carpenter ant - Jerdon’s jumping ant - Leaf-cutter antRoundwormSilkworm
Cell LineChinese Hamster Ovary
The Success of E. coli
To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001
Our First DOI®
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
N Engl J Med 2011; 365:718-724.
The Macaque Story
Analysis paper published
Data DOIs appear in the paper
Sorghum as the New Gold Standard
• Data also submitted to NCBI (including SV data to dbVar)
• Submission to public databases complemented by its citable form in GigaDB:
Recently published
- Raw data- InDels- SV
- Assemblies of three strains- SNPs- CNVs
In the paper…
In the references…
Is the DOI.
Progress!
August
October
November
(It’s been a busy year.)
We begin issuing data DOIs Journals accept
articles with data that have data DOIs
Data DOIs listed in journal articles
Data DOIs are properly cited in the reference section of journal articles
July
Reproducibility/Reuse
DOI®
Utility/Usability
Standards/Searchability/Sharing
Data publishing/DOI
Challenges for a Journal/Database
Challenges for /
Reproducibility/Reuse
DOI®
Utility/Usability
Standards/Searchability/Sharing
Data publishing/DOI✔
Reproducibility/Reuse• BGI Cloud Computing resources for
handling and analyzing large-scale data.• Integrated tools to promote more
widespread access, viewing, and analysis of data.
• Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).
Utility/Usability = ease of access• Special series/hub for cloud-based tools
- Technical notes: test tools in the BGI-Cloud.- Tools + test data (BGI or user) in one place.- Aids reproducibility. - Aids reviewers (free)- Aids authors: visibility (pubmed, etc.)
hosting (included/free offers)
–contact us: [email protected]
Oledoe flickr cc
Tin-Lap Lee, CUHK
Utility/Usability = tools
Standards/Searchability/Sharing• ISA-Tab compatibility to aid and promote
best practice in metadata reporting.• All supporting data must be publically
available.• Ask for MIBBI compliance and use of
reporting checklists.• Part of the Biosharing network and the
International Neuroinformatics Coordinating Facility.
Big Data
ldl.genomics.cn
• Initiated 505 plant and animal genome projects
• Completed fine or draft genome maps for over 100 species
• Finished the sequencing of about 200 species
www.g igasc ience journa l . com www.g igaDB.org
Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD
Follow GigaScience on Twitter @GigaScience Contact: [email protected]