Top Banner
A Journal’s Perspective on Data Standards and Biocuration Alexandra Basford, PhD www.gigasciencejournal.com
46

Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Jan 27, 2015

Download

Technology

Alexandra Basford's talk in the curation session at the InCoB meeting in Kuala Lumpar, 30/11/11 on: GigaScience: A Journal’s Perspective on Data Standards and Biocuration
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

A Journal’s Perspective on Data Standards and Biocuration

Alexandra Basford, PhD

w w w. g i g a s c i e n c e j o u r n a l . c o m

Page 2: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Overview

Introduction

/ The Curation Challenges of a Journal/Database

Data Publishing

Our DOI Adventures

Reproducibility/Reuse

Utility/Usability

Standards/Searchability/Sharing

Page 3: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Overview

Introduction

/ The Curation Challenges of a Journal/Database

Data Publishing

Our DOI Adventures

Reproducibility/Reuse

Utility/Usability

Standards/Searchability/Sharing

How do we deal with “big data”?

Page 4: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

vs. ?

Page 5: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

What is

?

Page 6: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

w w w. g i g a s c i e n c e j o u r n a l . c o m

Page 7: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

… “big and sharable”

is a new open-access open-data journal for the publication of all types of biological studies that use or create large-scale data sets

- “Omics”- Imaging- Neuroscience

- Ecology- Medicine- Systems biology

The scope spans the biomedical and life sciences, including:

Published byin partnership with

Page 8: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Stephen O'Brien, USA Hanchuan Peng, USA Russell Poldrack, USAMing Qi, China/USA Susanna-Assunta Sansone, UK Michael Schatz, USA David Schwartz, USAFritz Sommer, USA Lincoln Stein, CanadaSumio Sugano, Japan Thomas Wachtler, Germany Jun Wang, ChinaAlistair Young, New ZealandZang Yufeng, China Marie Zins, France

Stephan Beck, UKAlvis Brazma, UKAnn-Shyn Chiang, Taiwan Richard Durbin, UK Paul Flicek, UK Robert Hanner, Canada Yoshihide Hayashizaki, Japan Henning Hermjakob, UK Wolfgang Huber, GermanyGary King, USA Tin-Lap Lee, Hong KongDonald Moerman, CanadaKaren Nelson, USA Francis Ouellette, Canada Lennart Hammarström, SwedenPaul Horton, Japan

Editorial Board – International

Page 9: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Stephen O'Brien, GenomicsHanchuan Peng, Imaging/Neuro Russell Poldrack, NeuroscienceMing Qi, GeneticsSusanna-Assunta Sansone, Standards Michael Schatz, Cloud ComputingDavid Schwartz, Optical MappingFritz Sommer, NeuroscienceLincoln Stein, Cloud ComputingSumio Sugano, GenomicsThomas Wachtler, Neuroscience Jun Wang, GenomicsAlistair Young, Medical ImagingZang Yufeng, NeuroscienceMarie Zins, Medicine

Stephan Beck, EpigenomicsAlvis Brazma, TranscriptomicsAnn-Shyn Chiang, NeuroscienceRichard Durbin, Genetics/GenomicsPaul Flicek, GenomicsRobert Hanner, DNA Barcoding/Ecology Yoshihide Hayashizaki, GenomicsHenning Hermjakob, ProteomicsWolfgang Huber, Functional GenomicsGary King, MedicineTin-Lap Lee, GenomicsDonald Moerman, Functional GenomicsKaren Nelson, MetagenomicsFrancis Ouellette, GenomicsLennart Hammarström, Immuno/GeneticsPaul Horton, Genetics/Tools

Editorial Board – Multidisciplinary

Page 10: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Nowaccepting

submissions

Page 11: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

What is ?

Page 12: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

www.GigaDB.org

Page 13: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

vs. !✕&

Page 14: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

An Unusual Format

• GigaScience combines standard manuscript publication with an ever expanding database

• Evolving data repository– Integrating tools for public access, viewing, and analysis of

the stored data – Improvements driven by community input

• All datasets are assigned data digital object identifiers (DOIs) to make them easy to access, track, and cite

&

Page 15: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

?

Data Sharing Hurdles• Technical

– too large volumes– too heterogeneous – no home for many data types

• Economic– too expensive– no long-term funding

• Cultural– inertia– no incentives to share – unaware of how– too time consuming

Page 16: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

?

Curation, cutation, curation

The long tail of new “big-data” producers?

Changing Trends

Growing/widening user base.

Cultural shift towards data sharing.

Page 17: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Use of Data = Importance + Usability

easier to assesssubjective?

Page 18: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Challenges for a Journal/Database

Reproducibility/Reuse

DOI®

Utility/Usability

Standards/Searchability/Sharing

Data publishing/DOI

Page 19: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

• Guarantee of permanency• Clear method for data tracking and data citation,

allowing: – Increased the searchability (and hopefully use) of data – Credit for data production, making it clear who produced

the data and when– Credit to original authors for their data’s use – The ability to track and receive feedback on data usage– A data citation metric potentially rivaling and

complementary to the impact factor– The potential make the data available and receive credit

for it earlier, then later publishing papers on the dataset

Why DOI®s?.org

Page 20: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Sequencers137 Illumina/HiSeq 200027 LifeTech/SOLiD 416 AB/3730xl + 110 MegaBACEs2 Illumina iScan

Largest Sequencing Capacity in the World

Data Production 5.6 Tb / day

> 1500X of human genome / day

Multiple Supercomputing Centers157 TB Flops

20 TB Memory

12.6 PB Storage

Page 21: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

BGI – “Sequence it.”

Page 22: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Early BGI DOI®s

Page 23: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Datasets

PlantsChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum

MicrobeE. Coli O104:H4 TY-2482

Human Asian individual  (YH) - DNA Methylome - Genome Assembly - TranscriptomeAncient DNA (coming soon) - Saqqaq Eskimo - Aboriginal Australian

VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingNaked mole rat  Penguin - Emperor penguin - Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope

InvertebratesAnt - Florida carpenter ant - Jerdon’s jumping ant - Leaf-cutter antRoundwormSilkworm

Cell LineChinese Hamster Ovary

Page 24: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

The Success of E. coli

Page 25: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our First DOI®

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 26: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration
Page 27: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration
Page 28: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

N Engl J Med 2011; 365:718-724.

Page 29: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

The Macaque Story

Page 30: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Analysis paper published

Page 31: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Data DOIs appear in the paper

Page 32: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Sorghum as the New Gold Standard

Page 33: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration
Page 34: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

• Data also submitted to NCBI (including SV data to dbVar)

• Submission to public databases complemented by its citable form in GigaDB:

Recently published

- Raw data- InDels- SV

- Assemblies of three strains- SNPs- CNVs

Page 35: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

In the paper…

Page 36: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

In the references…

Page 37: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Is the DOI.

Page 38: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Progress!

August

October

November

(It’s been a busy year.)

We begin issuing data DOIs Journals accept

articles with data that have data DOIs

Data DOIs listed in journal articles

Data DOIs are properly cited in the reference section of journal articles

July

Page 39: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Reproducibility/Reuse

DOI®

Utility/Usability

Standards/Searchability/Sharing

Data publishing/DOI

Challenges for a Journal/Database

Page 40: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Challenges for /

Reproducibility/Reuse

DOI®

Utility/Usability

Standards/Searchability/Sharing

Data publishing/DOI✔

Page 41: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Reproducibility/Reuse• BGI Cloud Computing resources for

handling and analyzing large-scale data.• Integrated tools to promote more

widespread access, viewing, and analysis of data.

• Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).

Page 42: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Utility/Usability = ease of access• Special series/hub for cloud-based tools

- Technical notes: test tools in the BGI-Cloud.- Tools + test data (BGI or user) in one place.- Aids reproducibility. - Aids reviewers (free)- Aids authors: visibility (pubmed, etc.)

hosting (included/free offers)

–contact us: [email protected]

Oledoe flickr cc

Page 43: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Tin-Lap Lee, CUHK

Utility/Usability = tools

Page 44: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Standards/Searchability/Sharing• ISA-Tab compatibility to aid and promote

best practice in metadata reporting.• All supporting data must be publically

available.• Ask for MIBBI compliance and use of

reporting checklists.• Part of the Biosharing network and the

International Neuroinformatics Coordinating Facility.

Page 45: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

Big Data

ldl.genomics.cn

• Initiated 505 plant and animal genome projects

• Completed fine or draft genome maps for over 100 species

• Finished the sequencing of about 200 species

Page 46: Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

www.g igasc ience journa l . com www.g igaDB.org

Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD

Follow GigaScience on Twitter @GigaScience Contact: [email protected]