Top Banner
Scott Edmunds :: a new resource for the big- data community. www.gigasciencejournal.com
23

GigaScience: a new resource for the big-data community.

Jan 28, 2015

Download

Technology

Scott Edmunds talk at BGI's Bio-IT APAC meeting in Shenzhen introducing BGI/BMC's new big-data journal - GigaScience. 7th July 2011
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GigaScience: a new resource for the big-data community.

Scott Edmunds

:: a new resource for the big-data community.

www.gigasciencejournal.com

Page 2: GigaScience: a new resource for the big-data community.

Data Tsunami?

Flickr cc: opensourceway

Page 3: GigaScience: a new resource for the big-data community.

~100,000X

Sequencing cost ($ per Mbp)

Moore’s Law

Sequencing

Source: E Lander/Broad

Page 4: GigaScience: a new resource for the big-data community.

Sequencing Output

Data

Moore’s/Kryders Law

Storage

Page 5: GigaScience: a new resource for the big-data community.

Sequencing Output

Data

Dissemination?

Publication

Page 6: GigaScience: a new resource for the big-data community.

1 Illumina HiSeq 2000 (+Truseq upgrade)

= 600Gb/run (12 days)

X 128 Hiseq = 6Tb/day = >2Pb/year

= ~ 2000 Human Genomes/day

Potential sequencing capacity

Page 7: GigaScience: a new resource for the big-data community.

Flickr cc: opensourceway

Can we keep up?

Page 8: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Large-Scale Data Journal/Database

Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD

In conjunction with:

Now taking submissions…

Page 9: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Editorial Board: InternationalStephen O'Brien, USA Hanchuan Peng, USA Ming Qi, China/USA Susanna-Assunta Sansone, UK Michael Schatz, USA David Schwartz, USA Sumio Sugano, Japan Thomas Wachtler, Germany Jun Wang, China Marie Zins, France

Stephan Beck, UKAnn-Shyn Chiang, Taiwan Richard Durbin, UK Paul Flicek, UK Robert Hanner, Canada Yoshihide Hayashizaki, Japan Henning Hermjakob, UK Gary King, USA Donald Moerman, Canada Karen Nelson, USA

Page 10: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Editorial Board: Broad Span (all “big-data”)Stephen O'Brien, GenomicsHanchuan Peng, Imaging/NeuroMing Qi, Genetics/VariomeSusanna-Assunta Sansone, Standards Michael Schatz, Genomics/Cloud David Schwartz, Optical MappingSumio Sugano, Genomics Thomas Wachtler, Neuroscience Jun Wang, GenomicsMarie Zins, Medicine

Stephan Beck, EpigenomicsAnn-Shyn Chiang, Neuroscience Richard Durbin, Genetics/Genomics Paul Flicek, Genomics/Databases Robert Hanner, Barcoding/EcologyYoshihide Hayashizaki, Genomics Henning Hermjakob, ProteomicsGary King, Medicine/methodsDonald Moerman, Functional Genomics Karen Nelson, Metagenomics

Page 11: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Criteria and Focus of Journal/DatabaseReproducibility/ReuseUtility/UsabilityStandards/Searchability/Scale/SharingData publishing/DOI

Page 12: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Use of Data = Importance + Usability

easier to assesssubjective?

Page 13: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Data publishing/DOIData hosting will follow standard funding agency and community guidelines.DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking.Datasets tracked by WOS/ISI allowing additional metrics/credit for use.

Page 14: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Reproducibility/Reuse BGI Cloud Computing resources for handling and analyzing large-scale data.Integrated tools to promote more widespread access, viewing, and analysis of data.Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).

Page 15: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Special Series/Hub for cloud-based toolsTechnical notes: test tools in the BGI-Cloud.Tools + Test Data (BGI or user) in one place.Aids reproducibility. Aids reviewers (free)Aids authors: visibility (pubmed, etc.)

hosting (included/free offers)

–contact us: [email protected]

Oledoe flickr cc

Page 16: GigaScience: a new resource for the big-data community.

www.gigasciencejournal.com

Standards/Searchability/Sharing ISA-Tab compatibility to aid and promote best practice in metadata reporting.All supporting data must be publically available.Ask for MIBBI compliance and use of reporting checklists.Part of the Biosharing network.

Page 17: GigaScience: a new resource for the big-data community.

Benefits of Data-sharing

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

Sharing Detailed Research Data Is Associated with Increased Citation Rate.

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Page 18: GigaScience: a new resource for the big-data community.

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 19: GigaScience: a new resource for the big-data community.
Page 20: GigaScience: a new resource for the big-data community.
Page 21: GigaScience: a new resource for the big-data community.

“The way that the genetic data of the 2011 E. coli strain were disseminated globally suggests a more effective approach for tackling public health problems. Both groups put their sequencing data on the Internet, so scientists the world over could immediately begin their own analysis of the bug's makeup. BGI scientists also are using Twitter to communicate their latest findings.”

“German scientists and their colleagues at the Beijing Genomics Institute in China have been working on uncovering secrets of the outbreak. BGI scientists revised their draft genetic sequence of the E. coli strain and have been sharing their data with dozens of scientists around the world as a way to "crowdsource" this data. By publishing their data publicy and freely, these other scientists can have a look at the genetic structure, and try to sort it out for themselves.”

Page 22: GigaScience: a new resource for the big-data community.

G10K Genomes Get DOI®s

doi:10.5524/100004