Top Banner
Data Consultant, Honorary Academic Editor Associate Director, Principal Investigator The rise of the data-centric research and publication enterprises Susanna-Assunta Sansone, PhD uk.linkedin.com/in/sasansone @biosharing @isatools @scientificdata 'Managing Big Data - Setting the standards for analyzing and integrating big data’, Berlin, July 9-10, 2014 http://www.slideshare.net/SusannaSansone
39

Managing Big Data - Berlin, July 9-10, 201.

Jan 26, 2015

Download

Technology

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Managing Big Data - Berlin, July 9-10, 201.

Data Consultant, Honorary Academic Editor

Associate Director, Principal Investigator

The rise of the data-centric !research and publication enterprises!

Susanna-Assunta Sansone, PhD!!

uk.linkedin.com/in/sasansone!@biosharing!

@isatools!@scientificdata!

!

'Managing Big Data - Setting the standards for analyzing and integrating big data’, Berlin, July 9-10, 2014

http://www.slideshare.net/SusannaSansone

Page 2: Managing Big Data - Berlin, July 9-10, 201.

•  About myself!o  activities and interests!

•  Be FAIR to your data!o  concept!o my related projects!

•  The Scientific Data exemplar!o  rationale!o Data Descriptors!

Outline!

Page 3: Managing Big Data - Berlin, July 9-10, 201.

My areas of activity:!•  Data capture and curation!•  Data (nano)publication!•  Data provenance !•  Open, community ontologies

and standards!•  Semantic web!•  Software development!•  Training!

Communities I work with/for:! As part of:!•  UK, European and international

consortia!•  Pre-competitive informatics

public-private partnerships!•  Standardization initiatives!with e.g.:!

Page 4: Managing Big Data - Berlin, July 9-10, 201.

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes and narrative! Spreadsheets and tables! Linked data and nanopublications!

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Enabling reproducible research and open science, driving science and discoveries !

Increase the level of annotation at the source, tracking provenance and using community standards

Page 5: Managing Big Data - Berlin, July 9-10, 201.

https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/

Credit to:

Page 6: Managing Big Data - Berlin, July 9-10, 201.

A great start, but not enough!

image by Greg Emmerich

http://discovery.urlibraries.org/

http://www.theguardian.com/higher-education-network/blog/2014/jun/26

Page 7: Managing Big Data - Berlin, July 9-10, 201.

Findable, Accessible, Interoperable, Reusable!

Worldwide movement for FAIR data

Page 8: Managing Big Data - Berlin, July 9-10, 201.

In all fairness, no much data is FAIR!

Page 9: Managing Big Data - Berlin, July 9-10, 201.

But it is not just about technology…!

Page 10: Managing Big Data - Berlin, July 9-10, 201.

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

10

…breath and depth of the content!

…is pivotal!

Page 11: Managing Big Data - Berlin, July 9-10, 201.

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

11

sample characteristic(s)!experimental design!

experimental variable(s)!technology(s)!

measurement(s)!protocols(s)!data file(s)!

......!

Page 12: Managing Big Data - Berlin, July 9-10, 201.

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

12

•  make annotation explicit and discoverable

•  structure the descriptions for consistency

•  ensure/regulate access

•  deposit and publish

•  etc….

§  To make this dataset ‘FAIR’, one must have standards, tools and best practices to: •  report sufficient details •  capture all salient features of

the experimental workflow

Page 13: Managing Big Data - Berlin, July 9-10, 201.

A community mobilization to develop standards, e.g.:

§  Structural and operational differences •  organization types (open, close to members, society, WG etc.) •  standards development (how to formulate, conduct and maintain) •  adoption, uptake, outreach (link to journals, funders and commercial sector) •  funds (sponsors, memberships, grants, volunteering)

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

Page 14: Managing Big Data - Berlin, July 9-10, 201.

A community mobilization to develop standards, e.g.:

§  Structural and operational differences •  organization types (open, close to members, society, WG etc.) •  standards development (how to formulate, conduct and maintain) •  adoption, uptake, outreach (link to journals, funders and commercial sector) •  funds (sponsors, memberships, grants, volunteering)

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

Page 15: Managing Big Data - Berlin, July 9-10, 201.

Focus on reporting or content standards

Including minimum information reporting requirements, or checklists to report the same core, essential information

Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’

Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another

Community-developed, standards are pivotal to structure, enrich the description and share datasets, facilitating understanding and reuse!

Page 16: Managing Big Data - Berlin, July 9-10, 201.

16

Technologically-delineated views of the world !

Biologically-delineated views of the world!

Generic features (‘common core’)!- description of source biomaterial!- experimental design components!

Arrays!

Scanning! Arrays &Scanning!

Columns!

Gels!MS! MS!

FTIR!

NMR!

Columns!

transcriptomics proteomics metabolomics

plant biology epidemiology microbiology

Fragmentation, duplications and gaps

To compare and integrate data we need interoperable standards

Page 17: Managing Big Data - Berlin, July 9-10, 201.

17

Arrays!

Scanning! Arrays &Scanning!

Columns!

Gels!MS! MS!

FTIR!

NMR!

Columns!

transcriptomics proteomics metabolomics

Synergistic examples exist, but more are needed!

Page 18: Managing Big Data - Berlin, July 9-10, 201.

Growing number of reporting standards

+ 130 + 150

+ 303

Source: BioPortal

Databases, !annotation,!

curation !tools !

implementing !standards!

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!Source: B

ioSharing

Source: BioSharing

Page 19: Managing Big Data - Berlin, July 9-10, 201.

Navigating the sea of standards is not trivial!

The relationship among popular standard formats for pathway information!BioPAX and PSI-MI are designed for data exchange to and from databases and pathway and network data integration. SBML and CellML are designed to support mathematical simulations of biological systems and SBGN represents pathway diagrams. !

CREDIT: Demir, et al., The BioPAX community standard for pathway data sharing, 2010.

Page 20: Managing Big Data - Berlin, July 9-10, 201.

Which standards and database can we use/recommend

I work in the field of cell migration research, which one are applicable to me?

I us cell migration in translational research, are there specific clinical standards?

Page 21: Managing Big Data - Berlin, July 9-10, 201.
Page 22: Managing Big Data - Berlin, July 9-10, 201.

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

22

Page 23: Managing Big Data - Berlin, July 9-10, 201.

Registering and cataloging is just step one; the next one are: •  Develop assessment criteria for usability and popularity of standards •  Associate standards to data policies and databases •  Assemble journal and funder policies re data storage •  Make fully cross-searchable •  Intended goal: help stakeholders make informed decisions

Page 24: Managing Big Data - Berlin, July 9-10, 201.
Page 25: Managing Big Data - Berlin, July 9-10, 201.

General-purpose, configurable format, designed to support: •  description of the experimental metadata,

making the annotation explicit and discoverable

•  provenance tracking •  use community standards, such as minimal

reporting guidelines and terminologies •  designed to be converted to - a growing

number of - other metadata formats, e.g. used by EBI repositories

analysis !method! script!

Data file or !record in a database!

Page 26: Managing Big Data - Berlin, July 9-10, 201.
Page 27: Managing Big Data - Berlin, July 9-10, 201.

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

ISA powers data collection, curation resources and repositories, e.g.:

Page 28: Managing Big Data - Berlin, July 9-10, 201.

eTRIKS – european Translational Information and Knowledge management Services Consortium of academic (Imperial College, CNRS, Un of Luxemburg) and pharmas (Janssen, Merck, AZ, Lilly, Lundbeck, Pfizer, Roche, Sanofi, Bayer, GSK) building a sustainable, open translational research informatics platform

•  Nature Publishing Group‘s Scientific Data •  BioMedCentral and BGI‘s GigaScience •  F1000 Research •  Oxford University Press

Susanna-Assunta Sansone, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Eamonn Maguire, Milo Thurston; Oxford alumni: Annapaola Santarsiero, Pavlos Georgiou + my previous team when at EBI (2001 – 2010)!

Page 29: Managing Big Data - Berlin, July 9-10, 201.

•  About myself!o  activities and interests!

•  Be FAIR to your data!o  concept!o my related projects!

•  The Scientific Data exemplar!o  rationale!o Data Descriptors!

Outline!

Page 30: Managing Big Data - Berlin, July 9-10, 201.

FAIR data - roles and responsibilities

•  Data has to become an integral part of the scholarly communications!

•  Responsibilities lie across several stakeholder groups: researchers, data centers, librarians, funding agencies and publishers!

•  But publishers occupy a “leverage point” in this process!

Page 31: Managing Big Data - Berlin, July 9-10, 201.

Human Genome 2001 62 Pages, 150 Authors,

49 Figure, 27 tables

Encode Project 2012 30 papers, 3 Journals

Journal publishing - changing landscape !

Page 32: Managing Big Data - Berlin, July 9-10, 201.

!!!

Launched on May 27th, 2014

A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!

Credit for sharing your data

Focused on reuse and reproducibility

Peer reviewed, curated

Promoting Community Data Repositories

Open Access

Supported by:!

Page 33: Managing Big Data - Berlin, July 9-10, 201.

!!!Experimental metadata or !

structured component!(in-house curated, machine-

readable formats)!

Article or !narrative component!

(PDF and HTML) !

Data Descriptor: narrative and structure!

Page 34: Managing Big Data - Berlin, July 9-10, 201.

!!!!!!!!Scientific hypotheses:!Synthesis!Analysis!Conclusions!

Methods and technical analyses supporting the quality of the measurements:!What did I do to generate the data?!How was the data processed?!Where is the data?!Who did what when!

BEFORE: get your data to the community as soon as possible (see NPG pre-publication policy) AT THE SAME TIME: publish your Data Descriptor(s) alongside research article(s) AFTER: expand on your research articles, adding further information for reuse of the data

Relation with traditional articles - content and time!

Page 35: Managing Big Data - Berlin, July 9-10, 201.

Export to various formats (ISA_tab, RDF, etc)

Linking between research papers, Data Descriptors, and data records

Making data discoverable !

We currently recognize over 50 public data repositories!!

Page 36: Managing Big Data - Berlin, July 9-10, 201.

Evaluation is not be based on the perceived impact or novelty of the findings!•  Experimental Rigour and Technical Data Quality!

o  Were the data produced in a rigorous and methodologically sound manner?!o  Was the technical quality of the data supported convincingly with technical validation

experiments and statistical analyses of data quality or error, as needed?!o  Are the depth, coverage, size, and/or completeness of these data sufficient for the types of

applications or research questions outlined by the authors?!

•  Completeness of the Description!o  Are the methods and any data-processing steps described in sufficient detail to allow others to

reproduce these steps?!o  Did the authors provide all the information needed for others to reuse this dataset or integrate it

with other data?!o  Is this Data Descriptor, in combination with any repository metadata, consistent with relevant

minimum information or reporting standards?!

•  Integrity of the Data Files and Repository Record!o  Have you confirmed that the data files deposited by the authors are complete and match the

descriptions in the Data Descriptor?!o  Have these data files been deposited in the most appropriate available data repository?!

Peer review process focused on quality and reuse!

Page 37: Managing Big Data - Berlin, July 9-10, 201.

•  Neuroscience, ecology, epidemiology, environmental science, functional genomics, metabolomics, toxicology!

•  New datasets and previously published data sets!o  a fuller, more in-depth look at the data processing steps,

supported by additional data files and code from each step!o  additional tutorial-like information for scientists interested in

reusing or integrating the data with their own!•  Datasets in figshare and domain specific databases!•  Code deposited in figshare and GitHub!•  Individual datasets, curated aggregation and citizen science!•  First dataset part of a collection !•  Academic and industry authors!

37

Current content is diverse – bimonthly releases !

Page 38: Managing Big Data - Berlin, July 9-10, 201.

•  Do you run a data resource we should recognize?!o  See on our website the list of criteria databases should meet!!

•  Are you interested in facilitating submission to us? !o  See our ISA-Tab specification on the website!

-  you can implement and export in this format from your authoring/curation tool, or from your database!

!

•  Do you want to submit Data Descriptor(s)?!o  Check suitability by sending a pre-submission enquire, we accept:!

-  Submissions in the life, environmental and biomedical sciences; but not limited to!-  Experimental, observational and computational datasets!-  Individual datasets, curated aggregations, and collections!-  Unpublished data and follow-up, with additional information for wider reuse, e.g.:!

ü  a fuller, more in-depth look at the data processing steps, supported by additional data files and code from each step!

ü  additional tutorial-like information for scientists interested in reusing or integrating the data with their own!

Interested in collaborating and/or enable submission?!

Page 39: Managing Big Data - Berlin, July 9-10, 201.

Helping you publish, discover and reuse research data

Visit nature.com/scientificdata Email [email protected] Tweet @ScientificData

Supported by:!

Honorary Academic Editor Susanna-Assunta Sansone, PhD Managing Editor Andrew L Hufton, PhD Editorial Curator Victoria Newman Advisory Panel and Editorial Board including senior researchers, funders, librarians and curators