Top Banner
On community-standards, FAIR data and scholarly communication Susanna-Assunta Sansone, PhD ORCID: 0000-0001-5306-5690 INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017 Data Consultant, Founding Academic Editor Associate Director, Principal Investigator www.slideshare.net/SusannaSansone
53

INSERM - Data Management & Reuse of Health Data - May 2017

Jan 21, 2018

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: INSERM - Data Management & Reuse of Health Data - May 2017

On community-standards, FAIR data and scholarly communication

Susanna-Assunta Sansone, PhDORCID: 0000-0001-5306-5690

INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017

Data Consultant,Founding Academic Editor

Associate Director,Principal Investigator

www.slideshare.net/SusannaSansone

Page 2: INSERM - Data Management & Reuse of Health Data - May 2017
Page 3: INSERM - Data Management & Reuse of Health Data - May 2017

Source: https://www.dataone.org/best-practices

Simplified research data life cycle

Page 4: INSERM - Data Management & Reuse of Health Data - May 2017

• Available in a public repository• Findable through some sort of search facility• Retrievable in a standard format• Self-describing so that third parties can make sense of it• The product of careful planning, organization and stewardship• Intended to outlive the experiment for which they were

collected

To do better science, more efficiently we need data that are…

Page 5: INSERM - Data Management & Reuse of Health Data - May 2017

Key problem: low findability and understandability

• Not always well cited and storedo True for data as well as for any other digital asset

• Poorly described for third party reuseo Different level of details and annotation

• Reporting and annotation activities are perceived as time consumingo Often rushed and minimally done

Page 6: INSERM - Data Management & Reuse of Health Data - May 2017

We need content or reporting standards

• To harmonized the datasets with respect to the structureand level or annotation of their:§ experimental components (e.g., design, conditions, parameters),

§ fundamental biological entities (e.g., samples, genes, cells),

§ complex concepts (such as bioprocesses, tissues, diseases),

§ analytical process and the mathematical models, and

§ their instantiation in computational simulations (from the

molecular level through to whole populations of individuals)

Page 7: INSERM - Data Management & Reuse of Health Data - May 2017

Minimum information reporting requirements, checklists

o Report the same core, essential information

o e.g. MIAME guidelines

Controlled vocabularies, taxonomies, thesauri, ontologies etc.

o Unambiguous identification and definition of concepts

o e.g. Gene Ontology

Conceptual model, schema, exchange formats etc

o Define the structure and interrelation of information, and the transmission format

o e.g. FASTAFormats Terminologies Guidelines

Types of content standards

Page 8: INSERM - Data Management & Reuse of Health Data - May 2017

de jure de factograss-roots

groupsstandard

organizations

Nanotechnology Working Group

Formats Terminologies Guidelines

Community-driven efforts, just few examples

Page 9: INSERM - Data Management & Reuse of Health Data - May 2017

Formats Terminologies Guidelines

224

115

500+

source sourcesource

MIAMEMIRIAM

MIQASMIXMIGEN

ARRIVEMIAPE

MIASE

MIQE

MISFISHIE….

REMARK

CONSORT

SRAxml

SOFT FASTADICOM

MzMLSBRML

SEDML…

GELML

ISA

CML

MITAB

AAOCHEBIOBI

PATO ENVOMOD

BTOIDO…

TEDDY

PROXAO

DO

VO

Content standards in numbers

Page 10: INSERM - Data Management & Reuse of Health Data - May 2017
Page 11: INSERM - Data Management & Reuse of Health Data - May 2017

How to discover the ‘right’ standards for your data?

Page 12: INSERM - Data Management & Reuse of Health Data - May 2017
Page 13: INSERM - Data Management & Reuse of Health Data - May 2017

Aweb-based,curatedandsearchableportalthat monitorsthedevelopment and

evolution ofstandards,theiruse indatabases andtheadoptionofbothindata

policies,toinform andeducate theusercommunity

Page 14: INSERM - Data Management & Reuse of Health Data - May 2017

Data policies by funders, journals and other organizations

Content standards

Formats Terminologies Guidelines

Map this complex and evolving landscape

Databases

Allrecordsaremanuallycuratedin-house

andverifiedbythecommunitybehindeachresource

Page 15: INSERM - Data Management & Reuse of Health Data - May 2017

Data policies by funders, journals and other organizations

Databases

Content standards

Formats Terminologies Guidelines

Using indicators to describe ‘status’

Readyforuse,implementation,orrecommendation

Indevelopment

Statusuncertain

Deprecatedassubsumedorsuperseded

Page 16: INSERM - Data Management & Reuse of Health Data - May 2017

Understanding how standards are used

Page 17: INSERM - Data Management & Reuse of Health Data - May 2017

Understanding how standards are used

Guideline

Page 18: INSERM - Data Management & Reuse of Health Data - May 2017

Understanding how standards are used

Formats

Guideline

Page 19: INSERM - Data Management & Reuse of Health Data - May 2017

Understanding how standards are used

Formats

Guideline

Formats

Page 20: INSERM - Data Management & Reuse of Health Data - May 2017

Understanding how standards are used

Formats

Guideline

Formats

Terminology

Page 21: INSERM - Data Management & Reuse of Health Data - May 2017

Data policies by funders, journals and other organizations

Databases

Content standards

Formats Terminologies Guidelines

Using indicators to indicate ‘adoption’

Page 22: INSERM - Data Management & Reuse of Health Data - May 2017
Page 23: INSERM - Data Management & Reuse of Health Data - May 2017
Page 24: INSERM - Data Management & Reuse of Health Data - May 2017
Page 25: INSERM - Data Management & Reuse of Health Data - May 2017

Standard developing groups:Journal, publishers:

Cross-links, data exchange:

Societies and organisations: Institutional RDM services:

Projects, programmes:

Page 26: INSERM - Data Management & Reuse of Health Data - May 2017

Technologically-delineated views of the world

Biologically-delineated views of the world

Generic features (‘common core’)- description of source biomaterial- experimental design components

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Duplications & lack of interoperability among standards

Page 27: INSERM - Data Management & Reuse of Health Data - May 2017

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Hard to use them in combinations, e.g. to represent:

Proteomics-based gut microbiota profiling

Proteomics and metabolomics based gut microbiota profiling

Page 28: INSERM - Data Management & Reuse of Health Data - May 2017

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Enhancing modularization

Proteomics-based gut microbiota profiling

Proteomics and metabolomics based gut microbiota profiling

Page 29: INSERM - Data Management & Reuse of Health Data - May 2017

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Enhancing modularization

Proteomics-based gut microbiota profiling

Proteomics and metabolomics based gut microbiota profiling

Page 30: INSERM - Data Management & Reuse of Health Data - May 2017

bsg-000174

biosharing:ReportingGuideline

bsg-000161

MINSEQE

MIMARKS

sample information

sample identifier

taxonomyidentifier

sequence read

geo location

High-level information about the metadata standards

Representations of the standards elements

Template elementsfor

el-000001

el-000002

el-000003

provenance: MINSEQE

provenance: MINSEQE

and MIMARKS

provenance:MIMARKS

Serve machine-readable content metadata standards, providing provenance for their elements, rendering standards invisible to the researchers

Inform the creation of metadata templates

Page 31: INSERM - Data Management & Reuse of Health Data - May 2017

How to discover the datasets relevant to your work?

Page 32: INSERM - Data Management & Reuse of Health Data - May 2017

OmicsDI: Nature Biotechnology 35, 406–409 (2017) doi:10.1038/nbt.3790

omicsdi.org

Page 33: INSERM - Data Management & Reuse of Health Data - May 2017

datamed.org

DataMed: bioRxiv 094888; https://doi.org/10.1101/094888 Nature Genetics (in press)

DATS: bioRxiv 103143; https://doi.org/10.1101/103143 Scientific Data (in press)

Page 34: INSERM - Data Management & Reuse of Health Data - May 2017

• Discoverability and reusabilityo Complementing community

databases• Incentive, credit for sharing

o Big and small datao Unpublished datao Long tail of datao Curated aggregation

• Peer review of data• Value of data vs. analysis

Growing number of data papers and data journals, e.g:

Page 35: INSERM - Data Management & Reuse of Health Data - May 2017

nature.com/scientificdataHonorary Academic Editor Susanna-Assunta Sansone, PhD

Managing EditorAndrew L Hufton, PhD

Editorial CuratorVarsha Khodiyar

PublisherIain Hrynaszkiewicz

A new open-access, online-only publication for descriptions of scientifically valuable datasets

Supported by

Page 36: INSERM - Data Management & Reuse of Health Data - May 2017

• A peer reviewed description of data, to maximize usage• Citable publications that give credit for reusable data• It requires data deposition to the appropriate repository(s)• Is complementary and can be associated or not to traditional article(s)

New article type

Page 37: INSERM - Data Management & Reuse of Health Data - May 2017

Res

earc

hpa

pers

Dat

a re

cord

sD

ata

Des

crip

tors

Value added component – complementing articles and repositories

Page 38: INSERM - Data Management & Reuse of Health Data - May 2017

• Title• Abstract• Background & Summary• Methods• Data Records• Technical Validation• Usage Notes • Figures & Tables • References• Data Citations

• following the Joint Declaration of Data Citation Principles

Detailed description of the methods and technical analyses supporting the

quality of the measurements; no scientific hypotheses

Article structure

Page 39: INSERM - Data Management & Reuse of Health Data - May 2017

Focus on data peer review

• Completeness = can others reproduce?• Consistency = were community standards followed?• Integrity = are data in the best repository?• Experimental rigour, technical quality = were the methods sound?

Does not focus on perceived impact, importance, size, complexity of data

Page 40: INSERM - Data Management & Reuse of Health Data - May 2017

Credit for data producers, data managers/curators etc.

Credit to: Varsha Khodiyar

Page 41: INSERM - Data Management & Reuse of Health Data - May 2017

“The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.”

Professor Daniele Marinazzo

Credit to: Varsha Khodiyar

Data (re)use made easier

Page 42: INSERM - Data Management & Reuse of Health Data - May 2017

Decades old dataset

Aggregated or curated data

resources

Computationally produced data

productsLarge

consortium dataset

Data from a single

experiment

Data that YOU find valuable

and that others might find useful too

Data associated with a high impact

analysis article

What makes a good ?

Page 43: INSERM - Data Management & Reuse of Health Data - May 2017

Experimental metadata or structured component

(in-house curated, machine-readable formats)

Article or narrative component

(PDF and HTML)

Data Descriptors has two components

Page 44: INSERM - Data Management & Reuse of Health Data - May 2017

The Data Curation Editor is responsible for creating and curating the machine-readable structured component• Enables browsing and searching the articles• Facilitates links to related journal articles and repository

records

Curation and discoverability

Page 45: INSERM - Data Management & Reuse of Health Data - May 2017

Created with the input of the authors, includes value-added semantic annotation of the experimental metadata

analysis method script

Data file or record in a database

Data Descriptors: structured component

Page 46: INSERM - Data Management & Reuse of Health Data - May 2017
Page 47: INSERM - Data Management & Reuse of Health Data - May 2017
Page 48: INSERM - Data Management & Reuse of Health Data - May 2017
Page 49: INSERM - Data Management & Reuse of Health Data - May 2017

Complementary roles of ISA and nanopublications

From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics. https://doi.org/10.1371/journal.pone.0127612

PloS ONE (2015)

Page 50: INSERM - Data Management & Reuse of Health Data - May 2017

The (long) road to FAIR

Page 51: INSERM - Data Management & Reuse of Health Data - May 2017

Responsibilities lie across several stakeholder groups

Understand the benefits of sharing FAIR datasets and enact them

Engage and assist researchers to enable them to share FAIR datasets

Release or endorse practices and polices, but also incentive

and credit mechanisms for researchers, curators and

developers

Page 52: INSERM - Data Management & Reuse of Health Data - May 2017

“As Data Science culture grows,digital research outputs (such asdata, computational analysis andsoftware) are being established asfirst-class citizens.

This cultural shift is required to goone step further: to recognizeinteroperability standards as digitalobjects in their own right, with theirassociated research, developmentand educational activities”.

Sansone, Susanna-Assunta; Rocca-Serra, Philippe (2016). Interoperability Standards - Digital Objects in Their Own Right. Wellcome Trust” https://dx.doi.org/10.6084/m9.figshare.4055496.v1

Page 53: INSERM - Data Management & Reuse of Health Data - May 2017

Philippe Rocca-Serra, PhDSenior Research Lecturer

AlejandraGonzalez-Beltran, PhDResearch Lecturer

Milo Thurston, DPhDResearch Software Engineer

MassimilianoIzzo, PhDResearch Software Engineer

Peter McQuilton, PhDKnowledge Engineer

Allyson Lister, PhDKnowledge Engineer

EamonnMaguire, DphilContractor

David Johnson, PhDResearch Software Engineer

MelanieAdekale, PhDBiocurator Contractor

DelphineDauga, PhDBiocurator Contractor

We work with and for

to make data and other digital research assets

Susanna-Assunta Sansone, PhDPrincipal Investigator, Associate Director and Data Consultant for Springer Nature

enabling open science, driving science and discoveries