INSERM - Data Management & Reuse of Health Data - May 2017

Post on 21-Jan-2018

421 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

Transcript

On community-standards, FAIR data and scholarly communication

Susanna-Assunta Sansone, PhDORCID: 0000-0001-5306-5690

INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017

Data Consultant,Founding Academic Editor

Associate Director,Principal Investigator

www.slideshare.net/SusannaSansone

Source: https://www.dataone.org/best-practices

Simplified research data life cycle

• Available in a public repository• Findable through some sort of search facility• Retrievable in a standard format• Self-describing so that third parties can make sense of it• The product of careful planning, organization and stewardship• Intended to outlive the experiment for which they were

collected

To do better science, more efficiently we need data that are…

Key problem: low findability and understandability

• Not always well cited and storedo True for data as well as for any other digital asset

• Poorly described for third party reuseo Different level of details and annotation

• Reporting and annotation activities are perceived as time consumingo Often rushed and minimally done

We need content or reporting standards

• To harmonized the datasets with respect to the structureand level or annotation of their:§ experimental components (e.g., design, conditions, parameters),

§ fundamental biological entities (e.g., samples, genes, cells),

§ complex concepts (such as bioprocesses, tissues, diseases),

§ analytical process and the mathematical models, and

§ their instantiation in computational simulations (from the

molecular level through to whole populations of individuals)

Minimum information reporting requirements, checklists

o Report the same core, essential information

o e.g. MIAME guidelines

Controlled vocabularies, taxonomies, thesauri, ontologies etc.

o Unambiguous identification and definition of concepts

o e.g. Gene Ontology

Conceptual model, schema, exchange formats etc

o Define the structure and interrelation of information, and the transmission format

o e.g. FASTAFormats Terminologies Guidelines

Types of content standards

de jure de factograss-roots

groupsstandard

organizations

Nanotechnology Working Group

Formats Terminologies Guidelines

Community-driven efforts, just few examples

Formats Terminologies Guidelines

224

115

500+

source sourcesource

MIAMEMIRIAM

MIQASMIXMIGEN

ARRIVEMIAPE

MIASE

MIQE

MISFISHIE….

REMARK

CONSORT

SRAxml

SOFT FASTADICOM

MzMLSBRML

SEDML…

GELML

ISA

CML

MITAB

AAOCHEBIOBI

PATO ENVOMOD

BTOIDO…

TEDDY

PROXAO

DO

VO

Content standards in numbers

How to discover the ‘right’ standards for your data?

Aweb-based,curatedandsearchableportalthat monitorsthedevelopment and

evolution ofstandards,theiruse indatabases andtheadoptionofbothindata

policies,toinform andeducate theusercommunity

Data policies by funders, journals and other organizations

Content standards

Formats Terminologies Guidelines

Map this complex and evolving landscape

Databases

Allrecordsaremanuallycuratedin-house

andverifiedbythecommunitybehindeachresource

Data policies by funders, journals and other organizations

Databases

Content standards

Formats Terminologies Guidelines

Using indicators to describe ‘status’

Readyforuse,implementation,orrecommendation

Indevelopment

Statusuncertain

Deprecatedassubsumedorsuperseded

Understanding how standards are used

Understanding how standards are used

Guideline

Understanding how standards are used

Formats

Guideline

Understanding how standards are used

Formats

Guideline

Formats

Understanding how standards are used

Formats

Guideline

Formats

Terminology

Data policies by funders, journals and other organizations

Databases

Content standards

Formats Terminologies Guidelines

Using indicators to indicate ‘adoption’

Standard developing groups:Journal, publishers:

Cross-links, data exchange:

Societies and organisations: Institutional RDM services:

Projects, programmes:

Technologically-delineated views of the world

Biologically-delineated views of the world

Generic features (‘common core’)- description of source biomaterial- experimental design components

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Duplications & lack of interoperability among standards

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Hard to use them in combinations, e.g. to represent:

Proteomics-based gut microbiota profiling

Proteomics and metabolomics based gut microbiota profiling

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Enhancing modularization

Proteomics-based gut microbiota profiling

Proteomics and metabolomics based gut microbiota profiling

Arrays

Scanning Arrays &Scanning

Columns

GelsMS MS

FTIR

NMR

Columns

transcriptomics proteomics metabolomics

plant biologyepidemiology microbiology

Enhancing modularization

Proteomics-based gut microbiota profiling

Proteomics and metabolomics based gut microbiota profiling

bsg-000174

biosharing:ReportingGuideline

bsg-000161

MINSEQE

MIMARKS

sample information

sample identifier

taxonomyidentifier

sequence read

geo location

High-level information about the metadata standards

Representations of the standards elements

Template elementsfor

el-000001

el-000002

el-000003

provenance: MINSEQE

provenance: MINSEQE

and MIMARKS

provenance:MIMARKS

Serve machine-readable content metadata standards, providing provenance for their elements, rendering standards invisible to the researchers

Inform the creation of metadata templates

How to discover the datasets relevant to your work?

OmicsDI: Nature Biotechnology 35, 406–409 (2017) doi:10.1038/nbt.3790

omicsdi.org

datamed.org

DataMed: bioRxiv 094888; https://doi.org/10.1101/094888 Nature Genetics (in press)

DATS: bioRxiv 103143; https://doi.org/10.1101/103143 Scientific Data (in press)

• Discoverability and reusabilityo Complementing community

databases• Incentive, credit for sharing

o Big and small datao Unpublished datao Long tail of datao Curated aggregation

• Peer review of data• Value of data vs. analysis

Growing number of data papers and data journals, e.g:

nature.com/scientificdataHonorary Academic Editor Susanna-Assunta Sansone, PhD

Managing EditorAndrew L Hufton, PhD

Editorial CuratorVarsha Khodiyar

PublisherIain Hrynaszkiewicz

A new open-access, online-only publication for descriptions of scientifically valuable datasets

Supported by

• A peer reviewed description of data, to maximize usage• Citable publications that give credit for reusable data• It requires data deposition to the appropriate repository(s)• Is complementary and can be associated or not to traditional article(s)

New article type

Res

earc

hpa

pers

Dat

a re

cord

sD

ata

Des

crip

tors

Value added component – complementing articles and repositories

• Title• Abstract• Background & Summary• Methods• Data Records• Technical Validation• Usage Notes • Figures & Tables • References• Data Citations

• following the Joint Declaration of Data Citation Principles

Detailed description of the methods and technical analyses supporting the

quality of the measurements; no scientific hypotheses

Article structure

Focus on data peer review

• Completeness = can others reproduce?• Consistency = were community standards followed?• Integrity = are data in the best repository?• Experimental rigour, technical quality = were the methods sound?

Does not focus on perceived impact, importance, size, complexity of data

Credit for data producers, data managers/curators etc.

Credit to: Varsha Khodiyar

“The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.”

Professor Daniele Marinazzo

Credit to: Varsha Khodiyar

Data (re)use made easier

Decades old dataset

Aggregated or curated data

resources

Computationally produced data

productsLarge

consortium dataset

Data from a single

experiment

Data that YOU find valuable

and that others might find useful too

Data associated with a high impact

analysis article

What makes a good ?

Experimental metadata or structured component

(in-house curated, machine-readable formats)

Article or narrative component

(PDF and HTML)

Data Descriptors has two components

The Data Curation Editor is responsible for creating and curating the machine-readable structured component• Enables browsing and searching the articles• Facilitates links to related journal articles and repository

records

Curation and discoverability

Created with the input of the authors, includes value-added semantic annotation of the experimental metadata

analysis method script

Data file or record in a database

Data Descriptors: structured component

Complementary roles of ISA and nanopublications

From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics. https://doi.org/10.1371/journal.pone.0127612

PloS ONE (2015)

The (long) road to FAIR

Responsibilities lie across several stakeholder groups

Understand the benefits of sharing FAIR datasets and enact them

Engage and assist researchers to enable them to share FAIR datasets

Release or endorse practices and polices, but also incentive

and credit mechanisms for researchers, curators and

developers

“As Data Science culture grows,digital research outputs (such asdata, computational analysis andsoftware) are being established asfirst-class citizens.

This cultural shift is required to goone step further: to recognizeinteroperability standards as digitalobjects in their own right, with theirassociated research, developmentand educational activities”.

Sansone, Susanna-Assunta; Rocca-Serra, Philippe (2016). Interoperability Standards - Digital Objects in Their Own Right. Wellcome Trust” https://dx.doi.org/10.6084/m9.figshare.4055496.v1

Philippe Rocca-Serra, PhDSenior Research Lecturer

AlejandraGonzalez-Beltran, PhDResearch Lecturer

Milo Thurston, DPhDResearch Software Engineer

MassimilianoIzzo, PhDResearch Software Engineer

Peter McQuilton, PhDKnowledge Engineer

Allyson Lister, PhDKnowledge Engineer

EamonnMaguire, DphilContractor

David Johnson, PhDResearch Software Engineer

MelanieAdekale, PhDBiocurator Contractor

DelphineDauga, PhDBiocurator Contractor

We work with and for

to make data and other digital research assets

Susanna-Assunta Sansone, PhDPrincipal Investigator, Associate Director and Data Consultant for Springer Nature

enabling open science, driving science and discoveries

top related