Top Banner
On community-standards, data curation and scholarly communication Susanna-Assunta Sansone, PhD @SusannaASansone 13th Annual Meeting of the Bioinformatics Italian Society, University of Salerno, Italy, 15-17 June 2016. Data Consultant, Founding Academic Editor Associate Director, Principal Investigator Member, Executive Committee
57

On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Jan 21, 2018

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

On community-standards, data curation and scholarly communication

Susanna-Assunta Sansone, PhD

@SusannaASansone

13th Annual Meeting of the Bioinformatics Italian Society, University of Salerno, Italy, 15-17 June 2016.

Data Consultant, Founding Academic Editor

Associate Director, Principal Investigator Member,

Executive Committee

Page 2: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

•  Better data better science – the FAIR meme

•  Publication of digital research outputs – why it matters

•  Interoperability standards – as enablers

Outline

Page 3: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Research as a Connected Digital Enterprise aka The Commons•  Researcher X is automatically made aware of researcher Y through commonalities

in their respective data located in the Commons.

The vision - P. Bourne (NIH Associate Director for Data Science)

Page 4: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Research as a Connected Digital Enterprise aka The Commons•  Researcher X is automatically made aware of researcher Y through commonalities

in their respective data located in the Commons. •  Research X locates the researcher Y’s data sets with their associated usage

statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.

The vision - P. Bourne (NIH Associate Director for Data Science)

Page 5: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Research as a Connected Digital Enterprise aka The Commons•  Researcher X is automatically made aware of researcher Y through commonalities

in their respective data located in the Commons. •  Research X locates the researcher Y’s data sets with their associated usage

statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.

•  A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.

The vision - P. Bourne (NIH Associate Director for Data Science)

Page 6: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Research as a Connected Digital Enterprise aka The Commons•  Researcher X is automatically made aware of researcher Y through commonalities

in their respective data located in the Commons. •  Research X locates the researcher Y’s data sets with their associated usage

statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.

•  A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.

•  Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community.

The vision - P. Bourne (NIH Associate Director for Data Science)

Page 7: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Research as a Connected Digital Enterprise aka The Commons•  Researcher X is automatically made aware of researcher Y through commonalities

in their respective data located in the Commons. •  Research X locates the researcher Y’s data sets with their associated usage

statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.

•  A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.

•  Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community. An open source version remains, but the company adds services on top of the software and revenue flows back to the labs of researchers X and Y which is used to develop new innovative software for open distribution.

The vision - P. Bourne (NIH Associate Director for Data Science)

Page 8: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Research as a Connected Digital Enterprise aka The Commons•  Researcher X is automatically made aware of researcher Y through commonalities

in their respective data located in the Commons. •  Research X locates the researcher Y’s data sets with their associated usage

statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.

•  A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.

•  Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community. An open source version remains, but the company adds services on top of the software and revenue flows back to the labs of researchers X and Y which is used to develop new innovative software for open distribution.

•  Researchers X and Y provide hands-on advice in the use of their new version and their course is offered as a MOOC (Massive Open Online Courses).

The vision - P. Bourne (NIH Associate Director for Data Science)

Page 9: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Research as a Connected Digital Enterprise aka The Commons

The vision - P. Bourne (NIH Associate Director for Data Science)

https://datascience.nih.gov/commons

Page 10: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

A Data Discovery Index prototype that:

•  Helps users find and access shared data

•  Interoperates in the NIH Commons

Page 11: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

aggregator'A'

B C

Aaggregator'

Data'Discovery'Index'

data'

Dashed lines: mapping of metadata standards, links to aggregators, data Data: digital research objects

Pilot projects Core development team

Designed as an element of the ecosystem

Page 12: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

12

medicine

agriculture

bioindustries

environment

ELIXIRconnectsnationalbioinformaticscentresandEMBL-EBIintoasustainableEuropeaninfrastructureforbiologicalresearchdata

Building a pan-European infrastructure

Page 13: On community-standards, data curation and scholarly communication - BITS, Italy, 2016
Page 14: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

to do better science !more efficiently!

Page 15: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Credit to: ttps://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ 2014

Page 16: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

“Over 50% of completed studies in biomedicine do not appear in the published literature….Often because results do not conform to author's hypotheses”

“Only half the health-related studies funded by the European Union between 1998 and 2006 - an expenditure of €6 billion - led to identifiable reports”

Selective reporting is still an unfortunate practice

•  Small independent efforts, yielding a rich variety of specialty data setso  Most of these data (such as null findings) is unpublishedo  These dark data hold a potential wealth of knowledge

Page 17: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

•  Researchers still lack of or insufficient motivations

•  Hypothesis-confirming results get prioritized

•  Agreements, disagreements and timing

•  Loose requirements and monitoring by journals and

funders

But why?

Page 18: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

•  Most researchers are sharing data, and using the data of others

•  Direct contact* between researchers (on request) is a common way of sharing data

•  Repositories are second most common method of sharing

Kratz JE, Strasser C (2015) Researcher Perspectives on Publication and Peer Review of Data. PLoS ONE 10(2): e0117619.

Current approaches to sharing

* Data associated with published works disappears at a rate of ~17% per year (Vines et al. 2014, doi:10.1016/j.cub.2013.11.014 Datasets not referenced in a manuscript are essentially invisible and data producers do not get appropriate credit for their work

Page 19: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

•  Outputs are multi-dimensional, not always well cited, stored

o  Software, codes, workflows are hard(er) to get hold of

•  Poorly described for third party reuse

o  Different level of details and annotation

•  Curation activities are perceived as time consuming

o  Collection and harmonization of detailed methods and

experimental steps is done/rushed at publication stage

Shared data is not always understandable, reusable

Page 20: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

A B C D E 1 Group1 Group2 2 Day 0 3 Sodium 139 142 4 Potassium 3.3 4.8 5 Chloride 100 108 6 BUN 18 18 7 Creatine 1.2 1.2 8 Uric acid 5.5* 6.2* 9 Day 7 10 Sodium 140 146 11 Potassium 3.4 5.1 12 Chloride 97 108

S1Sh.cuo

Sharing starts with good metadata…

Credit to: Iain Hrynaszkiewicz

Page 21: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

A B C D E 1 Group1 Group2 2 Day 0 3 Sodium 139 142 4 Potassium 3.3 4.8 5 Chloride 100 108 6 BUN 18 18 7 Creatine 1.2 1.2 8 Uric acid 5.5* 6.2* 9 Day 7 10 Sodium 140 146 11 Potassium 3.4 5.1 12 Chloride 97 108

S1Sh.cuo Meaningless column titles

Special characters can cause text mining errors

No units

Unhelpful document name

Undefined abbreviation

Formatting for information that

should be in metadata

….…but this not!

Credit to: Iain Hrynaszkiewicz

Page 22: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

A B C D E F 1 Parameter Day Control Treated Units P 2 Sodium 0 139 142 mEq/l 0.82 3 Sodium 7 140 146 mEq/l 0.70 4 Sodium 14 140 158 mEq/l 0.03 5 Sodium 21 143 160 mEq/l 0.02 6 Potassium 0 3.3 4.8 mEq/l 0.06 7 Potassium 7 3.4 5.1 mEq/l 0.07 8 Potassium 14 3.7 4.7 mEq/l 0.10 9 Potassium 21 3.1 3.6 mEq/l 0.52 10 Chloride 0 100 108 mEq/l 0.56 11 Chloride 7 97 108 mEq/l 0.68 12 Chloride 14 101 106 mEq/l 0.79

Table_S1_Shanghai_blood.xls

….this is much clearer!

Credit to: Iain Hrynaszkiewicz

Page 23: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Without context data is meaningless

Page 24: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

24

…breadth and depth of the context is pivotal…

…including capturing experimental design and

statistical analysis

Page 25: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Among these, publishers occupy a leverage point, because of importance of

formal publications in the academic incentive structure

Stakeholders mobilizations, old and new driving forces

Page 26: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

•  Incentive, credit for sharingo  Big and small datao  Unpublished datao  Long tail of datao  Curated aggregation

•  Peer review of data•  Value of data vs. analysis•  Discoverability and reusability

o  Complementing community databases

Growing number of data papers and data journals

Page 27: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

nature.com/scientificdata Honorary Academic Editor Susanna-Assunta Sansone, PhD Managing Editor Andrew L Hufton, PhD

Editorial Curator Varsha Khodiyar Publisher Iain Hrynaszkiewicz

A new open-access, online-only publication for descriptions of scientifically valuable datasets

Supported by

Page 28: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

A new article type

A new category of publication that provides detailed descriptors of scientifically valuable datasets

Mandates open data, without unnecessary restrictions, as a condition of submission

Page 29: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Res

earc

h pa

pers

D

ata

reco

rds

Dat

a D

escr

ipto

rs

Value added component – complementing articles and repositories

Page 30: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Scientific hypotheses:SynthesisAnalysisConclusions

Methods and technical analyses supporting the quality of the measurements:What did I do to generate the data?How was the data processed?Where is the data?Who did what when

Relation with traditional articles – content

Page 31: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Citation of and links to data files and databases

Page 32: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Experimental metadata or structured component

(in-house curated, machine-readable formats)

Article or narrative component

(PDF and HTML)

Data Descriptors has two components

Page 33: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

The Data Curation Editor is responsible for creating and curating the machine-readable structured component•  Enables browsing and searching the articles•  Facilitates links to related journal articles and repository

records

Curation and discoverability

Page 34: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Created with the input of the authors, includes value-added semantic annotation of the experimental metadata

analysis method script

Data file or record in a database

Data Descriptors: structured component

Page 35: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Browse, search, view Data Descriptors

Page 36: On community-standards, data curation and scholarly communication - BITS, Italy, 2016
Page 37: On community-standards, data curation and scholarly communication - BITS, Italy, 2016
Page 38: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

38

Why data papers? Credit for data producers!

Credit to: Varsha Khodiyar

Page 39: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

“The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.”

Professor Daniele Marinazzo

Why data papers? Data reuse is easier!

Credit to: Varsha Khodiyar

Page 40: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

40

Decades old

dataset

Aggregated or curated data

resources

Computationally produced data

productsLarge

consortium dataset

Data from a single

experiment

Data associated with a high

impact analysis article

What does make a good Data Descriptors?

Credit to: Andrew Hufton

Page 41: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

•  Better data better science – the FAIR meme

•  Publication of digital research outputs – why it matters

•  Interoperability standards – as enablers

Outline

Page 42: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

•  To structure, enrich and report the description of the datasets and the experimental context under which they were produced

•  To facilitate discovery, sharing, understanding and reuse of datasets

Community-developed content standards

Page 43: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

Content standards as enabler for better described data

Including minimum information reporting requirements, or checklists to report the same core, essential information

Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’

Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another

Page 44: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

203 105

345

miame!

MIRIAM!MIQAS!MIX!

MIGEN!

ARRIVE!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

SRAxml!SOFT! FASTA!

DICOM!

MzML!SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!

Complex and evolving landscape

data policies databases

data/metadata standards

Page 45: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Is there a database, implementing standards, where to deposit my

metagenomics dataset?

My funder’s data sharing policy recommends the use of

established standards, but which ones are widely

endorsed and applicable to my toxicological and clinical data?

Am I using the most up-to-date version of this terminology to annotate cell-based assays?

I understand this format has been deprecated; what has been replaced

by and how is leading the work?

Are there databases implementing this exchange format, whose

development we have funded?

What are the mature standards and

standards-compliant databases we should

recommend to our authors?

But how do we help users to make informed decisions?

Page 46: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

A web-based, curated and searchable registry ensuring that standards and databases are registered, informative and

discoverable; monitoring development and evolution of standards,

their use in databases and adoption of both in data policies

An informative and educational resource

1,400 records and growing

Page 47: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

An informative and educational resource

Page 48: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

Tracking evolution, e.g. deprecations and substitutions

Page 49: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Model/format formalizing reporting guideline -->

<-- Reporting guideline used by model/format

Cross-linking standards to standards and databases

Page 50: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Standards and databases recommended by publishers in their data policies

Page 51: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Interactive graph to inform and educate, e.g. database

standard

policy

Page 52: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Interactive graph to inform and educate, e.g. database

standard

policy

Page 53: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Interactive graph to inform and educate, e.g. database

standard

policy

Page 54: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Linking standards and databases to training material

Page 55: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Advised by the ELIXIR Training Coordinators Group, including:

A collaboration between:

Page 56: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Data!Software!Standards!Databases!Workflow!

Publications!Training material!

Page 57: On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Philippe Rocca-Serra, PhD Senior Research Lecturer

Alejandra Gonzalez-Beltran, PhD Research Lecturer

Milo Thurston, DPhD Research Software Engineer

Massimiliano Izzo, PhD Research Software Engineer

Peter McQuilton, PhD Knowledge Engineer

Allyson Lister, PhD Knowledge Engineer

Eamonn Maguire, DPhil Software Engineer contractor

David Johnson, PhD Research Software Engineer

Susanna-Assunta Sansone, PhD Principal Investigator, Associate Director

We also acknowledge our network of collaborators in the following active projects: H2020 PhenoMeNal, H2020 ELIXIR-EXCELERATE, H2020 MultiMot, NIH bioCADDIE, NIH CEDAR and IMI eTRIKS