On community-standards, data curation and scholarly communication - BITS, Italy, 2016

On community-standards, data curation and scholarly communication

Susanna-Assunta Sansone, PhD

@SusannaASansone

13th Annual Meeting of the Bioinformatics Italian Society, University of Salerno, Italy, 15-17 June 2016.

Data Consultant, Founding Academic Editor

Associate Director, Principal Investigator Member,

Executive Committee

•  Better data better science – the FAIR meme

•  Publication of digital research outputs – why it matters

•  Interoperability standards – as enablers

Outline

Research as a Connected Digital Enterprise aka The Commons•  Researcher X is automatically made aware of researcher Y through commonalities

in their respective data located in the Commons.

The vision - P. Bourne (NIH Associate Director for Data Science)


in their respective data located in the Commons. •  Research X locates the researcher Y’s data sets with their associated usage

statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.





•  A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.






•  Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community.






•  Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community. An open source version remains, but the company adds services on top of the software and revenue flows back to the labs of researchers X and Y which is used to develop new innovative software for open distribution.






•  Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community. An open source version remains, but the company adds services on top of the software and revenue flows back to the labs of researchers X and Y which is used to develop new innovative software for open distribution.

•  Researchers X and Y provide hands-on advice in the use of their new version and their course is offered as a MOOC (Massive Open Online Courses).


Research as a Connected Digital Enterprise aka The Commons


https://datascience.nih.gov/commons

A Data Discovery Index prototype that:

•  Helps users find and access shared data

•  Interoperates in the NIH Commons

aggregator'A'

B C

Aaggregator'

Data'Discovery'Index'

data'

Dashed lines: mapping of metadata standards, links to aggregators, data Data: digital research objects

Pilot projects Core development team

Designed as an element of the ecosystem

12

medicine

agriculture

bioindustries

environment

ELIXIRconnectsnationalbioinformaticscentresandEMBL-EBIintoasustainableEuropeaninfrastructureforbiologicalresearchdata

Building a pan-European infrastructure

to do better science !more efficiently!

Credit to: ttps://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ 2014

“Over 50% of completed studies in biomedicine do not appear in the published literature….Often because results do not conform to author's hypotheses”

“Only half the health-related studies funded by the European Union between 1998 and 2006 - an expenditure of €6 billion - led to identifiable reports”

Selective reporting is still an unfortunate practice

•  Small independent efforts, yielding a rich variety of specialty data setso  Most of these data (such as null findings) is unpublishedo  These dark data hold a potential wealth of knowledge

•  Researchers still lack of or insufficient motivations

•  Hypothesis-confirming results get prioritized

•  Agreements, disagreements and timing

•  Loose requirements and monitoring by journals and

funders

But why?

•  Most researchers are sharing data, and using the data of others

•  Direct contact* between researchers (on request) is a common way of sharing data

•  Repositories are second most common method of sharing

Kratz JE, Strasser C (2015) Researcher Perspectives on Publication and Peer Review of Data. PLoS ONE 10(2): e0117619.

Current approaches to sharing

* Data associated with published works disappears at a rate of ~17% per year (Vines et al. 2014, doi:10.1016/j.cub.2013.11.014 Datasets not referenced in a manuscript are essentially invisible and data producers do not get appropriate credit for their work

•  Outputs are multi-dimensional, not always well cited, stored

o  Software, codes, workflows are hard(er) to get hold of

•  Poorly described for third party reuse

o  Different level of details and annotation

•  Curation activities are perceived as time consuming

o  Collection and harmonization of detailed methods and

experimental steps is done/rushed at publication stage

Shared data is not always understandable, reusable

A B C D E 1 Group1 Group2 2 Day 0 3 Sodium 139 142 4 Potassium 3.3 4.8 5 Chloride 100 108 6 BUN 18 18 7 Creatine 1.2 1.2 8 Uric acid 5.5* 6.2* 9 Day 7 10 Sodium 140 146 11 Potassium 3.4 5.1 12 Chloride 97 108

S1Sh.cuo

Sharing starts with good metadata…

Credit to: Iain Hrynaszkiewicz

A B C D E 1 Group1 Group2 2 Day 0 3 Sodium 139 142 4 Potassium 3.3 4.8 5 Chloride 100 108 6 BUN 18 18 7 Creatine 1.2 1.2 8 Uric acid 5.5* 6.2* 9 Day 7 10 Sodium 140 146 11 Potassium 3.4 5.1 12 Chloride 97 108

S1Sh.cuo Meaningless column titles

Special characters can cause text mining errors

No units

Unhelpful document name

Undefined abbreviation

Formatting for information that

should be in metadata

….…but this not!


A B C D E F 1 Parameter Day Control Treated Units P 2 Sodium 0 139 142 mEq/l 0.82 3 Sodium 7 140 146 mEq/l 0.70 4 Sodium 14 140 158 mEq/l 0.03 5 Sodium 21 143 160 mEq/l 0.02 6 Potassium 0 3.3 4.8 mEq/l 0.06 7 Potassium 7 3.4 5.1 mEq/l 0.07 8 Potassium 14 3.7 4.7 mEq/l 0.10 9 Potassium 21 3.1 3.6 mEq/l 0.52 10 Chloride 0 100 108 mEq/l 0.56 11 Chloride 7 97 108 mEq/l 0.68 12 Chloride 14 101 106 mEq/l 0.79

Table_S1_Shanghai_blood.xls

….this is much clearer!


Without context data is meaningless

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

24

…breadth and depth of the context is pivotal…

…including capturing experimental design and

statistical analysis

Among these, publishers occupy a leverage point, because of importance of

formal publications in the academic incentive structure

Stakeholders mobilizations, old and new driving forces

•  Incentive, credit for sharingo  Big and small datao  Unpublished datao  Long tail of datao  Curated aggregation

•  Peer review of data•  Value of data vs. analysis•  Discoverability and reusability

o  Complementing community databases

Growing number of data papers and data journals

nature.com/scientificdata Honorary Academic Editor Susanna-Assunta Sansone, PhD Managing Editor Andrew L Hufton, PhD

Editorial Curator Varsha Khodiyar Publisher Iain Hrynaszkiewicz

A new open-access, online-only publication for descriptions of scientifically valuable datasets

Supported by

A new article type

A new category of publication that provides detailed descriptors of scientifically valuable datasets

Mandates open data, without unnecessary restrictions, as a condition of submission

Res

earc

h pa

pers

D

ata

reco

rds

Dat

a D

escr

ipto

rs

Value added component – complementing articles and repositories

Scientific hypotheses:SynthesisAnalysisConclusions

Methods and technical analyses supporting the quality of the measurements:What did I do to generate the data?How was the data processed?Where is the data?Who did what when

Relation with traditional articles – content

Citation of and links to data files and databases

Experimental metadata or structured component

(in-house curated, machine-readable formats)

Article or narrative component

(PDF and HTML)

Data Descriptors has two components

The Data Curation Editor is responsible for creating and curating the machine-readable structured component•  Enables browsing and searching the articles•  Facilitates links to related journal articles and repository

records

Curation and discoverability

Created with the input of the authors, includes value-added semantic annotation of the experimental metadata

analysis method script

Data file or record in a database

Data Descriptors: structured component

Browse, search, view Data Descriptors

38

Why data papers? Credit for data producers!

Credit to: Varsha Khodiyar

“The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.”

Professor Daniele Marinazzo

Why data papers? Data reuse is easier!

Credit to: Varsha Khodiyar

40

Decades old

dataset

Aggregated or curated data

resources

Computationally produced data

productsLarge

consortium dataset

Data from a single

experiment

Data associated with a high

impact analysis article

What does make a good Data Descriptors?

Credit to: Andrew Hufton

•  Better data better science – the FAIR meme

•  Publication of digital research outputs – why it matters

•  Interoperability standards – as enablers

Outline

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

•  To structure, enrich and report the description of the datasets and the experimental context under which they were produced

•  To facilitate discovery, sharing, understanding and reuse of datasets

Community-developed content standards

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

Content standards as enabler for better described data

Including minimum information reporting requirements, or checklists to report the same core, essential information

Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’

Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another

203 105

345

miame!

MIRIAM!MIQAS!MIX!

MIGEN!

ARRIVE!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

SRAxml!SOFT! FASTA!

DICOM!

MzML!SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!

Complex and evolving landscape

data policies databases

data/metadata standards

Is there a database, implementing standards, where to deposit my

metagenomics dataset?

My funder’s data sharing policy recommends the use of

established standards, but which ones are widely

endorsed and applicable to my toxicological and clinical data?

Am I using the most up-to-date version of this terminology to annotate cell-based assays?

I understand this format has been deprecated; what has been replaced

by and how is leading the work?

Are there databases implementing this exchange format, whose

development we have funded?

What are the mature standards and

standards-compliant databases we should

recommend to our authors?

But how do we help users to make informed decisions?

A web-based, curated and searchable registry ensuring that standards and databases are registered, informative and

discoverable; monitoring development and evolution of standards,

their use in databases and adoption of both in data policies

An informative and educational resource

1,400 records and growing

An informative and educational resource

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

Tracking evolution, e.g. deprecations and substitutions

Model/format formalizing reporting guideline -->

<-- Reporting guideline used by model/format

Cross-linking standards to standards and databases

Standards and databases recommended by publishers in their data policies

Interactive graph to inform and educate, e.g. database

standard

policy


standard

policy


standard

policy

Linking standards and databases to training material

Advised by the ELIXIR Training Coordinators Group, including:

A collaboration between:

Data!Software!Standards!Databases!Workflow!

Publications!Training material!

Philippe Rocca-Serra, PhD Senior Research Lecturer

Alejandra Gonzalez-Beltran, PhD Research Lecturer

Milo Thurston, DPhD Research Software Engineer

Massimiliano Izzo, PhD Research Software Engineer

Peter McQuilton, PhD Knowledge Engineer

Allyson Lister, PhD Knowledge Engineer

Eamonn Maguire, DPhil Software Engineer contractor

David Johnson, PhD Research Software Engineer

Susanna-Assunta Sansone, PhD Principal Investigator, Associate Director

We also acknowledge our network of collaborators in the following active projects: H2020 PhenoMeNal, H2020 ELIXIR-EXCELERATE, H2020 MultiMot, NIH bioCADDIE, NIH CEDAR and IMI eTRIKS

On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Data & Analytics