On community-standards, data curation and scholarly communication Susanna-Assunta Sansone, PhD @SusannaASansone 13th Annual Meeting of the Bioinformatics Italian Society, University of Salerno, Italy, 15-17 June 2016. Data Consultant, Founding Academic Editor Associate Director, Principal Investigator Member, Executive Committee
57
Embed
On community-standards, data curation and scholarly communication - BITS, Italy, 2016
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
On community-standards, data curation and scholarly communication
Susanna-Assunta Sansone, PhD
@SusannaASansone
13th Annual Meeting of the Bioinformatics Italian Society, University of Salerno, Italy, 15-17 June 2016.
Data Consultant, Founding Academic Editor
Associate Director, Principal Investigator Member,
Executive Committee
• Better data better science – the FAIR meme
• Publication of digital research outputs – why it matters
• Interoperability standards – as enablers
Outline
Research as a Connected Digital Enterprise aka The Commons• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
The vision - P. Bourne (NIH Associate Director for Data Science)
Research as a Connected Digital Enterprise aka The Commons• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons. • Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.
The vision - P. Bourne (NIH Associate Director for Data Science)
Research as a Connected Digital Enterprise aka The Commons• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons. • Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.
The vision - P. Bourne (NIH Associate Director for Data Science)
Research as a Connected Digital Enterprise aka The Commons• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons. • Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.
• Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community.
The vision - P. Bourne (NIH Associate Director for Data Science)
Research as a Connected Digital Enterprise aka The Commons• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons. • Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.
• Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community. An open source version remains, but the company adds services on top of the software and revenue flows back to the labs of researchers X and Y which is used to develop new innovative software for open distribution.
The vision - P. Bourne (NIH Associate Director for Data Science)
Research as a Connected Digital Enterprise aka The Commons• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons. • Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and software; their output is captured in PubMed and the Commons, and is indexed by the data and software catalogs.
• Company Z identifies relevant data and software that, based on the metrics from the catalogs, have utilization above a threshold indicating that those data and software are heavily utilized by the community. An open source version remains, but the company adds services on top of the software and revenue flows back to the labs of researchers X and Y which is used to develop new innovative software for open distribution.
• Researchers X and Y provide hands-on advice in the use of their new version and their course is offered as a MOOC (Massive Open Online Courses).
The vision - P. Bourne (NIH Associate Director for Data Science)
Research as a Connected Digital Enterprise aka The Commons
The vision - P. Bourne (NIH Associate Director for Data Science)
https://datascience.nih.gov/commons
A Data Discovery Index prototype that:
• Helps users find and access shared data
• Interoperates in the NIH Commons
aggregator'A'
B C
Aaggregator'
Data'Discovery'Index'
data'
Dashed lines: mapping of metadata standards, links to aggregators, data Data: digital research objects
Credit to: ttps://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ 2014
“Over 50% of completed studies in biomedicine do not appear in the published literature….Often because results do not conform to author's hypotheses”
“Only half the health-related studies funded by the European Union between 1998 and 2006 - an expenditure of €6 billion - led to identifiable reports”
Selective reporting is still an unfortunate practice
• Small independent efforts, yielding a rich variety of specialty data setso Most of these data (such as null findings) is unpublishedo These dark data hold a potential wealth of knowledge
• Researchers still lack of or insufficient motivations
• Hypothesis-confirming results get prioritized
• Agreements, disagreements and timing
• Loose requirements and monitoring by journals and
funders
But why?
• Most researchers are sharing data, and using the data of others
• Direct contact* between researchers (on request) is a common way of sharing data
• Repositories are second most common method of sharing
Kratz JE, Strasser C (2015) Researcher Perspectives on Publication and Peer Review of Data. PLoS ONE 10(2): e0117619.
Current approaches to sharing
* Data associated with published works disappears at a rate of ~17% per year (Vines et al. 2014, doi:10.1016/j.cub.2013.11.014 Datasets not referenced in a manuscript are essentially invisible and data producers do not get appropriate credit for their work
• Outputs are multi-dimensional, not always well cited, stored
o Software, codes, workflows are hard(er) to get hold of
• Poorly described for third party reuse
o Different level of details and annotation
• Curation activities are perceived as time consuming
o Collection and harmonization of detailed methods and
experimental steps is done/rushed at publication stage
Shared data is not always understandable, reusable
A B C D E 1 Group1 Group2 2 Day 0 3 Sodium 139 142 4 Potassium 3.3 4.8 5 Chloride 100 108 6 BUN 18 18 7 Creatine 1.2 1.2 8 Uric acid 5.5* 6.2* 9 Day 7 10 Sodium 140 146 11 Potassium 3.4 5.1 12 Chloride 97 108
S1Sh.cuo
Sharing starts with good metadata…
Credit to: Iain Hrynaszkiewicz
A B C D E 1 Group1 Group2 2 Day 0 3 Sodium 139 142 4 Potassium 3.3 4.8 5 Chloride 100 108 6 BUN 18 18 7 Creatine 1.2 1.2 8 Uric acid 5.5* 6.2* 9 Day 7 10 Sodium 140 146 11 Potassium 3.4 5.1 12 Chloride 97 108
S1Sh.cuo Meaningless column titles
Special characters can cause text mining errors
No units
Unhelpful document name
Undefined abbreviation
Formatting for information that
should be in metadata
….…but this not!
Credit to: Iain Hrynaszkiewicz
A B C D E F 1 Parameter Day Control Treated Units P 2 Sodium 0 139 142 mEq/l 0.82 3 Sodium 7 140 146 mEq/l 0.70 4 Sodium 14 140 158 mEq/l 0.03 5 Sodium 21 143 160 mEq/l 0.02 6 Potassium 0 3.3 4.8 mEq/l 0.06 7 Potassium 7 3.4 5.1 mEq/l 0.07 8 Potassium 14 3.7 4.7 mEq/l 0.10 9 Potassium 21 3.1 3.6 mEq/l 0.52 10 Chloride 0 100 108 mEq/l 0.56 11 Chloride 7 97 108 mEq/l 0.68 12 Chloride 14 101 106 mEq/l 0.79
Table_S1_Shanghai_blood.xls
….this is much clearer!
Credit to: Iain Hrynaszkiewicz
Without context data is meaningless
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
24
…breadth and depth of the context is pivotal…
…including capturing experimental design and
statistical analysis
Among these, publishers occupy a leverage point, because of importance of
formal publications in the academic incentive structure
Stakeholders mobilizations, old and new driving forces
• Incentive, credit for sharingo Big and small datao Unpublished datao Long tail of datao Curated aggregation
• Peer review of data• Value of data vs. analysis• Discoverability and reusability
o Complementing community databases
Growing number of data papers and data journals
nature.com/scientificdata Honorary Academic Editor Susanna-Assunta Sansone, PhD Managing Editor Andrew L Hufton, PhD
Methods and technical analyses supporting the quality of the measurements:What did I do to generate the data?How was the data processed?Where is the data?Who did what when
Relation with traditional articles – content
Citation of and links to data files and databases
Experimental metadata or structured component
(in-house curated, machine-readable formats)
Article or narrative component
(PDF and HTML)
Data Descriptors has two components
The Data Curation Editor is responsible for creating and curating the machine-readable structured component• Enables browsing and searching the articles• Facilitates links to related journal articles and repository
records
Curation and discoverability
Created with the input of the authors, includes value-added semantic annotation of the experimental metadata
analysis method script
Data file or record in a database
Data Descriptors: structured component
Browse, search, view Data Descriptors
38
Why data papers? Credit for data producers!
Credit to: Varsha Khodiyar
“The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.”
Professor Daniele Marinazzo
Why data papers? Data reuse is easier!
Credit to: Varsha Khodiyar
40
Decades old
dataset
Aggregated or curated data
resources
Computationally produced data
productsLarge
consortium dataset
Data from a single
experiment
Data associated with a high
impact analysis article
What does make a good Data Descriptors?
Credit to: Andrew Hufton
• Better data better science – the FAIR meme
• Publication of digital research outputs – why it matters
• Interoperability standards – as enablers
Outline
de jure de facto
grass-roots groups
standard organizations
Nanotechnology Working Group
• To structure, enrich and report the description of the datasets and the experimental context under which they were produced
• To facilitate discovery, sharing, understanding and reuse of datasets
Community-developed content standards
de jure de facto
grass-roots groups
standard organizations
Nanotechnology Working Group
Content standards as enabler for better described data
Including minimum information reporting requirements, or checklists to report the same core, essential information
Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’
Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another
203 105
345
miame!
MIRIAM!MIQAS!MIX!
MIGEN!
ARRIVE!MIAPE!
MIASE!
MIQE!
MISFISHIE….!
REMARK!
CONSORT!
SRAxml!SOFT! FASTA!
DICOM!
MzML!SBRML!
SEDML…!
GELML!
ISA-Tab!
CML!
MITAB!
AAO!CHEBI!
OBI!
PATO! ENVO!MOD!
BTO!IDO…!
TEDDY!
PRO!XAO!
DO
VO!
Complex and evolving landscape
data policies databases
data/metadata standards
Is there a database, implementing standards, where to deposit my
metagenomics dataset?
My funder’s data sharing policy recommends the use of
established standards, but which ones are widely
endorsed and applicable to my toxicological and clinical data?
Am I using the most up-to-date version of this terminology to annotate cell-based assays?
I understand this format has been deprecated; what has been replaced
by and how is leading the work?
Are there databases implementing this exchange format, whose
development we have funded?
What are the mature standards and
standards-compliant databases we should
recommend to our authors?
But how do we help users to make informed decisions?
A web-based, curated and searchable registry ensuring that standards and databases are registered, informative and
discoverable; monitoring development and evolution of standards,
their use in databases and adoption of both in data policies
An informative and educational resource
1,400 records and growing
An informative and educational resource
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and substitutions
Model/format formalizing reporting guideline -->
<-- Reporting guideline used by model/format
Cross-linking standards to standards and databases
Standards and databases recommended by publishers in their data policies
Interactive graph to inform and educate, e.g. database
standard
policy
Interactive graph to inform and educate, e.g. database
standard
policy
Interactive graph to inform and educate, e.g. database
standard
policy
Linking standards and databases to training material
Advised by the ELIXIR Training Coordinators Group, including:
A collaboration between:
Data!Software!Standards!Databases!Workflow!
Publications!Training material!
Philippe Rocca-Serra, PhD Senior Research Lecturer
Susanna-Assunta Sansone, PhD Principal Investigator, Associate Director
We also acknowledge our network of collaborators in the following active projects: H2020 PhenoMeNal, H2020 ELIXIR-EXCELERATE, H2020 MultiMot, NIH bioCADDIE, NIH CEDAR and IMI eTRIKS