Reproducibility, Research Objects and Reality, Leiden 2016

Post on 17-Feb-2017

183 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

Transcript

Reproducibility, Research Objects and Reality

Professor Carole GobleThe University of Manchester, UKSoftware Sustainability Institute, UKELIXIR UK, FAIRDOM Association e.V.

carole.goble@manchester.ac.uk

University of Leiden, The Netherlands, 24 November 2016

Acknowledgements• Dagstuhl Seminar 16041 , January 2016

– http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=16041• ATI Symposium Reproducibility, Sustainability and Preservation , April

2016– https://turing.ac.uk/events/reproducibility-sustainability-and-preservation/– https://osf.io/bcef5/files/

• C Titus Brown• Juliana Freire• David De Roure• Stian Soiland-Reyes• Barend Mons• Tim Clark• Daniel Garijo• Norman Morrison• Katy Wolstencroft

Phil BourneNatalie StanfordJacky SnoepStuart OwenMarco RoosKristina HettneAlan WilliamsSean BechhoferIan ForeRafael Jimenez…. And many more

Michael CrusoePaul GrothNiall Beard

Context: Computational Science

http://tpeterka.github.io/maui-project/From: The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd

1. Observational, experimental

2. Theoretical3. Simulation4. Data intensive

Motivation: Knowledge Turningresearch infrastructures

• Computational tools• Sharing platforms• Knowledge Exchange• Reproducible

research• Software and data

practices• Policies

[Josh Sommer, for the picture]

Reproducibility Rampancy

NIH Rigor and Reproducibilityhttps://www.nih.gov/research-training/rigor-reproducibility

Plenty of guidelines

cos.io/top

Plenty of principles

https://wellcomeopenresearch.org/ Nature Scientific Data

Data as a first class citizen + Data Citation

Scholarly Communications Providers

Software as a first class citizen + Software Citation

Funders

http://www.acmedsci.ac.uk/policy/policy-projects/reproducibility-and-reliability-of-biomedical-research/

republic of science*

regulation of science

institution cores / libraries / public services

*Merton’s four norms of scientific behaviour (1942)

FAIRFindable

Accessible

Interoperable

ReusableIntelligible

Reproducible

Citable

Track & Countable

http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf

Research Infrastructure for FAIR Management and Sharing ofData, Operating Procedures, ModelFor Systems and Synthetic Biology Projects

Research Infrastructure for FAIR Data for Life Sciences in Europe

Data-Driven Science

designcherry picking data, random seed reporting, non-independent bias, poor positive and negative controls, dodgy normalisation, arbitrary cut-offs, premature data triage, un-validated materials, improper statistical analysis, poor statistical power, stop when “get to the right answer”, software misconfigurations misapplied black box softwarereportingincomplete reporting of software configurations, parameters & resource versions, missed steps, missing data, vague methods, missing softwareEmpirical StatisticalComputational

V. Stodden, IMS Bulletin (2013)

Reproducibility and reliability of biomedical research: improving research practice

“When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean - neither more nor less.”

Carroll, Through the Looking Glass

re-compute

replicatererun

repeat

re-examine

repurpose

recreate

reuse

restorereconstruct review

regeneraterevise

recycle

redo

robustness tolerance

verification compliance validation assurance

remix

Scientific publications goals: (i) announce a result(ii) convince readers its correct.

Papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension.

Papers in computational science should describe the results and provide the complete software development environment, data and set of instructions which generated the figures.

Virtual Witnessing*

*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.

Jill Mesirov

David Donoho

Computational Complex Assemblies

Remote Calls

“Micro” Reproducibility

“Macro” Reproducibility

Fixivity

Validate

Verify

Trust

Repeatability:“Sameness”Same result1 Lab1 experiment

Reproducibility:“Similarity”Similar result> 1 Lab> 1 experimentwhy the

differences?

https://2016-oslo-repeatability.readthedocs.org/en/latest/repeatability-discussion.html

Validate

Verify

Method Reproducibilitythe provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated.

Result Reproducibility (aka replicability)obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possibleGoodman, et al Science Translational Medicine 8

(341) 2016

Validate

Verify

ProductivityTrack differences

Validate

Verify

reviewers want additional workstatistician wants more runsanalysis needs to be repeatedpost-doc leaves, student arrivesnew/revised datasetsupdated/new versions of algorithms/codessample was contaminatedbetter kit - longer simulationsnew partners, new projects

Personal & Lab Productivity

Public GoodReproducibility

Computational “Datascopes”

Methodstechniques, algorithms, spec. of the steps, models

Materialsdatasets, parameters, algorithm seedsExperim

ent

Instrumentscodes, services, scripts, underlying libraries, workflows, ref datasets

Laboratorysw and hw infrastructure, systems software, integrative platformscomputational environment

Setup

“Datascope” Practicalities

MethodsMaterialsExperim

ent

InstrumentsLaboratory

Setup

Change Dependenciesscience, methods, datasetsquestions stay, answers change

breakage, labs decay, services, techniques and instruments change, updated datasets, services, codes, hardwaresoftware entropy

one offs, streams,stochastics, sensitivities,scale, non-portable data

supercomputer accessnon-portable softwarelicensing restrictionsunreliable resources and third party codescomplexity

Blackboxes

blackbox software

hidden manual steps

blackbox software

hidden manual steps

Active Instrument Byte level

preservation

Reproduce by RunningReproduce by Reading

Archived RecordPrepare to repair

ELNs

Markup LanguagesReporting Guidelines

Common Formats

Community vocabularies

Record AllAutomate AllContain AllExpose All

FindableAccessibleInteroperableReusable

provenance

portability preservation

robustnessversioning

access descriptionstandards

common APIslicensing

standards,common metadata

change variation sensitivity

discrepancy handling

packaging, containers

FAIR RACE shades of reproducibility

dependenciesstepsids

A robust infrastructure for biological information.

bio.tools

https://usegalaxy.org/

Workflow DescriptionWorkflows PreservationWorkflow PortabilityWorkflow Interoperability

Workflow Preservation and ExchangeExperimentsWorkflows & Workflow RunsWorkflow Commons

Third Party ServicesScattered resources

Workflow Preservation and ExchangeExperimentsWorkflows & Workflow RunsWorkflow Commons

Third Party ServicesScattered resources

Rich descriptionsPrepare to Repair

Standards-based metadata framework for bundling resources with context

Citable Reproducible Packaging

Metadata for bundling resources scattered and stored somewhere else

Container

Research Object in a nutshell

Packaging content & links: Zip files, BagIt, Docker

images

Catalogues & Commons Platforms: FAIRDOM, myExperiment

Manifest Constructi

on

Aggregates link things

togetherAnnotations

about things & their

relationships

Container

Research Object in a nutshell

Manifest Descripti

onDependencies

what else is needed

Versioning its evolution

Checklists what should be there

Provenance

where it came from

Identificationlocate things

regardless whereid

Packaging content & links: Zip files, BagIt, Docker

images

Catalogues & Commons Platforms: FAIRDOM, myExperiment

Manifest Constructi

on

Aggregates link things

togetherAnnotations

about things & their

relationships

Container

Research Object Profile for Workflows…

Manifest Descripti

onIdentificationlocate things

regardless where

Minimum informationfor one content type

Common properties

among content types

Research Object Profile for Workflows…

Manifest Descripti

on

Minimum informationfor one content type

Common properties

among content types

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003Hettne KM, et al (2014), Structuring research methods and data with the research object model: genomics workflows as a case study. J. Biomedical Semantics 5: 41

Workflow Research Object Bundles exchange, portability and maintenance

BagIt

workflows packaged into various containers for

sharing

Checksum

Workflow and Workflow Management System Zoo

https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems

bio.tools

A community led standard way of expressing and running workflows and command line tools using containers

Ontologies for describing tools and their inputs and outputs

Metadata framework for the manifest versioning, file integrity, more metadata about the workflow

Workflow fragment containers

FindableAccessibleInteroperableReusable

DataOperationsModels

Systems and Synthetic Biology Projects

Funder: Legacy!

Partners

Project Support

Community Actions

Platforms, Tools

Web-based Portal Public Commons

50+ projects5 programmes400+ people

22 independentinstallations

Systems Approach…Multiple, interrelated assets, Multiple, dispersed repositories

Literature

SOPS

STANDARDSversioning,

tracking:provenance, parameters,

citation

Operations

Data Mode

ls

FAIR Data and Metadata Standards that help to improve understanding and exchange….

Nicolas Le Novère, Babraham Institute, UK.

…researchers do not always use them....

… model reuse and reproducibility tricky…

Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053

Systems Approach…teams, processes, multi-partner, multi-discipline, legacy

P1. BaCell-SysMOThe transition from growing to non-growing Bacillus subtilis cells - A systems biology approach

P2. COSMICSystems Biology of Clostridium acetobutylicum - a possible answer to dwindling crude oil reserves

P3. SUMOSystems Understanding of Microbial Oxygen Responses Escherichia coli

P4. KOSMOBACIon and solute homeostasis in enteric bacteria Escherichia coli

P5. SysMO-LABComparative Systems Biology: Lactic Acid Bacteria: Lactococcus lactis, Enterococcus faecalis, Streptococcus pyogenes

P6. PSYSMOSystems analysis of biotech induced stresses: towards a quantum increase in process performance in the cell factory Pseudomonas putida

P7. SCaRABSystems Biology of a genetically engineered Pseudomonas fluorescens with inducible exo-polysaccharide

production: analysis of the dynamics and robustness of metabolic networks

P8. MOSESMicroOrganism Systems

Biology: Energy and Saccharomyces cerevisiae

P9. TRANSLUCENT Gene interaction networks and models of cation homeostasis in Saccharomyces cerevisiaeP10. STREAM

Global metabolic switching in Streptomyces coelicolor P11. SulfoSYSSilicon cell model for the central carbohydrate metabolism of the archaeon Sulfolobus solfataricus under temperature

variation

P12. SysMO-DBData management groupFunders

Researchers

Publishers

Who is working with wh

ich organism?

What methods are been used to determine enzyme activity?

Under which experimental conditions are

my

partners working on for the measurement

of glucose

concentration?What is the provenance of the parameters for this version of the model?What SOP was used for this

sample?

Where is the validation data for this model?

Is there any group generating kinetic data?

Is this data available?

Track versions of my model

Whats the relationship between the data and model?

Which data belong to which publications?

FAIR

A Commons

fairdomhub.org

Investigation

Study Analysis

Data

Model

SOP(Assay)

….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description

….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description.

Just Enough Results ModelCommon elements

….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description.

Uploaded into theFAIRDOM Store

Linked to entry in Public Archive

Linked to entry in Project store

... aggregating cataloguemetadata across repositories, retain context-> reproduce, reuse

Local Stores

ExternalDatabases

Publishing services

Secure Stores

Model Resources

… in situ reproducible modelsmetadata annotation against standards

model validation, comparison and simulation

SBML Model simulation

Model comparison

Model versioning

Reproducing simulations

[Jacky Snoep, Dagmar Waltemate, Martin Peters, Martin Scharm]

…. Nested Packages

context and credit

Research Objects• Link • Nest• Span • Bundle• Snapshot

Systematic, Standards-based metadata framework for logically and physically bundling resources with context• Exchange• Reproduce• Release packages

Reproducible Exchange and Publishingand better credit

reviewer

Author List: Joe Bloggs; Jane DoeTitle: My Investigation Date: September 2016DOI: https://doi.org/10.15490/seek##

information travels with the data and models

How do we do? Pretty well.Reproducibility window. But that’s ok!

• Can’t contain everything– Pesky Internet in a Box

• Can’t automate everything– Pesky people

• Can’t fix everything– Pesky science

Asthma Research e-Laboratory

Release builds of pharmacological knowledge warehouse

Exchanging large datasets

Samiul Hasan, GSKBiocuration need in Pharma: Drivers from a Translational Bioinformatics Perspective, Poster S161st EASYM Conference, Berlin 2016

Reality

Preparation pain. Goldilocks paradox.

[Norman Morrison]

replication hostility no funding, time, recognition, place to publishresource intensive access to the complete environment

“Data Parasites”“Data Flirters”

“Share Drift”FamilyFriendsPotential FriendsAcquaintancesStrangersRivals

Reciprocity

Using FAIRDOM my own lab colleagues saw what I was doing and called to collaborate!

Jurgen HannstraVrije Universiteit Amsterdam, Netherlands

Trust …

Half of researchers make research data available so they can be used by another.

Most not experienced any direct benefits nor experienced many bad effects.

Caveat: shared but usable?fake sharing

funder requirements

fear data will be misused or

misinterpreted

journal requirementsgood research practice

facilitate collaborationsenable validation and

replicationhigher citation rates

time and effort

new collaborations

extra funding for cost of data prep

enhance their academic reputationfeedback on how other researchers were using their data

taken into account in funding

taken into account in career

jeopardise future publications

its not ready to sharescrutiny scruples

answering questions

I won’t get credited

Metadata in by side effectTooling for annotations and checklist templates for different types of assay data.

Embed ontologies into Excel templates

Excel spreadsheets enriched with ontology annotations

Upload, extract metadata and register

http://www.rightfield.org.uk

Spreadsheet Ramps!!

Sharing by side effect …. libertarian paternalism

[Kristian Garza]

Finding and Citing by side effect

• Schema.org• Structured

markup in web pages

• Supported by Content Management Systems

• Harvested by search engines

• Builds snippets and sidebars

Bioschemas.org

Datarepository

Datarepository

TrainingResource

Bioschemas Bioschemas Bioschemas

Search engine Bio RegistriesBiosharingOLS, TeSSbio.tools

UKCRC TissueDirectory

bioCADDIE DATAMED

PDBe UniProtInterpro Molgenis Pfam

Gene3DBiosamplesBiobank websitesBRENDA HPA

TransPlantEGA Beacons

EBI-SearchGoogle

Finding and Citing by side effectBioschemas.org

Big co-operative data-driven science makes reproducibility

desirable but also means dependency and change are to be

expected

Words matter.50 Shades of Reproducibility.

form vs functionReproducibility is not a end.

Beware zealots.

Amplify Side effectsThink Research Objects!

top related