Top Banner
Reproducibility: The Myths and Truths of “Push-Button” Bioinformatics Simon Cockell Bioinformatics Special Interest Group 19 th July 2012
20

Reproducibility - The myths and truths of pipeline bioinformatics

May 10, 2015

Download

Technology

Simon Cockell

In a talk for the Newcastle Bioinformatics Special Interest Group (http://bsu.ncl.ac.uk/fms-bioinformatics) I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reproducibility - The myths and truths of pipeline bioinformatics

Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon Cockell

Bioinformatics Special Interest Group

19th July 2012

Page 2: Reproducibility - The myths and truths of pipeline bioinformatics

Repeatability and Reproducibility

• Main principle of scientific method

• Repeatability is ‘within lab’• Reproducibility is

‘between lab’• Broader concept

• This should be easy in bioinformatics, right?• Same data + same code =

same results• Not many analyses have

stochastisicity

http://xkcd.com/242/

Page 3: Reproducibility - The myths and truths of pipeline bioinformatics

Same data?

• Example• Data deposited in SRA• Original data deleted by

researchers• .sra files are NOT .fastq• All filtering/QC steps lost• Starting point for

subsequent analysis not the same – regardless of whether same code used

Page 4: Reproducibility - The myths and truths of pipeline bioinformatics

Same data?

• Data files are very large• Hardware failures are

surprisingly common• Not all hardware failures

are catastrophic• Bit-flipping by faulty RAM

• Do you keep an md5sum of your data, to ensure it hasn’t been corrupted by the transfer process?

Page 5: Reproducibility - The myths and truths of pipeline bioinformatics

Same code?

• What version of a particular software did you use?

• Is it still available?• Did you write it yourself?• Do you use version

control?• Did you tag a version?• Is the software

closed/proprietary?

Page 6: Reproducibility - The myths and truths of pipeline bioinformatics

Version Control

• Good practice for software AND data

• DVCS means it doesn’t have to be in a remote repository

• All local folders can be versioned• Doesn’t mean they have

to be, it’s a judgment call

• Check-in regularly• Tag important “releases”

https://twitter.com/sjcockell/status/202041359920676864

Page 7: Reproducibility - The myths and truths of pipeline bioinformatics

Pipelines

• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation

• Process captured by underlying pipeline architecture

http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/

Page 8: Reproducibility - The myths and truths of pipeline bioinformatics

Tools for pipelining analyses• Huge numbers

• See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

• Only a few widely used:

• Bash• old school

• Taverna• build workflows from public webservices

• Galaxy• sequencing focus – tools provided in ‘toolshed’

• Microbase• distributed computing, build workflows from ‘responders’

• e-Science Central• ‘Science as a Service’ – cloud focus• not specifically a bioinformatics tool

Page 9: Reproducibility - The myths and truths of pipeline bioinformatics

Bash

• Single-machine (or cluster) command-line workflows

• No fancy GUIs • Record provenance & process

• Rudimentary parallel processing

http://www.gnu.org/software/bash/

Page 10: Reproducibility - The myths and truths of pipeline bioinformatics
Page 11: Reproducibility - The myths and truths of pipeline bioinformatics

Taverna

• Workflows from web services

• Lack of relevant services• Relies on providers

• Gluing services together increasingly problematic

• Sharing workflows through myExperiment• http://

www.myexperiment.org/

http://www.taverna.org.uk/

Page 12: Reproducibility - The myths and truths of pipeline bioinformatics

Galaxy

• “open, web-based platform for data intensive biomedical research”

• Install or use (limited) public server

• Can build workflows from tools in ‘toolshed’

• Command-line tools wrapped with web interface

https://main.g2.bx.psu.edu/

Page 13: Reproducibility - The myths and truths of pipeline bioinformatics

Galaxy Workflow

Page 14: Reproducibility - The myths and truths of pipeline bioinformatics

Microbase

• Task management framework• Workflows emerge from interacting ‘responders’• Notification system passes messages around• ‘Cloud-ready’ system that scales easily• Responders must be written for new tools

http://www.microbasecloud.com/

Page 15: Reproducibility - The myths and truths of pipeline bioinformatics

e-Science Central

• ‘Blocks’ can be combined into workflows

• Blocks need to be written by an expert

• Social networking features

• Good provenance recording

http://www.esciencecentral.co.uk/

Page 16: Reproducibility - The myths and truths of pipeline bioinformatics

The best approach?• Good for individual analysis

• Package & publish

• All datasets different• One size does not fit all• Downstream processes often depend on results of upstream ones

• Note lack of QC• Requires human interaction – impossible to pipeline• Different every time• Subjective – major source of variation in results• BUT – important and necessary (GIGO)

Page 17: Reproducibility - The myths and truths of pipeline bioinformatics

More tools for reproducibility• iPython notebook

• http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html• Build notebooks with code embedded• Run code arbitrarily• Example: https://pilgrims.ncl.ac.uk:9999/

• Runmycode.org• Allows researchers to create ‘companion websites’ for papers• This website allows readers to implement the methodology

described in the paper• Example:

http://www.runmycode.org/CompanionSite/site.do?siteId=92

Page 18: Reproducibility - The myths and truths of pipeline bioinformatics

The executable paper

• The ultimate in repeatable research

• Data and code embedded in the publication

• Figures can be generated, in situ, from the actual data

• http://ged.msu.edu/papers/2012-diginorm/

Page 19: Reproducibility - The myths and truths of pipeline bioinformatics

Summary• For work to be repeatable:

• Data and code must be available• Process must be documented (and preferably shared)• Version information is important• Pipelines are not the great panacea

• Though they may help for parts of the process• Bash is as good as many ‘fancier’ tools (for tasks on a single machine or

cluster)

Page 20: Reproducibility - The myths and truths of pipeline bioinformatics

Inspirations for this talk• C. Titus Brown’s blogposts on repeatability and the

executable paper• http://ivory.idyll.org/blog

• Michael Barton’s blogposts about organising bioinformatics projects and pipelines• http://bioinformaticszen.com/