Reproducibility - The myths and truths of pipeline bioinformatics

Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon Cockell

Bioinformatics Special Interest Group

19th July 2012

Repeatability and Reproducibility

• Main principle of scientific method

• Repeatability is ‘within lab’• Reproducibility is

‘between lab’• Broader concept

• This should be easy in bioinformatics, right?• Same data + same code =

same results• Not many analyses have

stochastisicity

http://xkcd.com/242/

Same data?

• Example• Data deposited in SRA• Original data deleted by

researchers• .sra files are NOT .fastq• All filtering/QC steps lost• Starting point for

subsequent analysis not the same – regardless of whether same code used

Same data?

• Data files are very large• Hardware failures are

surprisingly common• Not all hardware failures

are catastrophic• Bit-flipping by faulty RAM

• Do you keep an md5sum of your data, to ensure it hasn’t been corrupted by the transfer process?

Same code?

• What version of a particular software did you use?

• Is it still available?• Did you write it yourself?• Do you use version

control?• Did you tag a version?• Is the software

closed/proprietary?

Version Control

• Good practice for software AND data

• DVCS means it doesn’t have to be in a remote repository

• All local folders can be versioned• Doesn’t mean they have

to be, it’s a judgment call

• Check-in regularly• Tag important “releases”

https://twitter.com/sjcockell/status/202041359920676864

Pipelines

• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation

• Process captured by underlying pipeline architecture

http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/

Tools for pipelining analyses• Huge numbers

• See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

• Only a few widely used:

• Bash• old school

• Taverna• build workflows from public webservices

• Galaxy• sequencing focus – tools provided in ‘toolshed’

• Microbase• distributed computing, build workflows from ‘responders’

• e-Science Central• ‘Science as a Service’ – cloud focus• not specifically a bioinformatics tool

http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

Bash

• Single-machine (or cluster) command-line workflows

• No fancy GUIs • Record provenance & process

• Rudimentary parallel processing

http://www.gnu.org/software/bash/

Taverna

• Workflows from web services

• Lack of relevant services• Relies on providers

• Gluing services together increasingly problematic

• Sharing workflows through myExperiment• http://

www.myexperiment.org/

http://www.taverna.org.uk/

Galaxy

• “open, web-based platform for data intensive biomedical research”

• Install or use (limited) public server

• Can build workflows from tools in ‘toolshed’

• Command-line tools wrapped with web interface

https://main.g2.bx.psu.edu/

Galaxy Workflow

Microbase

• Task management framework• Workflows emerge from interacting ‘responders’• Notification system passes messages around• ‘Cloud-ready’ system that scales easily• Responders must be written for new tools

http://www.microbasecloud.com/

e-Science Central

• ‘Blocks’ can be combined into workflows

• Blocks need to be written by an expert

• Social networking features

• Good provenance recording

http://www.esciencecentral.co.uk/

The best approach?• Good for individual analysis

• Package & publish

• All datasets different• One size does not fit all• Downstream processes often depend on results of upstream ones

• Note lack of QC• Requires human interaction – impossible to pipeline• Different every time• Subjective – major source of variation in results• BUT – important and necessary (GIGO)

More tools for reproducibility• iPython notebook

• http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html• Build notebooks with code embedded• Run code arbitrarily• Example: https://pilgrims.ncl.ac.uk:9999/

• Runmycode.org• Allows researchers to create ‘companion websites’ for papers• This website allows readers to implement the methodology

described in the paper• Example:

http://www.runmycode.org/CompanionSite/site.do?siteId=92

http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html

http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html

https://pilgrims.ncl.ac.uk:9999/



The executable paper

• The ultimate in repeatable research

• Data and code embedded in the publication

• Figures can be generated, in situ, from the actual data

• http://ged.msu.edu/papers/2012-diginorm/

http://ged.msu.edu/papers/2012-diginorm/



Summary• For work to be repeatable:

• Data and code must be available• Process must be documented (and preferably shared)• Version information is important• Pipelines are not the great panacea

• Though they may help for parts of the process• Bash is as good as many ‘fancier’ tools (for tasks on a single machine or

cluster)

Inspirations for this talk• C. Titus Brown’s blogposts on repeatability and the

executable paper• http://ivory.idyll.org/blog

• Michael Barton’s blogposts about organising bioinformatics projects and pipelines• http://bioinformaticszen.com/

Reproducibility - The myths and truths of pipeline bioinformatics

Technology

example data

data files

data dvcs

pipelines http

fromthe actual data

sra original data

executable paper http

process bash