Top Banner
Introduction to Reproducible Research in Bioinformatics Nan Xiao @road2stat CRI Bioinformatics Workshop
31

Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Apr 04, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Introduction to Reproducible Research in Bioinformatics

Nan Xiao @road2stat

CRI Bioinformatics Workshop

Page 2: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

About me• http://nanx.me

• Statistics background

• Previous experience in statistical machine learning , systems pharmacology, translational bioinformatics

• R developer (7 R/Bioconductor packages; 4 web applications; 4 translated books)

Page 3: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Agenda

• Concept

• Principles

• Tools

• Challenges

Page 4: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

We love copy & paste

Page 5: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Data updated today …

Page 6: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Concept

• Allowing other researchers to replicate your (computational) analysis of the data

• Reproducibility doesn’t ensure correctness, but still helpful

• Not only required in bioinformatics research, but also required in statistical research

Page 7: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Why is RR important?

• Reduce (honest) mistakes

• Improve productivity for the long run

• More likely to be used, extended, and cited

Page 8: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

General Principles of RR

• Keep track of how every result was produced:

• Avoid manual data manipulation

• Version control all custom code

• Provide public access to data and code

Page 9: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

One Principle:Everything Automated with Code

Page 10: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Tools1. Workflow automation: GNU make + CLI tools instead of

GUI tools; Workflow & pipeline systems

2. R / Python packages instead of code snippets

3. knitr / IPython Notebook + Markdown instead of Word

4. Code version control: git / GitHub

5. Package dependency management: packrat / virtualenv

6. System dependency management: Docker

Page 11: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

1. Make & its friends

• Make can be used for computational project workflow automation

• Organises code & data dependencies naturally

• Works seamlessly with Linux/Unix CLI tools, such as sed, awk, and many others

Page 12: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Example Makefile

fig1.pdf

fig2.pdf

code1.R

code2.R

report.tex

report.pdf

Page 13: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Example Makefile# Example Makefile for a LaTeX report

report.pdf: report.tex img/fig1.pdf img/fig2.pdf

pdflatex report

# Run R code to reproduce figures

img/fig1.pdf: Rcode/code1.R

cd Rcode; R CMD BATCH code1.R code1.Rout

img/fig2.pdf: Rcode/code2.R

cd Rcode; R CMD BATCH code2.R code2.Rout

Page 14: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Workflow Systems• Galaxy

• bpipe

• systemPipeR

• Rabix

• … and many others:

• https://github.com/pditommaso/awesome-pipeline

Page 15: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Galaxy Workflow

Page 16: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Pipeline in NCI Cancer Genomics Cloud http://www.cancergenomicscloud.org

Page 17: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

If you want to go beyond CLI and Galaxy … R / Python packages can save the day.

Page 18: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

2. R Packages• readr: data loading

• httr: web scraping

• tidyr: data cleaning

• dplyr: data mangling

• stringr: string data

• lubridate: time data

• ggplot2: visualization

• devtools: package dev

• roxygen2: documentation

• testthat: unit testing

Page 19: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

knitr

Page 20: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

3. knitr• knitr report = code + text

Let’s explore our ChIP-seq’s peak length:```{r}counts = read.table("peaks.broadPeak")hist(counts$end - counts$start)```The median of the peak length is`r median(counts$end - counts$start)`.

Page 21: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Demo with RStudio

• RStudio

• knitr + RMarkdown

• Analyze ChIP-seq peak length distribution

Page 22: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

4. Code Version Control

Easy to see when and where the code was changed

Page 23: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

5. Package Dependency

• R: packrat

• Python: virtualenv

• Manages project package dependency, to make sure every newly created environment has the same versions of packages

Page 24: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

6. System Dependency

Page 25: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Challenges• Allow for some level of interactivity: test & debug

(knitr, IPython Notebook)

• Handling large-scale computation (Docker + AWS)

• Dependency deprecation (packrat)

• OS-level reproducibility (Docker)

• Data privacy concerns

Page 26: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

liftr

liftr.me

Page 27: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Idea

• A framework for dockerizing R markdown documents

• Built-in Rabix (bioinformatics pipelines) support

• Reproducible bioinformatics and statistical analysis

Page 28: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

knitrRabix

Page 29: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Rmd Documents with `liftr`

options in metadata

Generated Dockerfile (Rabixfile)

Rendered HTML/PDF/Docx Reports

lift("foo.Rmd") drender("foo.Rmd") Read and share!

---liftr: syslib: - samtools biocpkg: - Rsamtools - Gviz rabix: true rabix_json: "bwa.json"

---+ +HTML PDF

+

liftr workflow

Page 30: Introduction to Reproducible Research in Bioinformatics · 2020-03-19 · Introduction to Reproducible Research in Bioinformatics ... as sed, awk, and many others. Example Makefile

Further Readings