Top Banner
Better science through superior software Michael R. Crusoe Software Engineer & Bioinformatician The GED Lab @ Michigan State [email protected] @biocrusoe
17

Better science through superior software

May 10, 2015

Download

Technology

Presentation given to the BEACON 2013 Congress during the "Collaborating with Industry" sandbox

Original w/ slide notes at: https://docs.google.com/presentation/d/1mmvD0R3fLIl11TmFHij5fGcMDb9qJxy_nwENO2Rt-YI/edit?usp=sharing
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Better science through superior software

Better science through superior software

Michael R. CrusoeSoftware Engineer & Bioinformatician

The GED Lab @ Michigan [email protected] @biocrusoe

Page 2: Better science through superior software

Open, online scienceMuch of the software and approaches talked about today are available:

khmer software:http://github.com/ged-lab/khmer/

Titus’s blog: http://ivory.idyll.org/blog/Titus’s twitter: @ctitusbrown

Page 3: Better science through superior software

Overview

● Next-gen sequencing data deluge● ♫How do you solve a problem like big data?♫● Impact of khmer software● Future work● Being a good F/OSS community member and

leading by example

● Acknowledgements

Page 4: Better science through superior software

Problem

“The power of next-gen. sequencing: get 180x coverage... and then watch your assemblies never finish” - Erich Schwarz

Page 5: Better science through superior software

“Three types of data scientists.”

(Bob Grossman, U. Chicago, at XLDB 2012)

1. Your data gathering rate is slower than Moore’s Law.

2. Your data gathering rate matches Moore’s Law.

3. Your data gathering rate exceeds Moore’s Law.

Page 6: Better science through superior software
Page 7: Better science through superior software

“Three types of data scientists.”

1. Your data gathering rate is slower than Moore’s Law.

=> Be lazy, all will work out.2. Your data gathering rate matches Moore’s Law.=> You need to write good software, but all will

work out.3. Your data gathering rate exceeds Moore’s Law.

=> You need serious help.

Page 8: Better science through superior software

A software & algorithms approach: can we develop lossy compression approaches that

1. Reduce data size & remove errors => efficient processing?

2. Retain all “information”? (think JPEG)

If so, then we can store only the compressed data for later reanalysis. Short answer is: yes,

we can.

Page 9: Better science through superior software

Digital normalization approach

A digital analog to cDNA library normalization, diginorm:

● Reference free.

● Is single pass: looks at each read only once;

● Does not “collect” the majority of errors;

● Keeps all low-coverage reads & retains all information.

● Smooths out coverage of regions.

Page 10: Better science through superior software

GED Lab’s approach: khmer

diginorm: ejects most data while retaining the information content.

partitioning: split transcriptomic and meta{transcript,gen}omic datasets

fast k-mer counting: for better preprocessing, repeat detection, and sequencing coverage estimates

Reference-free variant calling

- More to come -

Page 11: Better science through superior software

The GED lab at MSU:Theoretical => applied solutions.

Page 12: Better science through superior software

Impact

● any biologist can use our tools in a rented cloud computer, cheaply

● Overcome complexity: Erich Schwarz assembled H. contortus when it was previously not possible.

● Overcome data excess: 5.1 billion reads from 50 different sea lamprey tissue -> diginorm technique removed 98.7% for being redundant.

Page 13: Better science through superior software

Future work

● targeted-gene assembly from short reads (Fish et al., Ribosomal Database Project)

● rRNA search in shotgun data● error-correction for mRNAseq &

metagenomic data

● strain variation collapse, assembly, and recovery

● Goal: make most assembly easy and all evaluation easy.

Page 14: Better science through superior software

Interactions

khmer both builds upon existing Free and Open-Source Software (F/OSS) and is itself made under an open-source license.

used in curriculum: both Software Carpentry ANGUS based courses and the MSU NGS summer course

Page 15: Better science through superior software

● BIG DATA grant reviewers specifically mentioned the GED Lab’s “[...] long and successful track-record and experience in following rigorous but open software development processes.” -> CTB received 3-year NIH R01 support

● Transparent and public software development yielded participation from others.

Page 16: Better science through superior software

Personal Acknowledgments

C. Titus Brown for slides, employment

Page 17: Better science through superior software

Acknowledgements

Lab members involved Collaborators

● Adina Howe (w/Tiedje)● Jason Pell● Arend Hintze● Rosangela Canino-

Koning● Qingpeng Zhang● Elijah Lowe● Likit Preeyanon● Jiarong Guo● Tim Brom● Kanchan Pavangadkar● Eric McDonald● Chris Welcher

● Jim Tiedje, MSU● Billie Swalla, UW● Janet Jansson,

LBNL● Susannah Tringe,

JGI

Funding

USDA NIFA; NSF IOS; BEACON.