2014 ucl

C.Titus Brown Assistant Professor MMG, CSE, BEACON Michigan State University May 2014 [email protected] Large-scale transcriptome sequencing of non-model organisms: coping mechanisms

We practice open science! Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on LabWeb site: http://ged.msu.edu/research.html Preprints available. Everything is > 80% reproducible.

We practice open science! Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on LabWeb site: http://ged.msu.edu/research.html Preprints available. Everything is > 80% reproducible by you.

The challenges of non-model transcriptomics Missing or low quality genome reference. Evolutionarily distant. Most extant computational tools focus on model organisms Assume low polymorphism (internal variation) Assume reference genome Assume somewhat reliable functional annotation More significant compute infrastructure and cannot easily or directly be used on critters of interest.

Outline 1. Challenges of non-model transcriptomics. 2. Lamprey: too much data, not enough genome 3. Digital normalization as a coping mechanism 4. applied to Molgulid ascidians 5. and back to lamprey. 6. More transcriptome challenges 7. Whats next? (Implications of free data + free data analysis.)

Sea lamprey in the Great Lakes Non-native Parasite of medium to large fishes Caused populations of host fishes to crash Li Lab /Y-W C-D

The problem of lamprey: Diverged at base of vertebrates; evolutionarily distant from model organisms. Large, complicated genome (~2 GB) Relatively little existing sequence. We sequenced the liver genome

Lamprey has incomplete genomic sequence J. Smith et al., PNAS 2009 Evidence of somatic recombination; 100s of mb of sequence eliminated from genome during development. More recent evidence (unpub, J. Smith et al.) suggests that this loss is developmentally regulated, results in changes in gene expression (due to loss of genes!), and is tissue specific. Liver genome is not the entire genome.

Lamprey tissues for which we have mRNAseq embryo stages (late blastula, gastrula, neurula, 22b, neural- crest migration, 24c1,24c2) metamorphosis 3 (intestine, kidney) ovulatory female head skin adult intestine metamorphosis 4 (intestine, kidney) preovulatory female eye adult kidney metamorphosis 5 (liver, intestine, kidney) preovulatory female tail skin brain paired metamorphosis 6 (intestine, kidney) prespermiating male gill freshwater (gill, intestine, kidney) metamorphosis 7 (intestine, kidney) mature adult male rope tissue larval (gill, kidney, liver, intestine) monocytes spermiating male gill juvenile (intestine, liver, kidney) brain (0,3,21 dpi) spermiating male head skin lips spinal cord (0.3.21 dpi) supraneural tissue metamorphosis 1 (intestine, kidney) spermiating male muscle small parasite distal intestine, kidney, proximal intestine metamorphosis 2 (liver, intestine, salt water (gill, intestine)

Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness but for lots and lots of fragments!

Shared low-level transcripts may not reach the threshold for assembly.

Main problem (4 years ago): We have a massive amount of data that challenges existing computers when we try to assemble it all together.

Solution: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor ofA (10) to B(1). To get 10x of B you need to get 100x ofA! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you

Digital normalization

Digital normalization approach A digital analog to cDNA library normalization, diginorm: Is single pass: looks at each read only once; Does not collect the majority of errors; Keeps all low-coverage reads; Smooths out coverage of sequencing. => Enables analyses that are otherwise completely impossible.

Evaluating diginorm how? Cant assemble lamprey w/o diginorm; are results any good & how would we know? Need comparative data set ascidians!

Looking at the Molgula Putnam et al., 2008, Nature.Modified from Swalla 2001

Sea squirts! Molgula oculata Molgula occulta Molgula oculata Ciona intestinalis Elijah Lowe; collaboration w/Billie Swalla

Tail loss and notochord genes a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996

Diginorm applied to Molgula embryonic mRNAseq

Substantial time savings (3-5x) 1kb (compare with mouse: 17331 of 29769 genes are > 1kb) So, estimation by thumb ~ not that off, for long transcripts.

Common vs rare genes #transcripts # samples Camille Scott

Can look at transcripts by tissue -- Camille Scott

Too many samples Camille Scott Presence/absence clustering

Expression-based clustering Some known biology recapitulated; and ??? Camille Scott

Next challenges OK, we can deal with volume of data, make pretty pictures, and ... Now what?

Contamination! Both experimental or real contaminants are big probems. Camille Scott

Pathway predictions vary dramatically depending on data set, annotation Likit Preeyanon KEGG pathway comparison across several different gene annotation sets for chicken

The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentuma genomic bandwagon effect." Ref.: Pandey et al. (2014), PLoS One 11, e88889.Slide courtesy Erich Schwarz

Practical implications of diginorm Data is (essentially) free; For some problems, analysis is now cheaper than data gathering (i.e. essentially free); plus, we can run most of our approaches in the cloud (per-hour rental compute resources).

1. khmer-protocols Effort to provide standard cheap assembly protocols for the cloud. Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. Open, versioned, forkable, citable. (Dont bother me unless it doesnt work. Read cleaning Diginorm Assembly Annotation RSEM differential expression

CC0; BSD; on github; in reStructuredText.

A few thoughts on our approach Explicitly a protocol explicit steps, copy-paste, customizable. No requirement for computational expertise or significant computational hardware. ~1-5 days to teach a bench biologist to use. $100-150 of rental compute (cloud computing) for $1000 data set. Adding in quality control and internal validation steps.

Can we crowdsource bioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Lets take advantage of it!) Its as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But thats madness who on Earth would create such an amazing resource? - http://thescienceweb.wordpress.com/2014/02/21/bioinformatics -software-companies-have-no-clue-why-no-one-buys-their- products/

2. Data availability is important for annotating distant sequences Anything else Mollusc Cephalopod no similarity

Can we incentivize data sharing? ~$100-$150/transcriptome in the cloud Offer to analyze peoples existing data for free, IFF they open it up within a year. See: CephSeq white paper. Dead Sea Scrolls & Open MarineTranscriptome Project blog post;

First results: Loligo genomic/transcriptome resources Putting other peoples sequences where my mouth is: w/Josh Rosenthal and Benton Gravely

Research singularity The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary:The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data.

Acknowledgements Lab members involved Collaborators Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Billie Swalla (UW) Josh Rosenthal (UPR) Weiming Li, MSU Ona Bloom (Feinstein), Jen Morgan (MBL), Joe Buxbaum (MSSM) Funding USDA NIFA; NSF IOS; NIH; BEACON.

Efcient online counting of k-mers Trimming reads on abundance Efcient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efcient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efcient search for target genes Currentresearch (khmer software)

2014 ucl

Science

titus brown

bsd license

practice open

evolutionarily

digital normalization

model organisms

html preprints

camille scott