This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
C.Titus Brown Assistant Professor MMG, CSE, BEACON Michigan
State University May 2014 [email protected] Large-scale transcriptome
sequencing of non-model organisms: coping mechanisms
We practice open science! Everything discussed here: Code:
github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog
(titus brown blog) Twitter: @ctitusbrown Grants on LabWeb site:
http://ged.msu.edu/research.html Preprints available. Everything is
> 80% reproducible.
We practice open science! Everything discussed here: Code:
github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog
(titus brown blog) Twitter: @ctitusbrown Grants on LabWeb site:
http://ged.msu.edu/research.html Preprints available. Everything is
> 80% reproducible by you.
The challenges of non-model transcriptomics Missing or low
quality genome reference. Evolutionarily distant. Most extant
computational tools focus on model organisms Assume low
polymorphism (internal variation) Assume reference genome Assume
somewhat reliable functional annotation More significant compute
infrastructure and cannot easily or directly be used on critters of
interest.
Outline 1. Challenges of non-model transcriptomics. 2. Lamprey:
too much data, not enough genome 3. Digital normalization as a
coping mechanism 4. applied to Molgulid ascidians 5. and back to
lamprey. 6. More transcriptome challenges 7. Whats next?
(Implications of free data + free data analysis.)
Sea lamprey in the Great Lakes Non-native Parasite of medium to
large fishes Caused populations of host fishes to crash Li Lab /Y-W
C-D
The problem of lamprey: Diverged at base of vertebrates;
evolutionarily distant from model organisms. Large, complicated
genome (~2 GB) Relatively little existing sequence. We sequenced
the liver genome
Lamprey has incomplete genomic sequence J. Smith et al., PNAS
2009 Evidence of somatic recombination; 100s of mb of sequence
eliminated from genome during development. More recent evidence
(unpub, J. Smith et al.) suggests that this loss is developmentally
regulated, results in changes in gene expression (due to loss of
genes!), and is tissue specific. Liver genome is not the entire
genome.
Lamprey tissues for which we have mRNAseq embryo stages (late
blastula, gastrula, neurula, 22b, neural- crest migration,
24c1,24c2) metamorphosis 3 (intestine, kidney) ovulatory female
head skin adult intestine metamorphosis 4 (intestine, kidney)
preovulatory female eye adult kidney metamorphosis 5 (liver,
intestine, kidney) preovulatory female tail skin brain paired
metamorphosis 6 (intestine, kidney) prespermiating male gill
freshwater (gill, intestine, kidney) metamorphosis 7 (intestine,
kidney) mature adult male rope tissue larval (gill, kidney, liver,
intestine) monocytes spermiating male gill juvenile (intestine,
liver, kidney) brain (0,3,21 dpi) spermiating male head skin lips
spinal cord (0.3.21 dpi) supraneural tissue metamorphosis 1
(intestine, kidney) spermiating male muscle small parasite distal
intestine, kidney, proximal intestine metamorphosis 2 (liver,
intestine, salt water (gill, intestine)
Assembly It was the best of times, it was the wor , it was the
worst of times, it was the isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th It was the best of times,
it was the worst of times, it was the age of wisdom, it was the age
of foolishness but for lots and lots of fragments!
Shared low-level transcripts may not reach the threshold for
assembly.
Main problem (4 years ago): We have a massive amount of data
that challenges existing computers when we try to assemble it all
together.
Solution: Digital normalization (a computational version of
library normalization) Suppose you have a dilution factor ofA (10)
to B(1). To get 10x of B you need to get 100x ofA! Overkill!! This
100x will consume disk space and, because of errors, memory. We can
discard it for you
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach A digital analog to cDNA library
normalization, diginorm: Is single pass: looks at each read only
once; Does not collect the majority of errors; Keeps all
low-coverage reads; Smooths out coverage of sequencing. =>
Enables analyses that are otherwise completely impossible.
Evaluating diginorm how? Cant assemble lamprey w/o diginorm;
are results any good & how would we know? Need comparative data
set ascidians!
Looking at the Molgula Putnam et al., 2008, Nature.Modified
from Swalla 2001
Tail loss and notochord genes a) M. oculata b) hybrid (occulta
egg x oculata sperm) c) M. occulta Notochord cells in orange
Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15
November 1996
Diginorm applied to Molgula embryonic mRNAseq
Substantial time savings (3-5x) 1kb (compare with mouse: 17331
of 29769 genes are > 1kb) So, estimation by thumb ~ not that
off, for long transcripts.
Common vs rare genes #transcripts # samples Camille Scott
Can look at transcripts by tissue -- Camille Scott
Too many samples Camille Scott Presence/absence clustering
Expression-based clustering Some known biology recapitulated;
and ??? Camille Scott
Next challenges OK, we can deal with volume of data, make
pretty pictures, and ... Now what?
Contamination! Both experimental or real contaminants are big
probems. Camille Scott
Pathway predictions vary dramatically depending on data set,
annotation Likit Preeyanon KEGG pathway comparison across several
different gene annotation sets for chicken
The problem of lopsided gene characterization is pervasive:
e.g., the brain "ignorome" "...ignorome genes do not differ from
well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs,
paralogs, or protein domains. The major distinguishing
characteristic between these sets of genes is date of discovery,
early discovery being associated with greater research momentuma
genomic bandwagon effect." Ref.: Pandey et al. (2014), PLoS One 11,
e88889.Slide courtesy Erich Schwarz
Practical implications of diginorm Data is (essentially) free;
For some problems, analysis is now cheaper than data gathering
(i.e. essentially free); plus, we can run most of our approaches in
the cloud (per-hour rental compute resources).
1. khmer-protocols Effort to provide standard cheap assembly
protocols for the cloud. Entirely copy/paste; ~2-6 days from raw
reads to assembly, annotations, and differential expression
analysis. Open, versioned, forkable, citable. (Dont bother me
unless it doesnt work. Read cleaning Diginorm Assembly Annotation
RSEM differential expression
CC0; BSD; on github; in reStructuredText.
A few thoughts on our approach Explicitly a protocol explicit
steps, copy-paste, customizable. No requirement for computational
expertise or significant computational hardware. ~1-5 days to teach
a bench biologist to use. $100-150 of rental compute (cloud
computing) for $1000 data set. Adding in quality control and
internal validation steps.
Can we crowdsource bioinformatics? We already are!
Bioinformatics is already a tremendously open and collaborative
endeavor. (Lets take advantage of it!) Its as if somewhere, out
there, is a collection of totally free software that can do a far
better job than ours can, with open, published methods, great
support networks and fantastic tutorials. But thats madness who on
Earth would create such an amazing resource? -
http://thescienceweb.wordpress.com/2014/02/21/bioinformatics
-software-companies-have-no-clue-why-no-one-buys-their-
products/
2. Data availability is important for annotating distant
sequences Anything else Mollusc Cephalopod no similarity
Can we incentivize data sharing? ~$100-$150/transcriptome in
the cloud Offer to analyze peoples existing data for free, IFF they
open it up within a year. See: CephSeq white paper. Dead Sea
Scrolls & Open MarineTranscriptome Project blog post;
First results: Loligo genomic/transcriptome resources Putting
other peoples sequences where my mouth is: w/Josh Rosenthal and
Benton Gravely
Research singularity The data a researchers generates in their
lab constitutes an increasingly small component of the data used to
reach a conclusion. Corollary:The true value of the data an
individual investigator generates should be considered in the
context of aggregate data. Even if we overcome the social barriers
and incentivize sharing, we are, needless to say, not remotely
prepared for sharing all the data.
Acknowledgements Lab members involved Collaborators Adina Howe
(w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit
Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald
Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Billie
Swalla (UW) Josh Rosenthal (UPR) Weiming Li, MSU Ona Bloom
(Feinstein), Jen Morgan (MBL), Joe Buxbaum (MSSM) Funding USDA
NIFA; NSF IOS; NIH; BEACON.
Efcient online counting of k-mers Trimming reads on abundance
Efcient De Bruijn graph representations Read abundance
normalization Streaming algorithms for assembly, variant calling,
and error correction Cloud assembly protocols Efcient graph
labeling & exploration Data set partitioning approaches
Assembly-free comparison of data sets HMM-guided assembly Efcient
search for target genes Currentresearch (khmer software)