2012 erin-crc-nih-seattle

1. Extracting genomes fromcommunity sequencingWhat works, what will work, and whatneeds work C. Titus Brown [email protected] Science; Microbiology; BEACON Michigan State University

2. Warnings This talk contains forward looking statements.These forward-looking statements can beidentified by terminology such as will, expects,and believes. -- Safe Harbor provisions of the U.S. Private Securities Litigation ActMaking predictions is difficult, especially iftheyre about the future.-- Attributed to Niels Bohr 3. Thanks for the invitation! So, Linda Mansfield and I were talking one day Her: Itd be great to be able to look at communitieswith sequencing. Me: Oh, yeah, we can we do that now. My overall interest is in good hypothesis generation from computational data, with a focus on sequence data. For the past three years, I have been working on this specifically for soil metagenomics (and mRNAseq, too). 4. Deep connection between human gut soil 5. Soil is full of uncultured microbesEstimates of microbial diversity in agricultural soil ~1m species/gram Randy Jackson 6. SAMPLING LOCATIONS 7. Soil contains thousands to millions of species(Collectors curves of ~species) 2000 1800 1600Number of OTUs 1400 Iowa Corn Iowa_Native_Prairie 1200 Kansas Corn 1000Kansas_Native_Prairie Wisconsin Corn800Wisconsin Native Prairie Wisconsin Restored Prairie600 Wisconsin Switchgrass4002000100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100 Number of Sequences 8. Ecology => function emphasis Whats there? Is it really that complex a community? How deep do we need to sequence to samplethoroughly and systematically? How is ecological complexity created &maintained? How does ecological complexity respond toperturbation? What organisms and gene functions arepresent, including non-canonical carbon andnitrogen cycling pathways? What kind of organismal and functional overlap is 9. The human gut is a diverse placeDethlefsen et al., 2008 10. Ecology vs function in human gutWe can observe recovery of diversity after Cipro treatment; butwhat is driving recovery at a functional level? Dethlefsen and Relman, 2011 11. Culture independent methods Observation that 99% of microbes cannot easilybe cultured in the lab. (The great plate countanomaly) While this is less true for host-associatedmicrobes, culture independent methods are stillimportant: Syntrophic relationships Niche-specificity or unknown physiology Dormant microbes Abundance within communitiesSingle-cell sequencing &shotgun metagenomicsaretwo common ways to investigate microbial communities. 12. Shotgun metagenomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze.Wikipedia: Environmental shotgun sequencing.p 13. Shotgun sequencing & assemblyRandomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu) 14. Shotgun sequencing & assembly Why assembly? Assumption free (no reference needed) Necessary for soil and marine; useful for host-associated? Assembly can serve as reference for transcriptomeinterpretation Fragment, sequence, computationally assemble. What kind of results do you get? Almost certainly chimerism between different strains; but stilluseful for gene content &operon structure. Specificity seems high, but sensitivity is dependent onsequencing depth. Because of sampling rate, Illumina is primary choice. 15. Shotgun metagenomics: good news Cheap and easy to generate vast wholemetagenome/metatranscriptome shotgun data sets fromessentially any community you can sample. Such data can be quite interesting! Low hanging fruit correlation with diet, etc. Still early days for observation of pan genome and functionalcontent. Potential to illuminate or inform: Dynamics and selective pressures of antibioticresistance, virulence genes, and pathogenicity islands Phage and viral communities Community interactions. 16. Shotgun metagenomics: badnews Computational techniques are still relatively immature Mapping to known genomes? Discovery of unknown genomes & strain variants? Sensitivity and specificity are hard to evaluate. Computational ecosystem is not that rich Interpreting the data is still the bottleneck, of course. Vast majority of genes not usefully annotated. Reliance on specific reference databases, annotations. Tools for (e.g.) inferring community interactions fromcommunity dynamics & functional capacity aredesperately needed. 17. The computational conundrumMore data => better.and More data => computationally more challenging. 18. 1. Assembly depends on highcoverage 19. 2. Big data sets require big machinesFor even relatively small data sets, metagenomicassemblers scale poorly.Memory usage ~ real variation + number of errorsNumber of errors ~ size of data setSize of data set == big!!(Estimated 6 weeks x 3 TB of RAM to do 300gb soilsample, with a slightly modified conventionalassembler.) 20. Our Grand Challenge dataset 21. Approach 1: PartitioningSplit reads into bins belonging to different source species.Can do this based almost entirely on connectivity of sequences. 22. Technical challenges met (and defeated) Novel data structure properties elucidated via percolation theory analysis (Pell, Hintze, et al., in review, PNAS). Exhaustive in-memory traversal of graphs containing 5-15 billion nodes. Sequencing technology introduces false sequences in graph (Howe et al., in prep.) Only 20x improvement in assembly scaling . 23. (NOVEL)Approach 2: Digital normalization Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!! This 100x will consumedisk space and, because of errors, memory. 24. Digital normalization discardsredundant reads prior to assembly.This removes reads and decreases data size, eliminates errors from removed reads, andnormalizes coverage across loci.Discarded reads can be used after assembly for quantitative analysis. 25. A reads median k-mer count is agood estimator of coverage. This gives us a reference-free measure ofcoverage. 26. Shotgun data is often (1) highcoverage and (2) biased in coverage. (MD amplified) 27. Digital normalization fixes all that. Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica 28. Digital normalization retains information, whilediscarding data and errors 29. Evaluating sensitivity & specificityE. coli @ 10x + soil98.5% of E. coli 30. How much? A mathematicalinterlude. Suppose we need 10x coverage to assemble amicrobial genome, and microbial genomesaverage 5e6 bp of DNA. Further suppose that we want to be able toassemble a microbial species that is 1 in a100000, i.e. 1 in 1e5. Shotgun sequencing samples randomly, so mustsample deeply to be sensitive.10x coverage x 5e6 bp x1e5 =~ 50e11, or 5 Tbpofsequence. 31. ExampleDethlefsen shotgun data set / Relman lab251 m reads / 16gb FASTQ gzipped~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 onAmazon EC2(reads => final assembly + mapping)Assembly stats: 58,224 contigs> 1000 bp (average 3kb)summing to 190 mb genomic~38 microbial genomes worth of DNA ~65% of reads mapped back to assembly 32. What do we get for soil? Predicted Total% Reads rplb Total Contigsprotein AssemblyAssembled genescoding2.5 bill 4.5 mill 19%5.3 mill 3913.5 bill 5.9 mill 22%6.8 mill 466 This estimates number of species ^Putting it in perspective:Total equivalent of ~1200 bacterial genomesAdina HoweHuman genome ~3 billion bp 33. Extracting whole genomes?So far, we have only assembled contigs, but not whole genomes.Can entire genomes beassembled from metagenomicdata?Iverson et al. (2012), fromthe Armbrust lab, contains atechnique for scaffoldingmetagenomecontigs into~whole genomes. YES. 34. Concluding thoughts onassembly Illumina is the only game in town for sequencing complexmicrobial populations, but dealing with the data(volume, errors) is tricky. This problem is being solved, byus and others. Were working to make it as close to push button aspossible, with objectively argued parameters andtools, and methods for evaluating new tools andsequencing types. The community is working on dealing with datadownstream of sequencing & assembly. Most pipelines were built around 454 data long reads, andrelatively few of them. With Illumina, we can get both long contigs and quantitativeinformation about their abundance. This necessitateschanges to pipelines like MG-RAST and HUMANn. 35. The interpretation challenge For soil, we have generated approximately 1200bacterial genomes worth of assembled genomic DNAfrom two soil samples. The vast majority of this genomic DNA containsunknown genes with largely unknown function. Most annotations of gene function & interaction arefrom a few phylogenetically limited model organisms Est 98% of annotations are computationally inferred:transferred from model organisms to genomicsequence, using homology. Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years. 36. How will we annotate soil?? Predicted Total% Reads rplb Total Contigsprotein AssemblyAssembled genescoding2.5 bill 4.5 mill 19%5.3 mill 3913.5 bill 5.9 mill 22%6.8 mill 466 This estimates number of species ^Putting it in perspective:Total equivalent of ~1200 bacterial genomesAdina HoweHuman genome ~3 billion bp 37. Some lessons from C. jejuni In vivomurine transfer experiments demonstratesubstantial capacity for C. jejuni11168 to adapt solelyvia modification of poly-G tracts (Jerome et al., 2011). Bell et al. (unpub) have shown substantial variabilityin gene content of Campylobacter strains. Genecontent and gene expression are both important tounderstanding mechanisms of pathogenicity. In vitro serial transfer experiments demonstrate thatrapid genomic adaptation to new environments occursat multiple loci, with substantial variation in genes ofunknown function (Jereme et al., in preparation) 38. Multilocus strain variation in C.jejunidrives rapid adaptation 39. What works?Today, From deep metagenomicdata, you can get the gene and operon content (including abundance of both) from communities. You can get microarray-like expression information from metatranscriptomics. 40. What needs work? Assembling ultra-deep samples is going to require more engineering, but is straightforward. (Infinite assembly.) Building scaffolds and extracting whole genomes has been done, but I am not yet sure how feasible it is to do systematically with existing tools (c.f. Armbrust Lab). 41. What will work, someday? Sensitive analysis of strain variation. Both assembly and mapping approaches do a poorjob detecting many kinds of biological novelty. The 1000 Genomes Project has developed somegood tools that need to be evaluated on communitysamples. Ecological/evolutionary dynamics in vivo. Most work done on 16s, not on genomes orfunctional content. Here, sensitivity is really important! 42. What are future needs? High-quality, medium+ throughput annotation of genomes? Extrapolating from model organisms is bothimmensely important and yet lacking. Strong phylogenetic sampling bias in existingannotations. Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesnt scale) 43. Pubs, software, tutorials, etc.Metagenome assembly / HMP tutorial:http://ged.msu.edu/angus/nih-hmp-2012/ Everything I discussed is available pre-pub -- contact [email protected], or Google for khmer software package kmer-percolation paper (in re-review, PNAS) digital normalization paper (in review, PLoS One) a few dozen people using, one way or another. 44. Acknowledgements Jason Pell, Qingpeng Zhang, ArendHintze, andAdina Howe Soil: Jim Tiedje (MSU), Janet Jansson(LBNL/JGI), Susannah Tringe (JGI) Campy: Linda Mansfield, Julia Bell, JP Jerome,Jeff BarrickFunding:USDA NIFA; NSF IOS; BEACON.

2012 erin-crc-nih-seattle

Technology

computational data

sequence data

big data sets

small data sets

soil metagenomics

human gut soil

agricultural soil

communitieswith sequencing