Transcript
Investigate the diversity of extremely complex metagenomic samples
Qingpeng ZhangDepartment of Computer Science and Engineering
Michigan State UniversitySupervisor: Dr. Titus Brown
Outline
● Significance and background– Metagenomics
– Microbial diversity measurement
● Preliminary results– A novel method to investigate microbial diversity
based on an efficient k-mer counting approach
● Proposed research– Prove effectiveness using test data sets
– Tackle extremely large metagenomic data sets generated from extremely complex microbial samples
The Great Prairie Grand Challenge
● How many different species in a soil sample? What is their abundance distribution? How different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie?
● “Grand Challenge” - extremely large data sets from extremely complex microbial community
– Estimated 50 Tbps are needed for an individual gram of soil (Jason Gans,2005)
– In a gram of soil, there are approximately a billion microbial cells, containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013)
– Over a tera bases of sequences from Iowa cultivated and uncultivated
Metagenomics and Next Generation Sequencing
species
Individuals
OTUs
16S rRNAs sequences
Uniquek-mers
total k-mers in WGS data
Nature Reviews Genetics 6, 805-814, ettc.
Whole genome sequencing reads
Diversity measurement based on different unit concepts
97% similarity of 16S sequences
Statistics for Diversity Estimation
● rarefaction curve– Quite incapable of dealing with the scale of diversity
of the microbial world
● extrapolation from curves● parametric estimators(need relative species
abundance)● non-parametric estimators(Chao1,etc.)
– Lower bound estimator
– Sensitive to underlying distribution
The Goal of this Project
● Using whole genome shotgun metagenomic data set rather than 16S rRNA
– Measuring the microbial diversity of samples alpha-diversity
– Comparing microbial samples beta-diversity
● A novel method that is:
– Binning-free
– Assembly-free
– Annotation-free
– Reference-free
● Efficient (Memory and Time)
– extremely large shotgun metagenomic data sets (Terabytes, etc.)
– extremely diverse microbial communities (Soil, etc.)
species
Individuals
OTUs
16S rRNAs sequences
Uniquek-mers
total k-mers in WGS data
Nature Reviews Genetics 6, 805-814, ettc.
Whole genome sequencing reads
Diversity measurement based on different unit concepts
97% similarity of 16S sequences
Preliminary Results
● A novel method to investigate microbial diversity based on an efficient k-mer counting approach– Diversity measurement of one sample
– Comparison of multiple samples
● an approach to count k-mer efficiently●
–
An Approach to Count k-mer Efficiently
• Highly scalable: Constant memory consuming, independent of k and dataset size
• Probabilistic properties well suited to next generation sequencing datasets
• With certain counting false positive rate as tradeoff because of collision
(Zhang, Pell, Canino-Koning, Howe, & Brown, 2013,submitted)
What is khmer 's advantage?
● Good performance in time/memory usage
● Online counting, updating and retrieving (important for this project!!)
● With Python API – flexible and expandable
median k-mer frequency to represent the sequencing coverage of the read
Using median k-mer frequency rather than average k-mer frequency can decrease the influenceof sequencing error
Mapping and k-mer coverage measures correlate for simulated genome data and a real E. coli data set (5m reads).
(Brown, Howe, Zhang, Pyrkosz, & Brom, 2012)
iGS
It there are Y reads with a sequencing depth of X. In other word, for each of those Y reads, there are X-1 other reads that cover the same DNA segment in a genome that single read originates. So we can estimate that there are Y/X distinct DNA segments with reads coverage as X. We term these distinct DNA segments in species genome as IGS(informative genomic segment).
IGS(informative genomic
segment) can represent the
novel information of a genome
N =G/(L-k+1)
1000000/(80-22+1) Borrowing statistical methods from OTU based diversity
analysis, (rarefaction curve, estimators, etc.)
Compare the contents of multiple metagenomics samples
● How different are two samples?●
–
If sequencing coverage of a read from sample A in sample B >0,
the segment in sample A
that read originates exists in sample B
Synthetic datasetsA:(same abundance)
– SampleA: 100 species with 80 common to B
– SampleB: 100 species with 80 common to A
– SampleC: 100 species with 20 common to A/B, and 60 common to D
– SampleD: 100 species with 20 common to A/B, and 60 common to D
●
Synthetic datasetB:
– Sample1A:
● species IDs: 1,2,3,4,5,6,7,8,9,10 relative abundance: 20:18:16:4:3:2:2:2:2:2
– Sample1B:
● species IDs: 1,2,3,14,15,16,17,18,19,20 relative abundance: 20:18:16:4:3:2:2:2:2:2
– Sample1C:
● species IDs: 21,22,3,4,5,6,7,8,9,10 relative abundance: 2:2:2:2:2:3:4:16:18:20
– A and B high overlap on individual level, low overlap on species level A and C high overlap on species level, low overlap on individual level
– B and C low overlap on species level and low overlap on individual level
What's Next
● Refi ne the methods
– Errors are still haunting.
– More statistics of IGSs(informative genomic segment)
● Prove effectiveness using test data sets
– Simulated data sets based on real microbial genomes
– MetaHIT, 124 metagenomic samples from 99 healthy people, and 25 patients with inflammatory bowel disease (IBD) syndrome. Each sample has on average 65 ± 21 million reads.
● Integrate functions into khmer package
The Great Prairie Grand Challenge● How many different species in a soil sample? What is their abundance distribution? How
different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie?
● “Grand Challenge” - extremely large data sets from extremely complex microbial community
– Over a tera bases of sequences from Iowa cultivated and uncultivated
– Should be prepared to face technical challenge when dealing with such large-scale data sets (Storage, Computing, Resource, HPCC, etc.)
– A preliminary result :The majority of the prairie reads (50%) are present in the corn with a coverage of > 0
Acknowledgement
● Dr. Titus Brown● Lab members of GED● Dr. Jason Pell● Dr. Adina Howe● Eric McDonald● Everybody in this room
top related