Top Banner
C. TITUS BROWN [email protected] ASSOCIATE PROFESSOR POPULATION HEALTH AND REPRODUCTION SCHOOL OF VETERINARY MEDICINE UNIVERSITY OF CALIFORNIA, DAVIS Concepts and tools for exploring very large sequencing data sets.
57
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2015 ohsu-metagenome

C . T I T U S B R O W N

C T B R O W N @ U C D A V I S . E D U

A S S O C I A T E P R O F E S S O R

P O P U L A T I O N H E A L T H A N D R E P R O D U C T I O N

S C H O O L O F V E T E R I N A R Y M E D I C I N E

U N I V E R S I T Y O F C A L I F O R N I A , D A V I S

Concepts and tools for exploring very large sequencing data sets.

Page 2: 2015 ohsu-metagenome

Some background & motivation:

We primarily build tools to look at large sequencing data sets.

Our interest is in enabling scientists to move quickly to hypotheses from data.

Page 3: 2015 ohsu-metagenome

My goals

Enable hypothesis-driven biology through better hypothesis generation & refinement.

Devalue “interest level” of sequence analysis and put myself out of a job.

Be a good mutualist!

Page 4: 2015 ohsu-metagenome

Narrative arc

1. Shotgun metagenomics: can we reconstruct community genomes?

2. Underlying technology-enabled approach – tools and platforms are good.

3. My larger plan for world domination through technology and training – a kinder, gentler world (?).

Page 5: 2015 ohsu-metagenome

Shotgun metagenomics

Collect samples;

Extract DNA;

Feed into sequencer;

Computationally analyze.

Wikipedia: Environmental shotgun sequencing.png

Page 6: 2015 ohsu-metagenome

Shotgun sequencing & assembly

http://eofdreams.com/library.html;http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/

Page 7: 2015 ohsu-metagenome

To assemble, or not to assemble?

Goals: reconstruct phylogenetic content and predict functional potential of ensemble.

Should we analyze short reads directly?

OR

Do we assemble short reads into longer contigs first, and then analyze the contigs?

Page 8: 2015 ohsu-metagenome

Howe et al., 2014

Assemblies yield much more significant

similarity matches.

Assembly: good for annotation!

Page 9: 2015 ohsu-metagenome

But! Isn’t assembly problematic?

Chimeric misassemblies?

Uneven coverage?

Strain variation?

Computationally challenging?

Page 10: 2015 ohsu-metagenome

I. Benchmarking metagenome assembly

Most assembly papers analyze novel data sets and then have to argue that their result is ok (guilty!)

Very few assembly benchmarks have been done.

Even fewer (trustworthy) computational time/memory comparisons have been done.

And even fewer “assembly recipes” have been written down clearly.

Page 11: 2015 ohsu-metagenome

Shakya et al., 2013; pmid 23387867

Page 12: 2015 ohsu-metagenome

A mock community!

~60 genomes, all sequenced;

Lab mixed with 10:1 ratio of most abundant to least abundant;

2x101 reads, 107 mn reads total (Illumina);

10.5 Gbp of sequence in toto.

The paper also compared16s primer sets & 454 shotgun metagenome data => reconstruction.

Shakya et al., 2013; pmid 23387867

Page 13: 2015 ohsu-metagenome

Paper conclusions

“Metagenomic sequencing outperformed most SSU rRNA gene primer sets used in this study.”

“The Illumina short reads provided a very good estimates of taxonomic distribution above the species level, with only a two- to threefold overestimation of the actual number of genera and orders.”

“For the 454 data … the use of the default parameters severely overestimated higher level diversity (~ 20- fold for bacterial genera and identified > 100 spurious eukaryotes).”

Shakya et al., 2013; pmid 23387867

Page 14: 2015 ohsu-metagenome

How about assembly??

Shakya et al. did not do assembly; no standard for analysis at the time, not experts.

But we work on assembly!

And we’ve been working on a tutorial/process for doing it!

Page 15: 2015 ohsu-metagenome

Adapter trim & quality filter

Diginorm to C=10

Trim high-coverage reads at

low-abundancek-mers

Diginorm to C=5

Partitiongraph

Split into "groups"

Reinflate groups (optional

Assemble!!!

Map reads to assembly

Too big toassemble?

Small enough to assemble?

Annotate contigs with abundances

MG-RAST, etc.

The Kalamazoo Metagenomics Protocol

Derived from approach used in Howe et al., 2014

Page 16: 2015 ohsu-metagenome

Computational protocol for assembly

Page 17: 2015 ohsu-metagenome

Adapter trim & quality filter

Diginorm to C=10

Trim high-coverage reads at

low-abundancek-mers

Diginorm to C=5

Partitiongraph

Split into "groups"

Reinflate groups (optional

Assemble!!!

Map reads to assembly

Too big toassemble?

Small enough to assemble?

Annotate contigs with abundances

MG-RAST, etc.

Kalamazoo Metagenomics Protocol => benchmarking!

Assemble with Velvet, IDBA, SPAdes

Page 18: 2015 ohsu-metagenome

Benchmarking process

Apply various filtering treatments to the data (x3) Basic quality trimming and filtering

+ digital normalization

+ partitioning

Apply different assemblers to the data for each treatment (x3) IDBA

SPAdes

Velvet

Measure compute time/memory req’d.

Compare assembly results to “known” answer with Quast.

Page 19: 2015 ohsu-metagenome

Recovery, by assembler

Velvet IDBA Spades

Quality Quality Quality

Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08

Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08

Largest contig 561,449 979,948 1,387,918

# misassembled contigs 631 1032 752

Genome fraction (%) 72.949 90.969 90.424

Duplication ratio 1.004 1.007 1.004

Conclusion: SPAdes and IDBA achieve similar results.

Dr. Sherine Awad

Page 20: 2015 ohsu-metagenome

Treatments do not alter results very much.

IDBA

Default Diginorm Partition

Total length (>= 0 bp) 2.0E+08 2.0E+08 2.0E+08

Total length (>= 1000 bp) 1.9E+08 2.0E+08 1.9E+08

Largest contig 979,948 1,469,321 551,171

# misassembled contigs 1032 916 828

Unaligned length 10,709,716 10,637,811 10,644,357

Genome fraction (%) 90.969 91.003 90.082

Duplication ratio 1.007 1.008 1.007

Dr. Sherine Awad

Page 21: 2015 ohsu-metagenome

Treatments do save compute time.

Velvet idba Spades

Time

(h:m:s)

RAM

(gb)

Time

(h:m:s)

RAM

(gb)

Time

(h:m:s)

RAM

(gb)

Quality 60:42:52 1,594 33:53:46 129 67:02:16 400

Diginorm 6:48:46 827 6:34:24 104 15:53:10 127

Partition 4:30:36 1,156 8:30:29 93 7:54:26 129

(Run on Michigan State HPC)

Dr. Sherine Awad

Page 22: 2015 ohsu-metagenome

Need to understand:

What is not being assembled and why?

Low coverage?

Strain variation?

Something else?

Effects of strain variation: no assembly.

Additional contigs being assembled –contamination? Spurious assembly?

Page 23: 2015 ohsu-metagenome

Assembly conclusions

90% recovery is not bad; relatively few misassemblies, too.

This was not a highly polymorphic community BUT it did have several closely related strains; more generally, we see that strains do generate chimeras, but not between different species.

…challenging to execute even with a tutorial/protocol.

Page 24: 2015 ohsu-metagenome

We need much deeper sampling!

Sharon et al., 2015 (Genome Res)

Overlap between synthetic long reads and short reads.

Page 25: 2015 ohsu-metagenome

Benchmarking & protocols

Our work is completely reproducible and open.

You can re-run our benchmarks yourself if you want!

We will be adding new assemblers in as time permits.

Protocol is open, versioned, citable… but also still a work in progress :)

Page 26: 2015 ohsu-metagenome

II: Shotgun sequencing and coverage

“Coverage” is simply the average number of reads that overlapeach true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Page 27: 2015 ohsu-metagenome

Assembly depends on high coverage

HMP mock community

Page 28: 2015 ohsu-metagenome

Main questions --

I. How do we know if we’ve sequenced enough?

II. Can we predict how much more we need to sequence to see <insert some feature here>?

Note: necessary sequencing depth cannot accurately be predicted solely from

SSU/amplicon data

Page 29: 2015 ohsu-metagenome

Method 1: looking for WGS saturation

We can track how many sequences we keep of the sequences we’ve seen, to

detect saturation.

Page 30: 2015 ohsu-metagenome

Data from Shakya et al., 2013 (pmid: 23387867)

We can detect saturation of shotgun sequencing

Page 31: 2015 ohsu-metagenome

Data from Shakya et al., 2013 (pmid: 23387867)

We can detect saturation of shotgun sequencing

C=10, for assembly

Page 32: 2015 ohsu-metagenome

Estimating metagenome nt richness:

# bp at saturation / coverage

MM5 deep carbon: 60 Mbp

Iowa prairie soil: 12 Gbp

Amazon Rain Forest Microbial Observatory soil: 26 Gbp

Assumes: few entirely erroneous reads (upper bound); at saturation (lower bound).

Page 33: 2015 ohsu-metagenome

WGS saturation approach:

Tells us when we have enough sequence.

Can’t be predictive… if you haven’t sampled something, you can’t say anything about it.

Can we correlate deep amplicon sequencing with shallower WGS?

Page 34: 2015 ohsu-metagenome

Correlating 16s and shotgun WGS

How much of 16s

do you see…

with how much shotgun sequencing?

Page 35: 2015 ohsu-metagenome

Data from Shakya et al., 2013 (pmid: 23387867)

WGS saturation ~matches 16s saturation

< rRNA copynumber >

Page 36: 2015 ohsu-metagenome

Method is robust to organisms unsampled by amplicon sequencing.

Insensitive to amplicon primer bias.

Robust to genome size differences, eukaryotes, phage.

Data from Shakya et al., 2013 (pmid: 23387867)

Page 37: 2015 ohsu-metagenome

Can examine specific OTUs

Data from Shakya et al., 2013 (pmid: 23387867)

Page 38: 2015 ohsu-metagenome

OTU abundance is ~correct.

Data from Shakya et al., 2013 (pmid: 23387867)

Page 39: 2015 ohsu-metagenome

Running on real communities --

Page 40: 2015 ohsu-metagenome

Running on real communities --

Page 41: 2015 ohsu-metagenome

Concluding thoughts on metagenomes -

The main obstacle to recovering genomic details of communities is shallow sampling.

Considerably deeper sampling is needed – 1000x (petabasepair sampling)

This will inevitably happen!

…I would like to make sure the compute technology is there, when it does.

Page 42: 2015 ohsu-metagenome

More general: computation needs to scale!

Navin et al., 2011

Page 43: 2015 ohsu-metagenome

Cancer investigation ~ metagenome investigation

Some basic math: 1000 single cells from a tumor… …sequenced to 40x haploid coverage with Illumina… …yields 120 Gbp each cell… …or 120 Tbp of data.

HiSeq X10 can do the sequencing in ~3 weeks.

The variant calling will require 2,000 CPU weeks…

…so, given ~2,000 computers, can do this all in one month.

…but this will soon be done ~100s-1000s of times a month.

Page 44: 2015 ohsu-metagenome

Similar math applies:

Pathogen detection in blood;

Environmental sequencing;

Sequencing rare DNA from circulating blood.

Two issues:

Volume of data & compute infrastructure;

Latency in turnaround.

Page 45: 2015 ohsu-metagenome

Streaming algorithms are good for biggish data…

1-pass

Data

Answer

Page 46: 2015 ohsu-metagenome

Raw data(~10-100 GB) Analysis

"Information"~1 GB

"Information"

"Information"

"Information"

"Information"Database & integration

Compression (~2 GB)

Lossy compression can substantially reduce data size while retaining

information needed for later (re)analysis.

…as is lossy compression.

Page 47: 2015 ohsu-metagenome

Moving all sequence analysis generically tosemi-streaming:

~1.2 pass, sublinear memory

Paper at: https://github.com/ged-lab/2014-streaming

Page 48: 2015 ohsu-metagenome

Moving some sequence analysis to streaming.

~1.2 pass, sublinear memory

Paper at: https://github.com/ged-lab/2014-streaming

First pass: digital normalization - reduced set of k-mers.

Second pass: spectral analysis of data with reduced k-mer set.

First pass: collection of low-abundance reads + analysis of saturated reads.

Second pass: analysis of collected low-abundance reads.

First pass: collection of low-abundance reads + analysis of saturated reads.

(a)

(b)

(c)

two-pass;

reduced memory

few-pass;

reduced memory

online; streaming.

Page 49: 2015 ohsu-metagenome

Five super-awesome technologies…

1. Low-memory k-mer counting

(Zhang et al., PLoS One, 2014)

2. Compressible assembly graphs

(Pell et al., PNAS, 2012)

3. Streaming lossy compression of sequence data

(Brown et al., arXiv, 2012)

4. A semi-streaming framework for sequence analysis

5. Graph-alignment approaches for fun and profit.

Page 50: 2015 ohsu-metagenome

…implemented in one super- awesome software package…

github.com/ged-lab/khmer/

BSD licensed

Openly developed using good practice.

> 30 external contributors.

Thousands of downloads/month.

100+ citations in 4 years.

We think > 5000 people are using it; have heard from 100s. Bundled with software that ~100k people

are using.

Page 51: 2015 ohsu-metagenome

What’s next?

In transition! MSU to UC Davis.

So, uh, I joined a Vet Med school -

“Companion animals have genomes too!”

Expanding my work more to genomic…

Co-incident to moving to Davis, I also became a Moore Foundation Data Driven Discovery Investigator.

Page 52: 2015 ohsu-metagenome

Tackling data availability…

In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic,

…?)

We currently have no good way of querying, exploring, investigating, or mining these data sets,

especially across multiple locations..

Moreover, most data is unavailable until after publication…

…which, in practice, means it will be lost.

Page 53: 2015 ohsu-metagenome

…and data integration.

Once you have all the data, what do you do?

"Business as usual simply cannot work."

Looking at millions to billions of genomes.

(David Haussler, 2014)

Page 54: 2015 ohsu-metagenome

Funded: distributed graph database server

Compute server

(Galaxy?

Arvados?)

Web interface + API

Data/

Info

Raw data sets

Public

servers

"Walled

garden"

server

Private

server

Graph query layer

Upload/submit

(NCBI, KBase)

Import

(MG-RAST,

SRA, EBI)

ivory.idyll.org/blog/2014-moore-ddd-award.html

Page 55: 2015 ohsu-metagenome

The larger research vision:100% buzzword compliantTM

Enable and incentivize sharing by providing immediate utility; frictionless sharing.

Permissionless innovation for e.g. new data mining approaches.

Plan for poverty with federated infrastructure built on open & cloud.

Solve people’s current problems, while remaining agile for the future.

ivory.idyll.org/blog/2014-moore-ddd-award.html

Page 56: 2015 ohsu-metagenome

Education and training

Biology is underprepared for data-intensive investigation.

We must teach and train the next generations.

~10-20 workshops / year, novice -> masterclass; open materials.

Deeply self-interested:

What problems does everyone have, now? (Assembly)

What problems do leading-edge researchers have? (Data integration)

dib-training.rtfd.org/

Page 57: 2015 ohsu-metagenome

Thanks!

Please contact me at [email protected]!