Top Banner
Probabilistic Data Structures and Breaking Down Big Sequence Data C. Titus Brown [email protected]
65

Probabilistic breakdown of assembly graphs

May 10, 2015

Download

Technology

c.titus.brown
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Probabilistic breakdown of assembly graphs

Probabilistic Data Structures and Breaking

Down Big Sequence DataC. Titus [email protected]

Page 2: Probabilistic breakdown of assembly graphs

Assistant Professor (2008)Computer Science & Engineering /Microbiology and Molecular Genetics,Michigan State University

BA Reed College/MathPhD Caltech / Developmental Biology

Member of the Python Software Foundation (a.k.a. awesomest programming language)

Personal Intro

Page 3: Probabilistic breakdown of assembly graphs

I’m a bit sick, so I may cough loudly and obnoxiously at times.

Apologies

Page 4: Probabilistic breakdown of assembly graphs

1. O’Reilly folk asked if I had anything to talk about.

2. Professors love talking.

3. Nifty techniques, applied to a new problem.

1. Can they be applied to your problem?2. Do you have any ideas for me?

Goals, or, why am I talking?

Page 5: Probabilistic breakdown of assembly graphs

[email protected]

http://ged.msu.edu/

http://github.com/ctb/◦ khmer package, BSD license; k-mer analysis.◦ …lotsa other stuff.

Feedback!

Page 6: Probabilistic breakdown of assembly graphs

Slide courtesy of Lincoln Stein

My blog: http://ivory.idyll.org/blog/oct-10/sky-is-falling ; cloud computing will not save us!

Page 7: Probabilistic breakdown of assembly graphs

“Quantity has a quality all its own”

J. Stalin

Guiding motto of genomics (and Big Data in general?)

Page 8: Probabilistic breakdown of assembly graphs

“Quantity has a quality all its own”

J. Stalin

“Ours is a just cause; victory will be ours!”

V. Molotov

Guiding motto of my lab

Page 9: Probabilistic breakdown of assembly graphs

SAMPLING LOCATIONS

Page 10: Probabilistic breakdown of assembly graphs

Sampling sites

Wisconsin◦ Native prairie (Goose Pond,

Audubon)◦ Long term cultivation (corn)◦ Switchgrass rotation (previously

corn)◦ Restored prairie (from 1998)

Iowa◦ Native prairie (Morris prairie)◦ Long term cultivation (corn)

Kansas ◦ Native prairie (Konza prairie)◦ Long term cultivation (corn)

Iowa Native Praire

Switchgrass (Wisconsin)

Iowa >100 yr tilled

Page 11: Probabilistic breakdown of assembly graphs

Metagenomic data sets: 30 Gb of sequence from Iowa corn

50 Gb of sequence from Iowa prairie

200 Gb of sequence from Wisconsin corn, prairie

http://ivory.idyll.org/blog/aug-10/assembly-part-i

http://ivory.idyll.org/blog/jul-10/kmer-filteringhttp://ivory.idyll.org/blog/jul-10/illumina-read-phenomenology

Page 12: Probabilistic breakdown of assembly graphs

Whole (meta)genome shotgun sequencing involves fragmenting and sequencing, followed by re-assembly.

The shorter the reads, the more difficult this is to do reliably.

Assembly scales poorly.

Shotgun sequencing and assembly.

Page 13: Probabilistic breakdown of assembly graphs

Whole genome shotgun sequencing & assembly

Randomly fragment & sequence from DNA;reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Page 14: Probabilistic breakdown of assembly graphs

Assembly – essential point

Assembly is inherently an all by all process. There is no good way to subdivide the short sequences without potentially missing a key

connection:

Page 15: Probabilistic breakdown of assembly graphs

K-mers and k-mer (de Bruijn) graphsEssentially, break reads (of any length) down into

multiple overlapping words of fixed length k.

ATGGACCAGATGACAC (k=12) =>

ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC

Page 16: Probabilistic breakdown of assembly graphs

K-mer graphs

J.R. Miller et al. / Genomics (2010)

Page 17: Probabilistic breakdown of assembly graphs

K-mer graphs - overlaps

J.R. Miller et al. / Genomics (2010)

Page 18: Probabilistic breakdown of assembly graphs

K-mer graphs - branching

For decisions about which paths etc, biology-based heuristics come into play as well.

Page 19: Probabilistic breakdown of assembly graphs

Fixed-length words => great CS techniques (hashing, trie structures, etc.)

Data loading/comparison scales with size of your data, N.

Memory usage scales with # of unique words.

This is an advantage over other techniques◦ NxN comparisons…

Some disadvantages, too; see review, J.R. Miller et al. / Genomics (2010)

Why k-mer graphs?

Page 20: Probabilistic breakdown of assembly graphs

Unlike some other common computational science problems in physics and chemistry, which are combinatorial in nature, graph analysis requires a lot of RAM (to store the graph).

This leads to the mildly unusual HPC scaling issue of RAM as a limiting factor.

…and RAM is expensive.

Digression: scaling

Page 21: Probabilistic breakdown of assembly graphs

Wouldn’t it be nice… If we knew which original genomes our

short sequences came from?

Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!

Page 22: Probabilistic breakdown of assembly graphs

Which nodes do not connect to each other?

Partitioning graphs as a goal…

Page 23: Probabilistic breakdown of assembly graphs

Wouldn’t it be nice… If we knew which original genomes our

short sequences came from?

Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!

Unfortunately this is already equivalent to solving the hard component of the assembly problem…

Page 24: Probabilistic breakdown of assembly graphs

The k-mer oracle Q: is this k-mer present in the data set?

A: no => then it is not.

A: yes => it may or may not be present.

This lets us store k-mers efficiently.

Page 25: Probabilistic breakdown of assembly graphs

Building on the k-mer oracle: Once we can store/query k-mers efficiently

in this oracle, we can build additional oracles on top of it:

Page 26: Probabilistic breakdown of assembly graphs

The k-mer graph oracle Q: does this k-mer overlap with this other k-

mer?

A: no => then it does not, guaranteed.

A: yes => it may or may not.

This lets us traverse k-mer graphs efficiently.

Page 27: Probabilistic breakdown of assembly graphs

Implementing a basic k-mer oracle

Conveniently, perhaps the simplest data structure in computer science is what we need…

…a hash table that ignores collisions.

Note, P(false positive) = fractional occupancy.

Page 28: Probabilistic breakdown of assembly graphs

If you ignore collisions…

O(1) query, insertion, update

Fixed memory usage

Ridiculously simple to implement (although developing a good hash function can take some effort)

Digression: hash tables are great

Page 29: Probabilistic breakdown of assembly graphs

Implementing a basic k-mer oracle

Conveniently, perhaps the simplest data structure in computer science is what we need…

…a hash table that ignores collisions.

Note, P(false positive) = fractional occupancy.

Page 30: Probabilistic breakdown of assembly graphs

A more reliable k-mer oracle

Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more

reliable.

http://en.wikipedia.org/wiki/Bloom_filter

Page 31: Probabilistic breakdown of assembly graphs

Scaling the k-mer oracle

Adding additional filters increases discrimination at the cost of speed.

This gives you a fairly straightforward tradeoff: memory (decrease individual false positives)

vs computation (more filters!)

Page 32: Probabilistic breakdown of assembly graphs
Page 33: Probabilistic breakdown of assembly graphs
Page 34: Probabilistic breakdown of assembly graphs

Memory usage, Bloom filter vs trie (theoretical minimum)

Page 35: Probabilistic breakdown of assembly graphs

The k-mer oracle, revisited We can now ask, “does k-mer

ACGTGGCAGG… occur in the data set?”, quickly and accurately.

This implicitly lets us store the graph structure, too!

Page 36: Probabilistic breakdown of assembly graphs

Traversing the k-mer graphOnce you can look up k-mers quickly, traversal

is easy: there are only 8 possible overlapping k-mers:

4 before, and 4 after.

Page 37: Probabilistic breakdown of assembly graphs

The k-mer oracle, revisited We can now ask, “does k-mer

ACGTGGCAGG… occur in the data set?”, quickly and accurately.

This implicitly lets us store the graph structure, too, because there are only 8 possible connected nodes.

We can now traverse this graph structure and ask several times of questions:

Page 38: Probabilistic breakdown of assembly graphs

A. Graph size filtering

Which of these graphs has more than 3 nodes?

Page 39: Probabilistic breakdown of assembly graphs

Graph size filtering – easy.

Which of these graphs has more than 3 nodes?

Page 40: Probabilistic breakdown of assembly graphs

B. Partitioning graphs into disconnected subgraphs

Which nodes do not connect to each other?

Page 41: Probabilistic breakdown of assembly graphs

Partitioning graphs – it’s easy looking

Which nodes do not connect to each other?

Page 42: Probabilistic breakdown of assembly graphs

But partitioning big graphs is expensive

Page 43: Probabilistic breakdown of assembly graphs

Tabu search – avoid global searches

Page 44: Probabilistic breakdown of assembly graphs

Tabu search – systematic local exploration

Page 45: Probabilistic breakdown of assembly graphs

Tabu search – systematic local exploration

Page 46: Probabilistic breakdown of assembly graphs

Tabu search – systematic local exploration

Page 47: Probabilistic breakdown of assembly graphs

Tabu search – systematic local exploration (parallelizable)

Page 48: Probabilistic breakdown of assembly graphs

Hard-to-traverse graphs are well-connected

Page 49: Probabilistic breakdown of assembly graphs

Add neighborhood-exclusion to tabu search

Page 50: Probabilistic breakdown of assembly graphs

Exclusion strategy lets you systematically explore big graphs with a local algorithm

Page 51: Probabilistic breakdown of assembly graphs

Potential problems

Our oracle can mistakenly connect clusters.

Page 52: Probabilistic breakdown of assembly graphs

Potential problems

This is a problem if the rate is sufficiently high!

Page 53: Probabilistic breakdown of assembly graphs

However, the error is one-sided:

Graphs will never be erroneously disconnected

Page 54: Probabilistic breakdown of assembly graphs

The error is one-sided:

Nodes will never be erroneously disconnected

Page 55: Probabilistic breakdown of assembly graphs

The error is one-sided:

Nodes will never be erroneously disconnected.

This is critically important: it guarantees that our k-mer graph representation yields reliable

“no” answers.

This, in turn, lets us reliably partition graphs into smaller graphs…

…and we can do so iteratively.

Page 56: Probabilistic breakdown of assembly graphs

…we can do serial partitioning

Page 57: Probabilistic breakdown of assembly graphs

1. Built lightweight probabilistic data structure/algorithm for k-mer storage.

- Constant memory, constant lookup- Linear time to create structure

2. Implemented systematic graph traversal of arbitrarily large graphs (> ~3 billion connected k-mers, so far)

- Affine memory (with small linear constant)- Bounded time for exploration; bound traded for memory

3. Built partitioning system to eliminate small graphs and extract disconnected graphs.

Strategy and implementation

Page 58: Probabilistic breakdown of assembly graphs

Actual implementation

Page 59: Probabilistic breakdown of assembly graphs

Pre-filter/partition for somebody else’s assembler

N.B. This results in identical assembly.

Page 60: Probabilistic breakdown of assembly graphs

Python wrapping C++, ~5000 LoC. (Python handles parallelization; go free, GIL!)

Partitioning & assembling 2 Gb data set can be done in ~8 gb of RAM in < 1 day◦ Compare with 40 gb requirement for existing (released) assemblers.◦ Probably 10-fold speed improvement easily (KISS; no premature opt)

Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM, single chassis, 8 CPU.

Not yet clear how well it scales to 200 Gb, but should…

…all of this is running on Amazon cloud rentals.

Actual implementation

Page 61: Probabilistic breakdown of assembly graphs

Lightweight probabilistic storage system for k-mers, ~1 byte / k-mer.

Large graph traversal (10-20 bn k-mers)◦ Tabu search◦ Neighborhood exclusion

Graph partitioning, trimming, grokking.◦ Iterative refinement is “perfect”◦ Failure rate ~ memory usage, with good failover

( connectivity increases).

In conclusion

Page 62: Probabilistic breakdown of assembly graphs

More general assembly graph analysis

Breaking graphs in good places

Clustering of large protein similarity graphs/matrices

Caveats: Preferential attachment with false positives?

First publication -- Bloom counting hash (see kmer-filtering blog post)

Future directions

Page 63: Probabilistic breakdown of assembly graphs

We were lucky & could turn our graph traversal problem into a set membership query.

Tabu search / neighborhood exclusion for exhaustive graph traversal isn’t novel, but might be useful. Requires systematic tagging.

But… random and probabilistic approaches (skip lists, Bloom filters, etc.) can be surprisingly useful.◦ One sided errors are awesome for Big Data.

Solving more general problems?

http://en.wikipedia.org/wiki/Category:Probabilistic_data_structures

Page 64: Probabilistic breakdown of assembly graphs

Acknowledgements

GED lab / k-mer gangAdina Howe (w/Tiedje)Arend Hintze, postdocJason Pell, gradRosangela Canino-Koning,

gradQingpeng Zhang, grad

Collaborators (MSU)

Weiming LiCharles OfriaJim Tiedje(w/Janet Jansson,

Rachel Mackelprang (JGI))

FundingUSDA NIFA, NSF, DOE,

Michigan State U.

Page 65: Probabilistic breakdown of assembly graphs

ABySS assembler – multi-node assembly in RAM

On-disk assembly:

SOAP assembler (BGI) – not open source

Cortex assembler (EBI) – unpub/not released

Contrail assembler (Michael Schatz) – unpub/not released

It’s hard for me to tell how these last three compare ;)BUT our current approach is orthogonal and can be used

in conjunction (as a pre-filter) with these assemblers.

Other k-mer graph work