Probabilistic breakdown of assembly graphs

Probabilistic Data Structures and Breaking

Down Big Sequence DataC. Titus [email protected]

Assistant Professor (2008)Computer Science & Engineering /Microbiology and Molecular Genetics,Michigan State University

BA Reed College/MathPhD Caltech / Developmental Biology

Member of the Python Software Foundation (a.k.a. awesomest programming language)

Personal Intro

I’m a bit sick, so I may cough loudly and obnoxiously at times.

Apologies

1. O’Reilly folk asked if I had anything to talk about.

2. Professors love talking.

3. Nifty techniques, applied to a new problem.

1. Can they be applied to your problem?2. Do you have any ideas for me?

Goals, or, why am I talking?

[email protected]

http://ged.msu.edu/

http://github.com/ctb/◦ khmer package, BSD license; k-mer analysis.◦ …lotsa other stuff.

Feedback!

mailto:[email protected]

http://ged.msu.edu/

http://github.com/ctb/

Slide courtesy of Lincoln Stein

My blog: http://ivory.idyll.org/blog/oct-10/sky-is-falling ; cloud computing will not save us!

“Quantity has a quality all its own”

J. Stalin

Guiding motto of genomics (and Big Data in general?)

“Quantity has a quality all its own”

J. Stalin

“Ours is a just cause; victory will be ours!”

V. Molotov

Guiding motto of my lab

SAMPLING LOCATIONS

Sampling sites

Wisconsin◦ Native prairie (Goose Pond,

Audubon)◦ Long term cultivation (corn)◦ Switchgrass rotation (previously

corn)◦ Restored prairie (from 1998)

Iowa◦ Native prairie (Morris prairie)◦ Long term cultivation (corn)

Kansas ◦ Native prairie (Konza prairie)◦ Long term cultivation (corn)

Iowa Native Praire

Switchgrass (Wisconsin)

Iowa >100 yr tilled

Metagenomic data sets: 30 Gb of sequence from Iowa corn

50 Gb of sequence from Iowa prairie

200 Gb of sequence from Wisconsin corn, prairie

http://ivory.idyll.org/blog/aug-10/assembly-part-i

http://ivory.idyll.org/blog/jul-10/kmer-filteringhttp://ivory.idyll.org/blog/jul-10/illumina-read-phenomenology

http://ivory.idyll.org/blog/jul-10/kmer-filtering

http://ivory.idyll.org/blog/jul-10/kmer-filtering

Whole (meta)genome shotgun sequencing involves fragmenting and sequencing, followed by re-assembly.

The shorter the reads, the more difficult this is to do reliably.

Assembly scales poorly.

Shotgun sequencing and assembly.

Whole genome shotgun sequencing & assembly

Randomly fragment & sequence from DNA;reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Assembly – essential point

Assembly is inherently an all by all process. There is no good way to subdivide the short sequences without potentially missing a key

connection:

K-mers and k-mer (de Bruijn) graphsEssentially, break reads (of any length) down into

multiple overlapping words of fixed length k.

ATGGACCAGATGACAC (k=12) =>

ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC

K-mer graphs

J.R. Miller et al. / Genomics (2010)

K-mer graphs - overlaps

J.R. Miller et al. / Genomics (2010)

K-mer graphs - branching

For decisions about which paths etc, biology-based heuristics come into play as well.

Fixed-length words => great CS techniques (hashing, trie structures, etc.)

Data loading/comparison scales with size of your data, N.

Memory usage scales with # of unique words.

This is an advantage over other techniques◦ NxN comparisons…

Some disadvantages, too; see review, J.R. Miller et al. / Genomics (2010)

Why k-mer graphs?

Unlike some other common computational science problems in physics and chemistry, which are combinatorial in nature, graph analysis requires a lot of RAM (to store the graph).

This leads to the mildly unusual HPC scaling issue of RAM as a limiting factor.

…and RAM is expensive.

Digression: scaling

Wouldn’t it be nice… If we knew which original genomes our

short sequences came from?

Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!

Which nodes do not connect to each other?

Partitioning graphs as a goal…

Wouldn’t it be nice… If we knew which original genomes our

short sequences came from?

Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!

Unfortunately this is already equivalent to solving the hard component of the assembly problem…

The k-mer oracle Q: is this k-mer present in the data set?

A: no => then it is not.

A: yes => it may or may not be present.

This lets us store k-mers efficiently.

Building on the k-mer oracle: Once we can store/query k-mers efficiently

in this oracle, we can build additional oracles on top of it:

The k-mer graph oracle Q: does this k-mer overlap with this other k-

mer?

A: no => then it does not, guaranteed.

A: yes => it may or may not.

This lets us traverse k-mer graphs efficiently.

Implementing a basic k-mer oracle

Conveniently, perhaps the simplest data structure in computer science is what we need…

…a hash table that ignores collisions.

Note, P(false positive) = fractional occupancy.

If you ignore collisions…

O(1) query, insertion, update

Fixed memory usage

Ridiculously simple to implement (although developing a good hash function can take some effort)

Digression: hash tables are great

Implementing a basic k-mer oracle

Conveniently, perhaps the simplest data structure in computer science is what we need…

…a hash table that ignores collisions.

Note, P(false positive) = fractional occupancy.

A more reliable k-mer oracle

Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more

reliable.

http://en.wikipedia.org/wiki/Bloom_filter

Scaling the k-mer oracle

Adding additional filters increases discrimination at the cost of speed.

This gives you a fairly straightforward tradeoff: memory (decrease individual false positives)

vs computation (more filters!)

Memory usage, Bloom filter vs trie (theoretical minimum)

The k-mer oracle, revisited We can now ask, “does k-mer

ACGTGGCAGG… occur in the data set?”, quickly and accurately.

This implicitly lets us store the graph structure, too!

Traversing the k-mer graphOnce you can look up k-mers quickly, traversal

is easy: there are only 8 possible overlapping k-mers:

4 before, and 4 after.

The k-mer oracle, revisited We can now ask, “does k-mer

ACGTGGCAGG… occur in the data set?”, quickly and accurately.

This implicitly lets us store the graph structure, too, because there are only 8 possible connected nodes.

We can now traverse this graph structure and ask several times of questions:

A. Graph size filtering

Which of these graphs has more than 3 nodes?

Graph size filtering – easy.

Which of these graphs has more than 3 nodes?

B. Partitioning graphs into disconnected subgraphs


Partitioning graphs – it’s easy looking


But partitioning big graphs is expensive

Tabu search – avoid global searches

Tabu search – systematic local exploration



Tabu search – systematic local exploration (parallelizable)

Hard-to-traverse graphs are well-connected

Add neighborhood-exclusion to tabu search

Exclusion strategy lets you systematically explore big graphs with a local algorithm

Potential problems

Our oracle can mistakenly connect clusters.

Potential problems

This is a problem if the rate is sufficiently high!

However, the error is one-sided:

Graphs will never be erroneously disconnected

The error is one-sided:

Nodes will never be erroneously disconnected

The error is one-sided:

Nodes will never be erroneously disconnected.

This is critically important: it guarantees that our k-mer graph representation yields reliable

“no” answers.

This, in turn, lets us reliably partition graphs into smaller graphs…

…and we can do so iteratively.

…we can do serial partitioning

1. Built lightweight probabilistic data structure/algorithm for k-mer storage.

- Constant memory, constant lookup- Linear time to create structure

2. Implemented systematic graph traversal of arbitrarily large graphs (> ~3 billion connected k-mers, so far)

- Affine memory (with small linear constant)- Bounded time for exploration; bound traded for memory

3. Built partitioning system to eliminate small graphs and extract disconnected graphs.

Strategy and implementation

Actual implementation

Pre-filter/partition for somebody else’s assembler

N.B. This results in identical assembly.

Python wrapping C++, ~5000 LoC. (Python handles parallelization; go free, GIL!)

Partitioning & assembling 2 Gb data set can be done in ~8 gb of RAM in < 1 day◦ Compare with 40 gb requirement for existing (released) assemblers.◦ Probably 10-fold speed improvement easily (KISS; no premature opt)

Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM, single chassis, 8 CPU.

Not yet clear how well it scales to 200 Gb, but should…

…all of this is running on Amazon cloud rentals.

Actual implementation

Lightweight probabilistic storage system for k-mers, ~1 byte / k-mer.

Large graph traversal (10-20 bn k-mers)◦ Tabu search◦ Neighborhood exclusion

Graph partitioning, trimming, grokking.◦ Iterative refinement is “perfect”◦ Failure rate ~ memory usage, with good failover

( connectivity increases).

In conclusion

More general assembly graph analysis

Breaking graphs in good places

Clustering of large protein similarity graphs/matrices

Caveats: Preferential attachment with false positives?

First publication -- Bloom counting hash (see kmer-filtering blog post)

Future directions

We were lucky & could turn our graph traversal problem into a set membership query.

Tabu search / neighborhood exclusion for exhaustive graph traversal isn’t novel, but might be useful. Requires systematic tagging.

But… random and probabilistic approaches (skip lists, Bloom filters, etc.) can be surprisingly useful.◦ One sided errors are awesome for Big Data.

Solving more general problems?

http://en.wikipedia.org/wiki/Category:Probabilistic_data_structures

Acknowledgements

GED lab / k-mer gangAdina Howe (w/Tiedje)Arend Hintze, postdocJason Pell, gradRosangela Canino-Koning,

gradQingpeng Zhang, grad

Collaborators (MSU)

Weiming LiCharles OfriaJim Tiedje(w/Janet Jansson,

Rachel Mackelprang (JGI))

FundingUSDA NIFA, NSF, DOE,

Michigan State U.

ABySS assembler – multi-node assembly in RAM

On-disk assembly:

SOAP assembler (BGI) – not open source

Cortex assembler (EBI) – unpub/not released

Contrail assembler (Michael Schatz) – unpub/not released

It’s hard for me to tell how these last three compare ;)BUT our current approach is orthogonal and can be used

in conjunction (as a pre-filter) with these assemblers.

Other k-mer graph work

Probabilistic breakdown of assembly graphs

Technology

prairie http

iowa prairie

iowa corn

restored prairie

wisconsin corn

assembly scales

mer analysis

mer present