May 10, 2015
Probabilistic Data Structures and Breaking
Down Big Sequence DataC. Titus [email protected]
Assistant Professor (2008)Computer Science & Engineering /Microbiology and Molecular Genetics,Michigan State University
BA Reed College/MathPhD Caltech / Developmental Biology
Member of the Python Software Foundation (a.k.a. awesomest programming language)
Personal Intro
I’m a bit sick, so I may cough loudly and obnoxiously at times.
Apologies
1. O’Reilly folk asked if I had anything to talk about.
2. Professors love talking.
3. Nifty techniques, applied to a new problem.
1. Can they be applied to your problem?2. Do you have any ideas for me?
Goals, or, why am I talking?
http://ged.msu.edu/
http://github.com/ctb/◦ khmer package, BSD license; k-mer analysis.◦ …lotsa other stuff.
Feedback!
Slide courtesy of Lincoln Stein
My blog: http://ivory.idyll.org/blog/oct-10/sky-is-falling ; cloud computing will not save us!
“Quantity has a quality all its own”
J. Stalin
Guiding motto of genomics (and Big Data in general?)
“Quantity has a quality all its own”
J. Stalin
“Ours is a just cause; victory will be ours!”
V. Molotov
Guiding motto of my lab
SAMPLING LOCATIONS
Sampling sites
Wisconsin◦ Native prairie (Goose Pond,
Audubon)◦ Long term cultivation (corn)◦ Switchgrass rotation (previously
corn)◦ Restored prairie (from 1998)
Iowa◦ Native prairie (Morris prairie)◦ Long term cultivation (corn)
Kansas ◦ Native prairie (Konza prairie)◦ Long term cultivation (corn)
Iowa Native Praire
Switchgrass (Wisconsin)
Iowa >100 yr tilled
Metagenomic data sets: 30 Gb of sequence from Iowa corn
50 Gb of sequence from Iowa prairie
200 Gb of sequence from Wisconsin corn, prairie
http://ivory.idyll.org/blog/aug-10/assembly-part-i
http://ivory.idyll.org/blog/jul-10/kmer-filteringhttp://ivory.idyll.org/blog/jul-10/illumina-read-phenomenology
Whole (meta)genome shotgun sequencing involves fragmenting and sequencing, followed by re-assembly.
The shorter the reads, the more difficult this is to do reliably.
Assembly scales poorly.
Shotgun sequencing and assembly.
Whole genome shotgun sequencing & assembly
Randomly fragment & sequence from DNA;reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
Assembly – essential point
Assembly is inherently an all by all process. There is no good way to subdivide the short sequences without potentially missing a key
connection:
K-mers and k-mer (de Bruijn) graphsEssentially, break reads (of any length) down into
multiple overlapping words of fixed length k.
ATGGACCAGATGACAC (k=12) =>
ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
K-mer graphs
J.R. Miller et al. / Genomics (2010)
K-mer graphs - overlaps
J.R. Miller et al. / Genomics (2010)
K-mer graphs - branching
For decisions about which paths etc, biology-based heuristics come into play as well.
Fixed-length words => great CS techniques (hashing, trie structures, etc.)
Data loading/comparison scales with size of your data, N.
Memory usage scales with # of unique words.
This is an advantage over other techniques◦ NxN comparisons…
Some disadvantages, too; see review, J.R. Miller et al. / Genomics (2010)
Why k-mer graphs?
Unlike some other common computational science problems in physics and chemistry, which are combinatorial in nature, graph analysis requires a lot of RAM (to store the graph).
This leads to the mildly unusual HPC scaling issue of RAM as a limiting factor.
…and RAM is expensive.
Digression: scaling
Wouldn’t it be nice… If we knew which original genomes our
short sequences came from?
Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!
Which nodes do not connect to each other?
Partitioning graphs as a goal…
Wouldn’t it be nice… If we knew which original genomes our
short sequences came from?
Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!
Unfortunately this is already equivalent to solving the hard component of the assembly problem…
The k-mer oracle Q: is this k-mer present in the data set?
A: no => then it is not.
A: yes => it may or may not be present.
This lets us store k-mers efficiently.
Building on the k-mer oracle: Once we can store/query k-mers efficiently
in this oracle, we can build additional oracles on top of it:
The k-mer graph oracle Q: does this k-mer overlap with this other k-
mer?
A: no => then it does not, guaranteed.
A: yes => it may or may not.
This lets us traverse k-mer graphs efficiently.
Implementing a basic k-mer oracle
Conveniently, perhaps the simplest data structure in computer science is what we need…
…a hash table that ignores collisions.
Note, P(false positive) = fractional occupancy.
If you ignore collisions…
O(1) query, insertion, update
Fixed memory usage
Ridiculously simple to implement (although developing a good hash function can take some effort)
Digression: hash tables are great
Implementing a basic k-mer oracle
Conveniently, perhaps the simplest data structure in computer science is what we need…
…a hash table that ignores collisions.
Note, P(false positive) = fractional occupancy.
A more reliable k-mer oracle
Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more
reliable.
http://en.wikipedia.org/wiki/Bloom_filter
Scaling the k-mer oracle
Adding additional filters increases discrimination at the cost of speed.
This gives you a fairly straightforward tradeoff: memory (decrease individual false positives)
vs computation (more filters!)
Memory usage, Bloom filter vs trie (theoretical minimum)
The k-mer oracle, revisited We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”, quickly and accurately.
This implicitly lets us store the graph structure, too!
Traversing the k-mer graphOnce you can look up k-mers quickly, traversal
is easy: there are only 8 possible overlapping k-mers:
4 before, and 4 after.
The k-mer oracle, revisited We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”, quickly and accurately.
This implicitly lets us store the graph structure, too, because there are only 8 possible connected nodes.
We can now traverse this graph structure and ask several times of questions:
A. Graph size filtering
Which of these graphs has more than 3 nodes?
Graph size filtering – easy.
Which of these graphs has more than 3 nodes?
B. Partitioning graphs into disconnected subgraphs
Which nodes do not connect to each other?
Partitioning graphs – it’s easy looking
Which nodes do not connect to each other?
But partitioning big graphs is expensive
Tabu search – avoid global searches
Tabu search – systematic local exploration
Tabu search – systematic local exploration
Tabu search – systematic local exploration
Tabu search – systematic local exploration (parallelizable)
Hard-to-traverse graphs are well-connected
Add neighborhood-exclusion to tabu search
Exclusion strategy lets you systematically explore big graphs with a local algorithm
Potential problems
Our oracle can mistakenly connect clusters.
Potential problems
This is a problem if the rate is sufficiently high!
However, the error is one-sided:
Graphs will never be erroneously disconnected
The error is one-sided:
Nodes will never be erroneously disconnected
The error is one-sided:
Nodes will never be erroneously disconnected.
This is critically important: it guarantees that our k-mer graph representation yields reliable
“no” answers.
This, in turn, lets us reliably partition graphs into smaller graphs…
…and we can do so iteratively.
…we can do serial partitioning
1. Built lightweight probabilistic data structure/algorithm for k-mer storage.
- Constant memory, constant lookup- Linear time to create structure
2. Implemented systematic graph traversal of arbitrarily large graphs (> ~3 billion connected k-mers, so far)
- Affine memory (with small linear constant)- Bounded time for exploration; bound traded for memory
3. Built partitioning system to eliminate small graphs and extract disconnected graphs.
Strategy and implementation
Actual implementation
Pre-filter/partition for somebody else’s assembler
N.B. This results in identical assembly.
Python wrapping C++, ~5000 LoC. (Python handles parallelization; go free, GIL!)
Partitioning & assembling 2 Gb data set can be done in ~8 gb of RAM in < 1 day◦ Compare with 40 gb requirement for existing (released) assemblers.◦ Probably 10-fold speed improvement easily (KISS; no premature opt)
Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM, single chassis, 8 CPU.
Not yet clear how well it scales to 200 Gb, but should…
…all of this is running on Amazon cloud rentals.
Actual implementation
Lightweight probabilistic storage system for k-mers, ~1 byte / k-mer.
Large graph traversal (10-20 bn k-mers)◦ Tabu search◦ Neighborhood exclusion
Graph partitioning, trimming, grokking.◦ Iterative refinement is “perfect”◦ Failure rate ~ memory usage, with good failover
( connectivity increases).
In conclusion
More general assembly graph analysis
Breaking graphs in good places
Clustering of large protein similarity graphs/matrices
Caveats: Preferential attachment with false positives?
First publication -- Bloom counting hash (see kmer-filtering blog post)
Future directions
We were lucky & could turn our graph traversal problem into a set membership query.
Tabu search / neighborhood exclusion for exhaustive graph traversal isn’t novel, but might be useful. Requires systematic tagging.
But… random and probabilistic approaches (skip lists, Bloom filters, etc.) can be surprisingly useful.◦ One sided errors are awesome for Big Data.
Solving more general problems?
http://en.wikipedia.org/wiki/Category:Probabilistic_data_structures
Acknowledgements
GED lab / k-mer gangAdina Howe (w/Tiedje)Arend Hintze, postdocJason Pell, gradRosangela Canino-Koning,
gradQingpeng Zhang, grad
Collaborators (MSU)
Weiming LiCharles OfriaJim Tiedje(w/Janet Jansson,
Rachel Mackelprang (JGI))
FundingUSDA NIFA, NSF, DOE,
Michigan State U.
ABySS assembler – multi-node assembly in RAM
On-disk assembly:
SOAP assembler (BGI) – not open source
Cortex assembler (EBI) – unpub/not released
Contrail assembler (Michael Schatz) – unpub/not released
It’s hard for me to tell how these last three compare ;)BUT our current approach is orthogonal and can be used
in conjunction (as a pre-filter) with these assemblers.
Other k-mer graph work