Sequencing run grief counseling: coun0ng kmers at MGRAST Will Trimble metagenomic annota0on group Argonne Na0onal Laboratory April 29, 2014 UIC
Dec 02, 2014
Sequencing run grief counseling: coun0ng kmers at MG-‐RAST
Will Trimble metagenomic annota0on group Argonne Na0onal Laboratory
April 29, 2014 UIC
Apology: I speak biology with an accent
• I spent six years in dark rooms with lasers • Now I use computers to analyze high-‐throughput sequence data.
• I introduce myself as an applied mathema0cian.
• Finding scoring func0ons to use ambiguous data to answer life’s persistent ques0ons.
Apology: I speak biology with an accent
• I spent six years in dark rooms with lasers • Now I use computers to analyze high-‐throughput sequence data.
• I introduce myself as an applied mathema0cian.
• Finding scoring func0ons to use ambiguous data to answer life’s persistent ques0ons.
• Shoveling data from the data producing machine into the data-‐consuming furnace.
• Sequences are different • Sequencing is like photography • Sequencing is beau0ful thumbnailpolish • How diverse are my shotgun sequences? nonpareil-k! kmerspectrumanalyzer! !!
Outline
• Sequences are different (math) • Sequencing is like photography (pictures) • Sequencing is beau0ful thumbnailpolish (micrographs) • How diverse are my shotgun sequences? nonpareil-k (graphs) kmerspectrumanalyzer! (graphs)
Outline
Sequences are different
• Sequencing produces sequences. Sequences are qualita0vely different from all other data types.
Low-‐throughput categorical data Categories are sound
Sequences are different
• Sequencing produces sequences. Sequences are qualita0vely different from all other data types.
Instrument readings, spectra, micrographs Not categorical.
Low-‐throughput categorical data Categories are sound
Sequences are different
• Sequencing produces sequences. Sequences are qualita0vely different from all other data types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!
Instrument readings, spectra, micrographs Not categorical.
Low-‐throughput categorical data Categories are sound
High throughput sequence data Categories uncertain
Sequences are different
• Sequencing produces sequences. Sequences are qualita0vely different from all other data types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!
Instrument readings, spectra, micrographs Not categorical.
Low-‐throughput categorical data Categories are sound
High throughput sequence data Categories uncertain
100-‐102 102-‐107 1012-‐1080
Experiment design Sequencing run Sequence data
Assembly, Annota0on
SEED M5NR
489 !Sensory box/GGDEF family!470 !hyphothetical protein!241 !Co-Zn-Cd resistance CzcA!202 !Transposase!200 !homocysteine methyltransferase (EC 2.1.1.13)!175 !cyclase/phosphodiesterase !164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!156 !Methyl-accepting chemotaxis protein!149 !ABC transporter, ATP-binding protein!147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!133 !Ferrous iron transport protein B!
So we reduce sequence data to categorical data.
Forward-‐backward problem
Experiment design Sequencing run Sequence data
Assembly, Annota0on
SEED M5NR
489 !Sensory box/GGDEF family!470 !hyphothetical protein!241 !Co-Zn-Cd resistance CzcA!202 !Transposase!200 !homocysteine methyltransferase (EC 2.1.1.13)!175 !cyclase/phosphodiesterase !164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!156 !Methyl-accepting chemotaxis protein!149 !ABC transporter, ATP-binding protein!147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!133 !Ferrous iron transport protein B!
1012
103-‐105 100-‐101
So we reduce sequence data to categorical data.
Sequences are different
• Sequencing produces sequences. Sequences are qualita0vely different from all other data types.
• Each sequence is an informa0on-‐rich (possibly corrupted) quota9on from the catalog of gene0c polymers.
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
Same answer for both puzzles: you go to this website…
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
How long do reads need to be to recognize them?
How long do phrases need to be to recognize them?
How long do reads need to be?
Informa9on (Shannon, 1949, BSTJ): is a quan0ta0ve summary of the uncertainty of a probability distribu9on – a model of the data Profound applicability in machine learning and probabilis0c modeling
H =
X
i
pi log2
✓1
pi
◆
How long do phrases need to be?
Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
• Informa0on content of English words: Hword ca. 12 bits per word. • Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits • So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.
Try it.
How long do phrases need to be?
How long do phrases need to be?
Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
How long do phrases need to be?
Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
Usually nails your source in four words.
• Maximum informa0on content of base pairs Hread 2 bits per length-‐ sequence • Most long kmers are dis0nct: genome of size G (ca 1010 bp) log(G) = 1010 = 233.2 = 34 bits • So we expect that when 2 > 34 bits, we should be able to place any sequence.
• That means we need at least 17 base pairs (seems small) to deliver mail anywhere in the genome.
How long do reads need to be?
``
`
`
• Maximum informa0on content of base pairs Hread 2 bits per length-‐ sequence • Most long kmers are dis0nct: genome of size G (ca 1010 bp) log(G) = 1010 = 233.2 = 34 bits • So we expect that when 2 > 34 bits, we should be able to place any sequence.
• That means we need at least 17 base pairs (seems small) to deliver mail anywhere in the genome.
How long do reads need to be?
``
`
`
Short sequences end up being very dis0nc0ve, even fingerprint-‐like.
`
Check: Human reference genome
The data deluge
• There were some technological breakthroughs in the mid-‐2000s that led to inexpensive collec0on of 10s of Gbytes of sequence data at once.
• The data has outgrown some favorite algorithms from the 1990s (BLAST)
http://www.mcs.anl.gov/~trimble/flowcell/!
thumbnailpolish!
Rarefac0on of a photograph A camera records the number of photons that land on each of millions of pixels. A sequencer records the number of sequences that land in each possible sequence.
I actually think of a sequencer like a mul0channel gene0c spectrometer.
Rarefac0on of a photograph A camera records the number of photons that land on each of millions of pixels. A sequencer records the number of sequences that land in each possible sequence.
I actually think of a sequencer like a mul0channel gene0c spectrometer.
The gene0c spectrometer
With my 1012-‐channel gene0c spectrometer, I am trying to ar0culate the diversity of what the sequencer sees. Species diversity
ATCGCGAAAAGTCCC 2!AAAAAAAAAAAAAAA 459!AAAAAAAAAAAAAAC 71!AAAATAAAAAAAATA 1!AAAAAAAAAAAAAAG 36!ACATGAAAAACAACT 1!AAAAAAAAAAAAAAT 23!AAAAAAAAAAAAACA 95!GTAGGAAAAGCCCAC 1!AAAAAAAAAAAAACC 7!AAAAAAAAAAAAACG 8!AAAAAAAAAAAAACT 9!AAAAAAAAAAAAAGA 36!AACAAGAAAAACAAA 1!AAAAAAAAAAAAAGC 10!AAATAAAAAAAATAG 1!AACAGAAAAAACACG 1!AAAAAAAAAAAAAGG 2!AAAAAAAAAAAAAGT 6!
The gene0c spectrometer
With my 1012-‐channel gene0c spectrometer, I am trying to ar0culate the diversity of what the sequencer sees. Species diversity Gene diversity
ATCGCGAAAAGTCCC 2!AAAAAAAAAAAAAAA 459!AAAAAAAAAAAAAAC 71!AAAATAAAAAAAATA 1!AAAAAAAAAAAAAAG 36!ACATGAAAAACAACT 1!AAAAAAAAAAAAAAT 23!AAAAAAAAAAAAACA 95!GTAGGAAAAGCCCAC 1!AAAAAAAAAAAAACC 7!AAAAAAAAAAAAACG 8!AAAAAAAAAAAAACT 9!AAAAAAAAAAAAAGA 36!AACAAGAAAAACAAA 1!AAAAAAAAAAAAAGC 10!AAATAAAAAAAATAG 1!AACAGAAAAAACACG 1!AAAAAAAAAAAAAGG 2!AAAAAAAAAAAAAGT 6!
The gene0c spectrometer
With my 1012-‐channel gene0c spectrometer, I am trying to ar0culate the diversity of what the sequencer sees. Species diversity Gene diversity Sequence diversity
ATCGCGAAAAGTCCC 2!AAAAAAAAAAAAAAA 459!AAAAAAAAAAAAAAC 71!AAAATAAAAAAAATA 1!AAAAAAAAAAAAAAG 36!ACATGAAAAACAACT 1!AAAAAAAAAAAAAAT 23!AAAAAAAAAAAAACA 95!GTAGGAAAAGCCCAC 1!AAAAAAAAAAAAACC 7!AAAAAAAAAAAAACG 8!AAAAAAAAAAAAACT 9!AAAAAAAAAAAAAGA 36!AACAAGAAAAACAAA 1!AAAAAAAAAAAAAGC 10!AAATAAAAAAAATAG 1!AACAGAAAAAACACG 1!AAAAAAAAAAAAAGG 2!AAAAAAAAAAAAAGT 6!
Rarefac0on of a photograph Sampling only a few sequences is like exposing the camera for too short a 0me. Not enough photons to make out the picture.
Rarefac0on of a photograph
some parts seem to be dark.
Rarefac0on of a photograph
Rarefac0on of a photograph
This looks like a portrait
Rarefac0on of a photograph
Rarefac0on of a photograph
Start to see the mood
Rarefac0on of a photograph
Rarefac0on of a photograph
A 0ny bit of graininess leg
Rarefac0on of a photograph
“shot noise” in electrical engineering
Rarefac0on of a photograph
A studio portrait of Jane Goodall
A scien0fic image
This is a famous scien0fic image.
Anybody recognize it?
A scien0fic image
Does this help?
A scien0fic image
There are small patches of brightness
A scien0fic image
Were you expec0ng x-‐ray diffrac0on?
A scien0fic image
At longer exposures
A scien0fic image
more objects, smaller and dimmer, appear.
A scien0fic image
This is a part of the Hubble Deep Field image
Image / sequencing analogy Analogy to sequencing: • Most of field is black • Bright objects have halos
• Contains camera ar0facts
• We can’t know what we didn’t see without longer exposures.
Opportunity cost of deep sequencing
This took two weeks to acquire on a one-‐of-‐a-‐kind telescope. Consider the opportunity cost of studying a single sample for two weeks.
STSI did only four long exposures like this in 23 years.
Image / sequencing analogy Analogy to sequencing: • Most of field is black • Bright objects have halos
• Contains camera ar0facts
• We can’t know what we didn’t see without longer exposures.
Sampling effort interacts with sequence diversity to produce a “horizon” Inferences are supported on the bright parts first, on the dim parts only at higher depth. Not all the sequences, abundant or rare, are real. Dim targets come at great cost in sample number.
How much novelty is in my dataset?
How many sequences do you need to see before you start seeing the same ones over and over again?
How much novelty is in my dataset?
How many sequences do you need to see before you start seeing the same ones over and over again? Ini0ally, everything is novel, but there will come a point at which less than half of your new observa0ons are already in the catalog.
How much novelty is in my dataset? Luis Rodriguez-Rojas and Kostas Konstantinidis developed a subset-against-all alignment approach to address the question “how quickly do we encounter novelty in shotgun datasets?” Nonpareil I found a way to answer almost the same question 300x faster. Nonpareil-k
Nonuniqefraction(✏; {r}, {n}) =X
i
ni · riPj nj · rj
(1� Poisscdf (✏ · ri, 1))(1� Poisscdf (✏ · ri, 0))
How much novelty is in my dataset?
Nonpareil-k
Nonpareil: model of sequence coverage Georgia Tech
Nonpareil: model of sequence coverage Georgia Tech
Nonpareil-k: kmer rarefaction Argonne + Georgia Tech
summary of sequence diversity
Nonpareil-‐k: stra0fy datasets by coverage distribu0on
most of dataset likely contained in assembly
assembly is likely to miss or alenuate the large unique frac0on of dataset.
Looking for abundance palerns
Looking for abundance palerns
Let’s look at the greyscale histogram
Looking for abundance palerns
Looking for abundance palerns
Shadows
Background Jacket Face and hands
We can even tease out a few palerns in the histogram
Kmers can tell you genome size and coverage depth
Kmers can tell you genome size and coverage depth
Redundancy is good
• OMG! Check out these three sequences! I’ve found the fourth, figh, and sixth domains of life.
• OMG! I see this sequence 10 million 0mes.
• OMG! There are more than 10 billion dis0nct 31mers in my dataset. I only have 128 Gbases of memory.
• Error correc0on / clustering / assembly works on subsets of the data with high sequence depth.
Redundancy is good
• OMG! Check out these three sequences! I’ve found the fourth, figh, and sixth domains of life.
• OMG! I see this sequence 10 million 0mes.
• OMG! There are more than 10 billion dis0nct 31mers in my dataset. I only have 128 Gbases of memory.
• Error correc0on / clustering / assembly works on subsets of the data with high sequence depth.
Abundance-‐based inferences are beler in the high-‐
abundance part of the data.
But I want to sequence everything! Ok, we can count kmers in everything too..
kmerspectrumanalyzer summarizes distribu0on, es0mates genome size, coverage depth, … but what it’s really good at
Kmers show problems in datasets
• Amok PCR – seemingly random sequences • Amok MDA – 10 Gbases of sequence, one gene • PCR duplicates: en0re sequencing run was 50x exact-‐ and near-‐exact duplicate reads
• Unusually high error rate: indicated by low frac0on of “solid” kmers (for isolate genomes)
• Contaminated samples: 95% E. coli 5% E. faecalis • Many datasets have as much as 5-‐45% of the sequence yield in adapters.
Generali0es from the kmer coun0ng mines
• FEW DATASETS have well-‐separated abundance peaks (of the sort metavelvet was engineered to find)
• Diverse datasets have a featureless, geometric rela9onship between kmer rank and kmer abundance (but I’m not about to write a paper fipng kmers to the Yule, Mandelbrot, Levy, or Pareto distribu0ons)
Figure'1c!
-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04
0100
200
300
400
500
600
PC02 vs Alpha Diversity
eigen_vectors[, "PCO2"]
colo
r_m
atr
ix[, "
alp
ha
-div
ers
ity"]
All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7
Figure'1d!
HMP / quan0le norm / euclidean / colored by alpha
MG-‐RAST API R-‐package matR
Hey kid, you want some unlabeled data? Kevin Keegan, Argonne Na0onal Laboratory
Figure'1c!
-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04
0100
200
300
400
500
600
PC02 vs Alpha Diversity
eigen_vectors[, "PCO2"]
colo
r_m
atr
ix[, "
alp
ha
-div
ers
ity"]
All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7
Figure'1d!
HMP / quan0le norm / euclidean / colored by alpha
MG-‐RAST API R-‐package matR
Hey kid, you want some unlabeled data? Kevin Keegan, Argonne Na0onal Laboratory
I’m not sure how to do science with an unlabeld pile
of datasets.
Figure'2a!
Figure'2b!
Hey kid, you want some prely ordina0ons? Kevin Keegan, Argonne Na0onal Laboratory
Observa0on: Most scien0sts seem to be self-‐taught in compu0ng.
Observa0on: Most scien0sts waste a
lot of 0me using computers inefficiently.
Rachel and I volunteer with
We teach scien0sts how to get more done
Woods Hole
Tugs
U. Chicago
U. Chicago
UIC
Metagenomic annota0on group Folker Meyer Elizabeth Glass Narayan Desai Kevin Keegan Adina Howe Wolfgang Gerlach Wei Tang Travis Harrison Jared Bishof Dan Braithwaite Hunter Malhews Sarah Owens
Formerly of Yale: Howard Ochman David Williams Georgia Tech: Kostas Konstan0nidis Luis Rodriguez-‐Rojas