Sequencing run grief counseling: counting kmers at MG-RAST

Sequencing run grief counseling: coun0ng kmers at MG-‐RAST

Will Trimble metagenomic annota0on group Argonne Na0onal Laboratory

April 29, 2014 UIC

Apology: I speak biology with an accent

•  I spent six years in dark rooms with lasers •  Now I use computers to analyze high-‐throughput sequence data.

•  I introduce myself as an applied mathema0cian.

•  Finding scoring func0ons to use ambiguous data to answer life’s persistent ques0ons.

Apology: I speak biology with an accent

•  I spent six years in dark rooms with lasers •  Now I use computers to analyze high-‐throughput sequence data.

•  I introduce myself as an applied mathema0cian.

•  Finding scoring func0ons to use ambiguous data to answer life’s persistent ques0ons.

•  Shoveling data from the data producing machine into the data-‐consuming furnace.

•  Sequences are different •  Sequencing is like photography •  Sequencing is beau0ful thumbnailpolish •  How diverse are my shotgun sequences? nonpareil-k! kmerspectrumanalyzer! !!

Outline

•  Sequences are different (math) •  Sequencing is like photography (pictures) •  Sequencing is beau0ful thumbnailpolish (micrographs) •  How diverse are my shotgun sequences? nonpareil-k (graphs) kmerspectrumanalyzer! (graphs)

Outline

Sequences are different

•  Sequencing produces sequences. Sequences are qualita0vely different from all other data types.

Low-‐throughput categorical data Categories are sound



Instrument readings, spectra, micrographs Not categorical.




@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!



High throughput sequence data Categories uncertain



@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!



High throughput sequence data Categories uncertain

100-‐102 102-‐107 1012-‐1080

Experiment design Sequencing run Sequence data

Assembly, Annota0on

SEED M5NR

489 !Sensory box/GGDEF family!470 !hyphothetical protein!241 !Co-Zn-Cd resistance CzcA!202 !Transposase!200 !homocysteine methyltransferase (EC 2.1.1.13)!175 !cyclase/phosphodiesterase !164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!156 !Methyl-accepting chemotaxis protein!149 !ABC transporter, ATP-binding protein!147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!133 !Ferrous iron transport protein B!

So we reduce sequence data to categorical data.

Forward-‐backward problem

Experiment design Sequencing run Sequence data

Assembly, Annota0on

SEED M5NR

489 !Sensory box/GGDEF family!470 !hyphothetical protein!241 !Co-Zn-Cd resistance CzcA!202 !Transposase!200 !homocysteine methyltransferase (EC 2.1.1.13)!175 !cyclase/phosphodiesterase !164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!156 !Methyl-accepting chemotaxis protein!149 !ABC transporter, ATP-binding protein!147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!133 !Ferrous iron transport protein B!

1012

103-‐105 100-‐101

So we reduce sequence data to categorical data.



•  Each sequence is an informa0on-‐rich (possibly corrupted) quota9on from the catalog of gene0c polymers.

What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”

Searching



Searching

Same answer for both puzzles: you go to this website…



Searching

How long do reads need to be to recognize them?

How long do phrases need to be to recognize them?

How long do reads need to be?

Informa9on (Shannon, 1949, BSTJ): is a quan0ta0ve summary of the uncertainty of a probability distribu9on – a model of the data Profound applicability in machine learning and probabilis0c modeling

H =

X

i

pi log2

✓1

pi

◆

How long do phrases need to be?

Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

•  Informa0on content of English words: Hword ca. 12 bits per word. •  Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits •  So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.

Try it.






Usually nails your source in four words.

•  Maximum informa0on content of base pairs Hread 2 bits per length-‐ sequence •  Most long kmers are dis0nct: genome of size G (ca 1010 bp) log(G) = 1010 = 233.2 = 34 bits •  So we expect that when 2 > 34 bits, we should be able to place any sequence.

•  That means we need at least 17 base pairs (seems small) to deliver mail anywhere in the genome.


``

`

`

•  Maximum informa0on content of base pairs Hread 2 bits per length-‐ sequence •  Most long kmers are dis0nct: genome of size G (ca 1010 bp) log(G) = 1010 = 233.2 = 34 bits •  So we expect that when 2 > 34 bits, we should be able to place any sequence.

•  That means we need at least 17 base pairs (seems small) to deliver mail anywhere in the genome.


``

`

`

Short sequences end up being very dis0nc0ve, even fingerprint-‐like.

`

Check: Human reference genome

The data deluge

•  There were some technological breakthroughs in the mid-‐2000s that led to inexpensive collec0on of 10s of Gbytes of sequence data at once.

•  The data has outgrown some favorite algorithms from the 1990s (BLAST)

http://www.mcs.anl.gov/~trimble/flowcell/!

thumbnailpolish!

Rarefac0on of a photograph A camera records the number of photons that land on each of millions of pixels. A sequencer records the number of sequences that land in each possible sequence.

I actually think of a sequencer like a mul0channel gene0c spectrometer.

Rarefac0on of a photograph A camera records the number of photons that land on each of millions of pixels. A sequencer records the number of sequences that land in each possible sequence.

I actually think of a sequencer like a mul0channel gene0c spectrometer.

The gene0c spectrometer

With my 1012-‐channel gene0c spectrometer, I am trying to ar0culate the diversity of what the sequencer sees. Species diversity

ATCGCGAAAAGTCCC 2!AAAAAAAAAAAAAAA 459!AAAAAAAAAAAAAAC 71!AAAATAAAAAAAATA 1!AAAAAAAAAAAAAAG 36!ACATGAAAAACAACT 1!AAAAAAAAAAAAAAT 23!AAAAAAAAAAAAACA 95!GTAGGAAAAGCCCAC 1!AAAAAAAAAAAAACC 7!AAAAAAAAAAAAACG 8!AAAAAAAAAAAAACT 9!AAAAAAAAAAAAAGA 36!AACAAGAAAAACAAA 1!AAAAAAAAAAAAAGC 10!AAATAAAAAAAATAG 1!AACAGAAAAAACACG 1!AAAAAAAAAAAAAGG 2!AAAAAAAAAAAAAGT 6!


With my 1012-‐channel gene0c spectrometer, I am trying to ar0culate the diversity of what the sequencer sees. Species diversity Gene diversity



With my 1012-‐channel gene0c spectrometer, I am trying to ar0culate the diversity of what the sequencer sees. Species diversity Gene diversity Sequence diversity


Rarefac0on of a photograph Sampling only a few sequences is like exposing the camera for too short a 0me. Not enough photons to make out the picture.

Rarefac0on of a photograph

some parts seem to be dark.



This looks like a portrait



Start to see the mood



A 0ny bit of graininess leg


“shot noise” in electrical engineering


A studio portrait of Jane Goodall

A scien0fic image

This is a famous scien0fic image.

Anybody recognize it?

A scien0fic image

Does this help?

A scien0fic image

There are small patches of brightness

A scien0fic image

Were you expec0ng x-‐ray diffrac0on?

A scien0fic image

At longer exposures

A scien0fic image

more objects, smaller and dimmer, appear.

A scien0fic image

This is a part of the Hubble Deep Field image

Image / sequencing analogy Analogy to sequencing: •  Most of field is black •  Bright objects have halos

•  Contains camera ar0facts

•  We can’t know what we didn’t see without longer exposures.

Opportunity cost of deep sequencing

This took two weeks to acquire on a one-‐of-‐a-‐kind telescope. Consider the opportunity cost of studying a single sample for two weeks.

STSI did only four long exposures like this in 23 years.

Image / sequencing analogy Analogy to sequencing: •  Most of field is black •  Bright objects have halos

•  Contains camera ar0facts

•  We can’t know what we didn’t see without longer exposures.

Sampling effort interacts with sequence diversity to produce a “horizon” Inferences are supported on the bright parts first, on the dim parts only at higher depth. Not all the sequences, abundant or rare, are real. Dim targets come at great cost in sample number.

How much novelty is in my dataset?

How many sequences do you need to see before you start seeing the same ones over and over again?


How many sequences do you need to see before you start seeing the same ones over and over again? Ini0ally, everything is novel, but there will come a point at which less than half of your new observa0ons are already in the catalog.

How much novelty is in my dataset? Luis Rodriguez-Rojas and Kostas Konstantinidis developed a subset-against-all alignment approach to address the question “how quickly do we encounter novelty in shotgun datasets?” Nonpareil I found a way to answer almost the same question 300x faster. Nonpareil-k

Nonuniqefraction(✏; {r}, {n}) =X

i

ni · riPj nj · rj

(1� Poisscdf (✏ · ri, 1))(1� Poisscdf (✏ · ri, 0))


Nonpareil-k

Nonpareil: model of sequence coverage Georgia Tech

Nonpareil: model of sequence coverage Georgia Tech

Nonpareil-k: kmer rarefaction Argonne + Georgia Tech

summary of sequence diversity

Nonpareil-‐k: stra0fy datasets by coverage distribu0on

most of dataset likely contained in assembly

assembly is likely to miss or alenuate the large unique frac0on of dataset.

Looking for abundance palerns


Let’s look at the greyscale histogram



Shadows

Background Jacket Face and hands

We can even tease out a few palerns in the histogram

Kmers can tell you genome size and coverage depth

Kmers can tell you genome size and coverage depth

Redundancy is good

•  OMG! Check out these three sequences! I’ve found the fourth, figh, and sixth domains of life.

•  OMG! I see this sequence 10 million 0mes.

•  OMG! There are more than 10 billion dis0nct 31mers in my dataset. I only have 128 Gbases of memory.

•  Error correc0on / clustering / assembly works on subsets of the data with high sequence depth.

Redundancy is good

•  OMG! Check out these three sequences! I’ve found the fourth, figh, and sixth domains of life.

•  OMG! I see this sequence 10 million 0mes.

•  OMG! There are more than 10 billion dis0nct 31mers in my dataset. I only have 128 Gbases of memory.

•  Error correc0on / clustering / assembly works on subsets of the data with high sequence depth.

Abundance-‐based inferences are beler in the high-‐

abundance part of the data.

But I want to sequence everything! Ok, we can count kmers in everything too..

kmerspectrumanalyzer summarizes distribu0on, es0mates genome size, coverage depth, … but what it’s really good at

Kmers show problems in datasets

•  Amok PCR – seemingly random sequences •  Amok MDA – 10 Gbases of sequence, one gene •  PCR duplicates: en0re sequencing run was 50x exact-‐ and near-‐exact duplicate reads

•  Unusually high error rate: indicated by low frac0on of “solid” kmers (for isolate genomes)

•  Contaminated samples: 95% E. coli 5% E. faecalis •  Many datasets have as much as 5-‐45% of the sequence yield in adapters.

Generali0es from the kmer coun0ng mines

•  FEW DATASETS have well-‐separated abundance peaks (of the sort metavelvet was engineered to find)

•  Diverse datasets have a featureless, geometric rela9onship between kmer rank and kmer abundance (but I’m not about to write a paper fipng kmers to the Yule, Mandelbrot, Levy, or Pareto distribu0ons)

Figure'1c!

-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04

0100

200

300

400

500

600

PC02 vs Alpha Diversity

eigen_vectors[, "PCO2"]

colo

r_m

atr

ix[, "

alp

ha

-div

ers

ity"]

All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7

Figure'1d!

HMP / quan0le norm / euclidean / colored by alpha

MG-‐RAST API R-‐package matR

Hey kid, you want some unlabeled data? Kevin Keegan, Argonne Na0onal Laboratory

Figure'1c!

-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04

0100

200

300

400

500

600

PC02 vs Alpha Diversity

eigen_vectors[, "PCO2"]

colo

r_m

atr

ix[, "

alp

ha

-div

ers

ity"]

All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7

Figure'1d!

HMP / quan0le norm / euclidean / colored by alpha

MG-‐RAST API R-‐package matR

Hey kid, you want some unlabeled data? Kevin Keegan, Argonne Na0onal Laboratory

I’m not sure how to do science with an unlabeld pile

of datasets.

Figure'2a!

Figure'2b!

Hey kid, you want some prely ordina0ons? Kevin Keegan, Argonne Na0onal Laboratory

Observa0on: Most scien0sts seem to be self-‐taught in compu0ng.

Observa0on: Most scien0sts waste a

lot of 0me using computers inefficiently.

Rachel and I volunteer with

We teach scien0sts how to get more done

Woods Hole

Tugs

U. Chicago

U. Chicago

UIC

Metagenomic annota0on group Folker Meyer Elizabeth Glass Narayan Desai Kevin Keegan Adina Howe Wolfgang Gerlach Wei Tang Travis Harrison Jared Bishof Dan Braithwaite Hunter Malhews Sarah Owens

Formerly of Yale: Howard Ochman David Williams Georgia Tech: Kostas Konstan0nidis Luis Rodriguez-‐Rojas

Sequencing run grief counseling: counting kmers at MG-RAST

Science

chemotaxis protein

hyphothetical protein

atpbinding protein

atpase ec

micrographs notcategorical

coa ligase ec

zncd resistance czca

exact results