Page 1
MEWE WorkshopPrinciples, potential, and limitations of novel molecular methods in water
engineering; from amplicon sequencing to omics methods
Programme
9:00 Introduction, Per Halkjær Nielsen, Aalborg University9:10 Amplicon sequencing, Trina McMahon, University of Wisconsin-Madison10:10 Importance of a curated 16S database, Aaron Saunders, Aalborg University10:40 Break 11:00 DNA extraction and primer selection, Søren Karst, Aalborg University11:30 Discussion in groups/questions 12:15 Lunch
Page 2
12:15 Lunch 13:15 Metagenomics, principles, potential and problems, Mads Albertsen,
Aalborg University14:30 Metatranscriptomics, principles, potential and problems, Rohan Williams, SCELSE, Singapore15:30 Break 15:45 Informatics and data management, Trina McMahon, University of Wisconsin-Madison16:15 Discussion in groups/questions 17:00 Closing, Per Halkjær Nielsen
MEWE WorkshopPrinciples, potential, and limitations of novel molecular methods in water
engineering; from amplicon sequencing to omics methods
Page 3
Amplicon Sequencing
Trina McMahonUniversity of Wisconsin – Madison
(standing in for Pat Schloss)
Page 4
What is amplicon sequencing?
Anything that requires PCR-based amplification of a specific target gene (locus)
Page 5
First things first
• What is your question or hypothesis?• How can you answer your question or test
your hypothesis using the smallest amount of resources?– Replication– Treatments/controls– Time series– Collection effort (depth of sampling)
Page 7
Principles• Choice of locus – SSU/16S rRNA gene – “Functional” genes (amoA, ppk1, narG, napA, nifH)
• Choice of sequencing approach– Clone libraries and Sanger sequencing– Barcoded/multiplexed 454 pyrosequencing– Barcoded/multiplexed Illumina
• Choice of primers– Depends on the above two choices!
• Choice of data analysis pipeline– Software– Taxonomy trainingset
Page 8
~ 1400 bases of SSU rDNA from EBPR reactor
Page 9
Seq 1..AGCCCUGGUCGCA.. Seq 2..ACCCCUGGACUGUCGGA..
Page 10
Seq 1..AGCCCUG----GUCGCA.. Seq 2..ACCCCUGGACUGUCGGA..
Page 11
Seq 1..AGCCCUG----GUCGCA.. ..|x|||||----||||x|..Seq 2..ACCCCUGGACUGUCGGA..
Page 13
Distance (or “difference”) matrix
Fractional identity
Fractional difference
Note: difference = 1- (identity)
Page 15
The Big Tree
Pace, 1997, Science, 276:734
Page 16
Ashelford K E et al. Appl. Environ. Microbiol. 2005;71:7724-7736
PMID: 12692101
Certain regions of the 16S rRNA vary more in sequence than others
So-called “hyper-variable regions” are targeted by tag sequencing primer sets
Page 17
Regions of interest within 16S rRNA gene
V3 V4 V5
253 bp
429 bp
375 bp
Amount of overlap for 2x250 bp reads:V4: 247 bpV34: 71 bpV45: 125 bp
Page 18
sample gDNA
Amplified PCR product with
barcode
sequencer
~106 – 109 barcoded reads
Sequences sorted by sample
of origin
Page 19
Overview workflow (generic)
Page 20
>GQY1XT001A6MUAAATGGTACCCGTCAATTCATTTGAGTTTCATTCTTGCGAACGTACTCCCCAGGTGGATCACTTACTGCGTTTGCTGCGGCACCGGAGGTTCTTGAACCCCCGACACCTAGTGATCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAACGTCAGTTACAGTCCAGTAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCACTTACCTCTCCTGCACTCCAGTCATACAGTTTCCAATG>GQY1XT001BTRWSAATGGTACCCGTCAATTCCTTTGAGTTTCATTCTTGCGAACGTACTCCCCAGGTGGATTACTTAATGCGTTTGCGGCGGCACCGGAGGGCCTTGGCCCCCCGACACCTAGTAATCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAACGTCAGTTACAGTCCAGTAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTACCTCTCCTGCACTCGAGCTGCACAGTTTCCAAAGCAGTTCCGGGGTTGGG>GQY1XT001BBPBRAATGGTACCCGTCAATTCATTTGAGTTTCACCGTTGCCGGCGTACTCCCCAGGTGGGATGCTTAACGCTTTCGCTTTGCCACCCAGGCCCCATTCGGCCCGGACAGCTGGCATCCATCGTTTACTGTGCGGACTACCAGGGTATCTAATCCTGTTCGATCCCCGCACTTTCGTGCCTCAGCGTCAGTAGGGCGCCGGAAGGCTGCCTTCGCAATCGGGGTTCTGCGTGATATCTATGCATTTCACCGCTACACCACGCATTCCGCCTTCTTCTCGCCCACTCAAGGCCCCCAGTTTCAACGG>GQY1XT001BDDE9AATGGTACCCGTCAATTCCTTTAAGTTTCATTCTTGCGAACGTACTCCCCAGGTGGATCACTTACTGCGTTTGCTGCGGCACCGATGGGTCCATACCCACCCACACCTAGTAATCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAACGTCAGTTACAGTCCAGCAGGCCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCCTGCCTCTCCTGCACTCCAGTTACACAGTTTCCAGAG>GQY1XT001CIUF3AATGGTACCCGTCAATTCCTTTGAGTTTCATTCTTGCGAACGTACTCCCCAGGCGGAATACTTACTGCGTTTGCTGCGGCACCGGCGGGCCGTGCCCGCCGACACCTGGTATTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTCGTCGTCCAGCAGGCCGCCTTCGCCACCGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCCTGCCCCTCCGACACTCCAGCCCGGCAGTTTCCAGTGCAGTCCCGGGGTT
Example 454 data
Page 21
Clustering (and picking OTUs)
singletons
Page 22
Clustering (and picking OTUs)
Page 23
Clustering (and picking OTUs)
Page 24
Assigning taxonomies>378462GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAACAGATAAGGAGCTTGCTCCTTTGACGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTACCTATAAGACTGGA...>186233AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCATCAGTTTGGTTTGCTTGCAAACCAAAGCTGGCGACCGGCGCACGGGTGAGTAACAC...>260529AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACGAAGCATAAGGGAAGGAAGATTCGTCTGACGGAACTTATGACTGAGTGGCGGACGGGTGA...>256122CCTGGCTCACAATCACGAAGGAGAGGCGTGCGTAACACATGCAAGTCGACACGGGAGAGCGTGAGGCAACTCCGCAAGTATAGTGGCAGACGGGTGAGTAACACGTGAACAACCTACCCT...>312796AGTGGCGAACGGGTGAGTAACGCGTGAGGAACCTGCCTTTCAGAGGGGGACAACAGTTGGAAACGACTGCTAATACCGCATAATACGGTCTGACCGCATGATCGGATCGTCAAAGATTTA...>574086CCGCAAGGGGAGTGGCAGACGGGTGAGTAACGCGTGGGAACCTTCCCAGTGGTACGGAATAACCCAGGGAAACCTGAGCTAATACCGTATACGCCCGAAAGGGGAAAGATTTATCGCCAT...
Page 25
Assigning taxonomies378462 k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__;186233 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides;s__Parabacteroidesdistasonis;260529 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Clostridium;s__;256122 k__Bacteria;p__Acidobacteria;c__MVS-40;o__;f__;g__;s__;312796 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__;s__;574086 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__;s__;
Page 26
Assigning taxonomies
378462 k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__;186233 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides;s__Parabacteroidesdistasonis;260529 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Clostridium;s__;256122 k__Bacteria;p__Acidobacteria;c__MVS-40;o__;f__;g__;s__;312796 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__;s__;574086 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__;s__;
>378462GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAACAGATAAGGAGCTTGCTCCTTTGACGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTACCTATAAGACTGGA...>186233AGAGTTTGATCCTGGCTCAGGATGAACACTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCATCAGTTTGGTTTGCTTGCAAACCAAAGCTGGCGACCGGCGCACGGGTGAGTAACAC...>260529AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACGAAGCATAAGGGAAGGAAGATTCGTCTGACGGAACTTATGACTGAGTGGCGGACGGGTGA...>256122CCTGGCTCACAATCACGAAGGAGAGGCGTGCGTAACACATGCAAGTCGACACGGGAGAGCGTGAGGCAACTCCGCAAGTATAGTGGCAGACGGGTGAGTAACACGTGAACAACCTACCCT...>312796AGTGGCGAACGGGTGAGTAACGCGTGAGGAACCTGCCTTTCAGAGGGGGACAACAGTTGGAAACGACTGCTAATACCGCATAATACGGTCTGACCGCATGATCGGATCGTCAAAGATTTA...>574086CCGCAAGGGGAGTGGCAGACGGGTGAGTAACGCGTGGGAACCTTCCCAGTGGTACGGAATAACCCAGGGAAACCTGAGCTAATACCGTATACGCCCGAAAGGGGAAAGATTTATCGCCAT...
Page 27
Pyrosequencing
• Next generation sequencing technology
• Ability to generate ~500,000 sequences in an afternoon
• Can barcode sequences to sequence many samples in a single run
• Reads are getting longer• $10,000-15,000 per run
Schloss et al. (2011) PLoS ONE 6:e27310
Page 30
Caporaso et al 2012 ISMEJ 6:1621-1624
Page 31
Other methods…
• IonTorrent– Tons of short crappy reads– Not worth the effort
• PacBio– Modest number of long reads– Not worth the effort
• Stick with 454 or MiSeq (preferred)
Page 32
Costs are falling
• Very cheap– Schloss lab sequenced ~30 plates by 454 for $4000
per plate ~ $120,000– Could re-do everything on MiSeq in 8 runs for
$1500 per plate ~ $12,000• Cost is in DNA extraction analysis– ~$8.00 per sample to get DNA– ~$5.00 per sample to sequence
Page 33
Data analysis pipelines
Page 34
The Major Players (for 16S-tag amplicons)
• Pat Schloss, UMichigan – mothur– Command line– Coded in C++ but distributed as compiled– Excellent documentation
• Rob Knight and friends, UColorado – QIIME– Command line– Coded in python– Can run as a “Virtual Box”– Pretty good documentation
• Ribosomal Database Project, MSU – RDP– Web interface– Pretty good documentation
Page 35
Others
• Victor Kunin and Phil Hugenholtz, JGI – Pyrotagger• Eric Triplett and friends, UFlorida - PANGEA• Kumar and friends, UOslo – CLOTU• Fricke and friends, UMaryland - CloVR• Schloetterer, Austria – CANGS• Sogin and friends, MBL - VAMPS• Quince/Curtis/Sloan, UGlasgow –
AmpliconNoise/Pyronoise• Greg Hannon, CSHL - FASTX-Toolkit• Claros and friends, Malaga Spain - SeqTrim
Page 36
Discussion questions
1. How do you think the choice of sequencing technology affects the results?
2. How do you think the choice of primers affects the results?
3. Which data analysis tools do you use and why? What differences do you perceive between mothur, QIIME, RDP, etc?
4. Which kinds of questions can you answer using amplicon sequencing, and which can you not?
5. Which part of the amplicon sequencing process intimidates you the most and why?
Page 39
Which microbial organisms are represented by the rRNA gene sequences in each sample?
>PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG>PC.634_2 FLP3FBN01EG8AXTTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT>PC.354_3 FLP3FBN01EEWKDTTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTGGG
rRNA reference database (sequences are available for
each ‘tip’ in the tree)
Search against reference sequences
Page 40
Search against reference sequences
>PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG>PC.634_2 FLP3FBN01EG8AXTTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT>PC.354_3 FLP3FBN01EEWKDTTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGGCTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATGCACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCTAGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGTGTTATCCCAGTCTCTTGGG
Which microbial organisms are represented by the rRNA gene sequences in each sample?
Page 41
Assign millions of sequences from thousands
of samples to reference
Compare samples statistically and visually
www.qiime.org
Assign reads to samples
>GCACCTGAGGACAGGCATGAGGAA…>GCACCTGAGGACAGGGGAGGAGGA…>TCACATGAACCTAGGCAGGACGAA…>CTACCGGAGGACAGGCATGAGGAT…>TCACATGAACCTAGGCAGGAGGAA…>GCACCTGAGGACACGCAGGACGAC…>CTACCGGAGGACAGGCAGGAGGAA…>CTACCGGAGGACACACAGGAGGAA…>GAACCTTCACATAGGCAGGAGGAT…>TCACATGAACCTAGGGGCAAGGAA…>GCACCTGAGGACAGGCAGGAGGAA…
Page 42
OTU picking
• De Novo – Reads are clustered based on similarity to one
another.• Reference-based– Closed reference: any reads which don’t hit a
reference sequence are discarded– Open reference: any reads which don’t hit a
reference sequence are clustered de novo
http://qiime.org/tutorials/otu_picking.html
Page 43
De novo OTU picking
• Pros– All reads are clustered
• Cons– Not parallelizable– OTUs may be defined by erroneous reads
pick_de_novo_otus.pyhttp://qiime.org/tutorials/tutorial.html
Page 44
De novo OTU picking
• You must use if:– You do not have a reference sequence collection to
cluster against, for example because you're working with an infrequently used marker gene.
• You cannot use if:– You are comparing non-overlapping amplicons, such
as the V2 and the V4 regions of the 16S rRNA.– You working with very large data sets, like a full
HiSeq 2000 run. (Technically you can, but it will be really slow.)
pick_de_novo_otus.pyhttp://qiime.org/tutorials/tutorial.html
Page 45
Closed-reference OTU picking
• Pros– Built-in quality filter– Easily parallelizable– OTUs are defined by high-quality, trusted
sequences• Cons– Reads that don’t hit reference dataset are
excluded, so you can never observe new OTUs
pick_closed_reference_otus.py
Page 46
Closed-reference OTU picking
• You must use if:– You are comparing non-overlapping amplicons,
such as the V2 and the V4 regions of the 16S rRNA. Your reference sequences must span both of the regions being sequenced.
• You cannot use if:– You do not have a reference sequence collection
to cluster against, for example because you're working with an infrequently used marker gene.
pick_closed_reference_otus.py
Page 47
Percentage of reads that do not hit the reference collection, by environment type.
Page 48
Open-reference OTU picking
• Pros– All reads are clustered– Partially parallelizable
• Cons– Only partially parallelizable– Mix of high quality sequences defining OTUs (i.e.,
the database sequences) and possible low quality sequences defining OTUs (i.e., the sequencing reads)
pick_open_reference_otus.pyhttp://qiime.org/tutorials/illumina_overview_tutorial.html
http://qiime.org/tutorials/open_reference_illumina_processing.htmlhttp://qiime.org/tutorials/fungal_its_analysis.html
Page 49
Open-reference OTU picking
• You cannot use if:– You are comparing non-overlapping amplicons,
such as the V2 and the V4 regions of the 16S rRNA.
– You do not have a reference sequence collection to cluster against, for example because you're working with an infrequently used marker gene.
pick_open_reference_otus.pyhttp://qiime.org/tutorials/illumina_overview_tutorial.html
http://qiime.org/tutorials/open_reference_illumina_processing.htmlhttp://qiime.org/tutorials/fungal_its_analysis.html
Page 50
pick_open_reference_otus.pyhttp://qiime.org/tutorials/open_reference_illumina_processing.html
Subsampled open reference OTU picking scales to billions of reads
Page 51
Read assignment is different for shotgun data, but not that different. In general, the bottleneck
is identifying/compiling a reference database.
map_reads_to_reference.pyparallel_map_reads_to_reference.py
http://qiime.org/tutorials/shotgun_analysis.html http://qiime.org/scripts/map_reads_to_reference.html