Top Banner
36626 - Next Generation Sequencing Analysis Species level quantitative metagenomics Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics
22

Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

Sep 09, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group

Species level quantitative metagenomics

Simon Rasmussen36626: Next Generation Sequencing analysis

DTU Bioinformatics

Page 2: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Nothing really new hereNothing new- but the technology

Page 3: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Classical measures

• Abundance

• Richness

• Rarefaction

• Diversity

Global terrestrial diversity

Page 4: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

Abundance (counts)

Page 5: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Species richness

• The number of different species in a system

13/06/2016 Quantitative Metagenomics 10 DTU Sytems Biology, Technical University of Denmark

Abundance (Count)

Lion 64 Zebra 128 Giraffe 64 leopard 64 rhinoceros 64 hippopotamus 128 gazelle 128 elephant 64 monkey 9

9 observed species

Page 6: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Rarefaction

• Species richness is a function of our no. observations

• When have we sampled enough?

Rarefaction curve

Page 7: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Species diversity• “Effective number of species” represented in a

dataset

• Richness & Evenness

• Alpha diversity (within sample)

• Beta diversity (between samples)

Page 8: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Alpha diversity• Shannon index

• Quantify the entropy (information content)

• Quantifies the uncertainty (degree of surprise) associated with a prediction

Pi = species proportionR = observed species

Page 9: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Alpha diversity

13/06/2016 Quantitative Metagenomics 14 DTU Sytems Biology, Technical University of Denmark

Richness

Lion 1 Zebra 2 Giraffe 1 Leopard 1 Rhinoceros 1 Hippopotamus 2 Gazelle 2 Elephant 1 Monkey 0

Species richness estimators: Chao1 index = Sobs + f12/(2f2) Sobs = observed species f1 = species observed once f2 = species observed twice

8 observed species

Chao1 index = 8 + 52/(2*3) = 12.17

13/06/2016 Quantitative Metagenomics 14 DTU Sytems Biology, Technical University of Denmark

Richness

Lion 1 Zebra 2 Giraffe 1 Leopard 1 Rhinoceros 1 Hippopotamus 2 Gazelle 2 Elephant 1 Monkey 0

Species richness estimators: Chao1 index = Sobs + f12/(2f2) Sobs = observed species f1 = species observed once f2 = species observed twice

8 observed species

Chao1 index = 8 + 52/(2*3) = 12.17

H’ = -(ln(0.090.09) + ln(0.180.18) + … = 2.0

Page 10: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Beta diversity• Between samples (communities)

13/06/2016 Quantitative Metagenomics 26 DTU Sytems Biology, Technical University of Denmark

Beta-Diversity

Lion 0 2 Zebra 3 2 Giraffe 0 4 Leopard 0 2 Rhinoceros 1 2 Hippodrome 4 0 Gazelle 0 1 Elephant 1 0 Total 9 13

13/06/2016 Quantitative Metagenomics 26 DTU Sytems Biology, Technical University of Denmark

Beta-Diversity

Lion 0 2 Zebra 3 2 Giraffe 0 4 Leopard 0 2 Rhinoceros 1 2 Hippodrome 4 0 Gazelle 0 1 Elephant 1 0 Total 9 13

Page 11: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Bray-curtis dissimilarity

13/06/2016 Quantitative Metagenomics 27 DTU Sytems Biology, Technical University of Denmark

Beta-Diversity Lion 0 2 Zebra 3 2 Giraffe 0 4 Leopard 0 2 Rhinoceros 1 2 Hippodrome 4 0 Gazelle 0 1 Elephant 1 0 Total 9 13

Bray-Curtis dissimilarity metric

Bij = 1 - 2Cij / (Si + Sj) C = sum of the lowest count of common species S = total count of the sample Bs1s2 = 1 – 2*3 / 22 = 0.73 - Dissimilar C = 3 Ss1 + Ss2 = 22

0 ≤ B ≤ 1 Bij = 1 - 2Cij / (Si + Sj)

C = sum of the lowest count of common species

S = total count of the sample

Bs1s2 = 1 - 2*(2+1) / (9 + 13) = 0.73

0 ≤ B ≤ 1

Page 12: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Bray-curtis dissimilarity

13/06/2016 Quantitative Metagenomics 27 DTU Sytems Biology, Technical University of Denmark

Beta-Diversity Lion 0 2 Zebra 3 2 Giraffe 0 4 Leopard 0 2 Rhinoceros 1 2 Hippodrome 4 0 Gazelle 0 1 Elephant 1 0 Total 9 13

Bray-Curtis dissimilarity metric

Bij = 1 - 2Cij / (Si + Sj) C = sum of the lowest count of common species S = total count of the sample Bs1s2 = 1 – 2*3 / 22 = 0.73 - Dissimilar C = 3 Ss1 + Ss2 = 22

0 ≤ B ≤ 1 Bij = 1 - 2Cij / (Si + Sj)

C = sum of the lowest count of common species

S = total count of the sample

Bs1s2 = 1 - 2*(2+1) / (9 + 13) = 0.73

0 ≤ B ≤ 1

Page 13: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Bray-curtis dissimilarity

13/06/2016 Quantitative Metagenomics 27 DTU Sytems Biology, Technical University of Denmark

Beta-Diversity Lion 0 2 Zebra 3 2 Giraffe 0 4 Leopard 0 2 Rhinoceros 1 2 Hippodrome 4 0 Gazelle 0 1 Elephant 1 0 Total 9 13

Bray-Curtis dissimilarity metric

Bij = 1 - 2Cij / (Si + Sj) C = sum of the lowest count of common species S = total count of the sample Bs1s2 = 1 – 2*3 / 22 = 0.73 - Dissimilar C = 3 Ss1 + Ss2 = 22

0 ≤ B ≤ 1 Bij = 1 - 2Cij / (Si + Sj)

C = sum of the lowest count of common species

S = total count of the sample

Bs1s2 = 1 - 2*(2+1) / (9 + 13) = 0.73

0 ≤ B ≤ 1

Page 14: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Sampling effect• To be fair we should sample equally in the systems

we investigate

downsizingTo be fair we should sample equally carefully in the systems we investigate

Did you find any bacteria?

No - not here

Page 15: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

Sample sizes

Page 16: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Sample sizes

• Accounting for different sample sizes:

• Normalise to sample size

• Rarefy (downsize) samples

• Statistically model the variance

Page 17: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Normalising

13/06/2016 Quantitative Metagenomics 22 DTU Sytems Biology, Technical University of Denmark

Sample Sizes

Lion 64 1 Zebra 128 2 Giraffe 64 1 Leopard 64 1 Rhinoceros 64 1 Hippopotamus 128 2 Gazelle 128 2 Elephant 64 1 Monkey 9 0 Total 713 11

Normalize to library size: Norm = ni/ntot

Lion 8.98 9.09 Zebra 17.95 18.18 Giraffe 8.98 9.09 Leopard 8.98 9.09 Rhinoceros 8.98 9.09 Hippopotamus 17.95 18.18 Gazelle 17.95 18.18 Elephant 8.98 9.09 Monkey 1.26 0 Total 100 100

N = ni/ntot

Issue with different sampling power (higher chance of observing rare species)

Page 18: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Downsize / rarefy

• Resample an equal number of observations (reads) from each sample

• Select the target depth carefully

• The more reads we keep the more sensitive

• We may have to remove samples with few counts

Page 19: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Downsize / rarefy

13/06/2016 Quantitative Metagenomics 23 DTU Sytems Biology, Technical University of Denmark

Sample Sizes

Rarefying to smaller library size:

Lion 64 1 Zebra 128 2 Giraffe 64 1 Leopard 64 1 Rhinoceros 64 1 Hippopotamus 128 2 Gazelle 128 2 Elephant 64 1 Monkey 9 0 Total 713 11

Lion 2 1 Zebra 3 2 Giraffe 0 1 Leopard 1 1 Rhinoceros 0 1 Hippopotamus 3 2 Gazelle 1 2 Elephant 0 0 Monkey 0 0 Total 10 10

Resample x amount of observations

Page 20: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Sample sizes• Statistically model the variance &

heteroscedasticity

• Use packages developed for RNA-seq such as DESeq2 and edgeR (negative binomial)

• Wilcoxon rank-sum statistics (ok for downsized data)

Page 21: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

Lets try itTake a sample and count the animals!

Determine Richness, Species abundance and calculate Shannon diversity index

Calculate Bray-Curtis dissimilarity with your neighbour

Page 22: Species level quantitative metagenomics Preprocessing and ... · 36626 - Next Generation Sequencing Analysis Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for

36626 - Next Generation Sequencing Analysis

Computational exercise• Continuation from yesterday

• 124 gut microbiome samples (MetaHIT)

• Sequenced using Illumina (~5Gbp)

• De novo assembly, gene prediction, gene count matrix