This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. RNA-seq for DE analysis training Detecting differentially expressed genes Joachim Jacob 22 and 24 April 2014
44
Embed
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Fifth part of the training session 'RNA-seq for Differential expression analysis'. We explain the most important concepts of detecting DE expression based on a count table, explaining DESeq2 algorithm. Interested in following this session? Please contact http://www.jakonix.be/contact.html
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
RNA-seq for DE analysis training
Detecting differentially expressed genesJoachim Jacob22 and 24 April 2014
Bioinformatics analysis will take most of your time
Quality control (QC) of raw reads
Preprocessing: filtering of reads and read parts, to help our goal of differential detection.
QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
1
2
3
5
4
6
3 of 44
Goal: get me some DE genes!
Based on a raw count table, we want to detect differentially expressed genes between conditions of interest.
We will assign to each gene a p-value (0-1), which shows us 'how surprised we should be' to see this difference, when we assume there is no difference.
0 1
Very big chance there is a difference
p-value
Very small chance there is a real difference
4 of 44
Goal
Every single decision we have taken in previous analysis steps was done to improve this outcome of detecting DE expressed genes.
The read counts of a gene between different conditions, is dependent on (see first part): 1. Chance (NB model)2. Expression level3. Library size (number of reads in that library)4. Length of transcript5. GC content of the genes
13 of 44
Normalize for library size
Assumption: most genes are not DE between samples. DESeq calculates for every sample the 'effective library size' by a scale factor.
DESeq computes a scaling factor for a given sample by computing the median of the ratio, for each gene, of its read count over its geometric mean across all samples. It then uses the assumption that most genes are not DE and uses this median of ratios to obtain the scaling factor associated with this sample.
Original library size * scale factor = effective library size
DESeq will multiply original counts by the sample scaling factor.
DESeq: This normalization method [14] is included in the DESeq Bioconductor package (ver-sion 1.6.0) [14] and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis
In the end: the algorithms conduct internally the normalization, and just continue.
17 of 44
Dispersion estimation
● For every gene, an NB is fitted based on the counts. The most important factor in that model to be estimated is the dispersion.
● DESeq2 applies three steps● Estimates dispersion parameter for each gene● Plots and fits a curve● Adjusts the dispersion parameter towards the
curve ('shrinking')
18 of 44
Dispersion estimation
1. Black dots: estimatedfrom normalized data.
2. Red line: curve fitted
3. blue dots: final assigned dispersion parameter for
that gene
Model is fit!
19 of 44
Test is run between conditions
If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot).
Significant (p-value < 0,01)Not significant
MA-plot: mean of countsversus the log2 fold change
between 2 conditions.
20 of 44
Test is run between conditions
If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot).
This means that we are going to perform 1000's
of tests.
If we set a cut-off on the p-value of 0,01 and we have performed
20000 tests (= genes), 200 genes thatdo not differ will turn up significant only by chance.
21 of 44
Check the distribution of p-values
An enrichment (smaller or
Bigger) should be seen at low
P-values.
Other p-values should notshow a trend.
The histogram of the p-values must look like the one below. If not, the test is not reliable. Perhaps the NB fitting step did not succeed, or confounding variables are present.
22 of 44
Confounded distribution of p-values
23 of 44
Improve test results
A fraction isfalse positive
You set a cut-off of 0,05.
A fraction iscorrectly identified
as DE
24 of 44
Improve test results
We can improve testing by 2 measures:
● avoid testing: apply a filtering before testing, an independent filtering.
● apply a multiple testing correction
25 of 44
Avoid testing by independent filtering
Some scientists just remove genes with mean counts in the samples <10. But there is a more formal method to remove genes, in order to reduce the testing.
http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
Left: a scatter plot of mean counts versus transformed p-values. The red line depicts a cut-off of 0,1. Note that genes with lower counts do not reach the p-value threshold. Some of them are save to exclude from testing.
27 of 44
Avoid testing by independent filtering
If we filter out increasingly bigger portions of genes based on theirmean counts, the number of significant genes increase.
28 of 44
Avoid testing by independent filtering
See later (slide 30)
Choose the variable of interest.You can run it once on all to check the outcome.
http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
Automatically performed and reported in results: Benjamini/Hochberg correction, to control false discovery rate (FDR).
FDR is the fraction of false positives in the genes that are classified as DE.
If we set a threshold α of 0,05, 20% of the genes will be false positives. If we apply FDR correction of 0.05, 5% of the genes in the final list will be false positives.
32 of 44
Including influencing factors
Through a generalized linear model (GLM), the influencing factors are modeled to predict the counts. The factors come from the sample descriptions file.
Yeast (=WT)
GDA (=G)
Yeast mutant (=UPC)
GDA + vit C (=AG)
Additional metadata (batchfactor)
Day 1 Day 1Day 2 Day 2
33 of 44
DESeq2 to detect DE genes
We provide a combination of factors(the model, GLM) which influence the counts. Every factor should match the
column name in the sample descriptions
The levels of the factors correspondingTo the 'base' or 'no perturbation'.
The fraction filtered out, determinedby the independent filter tool.
Adjusted p-value cut-off
34 of 44
The output of DESeq2
The 'detect differential expression' tool gives you four results: the first is the report including graphs.
Only lower than cut-off and with indep filtering.
All genes, with indep filtering applied.
Complete DESeq results, without indep filtering applied.
35 of 44
Effect of variance on DE detection
Log2(FC) Log2(FC)
Stan
dar
d E
rror
(SE)
of
LogF
C
Stan
dar
d E
rro
r (S
E) o
f Lo
gFC
All genes, with their logFCOnly the DE genes
36 of 44
Volcano plot is often asymmetric
Volcano plot: shows the DE genes with our given cut-off.
-0.3 0.3
-log10(pvalue)
log10(FC)
37 of 44
Comparing different conditions
Yeast (=WT)
GDA (=G)
Yeast mutant (=UPC)
GDA + vit C (=AG)
Day 1 Day 1Day 2 Day 2
Which genes are DE between UPC and WT?Which genes are DE between G and AG?Which genes are DE in WT between G and AG?
38 of 44
Comparing different conditions
Adjust the sample descriptions file and the model:
Remove these
Remove these
1. Which genes are DE between UPC and WT? 2. Which genes are DE between G and AG?3. Which genes are DE in WT between G and AG?