Statistical analysis of RNA-Seq data Samuel Blanck, Univ. Lille Guillemette Marot-Briend, Univ. Lille, Inria Sources: J. Aubert and C. Hennequet-Antier (INRA) M.A. Dillies and H. Varet (Institut Pasteur Paris) Technical assistant : Pierre P´ ericard 16-17 september 2019
103
Embed
Statistical analysis of RNA-Seq dataMake an experimental design Context of a RNA-seq experiment Rule 0 :Share a common language in biology, bioinformatics and statistics. Experimental
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Statistical analysis of RNA-Seq data
Samuel Blanck, Univ. LilleGuillemette Marot-Briend, Univ. Lille, Inria
Sources: J. Aubert and C. Hennequet-Antier (INRA)M.A. Dillies and H. Varet (Institut Pasteur Paris)
State the null and the alternative hypothesesH0= {the mean expression (or proportion) of the gene is identicalbetween the two conditions}H1= {the mean expression ((or proportion) of the gene is different
A gene is declared differentially expressed if the observed differencebetween two conditions is statistically significant, that is to sayhigher than some natural random variation.
To consult a statistician after an experiment is finished is oftenmerely to ask him to conduct a post-mortem examination. He canperhaps say what the experiment died of (Ronald A. Fisher, Indianstatistical congress, 1938, vol. 4, p 17).
While a good design does not guarantee a successful experiment, asuitably bad design guarantees a failed experiment (Kathleen Kerr,Inserm workshop 145, 2003)
A good design is a list of experiments to conduct in order toanswer to the asked question which maximize collected informationand minimize experiments cost with respect to constraints.
Rule 1 : Well define the biological question : make a choice
Identify differentially expressed genes,
Detect and estimate isoforms,
Construct a de novo transcriptome.
Rule 2 : adapt Fisher’s principles : randomization and blockingAVOID CONFUSION between the biological variability of interestand a biological or technical source of variation
Experimental design Exploratory data analysis Normalization Differential analysis Multiple testing Gene Set Enrichment Analysis Meta-analysis
Experimental design, several stratgies : DGE, DTE, DTU
DGE : Differential Gene Expression
DTE : Differential Transcript Expression
DTU : Differential Transcript Usage
Gene-level quantification
Mapping on the genome
Counting using featureCounts or htseq-count
Transcript-level quantification
Transcripts Per Million (TPM) estimation using salmon
Possibility to aggregate at the gene level
Experimental design Exploratory data analysis Normalization Differential analysis Multiple testing Gene Set Enrichment Analysis Meta-analysis
Experimental design, several stratgies : DGE, DTE, DTU
A good design is a list of experiments to conduct in order toanswer to the asked question which maximize collected informationand minimize experiments cost with respect to constraints.
Rule 1 : Well define the biological question : make a choice
Identify differentially expressed genes,
Detect and estimate isoforms,
Construct a de novo transcriptome.
Rule 2 : adapt Fisher’s principles : randomization and blockingAVOID CONFUSION between the biological variability of interestand a biological or technical source of variation
Biological replicate : Repetition of the same experimental protocolbut independent data acquisition (several samples).Technical replicate : Same biological material but independentreplications of the technical steps (several extracts from the samesample).
Sequencing technology does not eliminate biological variability.(Nature Biotechnology Correspondence, 2011)
lane effect < run effect < library prep effect << biological effect[Marioni et al., 2008],[Bullard et al., 2010]
Include at least three biological replicates in your experiments !Technical replicates are not necessary.
Find genes that are differentially expressed between a normal skinand a damaged skin on mouse
Sample Condition RNA extraction date mouse
S1 control July 12th, 2016 m1S2 control July 20th, 2016 m2S3 control July 25th, 2016 m3S4 wound July 12th, 2016 m1S5 wound July 20th, 2016 m2S6 wound July 25th, 2016 m3
One solution : the day effect is evenly distributed across conditions.
In case of paired data the pairing may be confounded with thebatch effect. These effects are NOT confounded with the biologicaleffect of interest.
Ask a precise and well defined biological question
List all possible biological confounding effects (sex, age, ...)
Collect samples while taking care of the distribution ofunwanted sources of variation across samples
Include at least three biological replicates per condition.Technical replicates are not necessary
Distribute samples on lanes and flow cells ...
according to the comparisons to be madewithout introducing a confusion between technical effects andthe biological effects of interestapplying the same multiplexing rate on all samples
European Conference on Computational Biology 2014http://f1000.com/posters/browse/summary/1096840
How to Design a good RNA-Seq experiment in aninterdisciplinary context?Pôle Planification Expérimentale, PEPI IBIS 1,
1INRA, France
RNA-seq technology is a powerful tool for characteriz-ing and quantifying transcriptome. Upstream careful ex-perimental planning is necessary to pull the maximum ofrelevant information and to make the best use of theseexperiments.
An RNA-seqexperimental
designusing Fisher’sprinciples
Rule 1: Share a minimal commonlanguage
Rule 2: Well define the biologicalquestion
From Alon, 2009
• Choose scientific problems on feasibility and interest• Order your objectives (primary and secondary)• Ask yourself if RNA-seq is better than microarrayregarding the biological question
Make a choice
• Identify differentially expressed (DE) genes?• Detect and estimate isoforms?• Construct a de Novo transcriptome?
Rule 3: Anticipate difficulties witha well designed experiment
1 Prepare a checklist with all the needed elements to becollected,
2 Collect data and determine all factors of variation,3 Choose bioinformatics and statistical models,4 Draw conclusions on results.
Be aware of different types of bias
Keep in mind the influence of effects on results:lane ≤ run ≤ RNA library preparation ≤ biological(Marioni, 2008), (Bullard, 2010)
RNA-seq experiment analysis: from A to Z
Adapted from Mutz, 2013
Rule 4: Make good choices
How many reads?• 100M to detect 90% of the transcripts of 81% ofhuman genes (Toung, 2011)
• 20M reads of 75bp can detect transcripts of mediumand low abundance in chicken (Wand, 2011)
• 10M to cover by at least 10 reads 90% of all (humanand zebrafish) genes (Hart, 2013)...
Why increasing the number of biologicalreplicates?
• To generalize to the population level• To estimate to a higher degree of accuracy variation inindividual transcript (Hart, 2013)
• To improve detection of DE transcripts and control offalse positive rate: TRUE with at least 3 (Sonenson2013, Robles 2012)
More biological replicates or increasingsequencing depth?
It depends! (Haas, 2012), (Liu, 2014)• DE transcript detection: (+) biological replicates• Construction and annotation of transcriptome: (+)depth and (+) sampling conditions
• Transcriptomic variants search: (+) biologicalreplicates and (+) depth
A solution: multiplexing.Decision tools available: Scotty (Busby, 2013),RNAseqPower (Hart, 2013)
Some definitions
Biological and technical replicates:
Sequencing depth: Average number of a given position in agenome or a transcriptome covered by reads in a sequenc-ing runMultiplexing: Tag or bar coded with specific sequencesadded during library construction and that allow multiplesamples to be included in the same sequencing reaction(lane)Blocking: Isolating variation attributable to a nuisance vari-able (e.g. lane)
Conclusions
• Clarify the biological question• All skills are needed to discussions right from projectconstruction
• Prefer biological replicates instead of technicalreplicates
• Use multiplexing• Optimum compromise between replication number andsequencing depth depends on the question
• Wherever possible apply the three Fisher’s principles ofrandomization, replication and local control (blocking)
And do not forget: budget also includes cost of bio-logical data acquisition, sequencing data backup, bioin-formatics and statistical analysis.
Main goal : explore the structure of the dataset to betterunderstand the proximity between samples and detect possibleproblems. This is a quality control step
Two main tools
Principal Component Analysis (PCA) or MultiDimensionalScaling (MDS)
Clustering
Pre-requisite
To apply these methods, make the data homoscedastic : thevariance must be independent of the intensity
Normalization is a process designed to identify and correcttechnical biases removing the least possible biological signal. Thisstep is technology and platform-dependant.
Within-sample normalization
Normalization enabling comparisons of fragments (genes) from asame sample.No need in a differential analysis context.
Between-sample normalization
Normalization enabling comparisons of fragments (genes) fromdifferent samples.
Some are part of models for DE, others are ’stand-alone’
They do not rely on similar hypotheses
But all of them claim to remove technical bias associated withRNA-seq data
Which one is the best ?[Dillies et al., 2013], on behalf of StatOmique GroupEvaluation of normalization methods for RNA-Seq differentialanalysis at the gene level
A general method for testing a claim or hypothesis about aparameter in a population, using data measured in a sample.
Four ingredients
1 Experimental data x1, x2, . . . , xn2 Statistical model : assumptions about the independence or
distributions of the observations with parameter θ
3 Hypothesis to test : assumption about one parameter of thedistribution
4 Region of rejection (or critical region) : the set of values ofthe test statistic T for which the null hypothesis H0 isrejected. T = f (X1,X2, . . . ,Xn) is a function whichsummarizes the data without any loss of information about θ.The distribution of T under H0 is known.
For a realisation t of the T test statistic p(t) is the probability(calculating under H0) of obtaining a test statistic at least asextreme as the one that was actually observed.
In bilateral case :p(t) = PH0{|T | ≥ |t|}
The p-value measures the agreement between H0 and obtainedresult.
Random experiment with exactly two possible outcomes : success(S) or failure (F)p : probability of success
Negative Binomial distribution
Repeat Bernoulli trials with probability p of success. NB describesthe distribution of the number of failures k before getting nsuccesses
From Poisson to NB
A Negative Binomial distribution is a mixture of Poisson laws withvariable parameter. It is a robust alternative to Poisson in thecase of over-dispersed data (the variance is higher than themean)
1 Estimate gene-wise dispersion estimates using ML
2 Estimate a common dispersion parameter by ML
3 Moderate gene-wise dispersion estimates toward a commonestimate or toward a local estimate from genes with similarexpression strength using a weighted conditional likelihood.
Differences :
DESeq2 estimates the width of the prior distribution from thedata and therefore automatically controls the amount ofshrinkage based on the observed properties of the data.
edgeR requires a user-adjustable parameter, the prior degreesof freedom, which weights the contribution of the individualgene estimate and edgeR’s dispersion fit.
DESeq, DESeq2 µ(1 + φµµ) [Anders and Huber, 2010], [Love et al., 2014]edgeR µ(1 + φµ) [Robinson et al., 2009]
edgeR : borrow information across genes for stable estimates of φ3 ways to estimate φ (common, trend, moderated)
DESeq : data-driven relationship of variance and mean estimated usingparametric or local regression for robust fit across genes
DESeq2 : relationship of variance and mean (as in DESeq) + dispersionand fold change shrinkage (for PCA and Gene Set EnrichissmentAnalysis) + detection of outliers
DESeq will stop being maintained in a near future, use DESeq2 instead
Robust edgeR (not by default in R) suffers a tiny bit in powerwith no outliers, but has good capacity to dampen their effectif present (be careful with reviews which take the value bydefault of edgeR)
DESeq’s policy on outliers has a global effect, resulting in(sometimes drastic) drop in power
DESeq2 is very powerful in the absence of outliers, but policyto filter outliers results in loss of power
edgeR and edgeR robust are a bit liberal (5% FDR mightmean 6% or 7%)
Probability of having at least one Type I error (false positive), ofdeclaring DE at least one non DE gene.
FWER = P(FP ≥ 1)
The Bonferroni procedure
Either each test is realized at α = α∗/G levelor use of adjusted pvalue pBonfi = min(1, pi ∗G ) and FWER ≤ α∗.For G = 2000 and α∗ = 0.05 ; α = 2.5.10−5.
Gene sets [Subramanian et al., 2005] : groups of genes that sharecommon biological function, chromosomal location, or regulation.
Motivation :
GSEA can reveal many biological pathways in common wheresingle-gene analyses find little similarities between independentstudies [Subramanian et al., 2005]
Moelcular Signatures Database available at : http://software.broadinstitute.org/gsea/msigdb/index.jsp
Use of the hypergeometric distribution which describes theprobability of k successes (random draws for which the objectdrawn has a specified feature) in n draws, without replacement,from a finite population of size N that contains exactly K objectswith that feature, wherein each draw is either a success or a failure.
The hypergeometric test uses the hypergeometric distribution toidentify which gene-sets are over-represented in the list ofdifferentially expressed genes.
source :Haas and Zody, Nature Biotechnology (2010)
Microarrays RNA-SeqInformation Intensity measures Counts of readsModelling Normal Distribution Poisson, Negative binomialTests Moderated t-tests Likelihood ratio tests
Adapt the method to your dataSpecific methods have been developped for few replicates.The need for ’sophisticated’ methods decreases when the numberof replicates increases.
GSEA or meta-analysis with other studies can help finddifferentially expressed genes when not enough replicates werepresent in the initial study. Avoid merging the data when a highstudy effect is expected, prefer an appropriate statistical analysis !
Experimental design Exploratory data analysis Normalization Differential analysis Multiple testing Gene Set Enrichment Analysis Meta-analysis
To go further
Galaxy permanences
https://wikis.univ-lille.fr/bilille/permanences
To obtain help in statistical analysis of omic data
bilille call for projects (around december each year, to plan the calendar ofengineers)
Detecting differential usage of exons from RNA-seq dataGenome Research 2012 :22
Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W and Robinson MD
Count-based differential expression analysis of RNA sequencing data using R and BioconductorNature Protocols 2013, 8, 1765-1786
Anders A
Comparative analysis of RNA-seq data with DESeq and DEXseqhttp ://www.bioconductor.org/help/course-materials/2013/CSAMA2013/tuesday/morning/Anders DESeq DEXSeq.pdf
Benjamini Y and Hochberg Y
Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple TestingJournal of the Royal Statistical Society, 1995, 57 :1, 289–300
Benjamini Y and Speed TP
Summurizing and correcting the GC content bias in high throughput sequencingNucleic Acids Research, 2012, 1-14.
Bolstad BM, Irizarry RA, Astrand M, and Speed TP
A comparison of normalization methods for high density oligonucleotide array data based on bias andvariance.Bioinformatics 19, 185-193, 2003.
Bullard JH, Purdom E, Hansen KD, Dudoit S.
Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments.BMC Bioinformatics 2010, 11 :94
Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT
Scotty : a web tool for designing RNA-Seq experiments to measure differential gene expression.Bioinformatics 2013, 29(5),656 :657.
Dillies MA, Rau A, Aubert J, Hennequet-Antier C et al
A comprehensive comparison of normalization methods for Illumina high-throughput RNA-sequencing dataanalysisBriefings in Bioinformatics 2013, 14 :6, 671-683.
Dudoit S, Maya O and Jacob L.
Short course on RNA seq and CHiP seq data analysis.Valencia, Nov. 2010.
Eisenberg EE and Levanon EY.
Human housekeeping genes are compact.Trends Genet, 19(7) :362-365.
Fisher RA
The Design of experimentsOliver and Boyd 1935, 1-252
Haas BJ, Chin M, Nusbaum C, Birren BW, Livny J
How deep is deep enough for RNA-Seq profiling of bacterial transcriptomes ?BMC genomics 2012, 1 (13),734.
Hansen KD, Brenner SE, Dudoit S.
Biases in Illumina transcriptome sequencing caused by random hexamer priming.Nucleic Acids Research, 2010, 1-7.
Hansen KD, Irizarry RA and Wu Z
Removing technical variability in RNA-seq data using Conditional Quantile NormalizationBiostatistics 2011, 13 :2, pp204-216
Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher J-P
Calculating Sample Size Estimates for RNA Sequencing Data.Journal of Computational Biology 2013, 12(20), 970 :978