Analysis pipe Analysis pipe‐line line Analysis pipe Analysis pipe line line Quality Quality control control Normalization Normalization Filtering Filtering Statistical Statistical Normalization Normalization Filtering Filtering analysis analysis Bi l i l Bi l i l Annotation Annotation Biological Biological Knowledge Knowledge extraction extraction
46
Embed
Analysis pipe‐line - unito.it pipe‐line Quality control Normalization Filtering ... The method tries to decouple the meanThe method tries to decouple the mean––variance dependency
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
•• PerchèPerchè sisi filtranofiltrano ii datidati??–– Per Per ridurreridurre ilil numeronumero didi test test statisticistatistici cheche dovremodovremo fare!fare!
P f i l i l i i l fP f i l i l i i l f•• Performing multiple statistical tests two types of Performing multiple statistical tests two types of errors can occur:errors can occur:–– Type I error (False positive)Type I error (False positive)
–– Type II error (False negative)Type II error (False negative)
•• Reduction of type I errors increases the number of Reduction of type I errors increases the number of type II errors.type II errors.
•• It is important to identify an approach that reduces It is important to identify an approach that reduces false positivesfalse positives with the minimum loss of information with the minimum loss of information ((false negativefalse negative))
The multiple tests problemThe multiple tests problem
•• If the number of samples increases the tails of a If the number of samples increases the tails of a distribution are getting more populated.distribution are getting more populated.
FFilteringilteringFFilteringiltering
•• Filtering affects the false discovery rate .Filtering affects the false discovery rate .g yg y
•• Researcher is interested in keeping the number ofResearcher is interested in keeping the number ofResearcher is interested in keeping the number of Researcher is interested in keeping the number of tests/genes as low as possible while keeping the tests/genes as low as possible while keeping the interesting genes in the selected subset.interesting genes in the selected subset.g gg g
•• If the truly differentially expressed genes areIf the truly differentially expressed genes areIf the truly differentially expressed genes are If the truly differentially expressed genes are overrepresented among those selected in the overrepresented among those selected in the filtering step, filtering step, the FDR associated with a certain the FDR associated with a certain g p,g p,threshold of the test statistic will be lowered due to threshold of the test statistic will be lowered due to the filteringthe filtering..
Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004
Filtering can be performed at various Filtering can be performed at various l ll llevels:levels:
•• Annotation features:Annotation features:Annotation features:Annotation features:–– Specific gene features (i.e. GO term, presence of Specific gene features (i.e. GO term, presence of t i ti l l ti l t i tt i ti l l ti l t i ttranscriptional regulative elements in promoters, transcriptional regulative elements in promoters, etc.)etc.)
•• Signal features:Signal features:–– % intensities greater of a user defined value% intensities greater of a user defined valuegg
–– InterquantileInterquantile range (IQR) greater of a defined valuerange (IQR) greater of a defined value
How to define the efficacy of a filtering How to define the efficacy of a filtering procedure?procedure?
probesetsteringinAfterFilspike NNh − ×
100inspikeingfterFilterprobesetsA
probesetsteringinAfterFilspike
NNenrichment
−
−
×=100
•• This enrichment is very similar to that used to evaluate the purification foldsThis enrichment is very similar to that used to evaluate the purification foldsThis enrichment is very similar to that used to evaluate the purification folds This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.of a protein after a chromatographic step.
[ ][ ]mBeforeChroEAfterChromgP
mBeforeChrogPAfterChromEenrichment ..100××
=μ
μ[ ]mBeforeChroEAfterChromgP .. ×μ
Filtering by genefilter pOverAFiltering by genefilter pOverA(keep if(keep if ≥≥ 25% probe sets have intensities25% probe sets have intensities ≥ log≥ log22(100)(100)))(keep if (keep if ≥≥ 25% probe sets have intensities 25% probe sets have intensities ≥ log≥ log22(100)(100)))
5553 5553 42/42 SpikeIn42/42 SpikeIn
223002230042/42 SpikeIn42/42 SpikeIn
Enrichment: Enrichment: 401%401%
Enrichment: Enrichment: 100%100%
Filtering by InterQuantile RangeFiltering by InterQuantile RangeIQR25% 75%
How filtering by genefilter IQR works?How filtering by genefilter IQR works?The distribution of all intensity values of a differential expression experiment are the The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditionssummary of the distribution of each gene expression over the experimental conditions
How filtering by IQR works?How filtering by IQR works?
The filter removes genes that show little changes within the experimental pointsThe filter removes genes that show little changes within the experimental points
Filtering by genefilter IQRFiltering by genefilter IQR(removing if intensities IQR(removing if intensities IQR≤≤0.25, 0.50.25, 0.5))(removing if intensities IQR(removing if intensities IQR≤≤0.25, 0.50.25, 0.5))
Th iti it f t ti ti l t t i ff t d b thTh iti it f t ti ti l t t i ff t d b th•• The sensitivity of statistical tests is affected by the The sensitivity of statistical tests is affected by the number of available replicates.number of available replicates.
•• Replicates can be:Replicates can be:•• Replicates can be:Replicates can be:–– TechnicalTechnical–– BiologicalBiologicalgg
•• Biological replicates better summarize the variability Biological replicates better summarize the variability of samples belonging to a common group.of samples belonging to a common group.
•• The minimum number of replicates is an important The minimum number of replicates is an important issue!issue!
How much replicates are importantHow much replicates are important??Yang YH e Speed T, 2002
Sample sizeSample sizeSample sizeSample size
•• Microarray experiments are often performed with aMicroarray experiments are often performed with aMicroarray experiments are often performed with a Microarray experiments are often performed with a small number of biological replicates, resulting in low small number of biological replicates, resulting in low statistical power for detecting differentially expressedstatistical power for detecting differentially expressedstatistical power for detecting differentially expressed statistical power for detecting differentially expressed genes and concomitant high false positive rates. genes and concomitant high false positive rates.
•• The issue of how many replicates are required in aThe issue of how many replicates are required in a•• The issue of how many replicates are required in a The issue of how many replicates are required in a typical experimental system needs to be addressed.typical experimental system needs to be addressed.
Of ti l i t t i th diff i i dOf ti l i t t i th diff i i d•• Of particular interest is the difference in required Of particular interest is the difference in required sample sizes for similar experiments in sample sizes for similar experiments in inbredinbred vs. vs. o tbredo tbred pop lations (e g mo se and rat s h man)pop lations (e g mo se and rat s h man)outbredoutbred populations (e.g. mouse and rat vs. human).populations (e.g. mouse and rat vs. human).
•• Assessment of sample sizes for microarray data is a Assessment of sample sizes for microarray data is a tricky exercise. tricky exercise.
•• The reason why we are performing such analysis is to The reason why we are performing such analysis is to have a general feeling on the ability of our have a general feeling on the ability of our experimental data to robustly detect differential experimental data to robustly detect differential expression.expression.
AssumptionsAssumptionsAssumptionsAssumptions
i i i• A microarray experiment is set up to compare gene expressions between one treatment group and one control group.
• Microarray data has been normalized and ytransformed so that the data for each gene is sufficiently close to a normal distribution that ya standard 2‐sample pooled‐variance t‐test will reliably detect differentially expressedwill reliably detect differentially expressed genes.
• The tested hypothesis for each gene is:
versusversus
where μT and μC are means of gene expressions for treatment and control group respectively.g p p y
• The analysis is done using the common variance described in:variance described in: – Wei et al. BMC Genomics. 2004, 5:87
LogLog22(T/C) is frequently used to evaluate fold (T/C) is frequently used to evaluate fold change variationchange variation
•• The intensity change between experimental groups The intensity change between experimental groups (i.e. control versus treated) are known as:(i.e. control versus treated) are known as:
ld hld h–– Fold changeFold change..
• Frequently an arbitrary threshold
1log 2 =Ct lTrtd
is used to define a significant differential expression.
Ctrl
g p
Statistical analysisStatistical analysisStatistical analysisStatistical analysis•• Intensity changes betweenIntensity changes between•• Intensity changes between Intensity changes between
experimental groups (i.e. experimental groups (i.e. control versus treated) are control versus treated) are known as:known as:–– Fold change. Fold change. –– Ranking genes based on fold Ranking genes based on fold
change alone implicitly change alone implicitly g p yg p yassigns equal variance to assigns equal variance to every gene.every gene.
•• Fold change alone is not Fold change alone is not ffi i i di hffi i i di hsufficient to indicate the sufficient to indicate the
significance of the expression significance of the expression changes.changes.
•• Fold change has to beFold change has to be•• Fold change has to be Fold change has to be supported by statistical supported by statistical information. information.
S i i lS i i l fil ifil i b f d ib f d i•• Statistical Statistical filtering filtering can be performed using can be performed using parametric and nonparametric and non‐‐parametric tests.parametric tests.P iP i•• Parametric tests:Parametric tests:–– The populations under analysis are normally distributed.The populations under analysis are normally distributed.
•• Non parametric tests:Non parametric tests:–– There is no assumption on samples distribution.There is no assumption on samples distribution.
•• Non parametric are less sensitive than parametric.Non parametric are less sensitive than parametric.
•• Each method grasps some true signals but not Each method grasps some true signals but not llllall.all.
•• Each method catches some false signals.Each method catches some false signals.gg
•• The trick is to find the best condition to The trick is to find the best condition to maximi e true signals while minimi ing fakesmaximi e true signals while minimi ing fakesmaximize true signals while minimizing fakes.maximize true signals while minimizing fakes.
Mean y Mean y
Population Ctrl
Mean y1 Mean y2
Population Trtd
Sample mean “s”
Less than a 5% chance that the sample with mean s came from population y1, i.e., s is significantly different from “mean y1” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population y2.
t‐statistics
where
using the pooled variance
In the case of unequal varianceIn the case of unequal variance
Welch‐statistics
with the unpooled( d) t d d(sqared) standard error
•• TT‐‐statistics is widespread in assessing statistics is widespread in assessing differential expression.differential expression.
•• Unstable variance estimates that arise whenUnstable variance estimates that arise when•• Unstable variance estimates that arise when Unstable variance estimates that arise when sample size is small can be corrected using:sample size is small can be corrected using:–– Bayesian methods (Bayesian methods (LimmaLimma) )
Bayesian regularized tBayesian regularized t‐‐testtest(Baldi & Long 2001)(Baldi & Long 2001)(Baldi & Long 2001)(Baldi & Long 2001)
The method tries to decouple the meanThe method tries to decouple the mean––variance dependency variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a y g p gy g p g
function of the mean expression of the genefunction of the mean expression of the gene
The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,
to make more robust the tto make more robust the t--test resultstest results
The regularized tThe regularized t--test makes more evident the test makes more evident the presence of significant differential expressionspresence of significant differential expressions
Type I error correctionType I error correctionType I error correctionType I error correction
•• Null hypothesis (H0): Null hypothesis (H0): the mean of treated and the mean the mean of treated and the mean of control for a geneof control for a gene ii belong to the same distributionbelong to the same distributionof control for a gene of control for a gene ii belong to the same distribution.belong to the same distribution.
•• Type I errorType I error: H0 is false.: H0 is false.
•• If the pIf the p‐‐values are lower of K (gvalues are lower of K (g αα) all the remaining H0) all the remaining H0
g ),(αα= acceptance level (es 0.05)= acceptance level (es 0.05)gg= n. of independent tests= n. of independent tests
If the pIf the p‐‐values are lower of K (g,values are lower of K (g,αα) all the remaining H0 ) all the remaining H0 are considered true.are considered true.
( )( )Type I error correction (FWER)Type I error correction (FWER)
•• BH is the most used method for the correction of BH is the most used method for the correction of type I errors in microarray analysis.type I errors in microarray analysis.
The application of BH correction•• However, it has some limitation due to the initial However, it has some limitation due to the initial hypotheses:hypotheses:
The application of BH correctionto these pvalues will not produceany differential expressed genes!
–– The gene expressions are independent from each other.The gene expressions are independent from each other.
–– The raw distribution of p values should be uniform in the The raw distribution of p values should be uniform in the ppnon significant range.non significant range.
SAM SAM (Significance analysis of microarrays)(Significance analysis of microarrays)(Tusher et al. 2001)(Tusher et al. 2001)( )( )
fudge factor regularizes fudge factor regularizes the the t t --statistic statistic by inflating theby inflating theby inflating theby inflating thedenominatordenominator
s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.
•• SAM uses data permutations to define a setSAM uses data permutations to define a setSAM uses data permutations to define a set SAM uses data permutations to define a set of significant differential expression.of significant differential expression.
N N N
T T T
N
N
N
T
T
T N
N NT
T T N
N
N
T
T
T N
N NT
T T{ }T T T NT T N T T N NT N NT
{ }
FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed
How SAM calculates the False Discovery Rate for a How SAM calculates the False Discovery Rate for a
ifi d l ?ifi d l ?specific delta?specific delta?
Permutations1234720
Mean falseMean false
RankRank ProductProductRankRank ProductProduct
k d i i i i h• Rank Product is a non‐parametric statistic that detects items that are consistently highly ranked i b f li t f l th tin a number of lists, for example genes that are consistently found among the most strongly upregulated genes in a number of replicateupregulated genes in a number of replicate experiments.I i b d h i h d h ll• It is based on the assumption that under the null hypothesis that the order of all items is random th b bilit f fi di ifi itthe probability of finding a specific item among the top r of n items in a list is p = r/n.
RankRank ProductProductRankRank ProductProduct
M lti l i th b biliti l d t th• Multiplying these probabilities leads to the definition of the rank product:
∏= i
nrRP
where ri is the rank of the item in the i‐th list and i h l b f i i h i h li
in
ni is the total number of items in the i‐th list.
Th ll th RP l th ll th• The smaller the RP value, the smaller the probability that the observed placement of the item at the top of the lists is due to chanceitem at the top of the lists is due to chance.