Analysis pipe‐line - unito.it pipe‐line Quality control Normalization Filtering ... The method tries to decouple the meanThe method tries to decouple the mean––variance dependency

Analysis pipeAnalysis pipe‐‐linelineAnalysis pipeAnalysis pipe lineline

QualityQualitycontrolcontrol

NormalizationNormalization FilteringFiltering StatisticalStatisticalNormalizationNormalization FilteringFiltering analysisanalysis

Bi l i lBi l i lAnnotationAnnotation

Biological Biological KnowledgeKnowledgeextractionextraction

FFilteringilteringFFilteringiltering

•• PerchèPerchè sisi filtranofiltrano ii datidati??–– Per Per ridurreridurre ilil numeronumero didi test test statisticistatistici cheche dovremodovremo fare!fare!

Multiple testing errorsMultiple testing errorsMultiple testing errorsMultiple testing errors

P f i l i l i i l fP f i l i l i i l f•• Performing multiple statistical tests two types of Performing multiple statistical tests two types of errors can occur:errors can occur:–– Type I error (False positive)Type I error (False positive)

–– Type II error (False negative)Type II error (False negative)

•• Reduction of type I errors increases the number of Reduction of type I errors increases the number of type II errors.type II errors.

•• It is important to identify an approach that reduces It is important to identify an approach that reduces false positivesfalse positives with the minimum loss of information with the minimum loss of information ((false negativefalse negative))

The multiple tests problemThe multiple tests problem

•• If the number of samples increases the tails of a If the number of samples increases the tails of a distribution are getting more populated.distribution are getting more populated.

FFilteringilteringFFilteringiltering

•• Filtering affects the false discovery rate .Filtering affects the false discovery rate .g yg y

•• Researcher is interested in keeping the number ofResearcher is interested in keeping the number ofResearcher is interested in keeping the number of Researcher is interested in keeping the number of tests/genes as low as possible while keeping the tests/genes as low as possible while keeping the interesting genes in the selected subset.interesting genes in the selected subset.g gg g

•• If the truly differentially expressed genes areIf the truly differentially expressed genes areIf the truly differentially expressed genes are If the truly differentially expressed genes are overrepresented among those selected in the overrepresented among those selected in the filtering step, filtering step, the FDR associated with a certain the FDR associated with a certain g p,g p,threshold of the test statistic will be lowered due to threshold of the test statistic will be lowered due to the filteringthe filtering..

Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004

Filtering can be performed at various Filtering can be performed at various l ll llevels:levels:

•• Annotation features:Annotation features:Annotation features:Annotation features:–– Specific gene features (i.e. GO term, presence of Specific gene features (i.e. GO term, presence of t i ti l l ti l t i tt i ti l l ti l t i ttranscriptional regulative elements in promoters, transcriptional regulative elements in promoters, etc.)etc.)

•• Signal features:Signal features:–– % intensities greater of a user defined value% intensities greater of a user defined valuegg

–– InterquantileInterquantile range (IQR) greater of a defined valuerange (IQR) greater of a defined value

Intensity distributionsIntensity distributionsyyBg level probe setsBg level probe sets

RMA GCRMA

How to define the efficacy of a filtering How to define the efficacy of a filtering procedure?procedure?

probesetsteringinAfterFilspike NNh − ×

100inspikeingfterFilterprobesetsA

probesetsteringinAfterFilspike

NNenrichment

−

−

×=100

•• This enrichment is very similar to that used to evaluate the purification foldsThis enrichment is very similar to that used to evaluate the purification foldsThis enrichment is very similar to that used to evaluate the purification folds This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.of a protein after a chromatographic step.

[ ][ ]mBeforeChroEAfterChromgP

mBeforeChrogPAfterChromEenrichment ..100××

=μ

μ[ ]mBeforeChroEAfterChromgP .. ×μ

Filtering by genefilter pOverAFiltering by genefilter pOverA(keep if(keep if ≥≥ 25% probe sets have intensities25% probe sets have intensities ≥ log≥ log22(100)(100)))(keep if (keep if ≥≥ 25% probe sets have intensities 25% probe sets have intensities ≥ log≥ log22(100)(100)))

5553 5553 42/42 SpikeIn42/42 SpikeIn

223002230042/42 SpikeIn42/42 SpikeIn

Enrichment: Enrichment: 401%401%


Filtering by InterQuantile RangeFiltering by InterQuantile RangeIQR25% 75%

How filtering by genefilter IQR works?How filtering by genefilter IQR works?The distribution of all intensity values of a differential expression experiment are the The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditionssummary of the distribution of each gene expression over the experimental conditions

How filtering by IQR works?How filtering by IQR works?

The filter removes genes that show little changes within the experimental pointsThe filter removes genes that show little changes within the experimental points

Filtering by genefilter IQRFiltering by genefilter IQR(removing if intensities IQR(removing if intensities IQR≤≤0.25, 0.50.25, 0.5))(removing if intensities IQR(removing if intensities IQR≤≤0.25, 0.50.25, 0.5))

68 68 42/42 SpikeIn42/42 SpikeIn

223002230042/42 SpikeIn42/42 SpikeIn

244 244 42/42 SpikeIn42/42 SpikeIn




Analysis pipeAnalysis pipe‐‐linelineAnalysis pipeAnalysis pipe lineline

QualityQualitycontrolcontrol

NormalizationNormalization FilteringFiltering StatisticalStatisticalNormalizationNormalization FilteringFiltering analysisanalysis

Bi l i lBi l i lAnnotationAnnotation

Biological Biological KnowledgeKnowledgeextractionextraction

Statistical analysisStatistical analysisStatistical analysisStatistical analysis

Th iti it f t ti ti l t t i ff t d b thTh iti it f t ti ti l t t i ff t d b th•• The sensitivity of statistical tests is affected by the The sensitivity of statistical tests is affected by the number of available replicates.number of available replicates.

•• Replicates can be:Replicates can be:•• Replicates can be:Replicates can be:–– TechnicalTechnical–– BiologicalBiologicalgg

•• Biological replicates better summarize the variability Biological replicates better summarize the variability of samples belonging to a common group.of samples belonging to a common group.

•• The minimum number of replicates is an important The minimum number of replicates is an important issue!issue!

How much replicates are importantHow much replicates are important??Yang YH e Speed T, 2002

Sample sizeSample sizeSample sizeSample size

•• Microarray experiments are often performed with aMicroarray experiments are often performed with aMicroarray experiments are often performed with a Microarray experiments are often performed with a small number of biological replicates, resulting in low small number of biological replicates, resulting in low statistical power for detecting differentially expressedstatistical power for detecting differentially expressedstatistical power for detecting differentially expressed statistical power for detecting differentially expressed genes and concomitant high false positive rates. genes and concomitant high false positive rates.

•• The issue of how many replicates are required in aThe issue of how many replicates are required in a•• The issue of how many replicates are required in a The issue of how many replicates are required in a typical experimental system needs to be addressed.typical experimental system needs to be addressed.

Of ti l i t t i th diff i i dOf ti l i t t i th diff i i d•• Of particular interest is the difference in required Of particular interest is the difference in required sample sizes for similar experiments in sample sizes for similar experiments in inbredinbred vs. vs. o tbredo tbred pop lations (e g mo se and rat s h man)pop lations (e g mo se and rat s h man)outbredoutbred populations (e.g. mouse and rat vs. human).populations (e.g. mouse and rat vs. human).

Assessing sample sizes inAssessing sample sizes inmicroarray experimentsmicroarray experiments

•• Assessment of sample sizes for microarray data is a Assessment of sample sizes for microarray data is a tricky exercise. tricky exercise.

•• The reason why we are performing such analysis is to The reason why we are performing such analysis is to have a general feeling on the ability of our have a general feeling on the ability of our experimental data to robustly detect differential experimental data to robustly detect differential expression.expression.

AssumptionsAssumptionsAssumptionsAssumptions

i i i• A microarray experiment is set up to compare gene expressions between one treatment group and one control group.

• Microarray data has been normalized and ytransformed so that the data for each gene is sufficiently close to a normal distribution that ya standard 2‐sample pooled‐variance t‐test will reliably detect differentially expressedwill reliably detect differentially expressed genes.

• The tested hypothesis for each gene is:

versusversus

where μT and μC are means of gene expressions for treatment and control group respectively.g p p y

• The analysis is done using the common variance described in:variance described in: – Wei et al. BMC Genomics. 2004, 5:87

LogLog22(T/C) is frequently used to evaluate fold (T/C) is frequently used to evaluate fold change variationchange variation

8 00

200, 400, 800, 1600, 32000100 100 100 100 100 200 400 800 1600 32000

100, 100, 100, 100, 100

4 00

6.00

8.00

log2(t/c)

t/c down-regulation

0.00

2.00

4.00compression

‐4.00

‐2.00

0.00

log2(t/c)

H07498

U6539

G7599

L8754

AA238

8345

0987

654

765

439

A

S09

MN

AC8

76 PT7

F654

Fold change filteringFold change filteringFold change filteringFold change filtering

•• The intensity change between experimental groups The intensity change between experimental groups (i.e. control versus treated) are known as:(i.e. control versus treated) are known as:

ld hld h–– Fold changeFold change..

• Frequently an arbitrary threshold

1log 2 =Ct lTrtd

is used to define a significant differential expression.

Ctrl

g p

Statistical analysisStatistical analysisStatistical analysisStatistical analysis•• Intensity changes betweenIntensity changes between•• Intensity changes between Intensity changes between

experimental groups (i.e. experimental groups (i.e. control versus treated) are control versus treated) are known as:known as:–– Fold change. Fold change. –– Ranking genes based on fold Ranking genes based on fold

change alone implicitly change alone implicitly g p yg p yassigns equal variance to assigns equal variance to every gene.every gene.

•• Fold change alone is not Fold change alone is not ffi i i di hffi i i di hsufficient to indicate the sufficient to indicate the

significance of the expression significance of the expression changes.changes.

•• Fold change has to beFold change has to be•• Fold change has to be Fold change has to be supported by statistical supported by statistical information. information.

StatisticalStatistical filteringfilteringStatistical Statistical filteringfiltering

S i i lS i i l fil ifil i b f d ib f d i•• Statistical Statistical filtering filtering can be performed using can be performed using parametric and nonparametric and non‐‐parametric tests.parametric tests.P iP i•• Parametric tests:Parametric tests:–– The populations under analysis are normally distributed.The populations under analysis are normally distributed.

•• Non parametric tests:Non parametric tests:–– There is no assumption on samples distribution.There is no assumption on samples distribution.

•• Non parametric are less sensitive than parametric.Non parametric are less sensitive than parametric.

Selecting differentially expressed genesSelecting differentially expressed genesSelecting differentially expressed genesSelecting differentially expressed genes

Statistical validationmethod I

Statistical validationmethod IImethod II

Differential expressionlinked to a specific

biological event.biological event.

Statistical validationmethod III

Selecting differentially expressed genesSelecting differentially expressed genesSelecting differentially expressed genesSelecting differentially expressed genes

•• Each method grasps some true signals but not Each method grasps some true signals but not llllall.all.

•• Each method catches some false signals.Each method catches some false signals.gg

•• The trick is to find the best condition to The trick is to find the best condition to maximi e true signals while minimi ing fakesmaximi e true signals while minimi ing fakesmaximize true signals while minimizing fakes.maximize true signals while minimizing fakes.

Mean y Mean y

Population Ctrl

Mean y1 Mean y2

Population Trtd

Sample mean “s”

Less than a 5% chance that the sample with mean s came from population y1, i.e., s is significantly different from “mean y1” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population y2.

t‐statistics

where

using the pooled variance

In the case of unequal varianceIn the case of unequal variance

Welch‐statistics

with the unpooled( d) t d d(sqared) standard error

•• TT‐‐statistics is widespread in assessing statistics is widespread in assessing differential expression.differential expression.

•• Unstable variance estimates that arise whenUnstable variance estimates that arise when•• Unstable variance estimates that arise when Unstable variance estimates that arise when sample size is small can be corrected using:sample size is small can be corrected using:–– Bayesian methods (Bayesian methods (LimmaLimma) )

–– Error Error fudge factors (SAM)fudge factors (SAM)

Bayesian regularized tBayesian regularized t‐‐testtest(Baldi & Long 2001)(Baldi & Long 2001)(Baldi & Long 2001)(Baldi & Long 2001)

The method tries to decouple the meanThe method tries to decouple the mean––variance dependency variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a y g p gy g p g

function of the mean expression of the genefunction of the mean expression of the gene

My gene

{{

wherewhere

Bayesian regularized tBayesian regularized t‐‐testtest

The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,

to make more robust the tto make more robust the t--test resultstest results

Bayesian regularized tBayesian regularized t‐‐testtest

The regularized tThe regularized t--test makes more evident the test makes more evident the presence of significant differential expressionspresence of significant differential expressions

Type I error correctionType I error correctionType I error correctionType I error correction

•• Null hypothesis (H0): Null hypothesis (H0): the mean of treated and the mean the mean of treated and the mean of control for a geneof control for a gene ii belong to the same distributionbelong to the same distributionof control for a gene of control for a gene ii belong to the same distribution.belong to the same distribution.

•• Type I errorType I error: H0 is false.: H0 is false.

•• Sidak significance point:Sidak significance point: ggK αα −−= 11),(

•• If the pIf the p‐‐values are lower of K (gvalues are lower of K (g αα) all the remaining H0) all the remaining H0

g ),(αα= acceptance level (es 0.05)= acceptance level (es 0.05)gg= n. of independent tests= n. of independent tests

If the pIf the p‐‐values are lower of K (g,values are lower of K (g,αα) all the remaining H0 ) all the remaining H0 are considered true.are considered true.

( )( )Type I error correction (FWER)Type I error correction (FWER)

ggK αα −−= 11),(gK αα 11),(

P of diff. exprs. genes P of diff. exprs. genes αα’’<10<10--66 1 1 –– (1 (1 –– 0.05)0.05)1/51/5== 0.1020.102< 10< 10--66 1 1 –– (1 (1 –– 0.05)0.05)1/41/4== 0.01270.01272* 102* 10--55 1 1 –– (1 (1 –– 0.05)0.05)1/31/3== 0.01700.01700 0470 047 11 (1(1 0 05)0 05)1/21/2 0 02530 02530.0470.047 1 1 –– (1 (1 –– 0.05)0.05)1/21/2== 0.02530.0253……

BH correctionBH correctionBH correctionBH correction

•• BH is the most used method for the correction of BH is the most used method for the correction of type I errors in microarray analysis.type I errors in microarray analysis.

The application of BH correction•• However, it has some limitation due to the initial However, it has some limitation due to the initial hypotheses:hypotheses:

The application of BH correctionto these pvalues will not produceany differential expressed genes!

–– The gene expressions are independent from each other.The gene expressions are independent from each other.

–– The raw distribution of p values should be uniform in the The raw distribution of p values should be uniform in the ppnon significant range.non significant range.

SAM SAM (Significance analysis of microarrays)(Significance analysis of microarrays)(Tusher et al. 2001)(Tusher et al. 2001)( )( )

fudge factor regularizes fudge factor regularizes the the t t --statistic statistic by inflating theby inflating theby inflating theby inflating thedenominatordenominator

s(i) is the pooled standard deviation, taking into account differinggene-specific variation across arrays.

•• SAM uses data permutations to define a setSAM uses data permutations to define a setSAM uses data permutations to define a set SAM uses data permutations to define a set of significant differential expression.of significant differential expression.

N N N

T T T

N

N

N

T

T

T N

N NT

T T N

N

N

T

T

T N

N NT

T T{ }T T T NT T N T T N NT N NT

{ }

FDR is given by p0 * False / Calledp0 is the prior probability pi0 that a gene is not differentially expressed

How SAM calculates the False Discovery Rate for a How SAM calculates the False Discovery Rate for a

ifi d l ?ifi d l ?specific delta?specific delta?

Permutations1234720

Mean falseMean false

RankRank ProductProductRankRank ProductProduct

k d i i i i h• Rank Product is a non‐parametric statistic that detects items that are consistently highly ranked i b f li t f l th tin a number of lists, for example genes that are consistently found among the most strongly upregulated genes in a number of replicateupregulated genes in a number of replicate experiments.I i b d h i h d h ll• It is based on the assumption that under the null hypothesis that the order of all items is random th b bilit f fi di ifi itthe probability of finding a specific item among the top r of n items in a list is p = r/n.


M lti l i th b biliti l d t th• Multiplying these probabilities leads to the definition of the rank product:

∏= i

nrRP

where ri is the rank of the item in the i‐th list and i h l b f i i h i h li

in

ni is the total number of items in the i‐th list.

Th ll th RP l th ll th• The smaller the RP value, the smaller the probability that the observed placement of the item at the top of the lists is due to chanceitem at the top of the lists is due to chance.


∏= gg

rRP ∏

gg n

RankRank ProductProduct

1 )|(|1 *)( gm

lg

l gg RPPRI

GLP =≤= ∑∑

∑∑ ≤l RPPRI )|(|1 *

∑∑∑

≤

≤=

gl

gl g

g RPRPI

RPPRIL

FDR)|(|

)|(| )(

∑ ≤g

ggg RPRPI )|(|

Analysis pipe‐line - unito.it pipe‐line Quality control Normalization Filtering ... The method tries to decouple the meanThe method tries to decouple the mean––variance dependency

Documents