1-9-2006 1 • First approach - repeating a simple analysis for each gene separately - 30k times • Assume we have two experimental conditions (j=1,2) • We measure expression of all genes n times under both experimental conditions (n two-channel microarrays) • For a specific gene (focusing on a single gene) x ij = i th measurement under condition j • Statistical models for expression measurements under two different Identifying Differentially Expressed Genes ) σ , μ ( ~ x 2 1 i1 N ) σ , μ ( ~ x 2 2 i2 N 1 , 2 , are unknown model parameters - j represents the average expression measurement in the large number of replicated experiments, represents the variability of measurements • Question if the gene is differentially expressed corresponds to assessing if 1 2 • Strength of evidence in the observed data that this is the case is expressed in terms of a p-value
16
Embed
1-9-20061 First approach - repeating a simple analysis for each gene separately - 30k times Assume we have two experimental conditions (j=1,2) We measure.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1-9-2006 1
• First approach - repeating a simple analysis for each gene separately - 30k times
• Assume we have two experimental conditions (j=1,2)
• We measure expression of all genes n times under both experimental conditions (n two-channel microarrays)
• For a specific gene (focusing on a single gene) xij = ith measurement under condition j
• Statistical models for expression measurements under two different
Identifying Differentially Expressed Genes
)σ,μ(~ x 21i1 N )σ,μ(~ x 2
2i2 N
1, 2, are unknown model parameters - j represents the average expression measurement in the large number of replicated experiments, represents the variability of measurements
• Question if the gene is differentially expressed corresponds to assessing if 1 2
• Strength of evidence in the observed data that this is the case is expressed in terms of a p-value
1-9-2006 2
• Estimate the model parameters based on the data
P-value
• Calculating t-statistic which summarizes information about our hypothesis of interest (1 2)
2
)1()1(ˆ
21
222
21122
nn
snsns
j
n
iij
j n
x
x
j
1jˆ 1
)(
1
2
2
j
n
ijij
j n
xx
s
j
• Establishing the null-distribution of the t-statistic (the distribution assuming the “null-hypothesis” that 1 = 2)
• The “null-distribution” in this case turns out to be the t-distribution with n1+n2-2 degrees of freedom
• P-value is the probability of observing as extreme or more extreme value under the “null-distribution” as it was calculated from the data (t*)
21
12*
n
1
n
1s
t
xx
1-9-2006 3
t-distribution• Number of experimental replicates affects the precision at two levels
1. Everything else being equal, increase in sample size increases the t*
2. Everything else being equal, increase in sample size “shrinks” the “null-distribution”
• Suppose that t*=3. What is the difference in p-values depending on the sample size alone.
Statistical Inference and Statistical Significance – P-value
• Statistical Inference consists of drawing conclusions about the measured phenomenon (e.g. gene expression) in terms of probabilistic statements based on observed data. P-value is one way of doing this.
• P-value is NOT the probability of null hypothesis being true.• Rigorous interpretation of p-value is tricky.• It was introduced to measure the level of evidence against the “null-hypothesis” or better
to say in favor of a “positive experimental finding”• In this context p-value of 0.0001 could be interpreted as a stronger evidence than the p-
value of 0.01• Establishing Statistical Significance (is a difference in expression level statistically
significant or not) requires that we establish “cut-off” points for our “measure of significance” (p-value)
• For various historic reasons the cut-off 0.05 is generally used to establish “statistical significance”.
• It’s a rather arbitrary cut-off, but it is taken as a gold standard• Originally the p-value was introduced as a descriptive measure to be used in conjuction
with other criteria to judge the strength of evidence one way or another
1-9-2006 7
Statistical Inference and Statistical Significance-Hypothesis Testing
• The 5% cut-off points comes from the Hypothesis testing world• In this world the exact magnitude of p-value does not matter. It only matters if it is smaller than
the pre-specified statistical significance cut-off ().• The null hypothesis is rejected in favor of the alternative hypothesis at a significance level of =
0.05 if p-value<0.05• Type I error is committed when the null-hypothesis is falsely rejected• Type II error is committed when the null-hypothesis is not rejected but it is false • By following this “decision making scheme” you will on average falsely reject 5% of null-
hypothesis• If such a “decision making scheme” is adopted to identify differentially expressed genes on a
microarray, 5% of non-differentially expressed genes will be falsely implicated as differentially expressed.
• Family-wise Type I Error is committed if any of a set of null hypothesis is falsely rejected• Establishing statistical significance is a necessary but not sufficient step in assuring the
“reproducibility” of a scientific finding – Important point that will be further discussed when we start talking about issues in experimental design
• The other essential ingredient is a “representative sample” from the “population of interest”• This is still a murky point in molecular biology experimentation
1-9-2006 8
• For a specific gene xij = ith measurement under condition j, i=1,…,6; j=1,2
Is a Specific Gene Differentially Expressed
• Differential expression 1 2
)σ,μ(~ x 21i1 N )σ,μ(~ x 2
2i2 N• Statistical Model of observed data
• Estimate the model parameters based on the data
22
)1()1(ˆ
22
2122
n
snsns
n
xx
n
iij
j
1jˆ 1
)( 1
2
2
n
xxs
n
ijij
j
• Calculating t-statistic
n2
s
t 12*
xx
t*-t*-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
t-statistics
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
t-statistics
• Calculating p-value based on the “null distribution” of the t-statistic assuming 1 = 2
1-9-2006 9
• How do we perform t-test for 30,000 at once
• How do we handle results, present data and results
• What is significant
• How to compare different approaches to normalization of the data and the statistical analysis of results
• Ideally, we would like to maximize our ability to identify truly differentially expressed genes and minimize the falsely implicated genes.