q-value Tiffany Chao Beth Johnson Steven Lee
Hypothesis testing
● Test for each gene○ null hypothesis: no differential expression
● Two kinds of errors○ type I error (false positive)
say that a gene is differentially expressed when it actually isn't; wrongly reject a true null hypothesis
○ type II error (false negative)
say that a gene isn't differentially expressed when it actually is; fail to reject a false null hypothesis
Thinking about p-values
● Probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming the null hypothesis is true
● Minimum false positive rate at which an observed statistic can be called significant
● If null hypothesis is simple, then a null p-value is uniformly distributed
Multiple comparison problem
● Even if we have useful approximations for our p-values, we still face the multiple comparison problem
● When performing many independent tests, p-values no longer have the same interpretation
Not only in genomics!
● "Statistical Comparisons of Classifiers over Multiple Datasets", Demsar, JMLR 2006
● "Permutation Tests for Studying Classifier Performance", Ojala, JMLR 2010
● "On Comparing Classifiers: Pitfalls to avoid and a recommended approach", Salzberg, 1997, Data Mining and Knowledge Discovery
Suppose we care about p-values ≤ 0.05?
Multiple hypothesis testing
Called significant
Called not significant
Total
Null true F m0 – F m0
Alternative true T m1 – T m1
Total S m – S m
Error rates (more on this later)
● Per comparison error rate (PCER)○ E[F] / m
● Per family error rate (PFER)○ E[F]
● Family-wise error rate (FWER)○ Pr(F ≥ 1)
● False discovery rate (FDR)*○ E[F/S] (and set F/S = 0 when S = 0)
= E[F/S | S > 0] Pr(S > 0)
● Positive false discovery rate (pFDR)*○ E[F/S | S > 0]
MHT error controlling procedure
● Suppose you test m hypotheses and get m p-values: p1 , p2 , p3 , ... pm
● A multiple hypothesis test error controlling procedure is a function T(p; α) such that rejecting all nulls with pi ≤ T(p; α) implies that
Error ≤ α● Error is a population quantity (not random)
Weak and strong control
● Weak: T(p; α) is such that Error ≤ α only when m0 = m
● Strong: T(p; α) is such that Error ≤ α for any value of m0
○ note that m0 is not an argument for T(p; α)!
Why FDR and q-value?
● To help us interpret these values, two pieces of information would be useful● Estimate of the overall proportion of features that are
truly alternative (even if they cannot be precisely identified)
● Measure of significance that can be associated with each feature so that thresholding the numbers at a particular value has an easy interpretation
FDR
● Would like an error measure that provides a balance between ● Number of false positive features (F)● Number of true positive features (T)
FDR
● The false discovery rate is the expected value of the proportion of false positive features among all those called significant
*Some possibility S = 0, so some adjustment has to be made to definition of FDR
Estimating FDR
● Therefore, the FDR depends on what threshold (t) we are using to determine significance
Estimating FDR
● We now need to approximate E[S(t)] and E[F(t)]● To illustrate how FDR is determined, for m genes
we have m p values● denoted p
1, p
2,…,p
m
● Define F(t) and S(t)
can count these for a given t
Estimating FDR
● Approximating F(t) is more difficult because we do know how many values called significant were truly null
● Assuming null p values are uniformly distributed,
the probability(null p ≤ t) = t
(# of null features x probability of null feature called significant)
Estimating FDR
● We do not know true value of m0, (# of null
features) so we must estimate
● Equivalently, we can estimate the proportion of
features that are truly null (denoted by π0)
● Assuming a uniform distribution for null p-values, we can estimate this quantity using a histogram
Estimating π0
Can also fit a cubic
function to the π0
vs λ data to
determine π0(1)
(because “most” of the p values at 1 would be expected to be null)
q-value definition
● for a given feature, the q-value is the expected FDR incurred if it is called significant
○ (every other p_j <= p_i is also called significant)
● in practical terms: a q-value threshold is the "proportion of significant features that turn out to be false leads"
q-value
● a measure of each feature's significance
p-value is in terms of the false positive ratevs
q-value is in terms of the FDR
○ this takes into account that thousands of features are simultaneously being tested (via FDR)
■ uses a better model of where the significant features are likely to be
p vs q
● Example: ○ m = 10000
● p-values:○ cutoff at .01 assumes that you likely found about
100 false positives○ cutoff of .0001 assumes that you only found 1
false positive, but at what cost?● q-values:
○ set q-value cutoff at .05, and be sure that only 5% of the significant genes found are likely to be false positives
Algorithm for Determining q-Values
● Compute test statistic (p-value) for m genes● Estimate π0
○ Using histogram■ Find region where p-values are uniform + set λ■ Count p-values > λ and compute (1-λ)m (number of
values)○ Using cubic spline
● For each p-value○ calculate FDR for each threshold t >= p
■ only choose t values for each unique p in the gene set○ choose minimum FDR as q-value
q-value accuracy
● assumes that the dependence between features will generally be weak dependence
○ genes are actually dependent in pathways, which can be modeled as blocks
● if so, when m is large, calling all features significant with q <= alpha, implies the FDR <= alpha
● the estimated q value of each feature is greater than or equal to it's true q-value
○ conservative is desirable
q-value summary
● A standard measure of significance that can be universally interpreted between studies
● better than using just p-values○ arbitrary selection of alpha, where it is selected so
the expected number of false positives is < 1 throws away too many likely truly significant features