Dec 23, 2015
Modern Statistical Data Analysis:
February 27, 2015
1
CONTENTS CONTENTS
Contents
I Hypothesis Testing: 4
1 Data and Data Preprocessing: 4
1.1 Visualization of Data: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Data Transformations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Completion of Missing Values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Classical Hypothesis Testing: 6
2.1 Some Notation and Reminders: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Classical Problems in Parametric Statistics: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Non-Parametric Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Permutation Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Tests for Independence: 12
3.1 Classical tests for independence of scalar RVs: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Testing for independence using permutations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Testing for independence using a Kernel function: . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Testing for independence using the Distance Correlation (dCor) method: . . . . . . . . . . . . 15
3.2.3 Another distance based method for continuous univariate RVs: . . . . . . . . . . . . . . . . . 17
3.2.4 An extension of Hoe�ding's method: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5 An information theoretic test for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Comparison of di�erent tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Equitable tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Multiple Hypothesis Testing (MHT): 22
4.1 Family Wise Error Rate (FWER): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 The Bonferroni Correction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 False Discovery Rate (FDR): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 The BH procedure under more general dependence: . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 What is the true meaning of FDR control: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 Adaptive Procedures (Modi�ed BH procedures): . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.4 What to do if the Pvalues are not independent: . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.5 Estimation versus control of the FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2
CONTENTS CONTENTS
4.3 An alternative method for MHT using Qvalues: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Other variations of FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Empirical Bayes View: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References 31
3
1 DATA AND DATA PREPROCESSING:
Part I
Hypothesis Testing:
1 Data and Data Preprocessing:
During this course when we discuss "data" we will usually be referring to a collection of vectors x1, ..., xn ∈ Rp
which we usually assume that these are realizations of of an I.I.D sequence Xii.i.d∼ F with F being some known or
unknown distribution.
1.1 Visualization of Data:
One of the �rst thing one wants to do when dealing with real data before beginning to perform any analysis or
employing any statistical methods is to look at the data any try to get some initial idea about its nature, there are
several ways to do this:
• Examination of scatter-plots of the various variables in relation to one another in order could help detect
trends and relations as well as detect outliers. For example consider these two images:
Figure 1.1: Passing a linear trend-line through a scatter-plot for the collection of (x,y) values (blue dots). Wewould like to model the relation between X and Y andpossibly predict given an X value what correspondingY value would match it. The linear regression line (inblack) is one way to do this.
Figure 1.2: in this scatter-plot we see an outlier. Thisoutlier is clearly visible in this representation of the datawhile we would not be able to detect is using a histogramof the X or Y values separately.
• Examination of density-plots might be more informative when the number of observations and their location
makes a regular scatter-plot uninformative.
4
1.2 Data Transformations: 1 DATA AND DATA PREPROCESSING:
Figure 1.3: On the left side we see a scatter-plot of x,y values, it can be seen that the sheer amount of observationsand the way they are scattered does not allow us to notice any trend as all areas look equally dense (equally coveredwith black dots). On the other hand, the right side image shows us a density plot where di�erent colors representthe point density in each area. In this particular example the density in the red area is roughly 1000 times higherthan in the purple area - something that could not be seen in the left plot.
• Examination of box-plots gives a compact and convenient representation of the averages and standard devia-
tions of data in various subgroups of the data. This is mostly useful when there is a large number of variables
that need to be compared or examined together.
Figure 1.4: It can be seen that a box-plot can represent the distribution of the observations around the mean acrossseveral groups. This allows comparison of both means and standard deviations of the various groups and allowdetection of di�erences between the groups.
1.2 Data Transformations:
In many cases there are non-linear relationships between variables in a dataset and these relationships are harder to
detect and understand. Using a transformation (for example logarithmic or polynomial) of one or several variables
it is often possible to get a clearer representation of the relationships in the data or compensate for di�erences in
order of magnitude in the data.
5
1.3 Completion of Missing Values: 2 CLASSICAL HYPOTHESIS TESTING:
1.3 Completion of Missing Values:
In many cases some of the values of some variables in the dataset are missing and there are methods to complete
these values using various estimation methods. This is helpful in cases where one wants to analyze data in a method
which does not tolerate missing values but one doesn't want to simply omit observations.
2 Classical Hypothesis Testing:
2.1 Some Notation and Reminders:
Remark 1. From this point on we will abbreviate both a scalar random variable and random vector by writing
RV. The dimension will be left to be inferred from the context.
De�nition 2. Given a RV X with distribution F (usually described by a cumulative distribution function) a
distribution parameter θ is some (often unknown) value that characterizes the distribution (such as the expec-
tation, variance, etc).
De�nition 3. An independent sample or realization x1, ..., xn ∈ Rd from the distribution F is a realization
from a sequence of random variables X1, ..., Xni.i.d∼ F (here Xi can be either scalar or vector valued).
De�nition 4. Given a distribution F a statistic of the distribution is a RV T (X1, ..., Xn) which is a function of
a sequence of RVs X1, ..., Xni.i.d∼ F . The value of the statistic T (x1, ..., xn) is calculated based on realizations.
Remark 5. In many cases the statistics we will discuss will be estimators for the parameters of some distribution.
For example the sample average is an estimator for the parameter which is the distribution expectation.
De�nition 6. Given a family of distributions F (x; θ) (a statistical model) parameterized by θ, an estimator of θ
is any function θ (x1, ..., xn) that tries to approximate θ based on a sample from the distribution F .
De�nition 7. The bias of an estimator θ (X1, ..., Xn) for the parameter θ is de�ned by Bias(θ):=Eθ
[θ − θ
]. An
estimator is said to be unbiased if its bias equals zero.
De�nition 8. A loss function for an estimator θ of the parameter θ is any function L(θ, θ)that quanti�es the
"nearness" of the estimator to the real parameter value. One example is the squared-error loss L(θ, θ)
=∥∥∥θ − θ∥∥∥2
2.
De�nition 9. A risk function for an estimator θ of the parameter θ is the expected value of some loss function,
R(θ, θ)
= Eθ[L(θ, θ)], one common example is the mean-squared-error risk MSE
(θ)
:= Eθ[∥∥∥θ − θ∥∥∥2
2
].
2.1.1 Classical Problems in Parametric Statistics:
In classical statistic the assumption usually is that given observations x1, ..., xn there is a family of distributions
F (x; θ) parameterized by θ (both x and θ can be scalars or vectors) such that x1, ..., xn is an i.i.d sample from F .
In this context there are several classical question that arise:
6
2.1 Some Notation and Reminders: 2 CLASSICAL HYPOTHESIS TESTING:
1. Point Estimation: Given the observations x1, ..., xn one tries to �nd an estimator θ (x1, ..., xn) which
approximates the unknown parameter θ. There are a few desirable qualities in such an estimator:
(a) Consistency: we would like the estimator to converge in probability to the real value of θ.
(b) Lack of Bias: we would ideally like the estimator to have little or no bias in the sense that Eθ[θ − θ
]≈ 0.
(c) Low Risk: given some risk function R we would like an estimator with low risk.
2. Con�dence Interval Estimation: a con�dence interval [a, b] for the parameter θ with con�dence level α is
any interval such that Pθ [θ ∈ [a, b]] = 1− α. We often want to estimate such intervals based on observations
and the values of the interval edges (usually take to be symmetric) are functions of the observations.
3. Hypothesis Testing: hypothesis testing deals with decision problems of the type θ = θ0 or θ ≤,≥ θ0. There
is a null hypothesis regarding the true value of the parameter, denoted by θ0 and the objective is to decide
with a given level of certainty whether to accept or reject this null hypothesis.
Several key elements of all of these problems are the number of parameters that need to be estimated, the amount
of data available and the number of hypothesis we wish to test.
De�nition 10. Suppose x1, ..., xn is a realization of X1, ..., Xni.i.d∼ F (x; θ) . Given a null hypothesis H0 regarding
the value of θ and the complementary alternative hypothesis H1 there are several stages to a statistical test of
H0 :
1. De�nition and calculation of a test statistic T (X1, ..., Xn) based on the observations.
2. De�nition and calculation of a rejection area R (α) that contains the parameter θ with probability α.
3. Rejection of H0 i� T (x1, ..., xn) ∈ R (α) (equally acceptance of H1 i� T ∈ Rc)
There are two kinds of errors that arise in this context:
1. Type 1 Error (false positive): α = P (T ∈ R |H0 = true)
2. Type 2 Error (false negative or miss): β = P (T ∈ Rc |H1 = true)
De�nition 11. Given a statistical test for with test statistic T (x1, ..., xn) the Pvalue of the test is de�ned as the
probability that we observe a value at least as extreme as T (x1, ..., xn) given the underlying distribution of the
observations under H0:
Pvalue = P (T (X1, ..., Xn) ≥ T (x1, ..., xn) |T (X1, ..., Xn) is distributed accrording to H0)
It is important to remember that just like T (X1, ..., Xn) the Pvalue is a random variable which is a function of
X1, ..., Xn. Furthermore if H0 is true then the distribution of the Pvalue would be Uniform [0, 1].
7
2.1 Some Notation and Reminders: 2 CLASSICAL HYPOTHESIS TESTING:
Figure 2.1: Here we see the empirical density (left) and empirical cumulative distribution (right) of 500 Pvaluescalculated by a simulation in which the data was sampled according to H0. It can be seen that the simulation resultindeed shows that the approximated density of the Pvalues in such a case is roughly Uniform[0, 1].
Remark 12. A few remarks:
• It is equivalent to reject H0 when T (x1, ..., xn) ∈ R (α) and to reject H0 when Pval < α.
• The Pvalue is a good measure of our con�dence in the "correctness" of H0 when testing a single hypothesis.
• In the parametric case there is often an analytic way to calculate the Pvalue based on the distribution of the
test statistic under H0 (for example this is possible in the classic t-test).
• It is a common misconception that the following relation exist
Pval = P (H0 = True |Observed data) = P (T (X1, ..., Xn) is distributed accrording to H0|T (X1, ..., X1) = T (x1, ..., xn))
This is however not true and in order to calculate P (H0 = True |Observed data) one would need to rely on
Bayesian statistics and assume some prior distribution on H0.
• In order to calculate the type-2 error β one would need an assumption of the distribution of θ under H1.
De�nition 13. Given a statistical test with test statistic T and rejection area R (α) the power of the test is
Power = 1− β = P (T ∈ R (α) |H0 = false)
This value of course depends on the choice of α. It is generally desired that a test would have a low α and high
power simultaneously but in reality there is a trade o� between the two and that is not always possible.
8
2.1 Some Notation and Reminders: 2 CLASSICAL HYPOTHESIS TESTING:
Figure 2.2: Here we see the density of the test statistic under H0 (in blue) and under H1(in red). Given somerejection threshold (denoted by the black line) we would reject H0 if the test statistic is to the right of the line.The con�dence level α would then be the area to the right of the line found under the density given H0, here thatarea is marked in red. The power of the test is the area to the right of the line found under the density given H1,here denoted in green. It can be seen that moving the rejection threshold impact both α and the power.
Example 14. One of the simplest examples of a parametric test is the independent two-sample test for equality
of means. Assume that we are given observations {(xi, yi)}ni=1 where yi ∈ {0, 1} is a categorical variable and we
assume that the xivalues are realization from the distribution Xi|yi ∼ N(µyi , σ
2). We would like to test the
hypothesis H0 : µ0 = µ1 , that is that the means in both groups are equal. If we assume that σ is known then we
can conduct a simple Z-test by calculating the z-score of the mean di�erence:
nj =
n∑i=1
1{yi=j}
µj =1
nj
n∑i=1
xi1{yi=j}
Z =µ1 − µ0√
2(
1n0
+ 1n1
)σ2
Under the assumption of H0 being true Z ∼ N (0, 1) and thus we will reject H0 i� Z > Zα2or Z < −Zα
2for
Zα2
= Φ−1(α2
). This is a two-sided test in which we have equally divided the rejection area between the two tails
of the distribution. We can also calculate the Pval of the test given by Pval = 2 min {Φ (Z) , 1− Φ (Z)}. Similarly
if the variance of the distribution is unknown we would use a t-test with n− 2 degree of freedom by calculating the
following statistic:
tn−2 =µ1 − µ0√(σ2
0
n0+ σ2
1
n1
) ; σ2j =
1
nj − 1
n∑i=1
(xi − µj)21{yi=j}
9
2.2 Non-Parametric Tests: 2 CLASSICAL HYPOTHESIS TESTING:
The quantity in the denominator is simply an estimator for σ2. In this case the distribution of tn−2 is no longer
normal but it is a known distribution named the T-distribution with n− 2 degree of freedom. This distribution can
be used to calculate the rejection area and the Pval in the same way as with the Z-test.
One problem with the t-test is that it assumes an underlying normal distribution of the observations, as assumption
which can not be omitted and thus limits this test to speci�c cases. Additionally the t-test is not guaranteed to
control the type-I error α, especially for skewed distributions.
2.2 Non-Parametric Tests:
De�nition 15. GivenX1, ..., Xni.i.d∼ F a statistic T (X1, ..., Xn) will be calledDistribution Free if the distribution
of T does not depend on F . Furthermore, a hypothesis test for H0 will be called a distribution free test if the
test statistic is distribution free, that is it does not depend on the distribution of the observations under H0.
Remark 16. While a distribution free test statistic must be independent of the distribution of the data under H0 it
is possible that the distribution of the statistic does depend on the alternative hypothesis.
Example 17. We will now give an example for a distribution free aparametric test for the two independent sample
decision problem. Given the observations {(xi, yi)}ni=1 when again yi ∈ {0, 1} we would like to determine whether
Xi|yi = j has the same distribution for both j = 0 and j = 1. The Mann-Whitney-Wilcoxon Rank-Sum Test
allows us to do this and works as following:
1. The rank of each observation is calculated as ri =∑nj=1 1{xj≤xi}.
2. For j = 1, 2 the value Rj =∑ni=1 ri1{yi=j} is calculated.
3. Given n1 =∑ni=1 1{yi=1} The test statistic u = R1 − n1(n1+1)
2 is calculated.
4. The distribution of U is calculated and a rejection area is chosen accordingly given con�dence α.
Despite it appears U is not symmetric it actually is since R1 + R2 = n(n+1)2 is constant. The main advantage
of this test is the fact it is distribution free and thus can be used regardless of the underlying distribution of
the observations. The main question that remains is how to calculate the distribution of U in order to obtain the
rejection area. In practice for relatively small samples (up to 20 observations) there are tables that directly calculate
said distribution while for large sample there is a normal approximation to the distribution of U . An alternative
more modern method is to use numerical methods as we will now discuss.
2.2.1 Permutation Tests:
Given the data {(xi, yi)}ni=1 where yi are again binary labels we want to test the hypothesisH0 : X,Y are independent
(equivalently if x1, ..., xn1 ∼ F and x′
1, ..., x′
n2∼ G the hypothesis would be H0 : F = G). Assume that
Tobs := T ((x1, y1) , ..., (xn, yn)) is some statistic (essentially any function of the data). A permutation test based
on the statistic T would be carried out as follows:
10
2.2 Non-Parametric Tests: 2 CLASSICAL HYPOTHESIS TESTING:
1. Draw N random permutations s1, ..., sN ∈ Sn.
2. For each permutation calculate the test statistic on the permuted data Ti := T ({(xi, yπi)}ni=1).
3. Compute the empirical Pvalue P = 1N
∑Ni=1 1{Ti>Tobs}.
4. Reject H0 if P < α for given con�dence α.
Claim 18. For any joint-distribution (Xi, Yi)i.i.d∼ F and any test statistic T ({(xi, yi)}ni=1) the distribution of the
empirical Pval under H0 is approximately Uniform[0, 1].
Proof. Sketch of proof: Under H0 the RVs T1, ..., Tn, Tobs are i.i.d and thus #{Ti > Tobs
}is uniformly distributed
on {0, ..., N} which immediately gives us that P ∼ Uniform{
0, 1N ,
2N , ..., 1−
1N , 1
}and thus when N → ∞ we get
that the distribution of P converges to Uniform [0, 1].
Remark 19. All of this is correct assuming that T is continuous and there are no ties. However if T was discrete
and ties were possible the distribution of P is still approximately uniform.
Fact 20. The above procedure guarantees that P (reject |H0 = true) ≤ α.
The advantage of permutation tests over normal approximations is that the permutation test better captures the tails
of the distribution than the normal approximation. Even though for very large samples the normal approximation
converges to the true distribution in practice this still might not be satisfactory, especially when one wants to
conduct a test with very small α values.
Advantages of Permutation tests: Accuracy, no underlying assumption about the data, �exibility (can be used
with complicated null models)
Disadvantages of permutation tests: Computationally intensive and do not provide any insight regarding the
analytic characteristic of the distribution (for example the relation between the power of the test and the sample
size is not easily understood).
2.2.1.1 Using permutations to calculate test power:
Suppose we have an assumption regarding the underlying distribution of the data under both H0 and H1 and we
have a test statistics T meant for testing H0. We would often like to know what is the power of the test and how
does the power change with the sample size. In the parametric approach we can often analyze the power by direct
analytic computation or via an asymptotic approximation. In the non-parametric computational approach we can
compute the power using permutations using the following algorithm:
• Input: con�dence level α, an assumed distribution F of the data under H1, sample size n.
• Parameters: N - number of permutations, K - number of simulations.
• for k = 1, ....,K do:
� Sample by simulation {(xi, yi)}ni=1i.i.d∼ F
11
3 TESTS FOR INDEPENDENCE:
� perform a permutation test with N permutations with con�dence α based on the sampled data.
� Denote by Rk ∈ {0, 1} the result of the test (Rk = 1 if the k'th simulation rejected H0).
• The estimator for the power is then de�ned by 1− β = 1K
∑kk=1Rk
• Computational cost: O (N ·K) which could be quite costly.
3 Tests for Independence:
Suppose we are given two RVs X,Y and we would like to test whether they are independent. Ideally we would like
a test that would be able to detect any kind of probabilistic dependence between the two variables (not necessarily
linear or monotonic for example).
De�nition 21. Given Xni.i.d∼ F (x; θ) an estimator θ (X1, ..., Xn) for θ is said to be consistent if θn
P−→ θ.
De�nition 22. A sequence of tests Tn = (gn, Rn) with sample size n, test statistic gn (X1, ..., Xn) and rejection
area Rn for the rejection of H0 is said to be consistent if the two following properties hold:
1. P (reject | H0=true) = P (gn ∈ Rn |H0 = true)n→∞−→ 0.
2. P (reject | H1=true) = P (gn ∈ Rn |H1 = true)n→∞−→ 1 for any H1 6= H0.
The general method for constructing consistent test for independence is as follows:
1. Find a distance measure D (X,Y ) ≥ 0 such that D (X,Y ) = 0 i� X,Y are independent.
2. Find a consistent sequence of estimators Dn for D.
3. Find a series of threshold εn and de�ne rejection areas Rn ={Dn > εn
}.
Remark 23. The �nal step is often tricky as it depends on the rate of convergence of Dn.
Example 24. Testing for independence using a kernel method:
Suppose we again have the data {(xiyi)}ni=1i.i.d∼ F where yi is binary and we would like the testH0 : X,Y are independent.
We de�ne a kernelKh (xi, xj) and compare the kernel values for pairs of the same and di�erent groups by calculating
the following statistic:
T =
n∑i,j=1
Kh (xi, xj)
[1{yi=yj=0}
n20
+1{yi=yj=1}
n21
− 21{yi 6=yk}
2n0n1
]
The kernel function could for example be a Gaussian kernel Kh (xi, xj) = e−1h2‖xi−xj‖2 . Here h is a width parameter
which determines how fast the kernel goes to zero and in turn impacts the rejection area (it is possible to pick an
optimal h value using a more sophisticated technique). When the X values do not depend on Y we would obviously
have that E [T ] = 0 and thus it would su�ce to test the hypothesis that E [T ] = 0 by comparing the value of |T | to
an appropriate critical value. This can be done either by a normal approximation of the distribution of T or in a
direct method using permutations.
12
3.1 Classical tests for independence of scalar RVs: 3 TESTS FOR INDEPENDENCE:
3.1 Classical tests for independence of scalar RVs:
There are several measures for dependence of scalar random variables, the most known of which is the Pearson
correlation-coe�cient which is given by:
R (x, y) =
∑ni=1 (xi − x) (yi − y)√∑n
i=1 (xi − x)2 ·∑ni=1 (yi − y)
2
This is an estimator of the correlation coe�cient given by
ρ (X,Y ) =E [(X − E (X)) (Y − E (Y ))]√E[(X − E (X))
2(Y − E (Y ))
2] =
Cov (X,Y )√Var (X) ·Var (Y )
The correlation coe�cient represents the strength of the linear dependence between X and Y but is not informative
regarding other types of dependence. If one assume a normal distribution of X and Y then there is a closed form
to the distribution of R and more generally the distribution could always be approximated using permutations.
Fact 25. When ρ (X,Y ) = 0 we would say that X,Y are uncorrelated. It is known that independent RVs are always
uncorrelated but the opposite is not true.
Figure 3.1: Here we see the values of the Pearson correlation-coe�cient for various types of dependence and variouslevels of noise. It can be seen in the �rst and second lines that the value of the coe�cient depends on the amountof noise but does not depends on the slope as long as the slope is not zero. The third line shows that for non-lineardependence the coe�cient can very well be zero despite the fact the variables have a functional relationship andare not independent.
Another measure of dependence is the Kendell-Tau Rank correlation coe�cient de�ned by:
τ (x, y) =
∑i<j sgn ((xi − xi) (yi − yj))
12n (n− 1)
Using permutations it is again possible to perform a hypothesis test for the signi�cance of this coe�cient. The main
disadvantages of τ is that it is only de�ned for scalar RVs and it is only capable of detect monotonic dependencies.
On the other hand it has the advantage of being a distribution free statistic.
13
3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:
3.2 Testing for independence using permutations:
Fact 26. Given two RVs X ∈ Rp , Y ∈ Rq with distributions PX , PY and joint distribution PXY we know that
X,Y are independent i� PX (~x)PY (~y) = PXY (~x, ~y) for all (~x, ~y) ∈ Rp × Rq. If X,Y are continuous with densities
fX , fY and joint density fX,Y then an equivalent condition is that fX (~x) ·fY (~y) = fX,Y (~x, ~y) for all (~x, ~y) ∈ Rp×Rq.
Assume we are given two RVs X ∈ Rp , Y ∈ Rq with distributions PX , PY and joint distribution PXY , we would like
a statistical test which will determine whether the probability distributions PXPY and PXY are identical. When
the distributions are both univariate the Kolmogorov-Smirno� two-sample test provides an analytic solution to
this problem but it does not generalize to the multivariate case. We would thus want to approach the problem
by de�ning a statistic whose distribution is di�erent when X,Y are dependent or independent and then use a
permutation test in order to test for signi�cance of this statistic.
Remark 27. The problem of testing for independence is similar (but not identical) to testing for equality of distri-
bution in a two sample setting. When testing for equality of distribution we are given a set of data sampled from
distribution P and a di�erent set of data sampled from a distribution Q and we would like to determine whether
P ≡ Q based on these two independent samples. In comparison, when testing for independence we would like
to determine whether PXY ≡ PXPY but in this case we evaluate these distributions based on a single sample.
There are several types of methods for testing for independence using permutations:
• Kernel methods: these methods rely on de�nition of a kernel K(x, y) which measures similarity between x
and y. The value of the Kernel are then computed and compared for pairs from the same group and from the
two di�erent groups.
• Geometric methods: these methods rely on de�ning a distance between distribution measure D (X,Y ).
• Information based methods: it can be shown that two RVs X,Y are independent i� their mutual infor-
mation I (X,Y ) equals zero. This provides a way for testing for independence by computing or estimating
the mutual information.
3.2.1 Testing for independence using a Kernel function:
In example 24 we described a two-sample method for testing independence given a sample {(xi, yi)}ni=1 where yi
was a binary value based on the statistic
T =
n∑i,j=1
Kh (xi, xj)
[1{yi=yj=0}
n20
+1{yi=yj=1}
n21
− 21{yi 6=yk}
2n0n1
](3.1)
it can be seen that for yi = yj we are given a value with a positive sign and for yi 6= yj we got a value with a
negative sign, all these values are summed up and a permutation test is used to check whether the computed value is
signi�cantly di�erent than zero. In the more general case where we are given a single sample {(xi, yi)}ni=1i.i.d∼ PXY
we need to adapt this method to suit our needs (see [5]). One way to do this is treat (xi, yi) as sampled from
PXY and (xi, yj) i 6= j as samples from PXPY . In order to do this we de�ne A := {(xi, yi)}ni=1 and B :=
14
3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:
{(xi, yi) | 1 ≤ i 6= j ≤ n} and notice that under H0, A is an i.i.d sample from PXY and B is an i.i.d sample out of
PXPY . We thus want to determine whether these two samples are identically distributed since this is equivalent to
independence of X,Y . In order to do this we will compute the following values:
two points from same group︷ ︸︸ ︷+Kh ((xi, yi) , (xj , yj)) ∀ i 6= j
one from same and one from di�erent︷ ︸︸ ︷−Kh ((xi, yi) , (xl, yk)) ∀ i and l 6= k
one from same and one from di�erent︷ ︸︸ ︷−Kh ((xl, yk) , (xi, yi)) ∀ l 6= k
two points from di�erent groups︷ ︸︸ ︷+Kh ((xi, yj) , (xl, yk)) ∀ i 6= j, l 6= k
Where we note that (xi, yj) ∈ Rp×q are the "points" in our sample. We then calculate the test statistic given in
formula 3.1 and perform a standard permutation test in order to test for signi�cance.
Remark 28. One small problem with this method is that there are usually a lot more point in group B than in A.
3.2.2 Testing for independence using the Distance Correlation (dCor) method:
De�nition 29. The characteristic function of a RV X is de�ned by ϕX (t) = E[eit>X].
Remark 30. If X has density fX then ϕX is the Fourier transform of fX .
De�nition 31. given two vectors x, y ∈ Rp we de�ne a relation x ≤ y i� xi ≤ yi for all i.
De�nition 32. Given a realization x1, ..., xn ∈ Rp from the random sequence X1, ..., Xni.i.d∼ F we de�ne the
empirical cumulative distribution of F as follows:
FnX(~t)
=1
n
n∑i=1
1{xi≤~t}(~t)
(3.2)
De�nition 33. Given any sequence of values aij indexed by i ∈ {1, ..., n} and j ∈ {1, ...,m} we denote:
ai· :=1
m
m∑j=1
aij , a·j :=1
n
n∑i=1
aij , a·· =1
n ·m
n∑i=1
m∑j=1
aij
If the value aij we to be laid out in an n×m table these are the row/column/total averages respectively.
We mentioned previously that the Pearson correlation-coe�cient measure dependence but only of the linear nature.
More general dependence can be measure by looking at the correlations between the distances of the observations
from one another rather than looking at the original observations. The general intuition is that if X and Y are
dependent then small distances in X values should correspond to small distances in Y values. The dCov method
relies on this intuition in order to test for dependence and is performed as following:
1. Sample {(xi, yi)}ni=1 where (xi, yi) ∈ Rp × Rq and denote x := (x1, ..., xn), y := (y1, ..., yn).
15
3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:
2. Compute all the pairwise distances aij = ‖xi − xj‖2 and bij = ‖yi − yj‖2.
3. Normalize the distances by reducing the the row, column and total average and de�ne:
Aij = aij − ai· − a·i − a··
Bij = bij − bi· − b·i − b··
4. De�ne the dCov statistic and dCor statistic as follows::
dCov (x, y) := V 2 (x, y) =1
n2
n∑i=1
n∑j=1
AijBij
dCor (x, y) = R2 (x, y) =
V 2(x,y)√
V 2(x)√V 2(y)
V 2 (x)V 2 (y) > 0
0 V 2 (x)V 2 (y) = 0
Where V 2 (x) = V 2 (x, x) and V 2 (y) = V 2 (y, y).
5. Use permutation to compute the distribution of R2 and perform a hypothesis test for independence.
Remark 34. Note that the computational complexity here is O(n2)for the computation of all distance pairs.
Why this method works: It can be shown that V 2 (x, y) is an estimator for the parameter V2 (X,Y ) de�ned by:
V2 (X,Y ) :=1
cpcq
ˆ
Rp+q
|φX,Y (x, y)− φX (x)φY (y)|2
‖x‖1+p2 ‖y‖1+q
2
dxdy
where cd := π12(1+d)
Γ( 12 (1+d))
is a normalized constant. It then follows dCor (x, y) is an estimator for
dCor (X,Y ) = R2 (x, y) =
V2(X,Y )√V2(X)
√V2(Y )
V2 (X)V2 (Y ) > 0
0 V2 (X)V2 (Y ) = 0
It can be shown that V2 (X,Y ) ≥ 0 and X,Y are independent i� V2 (X,Y ) = 0.
Another important property: It can be shown that the sample version V is equal to the population version
V if one uses the empirical characteristic functions given by:
φnX (x) =1
n
n∑k=1
ei〈x,xk〉 , φnY (y) =1
n
n∑k=1
ei〈y,yk〉 , φX,Y (x, y) =1
n
n∑k=1
ei(〈x,xk〉+〈y,yk〉)
That is, if we sample x := (x1, ..., xn)and y := (y1, ..., yn) and calculate these empirical characteristic function and
de�ne ~Xn,Yn whose distribution is derived from these characteristic function then we will get that V2(Xn, Yn
)equals V 2 (x, y). Based on this relationship one can prove some interesting properties of the statistic R2 (x, y) such
as:
16
3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:
1. 0 ≤ R2 (x, y) ≤ 1 and R2 (x, y) = 0 i� x1 = y1 = ... = xn = yn.
2. The statistic V 2 (x, y) and R2 (x, y) converge almost surely to V2 (x, y) and R2 (X,Y ) as n→∞.
3. There exists a sequence of thresholds εn such that the sequence of tests that rejects H0 : X,Y are independent
when R2 ((x1, ..., xn) , (y1, ..., yn)) > εn is a consistent sequence of tests.
References: [11, 12]
3.2.3 Another distance based method for continuous univariate RVs:
Fact 35. If X,Y are two dependent univariate RVs then there exists (x, y) ∈ R2 such that FX,Y (x, y) 6= FX (x)FY (y).
Furthermore if X,Y are continuous then there exists (x, y) such that fX,Y (x, y) 6= fX (y) fY (y) and from continuity
there is a ball B := B ((x, y) , ε) such that fX,Y (u, v) 6= fX (u) fY (v) for all (u, v) ∈ B.
Motivation: based on the aforementioned fact it can be shown that ifX,Y are continuous and dependent univariate
RVs then there exists p := (xp, yp) ∈ R2 such that if the plane is divided into four quadrants around p denoted
Qp1,1, Qp1,2, Q
p2,1, Q
p2,2 then given Opj,k =
´ ´Qpj,k
fXY (u, v) dudv it holds that Op11 · Op22 6= Op1,2 · O
p2,1. This is also
equivalent to the indicator RVs 1{X>xp} and 1{Y >yp}being dependent.
Conclusion: Dependence of two univariate RVs can be identi�ed by inspecting division of the plane to quartiles.
The main question is how to choose the center of the division which will reveal such dependence. The idea suggest
by Hoe�ding [6] is to simply let the data itself de�ne the centers.
Figure 3.2: A division of the plane into four quadrants where one of the observations is selected as the center of thedivision. If we counts the number of observations in each quadrant then as the sample size grows this approximatesthe chance of a random sample belonging to that quadrant.
The procedure is carried out as follows given a sample {(xi, yi)}ni=1:
• Perform n division of the plane to quartiles where each time pi = (xi, yi) is taken to be the center point.
For each division compute a 2× 2 table of the values opij,k, j, k ∈ {1, 2}, these values are up to normalization
estimators for the values Opij,k previously de�ned.
17
3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:
• Compute the following test statistic:
Hn =1
n4
n∑i=1
(opi1,1o
pi2,2 − o
pi1,2o
pi2,1
)2• Perform a permutation test to check for signi�cance of the test statistic.
Remark 36. This test statistic is asymptotically equivalent to the following function of the empirical CDF:
Hn :=
¨R2
(FnX,Y (x, y)− FnX (x) FnY (y)
)dFnX,Y (x, y) (3.3)
Where the empirical CDF functions are de�ned as in 32. This fact can be used to prove consistency of the sequence
of test de�ned by Hn (with appropriate rejection areas). Since X,Y are independent i� FXY ≡ FXFY it can be
seen why Hn and thus Hn may be good measures of independence and we would expect Hn to converge to zero in
case of independence. Note that the integration in the de�nition is taken according to the empirical measure dFX,Y
and since this is in-fact a discrete distribution this actually means we sum up the values of the function under the
integral sign at the atoms of the distribution.
3.2.4 An extension of Hoe�ding's method:
Kaufman et al [7] suggested a variation on Hoe�ding's method where the main di�erence is that instead of using
divisions of the plane into quartiles based on a single center one should look at k di�erent points p1, ..., pk and
divide the plane into (k + 1)2areas (some of which are in�nite) by plotting the vertical and horizontal lines that
pass through the selected points. The test statistic is then de�ned as follows:
Sn,k =∑
1≤i1≤...≤ik−1≤n
(k+1)2∑j=1
(oi1,...,ik−1
j − ei1,...,ik−1
j
)2
ei1,...,ik−1
j
The values oi1,...,ik−1
j are the observed values which are simply the number of observations found in the area of
the division de�ned by the points pi1 , ..., pik−1and the values e
i1,...,ik−1
j are the expected probability values under
H0 (independence) which are obtained by multiplying the marginal CDF values FX , FY for the area of the division
de�ned by the points pi1 , ..., pik−1.
Remark 37. The value∑(k+1)2
j=1
(oi1,...,ik−1j −e
i1,...,ik−1j
)2
ei1,...,ik−1j
is actually the χ2 value computed for a (k + 1) × (k + 1)
contingency table with the aforementioned observed and expected values.
The computational complexity of this method is O(nmax(2,min(k−1,4))
).
• Advantage: Both this method and Hoe�ding methods can be shown to be distribution free.
• Disadvantage: Both methods are only applicable for two univariate RVs.
It remains an open question to �nd a distribution free test for multivariate RVs
Remark 38. the dCor test statistic is not distribution free.
18
3.3 Comparison of di�erent tests for independence: 3 TESTS FOR INDEPENDENCE:
3.2.5 An information theoretic test for independence:
De�nition 39. Given two continuous RVs X ∈ Rp, Y ∈ Rq with densities fX , fY the mutual information of X
and Y is:
I (X,Y ) :=
ˆRp+q
fX,Y (~u,~v) log
(fX,Y (~u,~v)
fX (~u) fY (~v)
)d~ud~v
Remark 40. The mutual information can similarly be de�ned for discrete or mixed RVs.
Fact 41. Given two RVs X ∈ Rp, Y ∈ Rq, I (X,Y ) ≥ 0 and I (X,Y ) = 0 i� the two are independent.
Given this fact the problem of determining dependence is reduced to estimation of the mutual information. The
problem is that this estimation is usually di�cult and tests that are based on this method usually have low power
(see ). One such test is the MIC test [8].
Remark 42. There is some evidence that better estimators for the mutual information can yield e�cient tests, see
[7]
3.3 Comparison of di�erent tests for independence:
For any type of hypothesis test there is always the question whether there is a test (a test statistic) which is better
than all others in some sense. Ideally we would want a "universal" test which is best under any/most alternatives
for example in a minimax sense.
De�nition 43. Suppose we are given a parameterized family of distributions f (x; θ) depending on the unknown
scalar parameter θ ∈ Θ and a partition of the parameter space Θ = Θ0 ∪· Θ1, we denote by Hi the hypothesis
θ ∈ Θi. Given a test statistic T(~Xn)
:= T (X1, ..., Xn) and a corresponding binary test function
φ(~Xn)
=
1 T(~Xn)∈ Rn (α)
0 otherwise
for testing a one-sided hypothesis with con�dence α. We say that T is uniformly most powerful (UMP) if for
any other test statistic T′(~Xn)
:= T′(X1, ..., Xn) and corresponding binary test function φ
′(~Xn)for which
supθ∈Θθ
Eθ(φ′(~Xn))
= α′≤ α = sup
θ∈Θ0
Eθ(φ(~Xn))
It holds that:
Eθ(φ′(~Xn))
= 1− β′≤ 1− β = Eθ
(φ(~Xn))∀ θ ∈ Θ1
Fact 44. It can be shown that there are no UMP tests for two-sided hypothesis testing and there are no UMP test
for vector valued parameterized families. Furthermore if the null is of the form H0 : θ = θ0 and the alternative is
one sided H1 : θ ≥ θ0 then the Neyman-Pearson lemma guarantees that the likelihood ratio test is a UMP test.
19
3.3 Comparison of di�erent tests for independence: 3 TESTS FOR INDEPENDENCE:
Remark 45. In particular the likelihood ratio test for testing for independence (which is not UMP since this is not
a simple one sided parameter hypothesis) rejects H0 if
n∑i=1
log
(fX,Y (xi, yi)
fX (xi) fY (yi)
)> C
In the context of testing for independence it is almost certain that no UMP test exists, one way of reaching this
conclusion is the fact that for di�erent types of alternatives (di�erent types of dependence) there are better suited
tests. Thus, instead of looking for a test which is optimal under any alternative we would like to be able to compare
between di�erent tests under a speci�c set of alternative hypothesis in which we are interested, for example for
speci�c alternatives we believe are likely to occur in real data. Given such a set of alternative hypothesis we can
perform a comparison of test power using simulations as follows:
1. Generate data from various kinds of dependency models using simulations.
2. Perform the various hypothesis test one wishes to compare on the simulated data with given con�dence α.
3. Estimate the power of each test using permutations.
4. Evaluate the power of each test as a function of the dependency model, the sample size and α.
Figure 3.3: A list of tests for Independence and their performance under various dependence models.
3.3.1 Equitable tests for independence:
When testing for independence of two RVs X,Y there are two possible questions one can ask:
1. Is there dependence between the RVs
2. How strong is the dependence.
In this context the following de�nition arises:
20
3.3 Comparison of di�erent tests for independence: 3 TESTS FOR INDEPENDENCE:
De�nition 46. A dependency measure D (X,Y ) will be called equitable if given RVs X1, X2, Y , a noise factor
Z ∼ N(0, σ2
)and two smooth functions g1, g2 such that Xi = gi (Y ) + Z it holds that D (X1, Y ) = D (X2, Y ).
Remark 47. This basically means that the measure D (X,Y ) evaluate the strength of the dependence equally
regardless of the type of functional dependence.
Figure 3.4: dCor values computed for di�erent types of dependence and noise levels. This illustration shows thatwhile the dCor measure is capable of detecting all types of dependence (the value is positive in all cases wheredependence exists) it is is not an equitable measure. For example, in the last row the noise level for the circle andsemi-circle is almost identical but the dCor values are quite di�erent (0.2 vs 0.5).
It has also been shown [8] that the mutual information based MIC statistic is not equitable. In fact it remains an
open question to �nd a dependency measure which is equitable even under restrictions on the functional dependence.
21
4 MULTIPLE HYPOTHESIS TESTING (MHT):
4 Multiple Hypothesis Testing (MHT):
In many cases when we are faced with performing some statistical test we will be required to conduct more than a
single hypothesis test. At the same time, we will want to control not only the probability for false rejection in each
single hypothesis test but the probability for false rejection across all tests performed.
Remark 48. There is no assumption on any relation between the hypothesis tests, each hypothesis can have its own
set of data, test statistic, rejection area and Pval.
Example 49. Suppose we want to test the connection between m = 20, 000 genes and a disease with con�dence
α = 0.05 in every hypothesis. Without any further restrictions if we conduct each test separately with α = 0.05 we
will get that even if none of the genes are connected to the disease the expected number of rejections is mα = 1000.
That is, we will have 1000 false rejections. This problem is known as the multiple-hypothesis problem.
De�nition 50. Given a MHT with m hypotheses we will denote the following:
# True Null True Alternative Total
Not Signi�cant U T m−R
Signi�cant V S R
Total m0 m−m0 m
For example U is the number of hypotheses where H0 should have been rejected and that was indeed the result and
V is the number of hypotheses where H0 should have been accepted but it was rejected (false rejections).
Remark 51. For simplicity we will be assuming that the test statistics we are discussing are all continuous and in
particular the distribution of the Pvals under H0 is Uniform [0, 1]. Most of the result we will discuss will remain
approximately true even if the test statistics were discrete.
Remark 52. Recall that when conducting a hypothesis test based on a statistic T (X1, ..., Xn) the Pval of the test
is itself a statistic (a RV) which is a function of X1, ..., Xn. We will now denote this RV as P := P (X1, ..., Xn) and
denote its realized value by p := P (x1, ..., xn).
4.1 Family Wise Error Rate (FWER):
When testing a single hypothesis we wanted to bound the type-1 error such that P (reject |H0 = true) ≤ α. Following
this logic the Family-Wise-error Rate (FWER) of a multiple hypothesis test is de�ned as:
FWER = P (There is at least one false rejection) = P (V > 0)
Where V is as previously denoted the number of false-rejections. Similarly to a single hypothesis test where we
wanted to upper bound the type-1 error we would want to upper bound the FWER and insure our testing method
guarantees FWER ≤ α. As we have already seen in order to do this it is not su�cient to simply set the con�dence
level of each test to be α. There are however corrections that do allow controlling the FWER simply by selecting
appropriate rejection thresholds (or con�dence level) for all singular tests.
22
4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):
4.1.1 The Bonferroni Correction:
A naive solution for FWER control when conducting m hypothesis tests with Pvalues Pi is to simply set the the
con�dence level of each test to be αm . Using a union bound it can be seen this achieves the required result:
FWER = P (All rejections are correct) = P
(m⋃i=1
{Pi <
α
m
})≤
m∑i=1
P({Pi <
α
m
})= m
α
m= α
Where P({Pi <
αm
})= α
m since under H0 the assumption is that Pi ∼ Uniform [0, 1]. The main problem with this
approach is that it is extremely conservative and it would be extremely di�cult to reject any hypothesis even in
cases where the alternative is true and even with powerful tests. That is, there is a signi�cant loss of overall power.
As is evident, control of the FWER is simply too restricting in order to allow for any e�cient testing method, wee
thus turn to an alternative approach of controlling the error when conducting multiple hypothesis tests:
4.2 False Discovery Rate (FDR):
The False Discover Rate is a less conservative measure to the quality of a MHT. The idea is that we will allow false
rejections but we will want to ensure the proportion of false rejections out of all rejections is low.
De�nition 53. Given a MHT with m hypothesis we denote R+ = max {R, 1} and Q = VR+ . Q is thus the
proportion of false rejections out of the total number of rejections. We then de�ne the FDR of the MHT to be
FDR = E [Q], that is the expected false rejection proportion.
Remark 54. The expectation here is taken with regard to the joint distribution of the Pvals of all the hypothesis
tests conducted. This is some distribution de�ned on [0, 1]m
such that the marginal distribution of every Pval is
Uniform [0, 1] under H0.
We would obviously want our testing procedure (note that in this context the procedure is something that we
would in theory want to repeat for di�erent data sets but with the same underlying hypotheses) to ensure a low
FDR value. The following procedure suggested by Benjamini and Hochberg [1] allows given a con�dence value α
to determine which hypothesis should be rejected while ensuring FDR ≤ α:
The Benjamini-Hochberg (BH) Procedure:
1. Conduct all the singular hypothesis tests and compute the realized Pvals p1, ..., pm.
2. Order the Pvals such that p(1) ≤ p(2) ≤ ... ≤ p(m).
3. Compute i∗ = max{i ∈ {1, ...,m} | p(i) ≤ α i
m
}(if exists).
4. Reject all the hypotheses that match the Pvals p(1), ..., p(i) (if i∗ doesn't exist reject none).
The rejected set of hypothesis is thus RejectedBH (p1, ..., pm;α) :={i ∈ {1, ...,m} | pi ≤ α i
∗
m
}.
23
4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):
Figure 4.1:
Claim 55. Assume we conduct m hypothesis tests based on continuous independent test statistics T1, ..., Tm using
the BH procedure with parameter α. Then FDR = m0
m α ≤ α.
Remark 56. If the statistics are discrete it is still guaranteed that FDR ≤ m0
m α.
Proof. We note that the result of the BH procedure given parameter α is a function only of the vector (p1, ..., pn)
of Pvals obtained in the tests performed. De�ne the following events informally:
C(i)k = {The i'th hypothesis was rejected =⇒ k hypotheses were rejected}
The idea is that if we look at all the Pvalues pj for i 6= j and we add a constant values to all of them which will
lead to the i'th hypothesis to being reject the result would be that exactly k hypothesis will be rejected in total.
More formally (p1, ..., pm) ∈ C(i)k i� for all qi ∈ [0, 1] exactly one of the following holds:
1. i /∈ RejectBH (p1, ..., pi−1, qi, pi+1, ..., pm;α) - The Pval qi does not lead to rejection of the i'th hypothesis.
2. i ∈ RejectBH (p1, ..., pi−1, qi, pi+1, ..., pm;α) and also:
V := |RejectBH (p1, ..., pi−1, qi, pi+1, ..., pm;α)| = k
That is for any value of qi if the i'th hypothesis was rejected then k hypotheses were rejected.
Two important properties of the events C(i)k (which we will not prove) are:
1. The event C(i)k does not depend on the value of the RV Pi and thus does is independent of the event
{Pi ≤ kα
m
}.
24
4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):
2. For any i the collection{C
(i)k | k ∈ {1, ...,m}
}is a partition of the sample space and thus
∑mk=1 P
(C
(i)k
)= 1.
We note that the event{Pi ≤ kα
m
}∩ C(i)
k is identical by de�nition to the event{Pi ≤ kα
m
}∪ {R = k} and thus:
P ({R = k}) · P(Pi ≤
kα
m|R = k
)= P
({Pi ≤
kα
m
}∪ {R = k}
)= P
({Pi ≤
kα
m
}∩ C(i)
k
)
Where R is again the number of rejections. Thus by independence and linearity of expectation we got:
FDR = E[V
R+
]=
m∑k=1
P ({R = k}) · E[V
k|R = k
]=
m∑k=1
m0∑i=1
1
kP ({R = k})P
(Pi ≤
kα
m|R = k
)Same Events︷︸︸︷
=
m∑k=1
m0∑i=1
1
kP({
Pi ≤kα
m
}∩ C(i)
k
) Independence︷︸︸︷=
m∑k=1
m0∑i=1
1
kP({
Pi ≤kα
m
})· P(C
(i)k
)Pi|H0∼U [0,1]︷︸︸︷
=m∑k=1
m0∑i=1
1
k
kα
m· P(C
(i)k
)Disjoint Union︷︸︸︷=
α
m
m0∑i=1
·
=1︷ ︸︸ ︷[m∑k=1
P(C
(i)k
)]=α
m·m0
As requested.
We have shown the B[3]H procedure controls the FDR given independent test statistics, the question remains what
happens when independence can not be assumed.
De�nition 57. Recall that given x, y ∈ Rm we denote x ≤ y i� xi ≤ yi ∀i. Furthermore we will say that a subset
D ⊆ Rm is ascending if for all x ∈ D and all y ∈ Rm x ≤ y =⇒ y ∈ D.
De�nition 58. A RV X := (X1, ..., Xn) will be said to be Positively-Regression-Dependent on Subset
(PRDS) if for any ascending D ⊆ Rm and for all 1 ≤ i ≤ n the function ψi,D (x) := P (X ∈ D |Xi = x) is
non-decreasing in x ∈ R. This means that the probability that X ∈ D does not decrease when Xi increases.
Claim 59. Assume we conduct m hypothesis tests based on continuous test statistics T1, ..., Tm that have the PRDS
property. Then the BH procedure with parameter α guarantees FDR ≤ m0
m α.
Proof. A generalization of the proof for independent statistics, see [3].
Remark 60. Even though in the case of PRDS dependency (and more generally in other models of positive depen-
dence) the BH procedure controls the FDR which is the expectation of VR+ the variance of V
R+ is generally tends to
grow as the level of positive dependence grows.
4.2.1 The BH procedure under more general dependence:
It turns out that without any assumption regarding the nature of dependence of the test statistics it is possible for
the BH procedure to fail and the FDR would exceed the m0
m α bound. However, when such a deviation does occur
the degree to which the FDR exceeds the bound is not large, as is emphasized by the following claim:
25
4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):
Claim 61. Assume we conduct m hypothesis tests based on continuous test statistics T1, ..., Tm. Then the BH
procedure with parameter α guarantees the following bound:
FDR ≤ m0
m
m∑i=1
1
i≈ m0 log (m)
mα
Proof. See [3].
Remark 62. Simulation studies have shown that in most cases which are not pathological examples the method
usually achieves the m0
m α bound and when it does not the deviation from the bound is relatively small and does
not reach the level of multiplication by the logarithmic factor.
Conclusion: The BH procedure is not particularly sensitive to existence of dependence of the test statistics.
4.2.2 What is the true meaning of FDR control:
We are reminded that the FDR is the expected value of the false rejection ratio and thus control of the FDR does
not guarantee a low false rejection ratio, speci�cally Var(VR+
)can be large even when E
[VR+
]is kept small.
Example 63. Suppose we conduct a MHT for testing m = 10000 hypotheses using the BH procedure with
parameter α = 0.1 and 1000 rejections were obtained. We would like to deduce that approximately 100 of these
rejections are false and thus approximately 900 of our �ndings are true. However, an alternative explanation is
that there is a strong positive dependence between the various test statistics and thus in 10% of cases in which we
implement the procedure there would be 1000 rejections while in the other 90% there will be none (since we chose
α = 0.1). In both scenarios it would the FDR is less than 0.1 but the actual circumstances are very di�erent.
The problem is that the BH procedure is still quite conservative, especially in cases where m0 � m we would get
a situation in which FDR ≤ m0
m α � α. In actuality we would want a procedure that has enough power to detect
e�ects that are not signi�cant in the BH procedure and still guarantees FDR ≤ α. Notice that if we knew what m0
before using the procedure (which is not possible) we could have used the BH procedure with parameter mm0α and
obtain the required result of FDR ≤ α.
4.2.3 Adaptive Procedures (Modi�ed BH procedures):
Since m0 is not known we would like the estimate it and use the estimator in an attempt to obtain a procedure
which will ensure FDR ≤ α. This approach bring rise to adaptive procedures which we will now describe (see
[2]). In the adaptive procedure the goal is to set the rejection threshold in an adaptive way which will be suitable
for various m0 values such that eventually the obtained FDR value will be independent of m0. Theses procedures
generally follow the following scheme:
1. Compute an estimator m0 for m0.
26
4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):
2. Conduct the BH procedure again with the parameter mm0α. That is �nd:
i∗ = max
{i ∈ {1, ...,m} | p(i) ≤ α
i
m0
}(4.1)
3. The rejected hypotheses are then RejectedABH (p1, ..., pm;α) :={i ∈ {1, ...,m} | pi ≤ α i
∗
m
}.
Iterative Method: Use the procedure iteratively until the estimator m0 converges (till the di�erence in the
estimator between iterations is less than some small error value ε).
Remark 64. The simplest way to compute an estimator for m0 is by simply to use the standard BH procedure with
parameter α and then take m0 = m−R as an estimator. There are however many other variations, see [2, 4, 9].
Ideally if this method worked perfectly it would guarantee FDR ≤ m0
E[m0]α. In actuality this is not the case and
there are several technical di�culties in proving control of the FDR using an adaptive procedure:
• First, for any constant c the BH procedure with parameter ca guarantees FDR ≤ m0
m cα. However, if c is a
RV there is no assurance that FDR ≤ m0
m E [c]α and speci�cally c = mm 0
doesn't yield to the desired result.
• Second, if m0 is an unbiased estimator of m0 then mm0α is biased upwards compared to α since by Jensen's
inequality we would know that 1m0
= E[
1m0
]> 1
E[m0] and thus
E[m
m0α
]= E
[1
m0
]m0α >
m0α
E [m0]= α
Which means even if we achieve FDR ≤ E[
1m0
]m0α we are not guaranteed FDR ≤ α.
A possible solution: this analysis immediately shows we would prefer positively biased estimators ofm0 (E [m0] >
m0). This type of estimators are more conservative in the sense that they estimate the number of hypothesis for
which H0 is false (m −m0) to be smaller than it actually is. It turns out that given such estimators it is possible
to prove control of the FDR by α under assumption of independence as the following theorem claims:
Theorem 65. Suppose that m0 = m0 (P1, ..., Pm) is a monotonic (in Pi values) estimator of m0. Denote by m(�1)0
the same estimator calculated for m−1 hypotheses and the same Pi values except one value which is Treu Null (that
is except for one Pj for which H0 is known to be true and thus Pj ∼ Uniform [0, 1]). Then assuming P1, ..., Pm are
independent it holds that FDR ≤ E
[m0
m(�1)0
]α.
Proof. See [13].
Remark 66. This theorem almost ful�lls the demand FDR ≤ m0
E[m0]α at the price of increasing m0 by at most 1.
Furthermore, there are other (biased) estimators that by using some additional conservation corrections guarantee
that E[
1m0
]≤ 1
m0and for these estimators control of the FDR to level α can be shown (under certain conditions).
Conclusion: At the price of being slightly conservative at estimation of m0 it is possible to construct an adaptive
procedure that under independence of the Pvalues controls the FDR as required. Speci�cally when the di�erence
27
4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):
between m0 and m is large using such a procedure will provide a great improvement in power compared to the
standard BH procedure.
4.2.4 What to do if the Pvalues are not independent:
When the Pvalues are not independent the variance of the estimators of m0 signi�cantly grows and the estimators
become unstable such that the corrections required in order to ensure that E[
1m0
]· ≤ 1
m0become greater and
greater. Thus there is no adaptive procedure that controls the FDR under general dependence or even under
speci�c types of dependence unless one is willing to lose a lot of power by using very conservative estimators of
m0. However, doing that is pointless since the original purpose of the adaptive procedure was to improve the power
compared to the standard BH procedure.
Conclusion: if it is known (or shown) that there is only a weak dependence between the Pvalues then the use
of an adaptive procedure can be very attractive. On the other hand, when the dependence is strong it is not
recommended to use these procedures since the FDR can grow beyond what is expected.
It remains an open challenge to �nd an adaptive procedure that both ensures high power and control the FDR at
level α even when there is dependence of the Pvalues (or to prove no such procedure exists).
4.2.5 Estimation versus control of the FDR:
The purpose of the BH procedure and its variations was the control the size of the FDR. An alternative approach
to the problem is to �rst use a certain procedure for MHT and then attempt to estimate the FDR of the procedure
in order to evaluate it (see [9]). Denote by π0 = m0
m the proportion of True Null hypothesis.
De�nition 67. Suppose a MHT with a rejection threshold γ (reject the i'th hypothsis i� Pi ≤ γ), we denote:
V (γ) := # {False Rejection with threshold γ}
R (γ) := # {Rejections with threshold γ} = # {pi ≤ γ}
Storey's Procedure (2002):
1. Select a parameter λ ∈ (0, 1).
2. Compute all the Pvalues p1, ...., pm.
3. Estimate π0 by π0 (λ) = #{pi>λ}(1−λ)m .
4. Given a threshold γ ∈ (0, 1) estimate P (Pi ≤ γ) by
P (Pi ≤ γ) =1
mmax
R(γ)︷ ︸︸ ︷
# {pi ≤ γ}, 1
5. Estimate the FDR by ˆFDRλ (γ) := π0(λ)γ
P(Pi≤γ)
28
4.3 An alternative method for MHT using Qvalues: 4 MULTIPLE HYPOTHESIS TESTING (MHT):
Explanation:
1. The reason π0 (λ) is a sensible (and unbiased) estimator of π0 is that#{pi>λ}
(1−λ) is an unbiased estimator of m0:
E [# {Pi > λ}] = E
[m∑i=1
1{Pi>λ}
]=
m∑i=1
E[1{Pi>λ}
]=
m∑i=1
P (Pi > λ)
†︷︸︸︷= m0 · P (Pi > λ|Pi ∼ Uniform [0, 1]) = m0 (1− λ)
The marked equality is the result of the fact that Pi > λ i� the i'th hypothesis was not rejected with con�dence
λ. Thus P (Pi > λ) = 0 unless Pi is distributed under H0 in which case it is distributed Uniform [0, 1] and has
probability (1− λ) to larger than λ. Since there are m0 such i values for which P (Pi > λ) = (1− λ) and for
the remaining i values P (Pi > λ) = 0 the result is obtained.
2. The reason ˆFDRλ (γ) is a sensible estimator of FDR = E[VR+
]is two fold:
(a) First, V (γ)m is the proportion of false rejections with threshold γ and thus π0 (λ) γ is an estimator
1mE [V (γ)] := E
[V (γ)m
]since π0 is an unbiased estimator of π0 = m0
m .
(b) Second, P (Pi ≤ γ) by de�nition clearly is an estimator of 1mE [R (γ)] = E
[R(γ)m
].
Thus the quotient π0(λ)γ
P(Pi≤γ)is an estimator of E[V (γ)]
E[R(γ)] , meaning we estimated the quotient of the expectation
instead of the expectation of the quotient which is the FDR.
Under certain conditions it can be shown that the estimator ˆFDRλ (γ) has some desirable properties such as in the
case where m grows bu the proportion π0 remains constant (in which case but R and V grow in tandem).
Despite we described this as an estimating procedure for the FDR it can also be used for purpose of control as part
of the adaptive BH procedure if we use the estimator m0 = mπ0 (λ), furthermore we can optimize over λ to achieve
even better results.
4.3 An alternative method for MHT using Qvalues:
Storey ([10]) proposed an alternative method for MHT using what he termed q-values. The goal of the procedure is
to produce a measure of con�dence in each hypothesis which is analogous to the p-value but is suitable for multiple
hypothesis.
De�nition 68. Given a MHT with m hypotheses the q-values q1, ..., qm are the minimal α values such that the
BH procedure with parameter α will reject the i'th hypothesis respectively.
Remark 69. The q-values represent the signi�cance of each hypothesis in the context of being tested as part of a
MHT. Using q-values is obviously equivalent to using the BH procedure but has the advantage of giving a more
accessible representation to MHT by transferring the majority of the di�culty the the computation of the q-values.
After said computation is done it is simply used to determine whether to reject the i'th hypothesis like one would use
29
4.4 Other variations of FDR: 4 MULTIPLE HYPOTHESIS TESTING (MHT):
the p-value for a single hypothesis test. Meaning, if we want to ensure FDR ≤ α it su�ces to reject all hypothesis
for which qi ≤ α. The following algorithm computes the q-values:
1. Compute all the Pvals p1, ..., pm and order them p(1) ≤ ... ≤ p(m).
2. Compute q(i) = min(p(i)m
i , 1).
3. Shrink and order: for i = m− 1 down to 1 set q(i) = min{q(i), q(i+1)
}.
4. To get qi from q(i) perform the opposite permutation to pi 7→ p(i).
Step 3 is performed in order to regain the order of the ordered values q(i) since it is impossible that p(i) < p(j)
and at the same time q(i) > q(j). One can show this algorithm indeed computes the values in accordance with the
de�nition of the q-values.
Remark 70. In general when conducting a MHT we assume that we only have access to the p-values and that we
have no preference for certain hypothesis over others. Assuming this is true we will always use the same rejection
threshold for all hypotheses since there is no logical reason to reject one hypothesis with a certain p-value if we did
not reject another hypothesis that has a lower p-values. Under di�erent assumptions it is possible to de�ne other
procedures that do not use an identical rejection threshold for all hypotheses.
4.4 Other variations of FDR:
- positive FDR
- local FDR (lfdr)
4.5 Empirical Bayes View:
30
REFERENCES REFERENCES
References
[1] Yoav Benjamini and Yosef Hochberg, Controlling the false discovery rate: a practical and powerful approach to
multiple testing, Journal of the Royal Statistical Society. Series B (Methodological) (1995), 289�300.
[2] Yoav Benjamini, Abba M Krieger, and Daniel Yekutieli, Adaptive linear step-up procedures that control the
false discovery rate, Biometrika 93 (2006), no. 3, 491�507.
[3] Yoav Benjamini and Daniel Yekutieli, The control of the false discovery rate in multiple testing under depen-
dency, Annals of statistics (2001), 1165�1188.
[4] Bradley Efron, Robert Tibshirani, John D Storey, and Virginia Tusher, Empirical bayes analysis of a microarray
experiment, Journal of the American statistical association 96 (2001), no. 456, 1151�1160.
[5] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J Smola, A kernel
method for the two-sample-problem, Advances in neural information processing systems 19 (2007), 513.
[6] Wassily Hoe�ding, A non-parametric test of independence, The Annals of Mathematical Statistics (1948),
546�557.
[7] Shachar Kaufman, Ruth Heller, Yair Heller, and Malka Gor�ne, Consistent distribution-free tests of association
between univariate random variables, arXiv preprint arXiv:1308.1559 (2013).
[8] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turnbaugh,
Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti, Detecting novel associations in large data sets,
science 334 (2011), no. 6062, 1518�1524.
[9] John D Storey, A direct approach to false discovery rates, Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 64 (2002), no. 3, 479�498.
[10] , The positive false discovery rate: A bayesian interpretation and the q-value, Annals of statistics (2003),
2013�2035.
[11] Gábor J Székely, Maria L Rizzo, Nail K Bakirov, et al., Measuring and testing dependence by correlation of
distances, The Annals of Statistics 35 (2007), no. 6, 2769�2794.
[12] Gábor J Székely, Maria L Rizzo, et al., Brownian distance covariance, The annals of applied statistics 3 (2009),
no. 4, 1236�1265.
[13] Amit Zeisel, Or Zuk, Eytan Domany, et al., Fdr control with adaptive procedures and fdr monotonicity, The
Annals of Applied Statistics 5 (2011), no. 2A, 943�968.
31