Dec 23, 2015

Lecture Notes

Welcome message from author

This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

Modern Statistical Data Analysis:

February 27, 2015

1

CONTENTS CONTENTS

Contents

I Hypothesis Testing: 4

1 Data and Data Preprocessing: 4

1.1 Visualization of Data: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Data Transformations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Completion of Missing Values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Classical Hypothesis Testing: 6

2.1 Some Notation and Reminders: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Classical Problems in Parametric Statistics: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Non-Parametric Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Permutation Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Tests for Independence: 12

3.1 Classical tests for independence of scalar RVs: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Testing for independence using permutations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Testing for independence using a Kernel function: . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.2 Testing for independence using the Distance Correlation (dCor) method: . . . . . . . . . . . . 15

3.2.3 Another distance based method for continuous univariate RVs: . . . . . . . . . . . . . . . . . 17

3.2.4 An extension of Hoe�ding's method: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.5 An information theoretic test for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Comparison of di�erent tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Equitable tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Multiple Hypothesis Testing (MHT): 22

4.1 Family Wise Error Rate (FWER): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.1 The Bonferroni Correction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 False Discovery Rate (FDR): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.1 The BH procedure under more general dependence: . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.2 What is the true meaning of FDR control: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.3 Adaptive Procedures (Modi�ed BH procedures): . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.4 What to do if the Pvalues are not independent: . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.5 Estimation versus control of the FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2

CONTENTS CONTENTS

4.3 An alternative method for MHT using Qvalues: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Other variations of FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 Empirical Bayes View: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

References 31

3

1 DATA AND DATA PREPROCESSING:

Part I

Hypothesis Testing:

1 Data and Data Preprocessing:

During this course when we discuss "data" we will usually be referring to a collection of vectors x1, ..., xn ∈ Rp

which we usually assume that these are realizations of of an I.I.D sequence Xii.i.d∼ F with F being some known or

unknown distribution.

1.1 Visualization of Data:

One of the �rst thing one wants to do when dealing with real data before beginning to perform any analysis or

employing any statistical methods is to look at the data any try to get some initial idea about its nature, there are

several ways to do this:

• Examination of scatter-plots of the various variables in relation to one another in order could help detect

trends and relations as well as detect outliers. For example consider these two images:

Figure 1.1: Passing a linear trend-line through a scatter-plot for the collection of (x,y) values (blue dots). Wewould like to model the relation between X and Y andpossibly predict given an X value what correspondingY value would match it. The linear regression line (inblack) is one way to do this.

Figure 1.2: in this scatter-plot we see an outlier. Thisoutlier is clearly visible in this representation of the datawhile we would not be able to detect is using a histogramof the X or Y values separately.

• Examination of density-plots might be more informative when the number of observations and their location

makes a regular scatter-plot uninformative.

4

1.2 Data Transformations: 1 DATA AND DATA PREPROCESSING:

Figure 1.3: On the left side we see a scatter-plot of x,y values, it can be seen that the sheer amount of observationsand the way they are scattered does not allow us to notice any trend as all areas look equally dense (equally coveredwith black dots). On the other hand, the right side image shows us a density plot where di�erent colors representthe point density in each area. In this particular example the density in the red area is roughly 1000 times higherthan in the purple area - something that could not be seen in the left plot.

• Examination of box-plots gives a compact and convenient representation of the averages and standard devia-

tions of data in various subgroups of the data. This is mostly useful when there is a large number of variables

that need to be compared or examined together.

Figure 1.4: It can be seen that a box-plot can represent the distribution of the observations around the mean acrossseveral groups. This allows comparison of both means and standard deviations of the various groups and allowdetection of di�erences between the groups.

1.2 Data Transformations:

In many cases there are non-linear relationships between variables in a dataset and these relationships are harder to

detect and understand. Using a transformation (for example logarithmic or polynomial) of one or several variables

it is often possible to get a clearer representation of the relationships in the data or compensate for di�erences in

order of magnitude in the data.

5

1.3 Completion of Missing Values: 2 CLASSICAL HYPOTHESIS TESTING:

1.3 Completion of Missing Values:

In many cases some of the values of some variables in the dataset are missing and there are methods to complete

these values using various estimation methods. This is helpful in cases where one wants to analyze data in a method

which does not tolerate missing values but one doesn't want to simply omit observations.

2 Classical Hypothesis Testing:

2.1 Some Notation and Reminders:

Remark 1. From this point on we will abbreviate both a scalar random variable and random vector by writing

RV. The dimension will be left to be inferred from the context.

De�nition 2. Given a RV X with distribution F (usually described by a cumulative distribution function) a

distribution parameter θ is some (often unknown) value that characterizes the distribution (such as the expec-

tation, variance, etc).

De�nition 3. An independent sample or realization x1, ..., xn ∈ Rd from the distribution F is a realization

from a sequence of random variables X1, ..., Xni.i.d∼ F (here Xi can be either scalar or vector valued).

De�nition 4. Given a distribution F a statistic of the distribution is a RV T (X1, ..., Xn) which is a function of

a sequence of RVs X1, ..., Xni.i.d∼ F . The value of the statistic T (x1, ..., xn) is calculated based on realizations.

Remark 5. In many cases the statistics we will discuss will be estimators for the parameters of some distribution.

For example the sample average is an estimator for the parameter which is the distribution expectation.

De�nition 6. Given a family of distributions F (x; θ) (a statistical model) parameterized by θ, an estimator of θ

is any function θ (x1, ..., xn) that tries to approximate θ based on a sample from the distribution F .

De�nition 7. The bias of an estimator θ (X1, ..., Xn) for the parameter θ is de�ned by Bias(θ):=Eθ

[θ − θ

]. An

estimator is said to be unbiased if its bias equals zero.

De�nition 8. A loss function for an estimator θ of the parameter θ is any function L(θ, θ)that quanti�es the

"nearness" of the estimator to the real parameter value. One example is the squared-error loss L(θ, θ)

=∥∥∥θ − θ∥∥∥2

2.

De�nition 9. A risk function for an estimator θ of the parameter θ is the expected value of some loss function,

R(θ, θ)

= Eθ[L(θ, θ)], one common example is the mean-squared-error risk MSE

(θ)

:= Eθ[∥∥∥θ − θ∥∥∥2

2

].

2.1.1 Classical Problems in Parametric Statistics:

In classical statistic the assumption usually is that given observations x1, ..., xn there is a family of distributions

F (x; θ) parameterized by θ (both x and θ can be scalars or vectors) such that x1, ..., xn is an i.i.d sample from F .

In this context there are several classical question that arise:

6

2.1 Some Notation and Reminders: 2 CLASSICAL HYPOTHESIS TESTING:

1. Point Estimation: Given the observations x1, ..., xn one tries to �nd an estimator θ (x1, ..., xn) which

approximates the unknown parameter θ. There are a few desirable qualities in such an estimator:

(a) Consistency: we would like the estimator to converge in probability to the real value of θ.

(b) Lack of Bias: we would ideally like the estimator to have little or no bias in the sense that Eθ[θ − θ

]≈ 0.

(c) Low Risk: given some risk function R we would like an estimator with low risk.

2. Con�dence Interval Estimation: a con�dence interval [a, b] for the parameter θ with con�dence level α is

any interval such that Pθ [θ ∈ [a, b]] = 1− α. We often want to estimate such intervals based on observations

and the values of the interval edges (usually take to be symmetric) are functions of the observations.

3. Hypothesis Testing: hypothesis testing deals with decision problems of the type θ = θ0 or θ ≤,≥ θ0. There

is a null hypothesis regarding the true value of the parameter, denoted by θ0 and the objective is to decide

with a given level of certainty whether to accept or reject this null hypothesis.

Several key elements of all of these problems are the number of parameters that need to be estimated, the amount

of data available and the number of hypothesis we wish to test.

De�nition 10. Suppose x1, ..., xn is a realization of X1, ..., Xni.i.d∼ F (x; θ) . Given a null hypothesis H0 regarding

the value of θ and the complementary alternative hypothesis H1 there are several stages to a statistical test of

H0 :

1. De�nition and calculation of a test statistic T (X1, ..., Xn) based on the observations.

2. De�nition and calculation of a rejection area R (α) that contains the parameter θ with probability α.

3. Rejection of H0 i� T (x1, ..., xn) ∈ R (α) (equally acceptance of H1 i� T ∈ Rc)

There are two kinds of errors that arise in this context:

1. Type 1 Error (false positive): α = P (T ∈ R |H0 = true)

2. Type 2 Error (false negative or miss): β = P (T ∈ Rc |H1 = true)

De�nition 11. Given a statistical test for with test statistic T (x1, ..., xn) the Pvalue of the test is de�ned as the

probability that we observe a value at least as extreme as T (x1, ..., xn) given the underlying distribution of the

observations under H0:

Pvalue = P (T (X1, ..., Xn) ≥ T (x1, ..., xn) |T (X1, ..., Xn) is distributed accrording to H0)

It is important to remember that just like T (X1, ..., Xn) the Pvalue is a random variable which is a function of

X1, ..., Xn. Furthermore if H0 is true then the distribution of the Pvalue would be Uniform [0, 1].

7

2.1 Some Notation and Reminders: 2 CLASSICAL HYPOTHESIS TESTING:

Figure 2.1: Here we see the empirical density (left) and empirical cumulative distribution (right) of 500 Pvaluescalculated by a simulation in which the data was sampled according to H0. It can be seen that the simulation resultindeed shows that the approximated density of the Pvalues in such a case is roughly Uniform[0, 1].

Remark 12. A few remarks:

• It is equivalent to reject H0 when T (x1, ..., xn) ∈ R (α) and to reject H0 when Pval < α.

• The Pvalue is a good measure of our con�dence in the "correctness" of H0 when testing a single hypothesis.

• In the parametric case there is often an analytic way to calculate the Pvalue based on the distribution of the

test statistic under H0 (for example this is possible in the classic t-test).

• It is a common misconception that the following relation exist

Pval = P (H0 = True |Observed data) = P (T (X1, ..., Xn) is distributed accrording to H0|T (X1, ..., X1) = T (x1, ..., xn))

This is however not true and in order to calculate P (H0 = True |Observed data) one would need to rely on

Bayesian statistics and assume some prior distribution on H0.

• In order to calculate the type-2 error β one would need an assumption of the distribution of θ under H1.

De�nition 13. Given a statistical test with test statistic T and rejection area R (α) the power of the test is

Power = 1− β = P (T ∈ R (α) |H0 = false)

This value of course depends on the choice of α. It is generally desired that a test would have a low α and high

power simultaneously but in reality there is a trade o� between the two and that is not always possible.

8

2.1 Some Notation and Reminders: 2 CLASSICAL HYPOTHESIS TESTING:

Figure 2.2: Here we see the density of the test statistic under H0 (in blue) and under H1(in red). Given somerejection threshold (denoted by the black line) we would reject H0 if the test statistic is to the right of the line.The con�dence level α would then be the area to the right of the line found under the density given H0, here thatarea is marked in red. The power of the test is the area to the right of the line found under the density given H1,here denoted in green. It can be seen that moving the rejection threshold impact both α and the power.

Example 14. One of the simplest examples of a parametric test is the independent two-sample test for equality

of means. Assume that we are given observations {(xi, yi)}ni=1 where yi ∈ {0, 1} is a categorical variable and we

assume that the xivalues are realization from the distribution Xi|yi ∼ N(µyi , σ

2). We would like to test the

hypothesis H0 : µ0 = µ1 , that is that the means in both groups are equal. If we assume that σ is known then we

can conduct a simple Z-test by calculating the z-score of the mean di�erence:

nj =

n∑i=1

1{yi=j}

µj =1

nj

n∑i=1

xi1{yi=j}

Z =µ1 − µ0√

2(

1n0

+ 1n1

)σ2

Under the assumption of H0 being true Z ∼ N (0, 1) and thus we will reject H0 i� Z > Zα2or Z < −Zα

2for

Zα2

= Φ−1(α2

). This is a two-sided test in which we have equally divided the rejection area between the two tails

of the distribution. We can also calculate the Pval of the test given by Pval = 2 min {Φ (Z) , 1− Φ (Z)}. Similarly

if the variance of the distribution is unknown we would use a t-test with n− 2 degree of freedom by calculating the

following statistic:

tn−2 =µ1 − µ0√(σ2

0

n0+ σ2

1

n1

) ; σ2j =

1

nj − 1

n∑i=1

(xi − µj)21{yi=j}

9

2.2 Non-Parametric Tests: 2 CLASSICAL HYPOTHESIS TESTING:

The quantity in the denominator is simply an estimator for σ2. In this case the distribution of tn−2 is no longer

normal but it is a known distribution named the T-distribution with n− 2 degree of freedom. This distribution can

be used to calculate the rejection area and the Pval in the same way as with the Z-test.

One problem with the t-test is that it assumes an underlying normal distribution of the observations, as assumption

which can not be omitted and thus limits this test to speci�c cases. Additionally the t-test is not guaranteed to

control the type-I error α, especially for skewed distributions.

2.2 Non-Parametric Tests:

De�nition 15. GivenX1, ..., Xni.i.d∼ F a statistic T (X1, ..., Xn) will be calledDistribution Free if the distribution

of T does not depend on F . Furthermore, a hypothesis test for H0 will be called a distribution free test if the

test statistic is distribution free, that is it does not depend on the distribution of the observations under H0.

Remark 16. While a distribution free test statistic must be independent of the distribution of the data under H0 it

is possible that the distribution of the statistic does depend on the alternative hypothesis.

Example 17. We will now give an example for a distribution free aparametric test for the two independent sample

decision problem. Given the observations {(xi, yi)}ni=1 when again yi ∈ {0, 1} we would like to determine whether

Xi|yi = j has the same distribution for both j = 0 and j = 1. The Mann-Whitney-Wilcoxon Rank-Sum Test

allows us to do this and works as following:

1. The rank of each observation is calculated as ri =∑nj=1 1{xj≤xi}.

2. For j = 1, 2 the value Rj =∑ni=1 ri1{yi=j} is calculated.

3. Given n1 =∑ni=1 1{yi=1} The test statistic u = R1 − n1(n1+1)

2 is calculated.

4. The distribution of U is calculated and a rejection area is chosen accordingly given con�dence α.

Despite it appears U is not symmetric it actually is since R1 + R2 = n(n+1)2 is constant. The main advantage

of this test is the fact it is distribution free and thus can be used regardless of the underlying distribution of

the observations. The main question that remains is how to calculate the distribution of U in order to obtain the

rejection area. In practice for relatively small samples (up to 20 observations) there are tables that directly calculate

said distribution while for large sample there is a normal approximation to the distribution of U . An alternative

more modern method is to use numerical methods as we will now discuss.

2.2.1 Permutation Tests:

Given the data {(xi, yi)}ni=1 where yi are again binary labels we want to test the hypothesisH0 : X,Y are independent

(equivalently if x1, ..., xn1 ∼ F and x′

1, ..., x′

n2∼ G the hypothesis would be H0 : F = G). Assume that

Tobs := T ((x1, y1) , ..., (xn, yn)) is some statistic (essentially any function of the data). A permutation test based

on the statistic T would be carried out as follows:

10

2.2 Non-Parametric Tests: 2 CLASSICAL HYPOTHESIS TESTING:

1. Draw N random permutations s1, ..., sN ∈ Sn.

2. For each permutation calculate the test statistic on the permuted data Ti := T ({(xi, yπi)}ni=1).

3. Compute the empirical Pvalue P = 1N

∑Ni=1 1{Ti>Tobs}.

4. Reject H0 if P < α for given con�dence α.

Claim 18. For any joint-distribution (Xi, Yi)i.i.d∼ F and any test statistic T ({(xi, yi)}ni=1) the distribution of the

empirical Pval under H0 is approximately Uniform[0, 1].

Proof. Sketch of proof: Under H0 the RVs T1, ..., Tn, Tobs are i.i.d and thus #{Ti > Tobs

}is uniformly distributed

on {0, ..., N} which immediately gives us that P ∼ Uniform{

0, 1N ,

2N , ..., 1−

1N , 1

}and thus when N → ∞ we get

that the distribution of P converges to Uniform [0, 1].

Remark 19. All of this is correct assuming that T is continuous and there are no ties. However if T was discrete

and ties were possible the distribution of P is still approximately uniform.

Fact 20. The above procedure guarantees that P (reject |H0 = true) ≤ α.

The advantage of permutation tests over normal approximations is that the permutation test better captures the tails

of the distribution than the normal approximation. Even though for very large samples the normal approximation

converges to the true distribution in practice this still might not be satisfactory, especially when one wants to

conduct a test with very small α values.

Advantages of Permutation tests: Accuracy, no underlying assumption about the data, �exibility (can be used

with complicated null models)

Disadvantages of permutation tests: Computationally intensive and do not provide any insight regarding the

analytic characteristic of the distribution (for example the relation between the power of the test and the sample

size is not easily understood).

2.2.1.1 Using permutations to calculate test power:

Suppose we have an assumption regarding the underlying distribution of the data under both H0 and H1 and we

have a test statistics T meant for testing H0. We would often like to know what is the power of the test and how

does the power change with the sample size. In the parametric approach we can often analyze the power by direct

analytic computation or via an asymptotic approximation. In the non-parametric computational approach we can

compute the power using permutations using the following algorithm:

• Input: con�dence level α, an assumed distribution F of the data under H1, sample size n.

• Parameters: N - number of permutations, K - number of simulations.

• for k = 1, ....,K do:

� Sample by simulation {(xi, yi)}ni=1i.i.d∼ F

11

3 TESTS FOR INDEPENDENCE:

� perform a permutation test with N permutations with con�dence α based on the sampled data.

� Denote by Rk ∈ {0, 1} the result of the test (Rk = 1 if the k'th simulation rejected H0).

• The estimator for the power is then de�ned by 1− β = 1K

∑kk=1Rk

• Computational cost: O (N ·K) which could be quite costly.

3 Tests for Independence:

Suppose we are given two RVs X,Y and we would like to test whether they are independent. Ideally we would like

a test that would be able to detect any kind of probabilistic dependence between the two variables (not necessarily

linear or monotonic for example).

De�nition 21. Given Xni.i.d∼ F (x; θ) an estimator θ (X1, ..., Xn) for θ is said to be consistent if θn

P−→ θ.

De�nition 22. A sequence of tests Tn = (gn, Rn) with sample size n, test statistic gn (X1, ..., Xn) and rejection

area Rn for the rejection of H0 is said to be consistent if the two following properties hold:

1. P (reject | H0=true) = P (gn ∈ Rn |H0 = true)n→∞−→ 0.

2. P (reject | H1=true) = P (gn ∈ Rn |H1 = true)n→∞−→ 1 for any H1 6= H0.

The general method for constructing consistent test for independence is as follows:

1. Find a distance measure D (X,Y ) ≥ 0 such that D (X,Y ) = 0 i� X,Y are independent.

2. Find a consistent sequence of estimators Dn for D.

3. Find a series of threshold εn and de�ne rejection areas Rn ={Dn > εn

}.

Remark 23. The �nal step is often tricky as it depends on the rate of convergence of Dn.

Example 24. Testing for independence using a kernel method:

Suppose we again have the data {(xiyi)}ni=1i.i.d∼ F where yi is binary and we would like the testH0 : X,Y are independent.

We de�ne a kernelKh (xi, xj) and compare the kernel values for pairs of the same and di�erent groups by calculating

the following statistic:

T =

n∑i,j=1

Kh (xi, xj)

[1{yi=yj=0}

n20

+1{yi=yj=1}

n21

− 21{yi 6=yk}

2n0n1

]

The kernel function could for example be a Gaussian kernel Kh (xi, xj) = e−1h2‖xi−xj‖2 . Here h is a width parameter

which determines how fast the kernel goes to zero and in turn impacts the rejection area (it is possible to pick an

optimal h value using a more sophisticated technique). When the X values do not depend on Y we would obviously

have that E [T ] = 0 and thus it would su�ce to test the hypothesis that E [T ] = 0 by comparing the value of |T | to

an appropriate critical value. This can be done either by a normal approximation of the distribution of T or in a

direct method using permutations.

12

3.1 Classical tests for independence of scalar RVs: 3 TESTS FOR INDEPENDENCE:

3.1 Classical tests for independence of scalar RVs:

There are several measures for dependence of scalar random variables, the most known of which is the Pearson

correlation-coe�cient which is given by:

R (x, y) =

∑ni=1 (xi − x) (yi − y)√∑n

i=1 (xi − x)2 ·∑ni=1 (yi − y)

2

This is an estimator of the correlation coe�cient given by

ρ (X,Y ) =E [(X − E (X)) (Y − E (Y ))]√E[(X − E (X))

2(Y − E (Y ))

2] =

Cov (X,Y )√Var (X) ·Var (Y )

The correlation coe�cient represents the strength of the linear dependence between X and Y but is not informative

regarding other types of dependence. If one assume a normal distribution of X and Y then there is a closed form

to the distribution of R and more generally the distribution could always be approximated using permutations.

Fact 25. When ρ (X,Y ) = 0 we would say that X,Y are uncorrelated. It is known that independent RVs are always

uncorrelated but the opposite is not true.

Figure 3.1: Here we see the values of the Pearson correlation-coe�cient for various types of dependence and variouslevels of noise. It can be seen in the �rst and second lines that the value of the coe�cient depends on the amountof noise but does not depends on the slope as long as the slope is not zero. The third line shows that for non-lineardependence the coe�cient can very well be zero despite the fact the variables have a functional relationship andare not independent.

Another measure of dependence is the Kendell-Tau Rank correlation coe�cient de�ned by:

τ (x, y) =

∑i<j sgn ((xi − xi) (yi − yj))

12n (n− 1)

Using permutations it is again possible to perform a hypothesis test for the signi�cance of this coe�cient. The main

disadvantages of τ is that it is only de�ned for scalar RVs and it is only capable of detect monotonic dependencies.

On the other hand it has the advantage of being a distribution free statistic.

13

3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:

3.2 Testing for independence using permutations:

Fact 26. Given two RVs X ∈ Rp , Y ∈ Rq with distributions PX , PY and joint distribution PXY we know that

X,Y are independent i� PX (~x)PY (~y) = PXY (~x, ~y) for all (~x, ~y) ∈ Rp × Rq. If X,Y are continuous with densities

fX , fY and joint density fX,Y then an equivalent condition is that fX (~x) ·fY (~y) = fX,Y (~x, ~y) for all (~x, ~y) ∈ Rp×Rq.

Assume we are given two RVs X ∈ Rp , Y ∈ Rq with distributions PX , PY and joint distribution PXY , we would like

a statistical test which will determine whether the probability distributions PXPY and PXY are identical. When

the distributions are both univariate the Kolmogorov-Smirno� two-sample test provides an analytic solution to

this problem but it does not generalize to the multivariate case. We would thus want to approach the problem

by de�ning a statistic whose distribution is di�erent when X,Y are dependent or independent and then use a

permutation test in order to test for signi�cance of this statistic.

Remark 27. The problem of testing for independence is similar (but not identical) to testing for equality of distri-

bution in a two sample setting. When testing for equality of distribution we are given a set of data sampled from

distribution P and a di�erent set of data sampled from a distribution Q and we would like to determine whether

P ≡ Q based on these two independent samples. In comparison, when testing for independence we would like

to determine whether PXY ≡ PXPY but in this case we evaluate these distributions based on a single sample.

There are several types of methods for testing for independence using permutations:

• Kernel methods: these methods rely on de�nition of a kernel K(x, y) which measures similarity between x

and y. The value of the Kernel are then computed and compared for pairs from the same group and from the

two di�erent groups.

• Geometric methods: these methods rely on de�ning a distance between distribution measure D (X,Y ).

• Information based methods: it can be shown that two RVs X,Y are independent i� their mutual infor-

mation I (X,Y ) equals zero. This provides a way for testing for independence by computing or estimating

the mutual information.

3.2.1 Testing for independence using a Kernel function:

In example 24 we described a two-sample method for testing independence given a sample {(xi, yi)}ni=1 where yi

was a binary value based on the statistic

T =

n∑i,j=1

Kh (xi, xj)

[1{yi=yj=0}

n20

+1{yi=yj=1}

n21

− 21{yi 6=yk}

2n0n1

](3.1)

it can be seen that for yi = yj we are given a value with a positive sign and for yi 6= yj we got a value with a

negative sign, all these values are summed up and a permutation test is used to check whether the computed value is

signi�cantly di�erent than zero. In the more general case where we are given a single sample {(xi, yi)}ni=1i.i.d∼ PXY

we need to adapt this method to suit our needs (see [5]). One way to do this is treat (xi, yi) as sampled from

PXY and (xi, yj) i 6= j as samples from PXPY . In order to do this we de�ne A := {(xi, yi)}ni=1 and B :=

14

3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:

{(xi, yi) | 1 ≤ i 6= j ≤ n} and notice that under H0, A is an i.i.d sample from PXY and B is an i.i.d sample out of

PXPY . We thus want to determine whether these two samples are identically distributed since this is equivalent to

independence of X,Y . In order to do this we will compute the following values:

two points from same group︷ ︸︸ ︷+Kh ((xi, yi) , (xj , yj)) ∀ i 6= j

one from same and one from di�erent︷ ︸︸ ︷−Kh ((xi, yi) , (xl, yk)) ∀ i and l 6= k

one from same and one from di�erent︷ ︸︸ ︷−Kh ((xl, yk) , (xi, yi)) ∀ l 6= k

two points from di�erent groups︷ ︸︸ ︷+Kh ((xi, yj) , (xl, yk)) ∀ i 6= j, l 6= k

Where we note that (xi, yj) ∈ Rp×q are the "points" in our sample. We then calculate the test statistic given in

formula 3.1 and perform a standard permutation test in order to test for signi�cance.

Remark 28. One small problem with this method is that there are usually a lot more point in group B than in A.

3.2.2 Testing for independence using the Distance Correlation (dCor) method:

De�nition 29. The characteristic function of a RV X is de�ned by ϕX (t) = E[eit>X].

Remark 30. If X has density fX then ϕX is the Fourier transform of fX .

De�nition 31. given two vectors x, y ∈ Rp we de�ne a relation x ≤ y i� xi ≤ yi for all i.

De�nition 32. Given a realization x1, ..., xn ∈ Rp from the random sequence X1, ..., Xni.i.d∼ F we de�ne the

empirical cumulative distribution of F as follows:

FnX(~t)

=1

n

n∑i=1

1{xi≤~t}(~t)

(3.2)

De�nition 33. Given any sequence of values aij indexed by i ∈ {1, ..., n} and j ∈ {1, ...,m} we denote:

ai· :=1

m

m∑j=1

aij , a·j :=1

n

n∑i=1

aij , a·· =1

n ·m

n∑i=1

m∑j=1

aij

If the value aij we to be laid out in an n×m table these are the row/column/total averages respectively.

We mentioned previously that the Pearson correlation-coe�cient measure dependence but only of the linear nature.

More general dependence can be measure by looking at the correlations between the distances of the observations

from one another rather than looking at the original observations. The general intuition is that if X and Y are

dependent then small distances in X values should correspond to small distances in Y values. The dCov method

relies on this intuition in order to test for dependence and is performed as following:

1. Sample {(xi, yi)}ni=1 where (xi, yi) ∈ Rp × Rq and denote x := (x1, ..., xn), y := (y1, ..., yn).

15

3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:

2. Compute all the pairwise distances aij = ‖xi − xj‖2 and bij = ‖yi − yj‖2.

3. Normalize the distances by reducing the the row, column and total average and de�ne:

Aij = aij − ai· − a·i − a··

Bij = bij − bi· − b·i − b··

4. De�ne the dCov statistic and dCor statistic as follows::

dCov (x, y) := V 2 (x, y) =1

n2

n∑i=1

n∑j=1

AijBij

dCor (x, y) = R2 (x, y) =

V 2(x,y)√

V 2(x)√V 2(y)

V 2 (x)V 2 (y) > 0

0 V 2 (x)V 2 (y) = 0

Where V 2 (x) = V 2 (x, x) and V 2 (y) = V 2 (y, y).

5. Use permutation to compute the distribution of R2 and perform a hypothesis test for independence.

Remark 34. Note that the computational complexity here is O(n2)for the computation of all distance pairs.

Why this method works: It can be shown that V 2 (x, y) is an estimator for the parameter V2 (X,Y ) de�ned by:

V2 (X,Y ) :=1

cpcq

ˆ

Rp+q

|φX,Y (x, y)− φX (x)φY (y)|2

‖x‖1+p2 ‖y‖1+q

2

dxdy

where cd := π12(1+d)

Γ( 12 (1+d))

is a normalized constant. It then follows dCor (x, y) is an estimator for

dCor (X,Y ) = R2 (x, y) =

V2(X,Y )√V2(X)

√V2(Y )

V2 (X)V2 (Y ) > 0

0 V2 (X)V2 (Y ) = 0

It can be shown that V2 (X,Y ) ≥ 0 and X,Y are independent i� V2 (X,Y ) = 0.

Another important property: It can be shown that the sample version V is equal to the population version

V if one uses the empirical characteristic functions given by:

φnX (x) =1

n

n∑k=1

ei〈x,xk〉 , φnY (y) =1

n

n∑k=1

ei〈y,yk〉 , φX,Y (x, y) =1

n

n∑k=1

ei(〈x,xk〉+〈y,yk〉)

That is, if we sample x := (x1, ..., xn)and y := (y1, ..., yn) and calculate these empirical characteristic function and

de�ne ~Xn,Yn whose distribution is derived from these characteristic function then we will get that V2(Xn, Yn

)equals V 2 (x, y). Based on this relationship one can prove some interesting properties of the statistic R2 (x, y) such

as:

16

3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:

1. 0 ≤ R2 (x, y) ≤ 1 and R2 (x, y) = 0 i� x1 = y1 = ... = xn = yn.

2. The statistic V 2 (x, y) and R2 (x, y) converge almost surely to V2 (x, y) and R2 (X,Y ) as n→∞.

3. There exists a sequence of thresholds εn such that the sequence of tests that rejects H0 : X,Y are independent

when R2 ((x1, ..., xn) , (y1, ..., yn)) > εn is a consistent sequence of tests.

References: [11, 12]

3.2.3 Another distance based method for continuous univariate RVs:

Fact 35. If X,Y are two dependent univariate RVs then there exists (x, y) ∈ R2 such that FX,Y (x, y) 6= FX (x)FY (y).

Furthermore if X,Y are continuous then there exists (x, y) such that fX,Y (x, y) 6= fX (y) fY (y) and from continuity

there is a ball B := B ((x, y) , ε) such that fX,Y (u, v) 6= fX (u) fY (v) for all (u, v) ∈ B.

Motivation: based on the aforementioned fact it can be shown that ifX,Y are continuous and dependent univariate

RVs then there exists p := (xp, yp) ∈ R2 such that if the plane is divided into four quadrants around p denoted

Qp1,1, Qp1,2, Q

p2,1, Q

p2,2 then given Opj,k =

´ ´Qpj,k

fXY (u, v) dudv it holds that Op11 · Op22 6= Op1,2 · O

p2,1. This is also

equivalent to the indicator RVs 1{X>xp} and 1{Y >yp}being dependent.

Conclusion: Dependence of two univariate RVs can be identi�ed by inspecting division of the plane to quartiles.

The main question is how to choose the center of the division which will reveal such dependence. The idea suggest

by Hoe�ding [6] is to simply let the data itself de�ne the centers.

Figure 3.2: A division of the plane into four quadrants where one of the observations is selected as the center of thedivision. If we counts the number of observations in each quadrant then as the sample size grows this approximatesthe chance of a random sample belonging to that quadrant.

The procedure is carried out as follows given a sample {(xi, yi)}ni=1:

• Perform n division of the plane to quartiles where each time pi = (xi, yi) is taken to be the center point.

For each division compute a 2× 2 table of the values opij,k, j, k ∈ {1, 2}, these values are up to normalization

estimators for the values Opij,k previously de�ned.

17

3.2 Testing for independence using permutations: 3 TESTS FOR INDEPENDENCE:

• Compute the following test statistic:

Hn =1

n4

n∑i=1

(opi1,1o

pi2,2 − o

pi1,2o

pi2,1

)2• Perform a permutation test to check for signi�cance of the test statistic.

Remark 36. This test statistic is asymptotically equivalent to the following function of the empirical CDF:

Hn :=

¨R2

(FnX,Y (x, y)− FnX (x) FnY (y)

)dFnX,Y (x, y) (3.3)

Where the empirical CDF functions are de�ned as in 32. This fact can be used to prove consistency of the sequence

of test de�ned by Hn (with appropriate rejection areas). Since X,Y are independent i� FXY ≡ FXFY it can be

seen why Hn and thus Hn may be good measures of independence and we would expect Hn to converge to zero in

case of independence. Note that the integration in the de�nition is taken according to the empirical measure dFX,Y

and since this is in-fact a discrete distribution this actually means we sum up the values of the function under the

integral sign at the atoms of the distribution.

3.2.4 An extension of Hoe�ding's method:

Kaufman et al [7] suggested a variation on Hoe�ding's method where the main di�erence is that instead of using

divisions of the plane into quartiles based on a single center one should look at k di�erent points p1, ..., pk and

divide the plane into (k + 1)2areas (some of which are in�nite) by plotting the vertical and horizontal lines that

pass through the selected points. The test statistic is then de�ned as follows:

Sn,k =∑

1≤i1≤...≤ik−1≤n

(k+1)2∑j=1

(oi1,...,ik−1

j − ei1,...,ik−1

j

)2

ei1,...,ik−1

j

The values oi1,...,ik−1

j are the observed values which are simply the number of observations found in the area of

the division de�ned by the points pi1 , ..., pik−1and the values e

i1,...,ik−1

j are the expected probability values under

H0 (independence) which are obtained by multiplying the marginal CDF values FX , FY for the area of the division

de�ned by the points pi1 , ..., pik−1.

Remark 37. The value∑(k+1)2

j=1

(oi1,...,ik−1j −e

i1,...,ik−1j

)2

ei1,...,ik−1j

is actually the χ2 value computed for a (k + 1) × (k + 1)

contingency table with the aforementioned observed and expected values.

The computational complexity of this method is O(nmax(2,min(k−1,4))

).

• Advantage: Both this method and Hoe�ding methods can be shown to be distribution free.

• Disadvantage: Both methods are only applicable for two univariate RVs.

It remains an open question to �nd a distribution free test for multivariate RVs

Remark 38. the dCor test statistic is not distribution free.

18

3.3 Comparison of di�erent tests for independence: 3 TESTS FOR INDEPENDENCE:

3.2.5 An information theoretic test for independence:

De�nition 39. Given two continuous RVs X ∈ Rp, Y ∈ Rq with densities fX , fY the mutual information of X

and Y is:

I (X,Y ) :=

ˆRp+q

fX,Y (~u,~v) log

(fX,Y (~u,~v)

fX (~u) fY (~v)

)d~ud~v

Remark 40. The mutual information can similarly be de�ned for discrete or mixed RVs.

Fact 41. Given two RVs X ∈ Rp, Y ∈ Rq, I (X,Y ) ≥ 0 and I (X,Y ) = 0 i� the two are independent.

Given this fact the problem of determining dependence is reduced to estimation of the mutual information. The

problem is that this estimation is usually di�cult and tests that are based on this method usually have low power

(see ). One such test is the MIC test [8].

Remark 42. There is some evidence that better estimators for the mutual information can yield e�cient tests, see

[7]

3.3 Comparison of di�erent tests for independence:

For any type of hypothesis test there is always the question whether there is a test (a test statistic) which is better

than all others in some sense. Ideally we would want a "universal" test which is best under any/most alternatives

for example in a minimax sense.

De�nition 43. Suppose we are given a parameterized family of distributions f (x; θ) depending on the unknown

scalar parameter θ ∈ Θ and a partition of the parameter space Θ = Θ0 ∪· Θ1, we denote by Hi the hypothesis

θ ∈ Θi. Given a test statistic T(~Xn)

:= T (X1, ..., Xn) and a corresponding binary test function

φ(~Xn)

=

1 T(~Xn)∈ Rn (α)

0 otherwise

for testing a one-sided hypothesis with con�dence α. We say that T is uniformly most powerful (UMP) if for

any other test statistic T′(~Xn)

:= T′(X1, ..., Xn) and corresponding binary test function φ

′(~Xn)for which

supθ∈Θθ

Eθ(φ′(~Xn))

= α′≤ α = sup

θ∈Θ0

Eθ(φ(~Xn))

It holds that:

Eθ(φ′(~Xn))

= 1− β′≤ 1− β = Eθ

(φ(~Xn))∀ θ ∈ Θ1

Fact 44. It can be shown that there are no UMP tests for two-sided hypothesis testing and there are no UMP test

for vector valued parameterized families. Furthermore if the null is of the form H0 : θ = θ0 and the alternative is

one sided H1 : θ ≥ θ0 then the Neyman-Pearson lemma guarantees that the likelihood ratio test is a UMP test.

19

3.3 Comparison of di�erent tests for independence: 3 TESTS FOR INDEPENDENCE:

Remark 45. In particular the likelihood ratio test for testing for independence (which is not UMP since this is not

a simple one sided parameter hypothesis) rejects H0 if

n∑i=1

log

(fX,Y (xi, yi)

fX (xi) fY (yi)

)> C

In the context of testing for independence it is almost certain that no UMP test exists, one way of reaching this

conclusion is the fact that for di�erent types of alternatives (di�erent types of dependence) there are better suited

tests. Thus, instead of looking for a test which is optimal under any alternative we would like to be able to compare

between di�erent tests under a speci�c set of alternative hypothesis in which we are interested, for example for

speci�c alternatives we believe are likely to occur in real data. Given such a set of alternative hypothesis we can

perform a comparison of test power using simulations as follows:

1. Generate data from various kinds of dependency models using simulations.

2. Perform the various hypothesis test one wishes to compare on the simulated data with given con�dence α.

3. Estimate the power of each test using permutations.

4. Evaluate the power of each test as a function of the dependency model, the sample size and α.

Figure 3.3: A list of tests for Independence and their performance under various dependence models.

3.3.1 Equitable tests for independence:

When testing for independence of two RVs X,Y there are two possible questions one can ask:

1. Is there dependence between the RVs

2. How strong is the dependence.

In this context the following de�nition arises:

20

3.3 Comparison of di�erent tests for independence: 3 TESTS FOR INDEPENDENCE:

De�nition 46. A dependency measure D (X,Y ) will be called equitable if given RVs X1, X2, Y , a noise factor

Z ∼ N(0, σ2

)and two smooth functions g1, g2 such that Xi = gi (Y ) + Z it holds that D (X1, Y ) = D (X2, Y ).

Remark 47. This basically means that the measure D (X,Y ) evaluate the strength of the dependence equally

regardless of the type of functional dependence.

Figure 3.4: dCor values computed for di�erent types of dependence and noise levels. This illustration shows thatwhile the dCor measure is capable of detecting all types of dependence (the value is positive in all cases wheredependence exists) it is is not an equitable measure. For example, in the last row the noise level for the circle andsemi-circle is almost identical but the dCor values are quite di�erent (0.2 vs 0.5).

It has also been shown [8] that the mutual information based MIC statistic is not equitable. In fact it remains an

open question to �nd a dependency measure which is equitable even under restrictions on the functional dependence.

21

4 MULTIPLE HYPOTHESIS TESTING (MHT):

4 Multiple Hypothesis Testing (MHT):

In many cases when we are faced with performing some statistical test we will be required to conduct more than a

single hypothesis test. At the same time, we will want to control not only the probability for false rejection in each

single hypothesis test but the probability for false rejection across all tests performed.

Remark 48. There is no assumption on any relation between the hypothesis tests, each hypothesis can have its own

set of data, test statistic, rejection area and Pval.

Example 49. Suppose we want to test the connection between m = 20, 000 genes and a disease with con�dence

α = 0.05 in every hypothesis. Without any further restrictions if we conduct each test separately with α = 0.05 we

will get that even if none of the genes are connected to the disease the expected number of rejections is mα = 1000.

That is, we will have 1000 false rejections. This problem is known as the multiple-hypothesis problem.

De�nition 50. Given a MHT with m hypotheses we will denote the following:

# True Null True Alternative Total

Not Signi�cant U T m−R

Signi�cant V S R

Total m0 m−m0 m

For example U is the number of hypotheses where H0 should have been rejected and that was indeed the result and

V is the number of hypotheses where H0 should have been accepted but it was rejected (false rejections).

Remark 51. For simplicity we will be assuming that the test statistics we are discussing are all continuous and in

particular the distribution of the Pvals under H0 is Uniform [0, 1]. Most of the result we will discuss will remain

approximately true even if the test statistics were discrete.

Remark 52. Recall that when conducting a hypothesis test based on a statistic T (X1, ..., Xn) the Pval of the test

is itself a statistic (a RV) which is a function of X1, ..., Xn. We will now denote this RV as P := P (X1, ..., Xn) and

denote its realized value by p := P (x1, ..., xn).

4.1 Family Wise Error Rate (FWER):

When testing a single hypothesis we wanted to bound the type-1 error such that P (reject |H0 = true) ≤ α. Following

this logic the Family-Wise-error Rate (FWER) of a multiple hypothesis test is de�ned as:

FWER = P (There is at least one false rejection) = P (V > 0)

Where V is as previously denoted the number of false-rejections. Similarly to a single hypothesis test where we

wanted to upper bound the type-1 error we would want to upper bound the FWER and insure our testing method

guarantees FWER ≤ α. As we have already seen in order to do this it is not su�cient to simply set the con�dence

level of each test to be α. There are however corrections that do allow controlling the FWER simply by selecting

appropriate rejection thresholds (or con�dence level) for all singular tests.

22

4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):

4.1.1 The Bonferroni Correction:

A naive solution for FWER control when conducting m hypothesis tests with Pvalues Pi is to simply set the the

con�dence level of each test to be αm . Using a union bound it can be seen this achieves the required result:

FWER = P (All rejections are correct) = P

(m⋃i=1

{Pi <

α

m

})≤

m∑i=1

P({Pi <

α

m

})= m

α

m= α

Where P({Pi <

αm

})= α

m since under H0 the assumption is that Pi ∼ Uniform [0, 1]. The main problem with this

approach is that it is extremely conservative and it would be extremely di�cult to reject any hypothesis even in

cases where the alternative is true and even with powerful tests. That is, there is a signi�cant loss of overall power.

As is evident, control of the FWER is simply too restricting in order to allow for any e�cient testing method, wee

thus turn to an alternative approach of controlling the error when conducting multiple hypothesis tests:

4.2 False Discovery Rate (FDR):

The False Discover Rate is a less conservative measure to the quality of a MHT. The idea is that we will allow false

rejections but we will want to ensure the proportion of false rejections out of all rejections is low.

De�nition 53. Given a MHT with m hypothesis we denote R+ = max {R, 1} and Q = VR+ . Q is thus the

proportion of false rejections out of the total number of rejections. We then de�ne the FDR of the MHT to be

FDR = E [Q], that is the expected false rejection proportion.

Remark 54. The expectation here is taken with regard to the joint distribution of the Pvals of all the hypothesis

tests conducted. This is some distribution de�ned on [0, 1]m

such that the marginal distribution of every Pval is

Uniform [0, 1] under H0.

We would obviously want our testing procedure (note that in this context the procedure is something that we

would in theory want to repeat for di�erent data sets but with the same underlying hypotheses) to ensure a low

FDR value. The following procedure suggested by Benjamini and Hochberg [1] allows given a con�dence value α

to determine which hypothesis should be rejected while ensuring FDR ≤ α:

The Benjamini-Hochberg (BH) Procedure:

1. Conduct all the singular hypothesis tests and compute the realized Pvals p1, ..., pm.

2. Order the Pvals such that p(1) ≤ p(2) ≤ ... ≤ p(m).

3. Compute i∗ = max{i ∈ {1, ...,m} | p(i) ≤ α i

m

}(if exists).

4. Reject all the hypotheses that match the Pvals p(1), ..., p(i) (if i∗ doesn't exist reject none).

The rejected set of hypothesis is thus RejectedBH (p1, ..., pm;α) :={i ∈ {1, ...,m} | pi ≤ α i

∗

m

}.

23

4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):

Figure 4.1:

Claim 55. Assume we conduct m hypothesis tests based on continuous independent test statistics T1, ..., Tm using

the BH procedure with parameter α. Then FDR = m0

m α ≤ α.

Remark 56. If the statistics are discrete it is still guaranteed that FDR ≤ m0

m α.

Proof. We note that the result of the BH procedure given parameter α is a function only of the vector (p1, ..., pn)

of Pvals obtained in the tests performed. De�ne the following events informally:

C(i)k = {The i'th hypothesis was rejected =⇒ k hypotheses were rejected}

The idea is that if we look at all the Pvalues pj for i 6= j and we add a constant values to all of them which will

lead to the i'th hypothesis to being reject the result would be that exactly k hypothesis will be rejected in total.

More formally (p1, ..., pm) ∈ C(i)k i� for all qi ∈ [0, 1] exactly one of the following holds:

1. i /∈ RejectBH (p1, ..., pi−1, qi, pi+1, ..., pm;α) - The Pval qi does not lead to rejection of the i'th hypothesis.

2. i ∈ RejectBH (p1, ..., pi−1, qi, pi+1, ..., pm;α) and also:

V := |RejectBH (p1, ..., pi−1, qi, pi+1, ..., pm;α)| = k

That is for any value of qi if the i'th hypothesis was rejected then k hypotheses were rejected.

Two important properties of the events C(i)k (which we will not prove) are:

1. The event C(i)k does not depend on the value of the RV Pi and thus does is independent of the event

{Pi ≤ kα

m

}.

24

4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):

2. For any i the collection{C

(i)k | k ∈ {1, ...,m}

}is a partition of the sample space and thus

∑mk=1 P

(C

(i)k

)= 1.

We note that the event{Pi ≤ kα

m

}∩ C(i)

k is identical by de�nition to the event{Pi ≤ kα

m

}∪ {R = k} and thus:

P ({R = k}) · P(Pi ≤

kα

m|R = k

)= P

({Pi ≤

kα

m

}∪ {R = k}

)= P

({Pi ≤

kα

m

}∩ C(i)

k

)

Where R is again the number of rejections. Thus by independence and linearity of expectation we got:

FDR = E[V

R+

]=

m∑k=1

P ({R = k}) · E[V

k|R = k

]=

m∑k=1

m0∑i=1

1

kP ({R = k})P

(Pi ≤

kα

m|R = k

)Same Events︷︸︸︷

=

m∑k=1

m0∑i=1

1

kP({

Pi ≤kα

m

}∩ C(i)

k

) Independence︷︸︸︷=

m∑k=1

m0∑i=1

1

kP({

Pi ≤kα

m

})· P(C

(i)k

)Pi|H0∼U [0,1]︷︸︸︷

=m∑k=1

m0∑i=1

1

k

kα

m· P(C

(i)k

)Disjoint Union︷︸︸︷=

α

m

m0∑i=1

·

=1︷ ︸︸ ︷[m∑k=1

P(C

(i)k

)]=α

m·m0

As requested.

We have shown the B[3]H procedure controls the FDR given independent test statistics, the question remains what

happens when independence can not be assumed.

De�nition 57. Recall that given x, y ∈ Rm we denote x ≤ y i� xi ≤ yi ∀i. Furthermore we will say that a subset

D ⊆ Rm is ascending if for all x ∈ D and all y ∈ Rm x ≤ y =⇒ y ∈ D.

De�nition 58. A RV X := (X1, ..., Xn) will be said to be Positively-Regression-Dependent on Subset

(PRDS) if for any ascending D ⊆ Rm and for all 1 ≤ i ≤ n the function ψi,D (x) := P (X ∈ D |Xi = x) is

non-decreasing in x ∈ R. This means that the probability that X ∈ D does not decrease when Xi increases.

Claim 59. Assume we conduct m hypothesis tests based on continuous test statistics T1, ..., Tm that have the PRDS

property. Then the BH procedure with parameter α guarantees FDR ≤ m0

m α.

Proof. A generalization of the proof for independent statistics, see [3].

Remark 60. Even though in the case of PRDS dependency (and more generally in other models of positive depen-

dence) the BH procedure controls the FDR which is the expectation of VR+ the variance of V

R+ is generally tends to

grow as the level of positive dependence grows.

4.2.1 The BH procedure under more general dependence:

It turns out that without any assumption regarding the nature of dependence of the test statistics it is possible for

the BH procedure to fail and the FDR would exceed the m0

m α bound. However, when such a deviation does occur

the degree to which the FDR exceeds the bound is not large, as is emphasized by the following claim:

25

4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):

Claim 61. Assume we conduct m hypothesis tests based on continuous test statistics T1, ..., Tm. Then the BH

procedure with parameter α guarantees the following bound:

FDR ≤ m0

m

m∑i=1

1

i≈ m0 log (m)

mα

Proof. See [3].

Remark 62. Simulation studies have shown that in most cases which are not pathological examples the method

usually achieves the m0

m α bound and when it does not the deviation from the bound is relatively small and does

not reach the level of multiplication by the logarithmic factor.

Conclusion: The BH procedure is not particularly sensitive to existence of dependence of the test statistics.

4.2.2 What is the true meaning of FDR control:

We are reminded that the FDR is the expected value of the false rejection ratio and thus control of the FDR does

not guarantee a low false rejection ratio, speci�cally Var(VR+

)can be large even when E

[VR+

]is kept small.

Example 63. Suppose we conduct a MHT for testing m = 10000 hypotheses using the BH procedure with

parameter α = 0.1 and 1000 rejections were obtained. We would like to deduce that approximately 100 of these

rejections are false and thus approximately 900 of our �ndings are true. However, an alternative explanation is

that there is a strong positive dependence between the various test statistics and thus in 10% of cases in which we

implement the procedure there would be 1000 rejections while in the other 90% there will be none (since we chose

α = 0.1). In both scenarios it would the FDR is less than 0.1 but the actual circumstances are very di�erent.

The problem is that the BH procedure is still quite conservative, especially in cases where m0 � m we would get

a situation in which FDR ≤ m0

m α � α. In actuality we would want a procedure that has enough power to detect

e�ects that are not signi�cant in the BH procedure and still guarantees FDR ≤ α. Notice that if we knew what m0

before using the procedure (which is not possible) we could have used the BH procedure with parameter mm0α and

obtain the required result of FDR ≤ α.

4.2.3 Adaptive Procedures (Modi�ed BH procedures):

Since m0 is not known we would like the estimate it and use the estimator in an attempt to obtain a procedure

which will ensure FDR ≤ α. This approach bring rise to adaptive procedures which we will now describe (see

[2]). In the adaptive procedure the goal is to set the rejection threshold in an adaptive way which will be suitable

for various m0 values such that eventually the obtained FDR value will be independent of m0. Theses procedures

generally follow the following scheme:

1. Compute an estimator m0 for m0.

26

4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):

2. Conduct the BH procedure again with the parameter mm0α. That is �nd:

i∗ = max

{i ∈ {1, ...,m} | p(i) ≤ α

i

m0

}(4.1)

3. The rejected hypotheses are then RejectedABH (p1, ..., pm;α) :={i ∈ {1, ...,m} | pi ≤ α i

∗

m

}.

Iterative Method: Use the procedure iteratively until the estimator m0 converges (till the di�erence in the

estimator between iterations is less than some small error value ε).

Remark 64. The simplest way to compute an estimator for m0 is by simply to use the standard BH procedure with

parameter α and then take m0 = m−R as an estimator. There are however many other variations, see [2, 4, 9].

Ideally if this method worked perfectly it would guarantee FDR ≤ m0

E[m0]α. In actuality this is not the case and

there are several technical di�culties in proving control of the FDR using an adaptive procedure:

• First, for any constant c the BH procedure with parameter ca guarantees FDR ≤ m0

m cα. However, if c is a

RV there is no assurance that FDR ≤ m0

m E [c]α and speci�cally c = mm 0

doesn't yield to the desired result.

• Second, if m0 is an unbiased estimator of m0 then mm0α is biased upwards compared to α since by Jensen's

inequality we would know that 1m0

= E[

1m0

]> 1

E[m0] and thus

E[m

m0α

]= E

[1

m0

]m0α >

m0α

E [m0]= α

Which means even if we achieve FDR ≤ E[

1m0

]m0α we are not guaranteed FDR ≤ α.

A possible solution: this analysis immediately shows we would prefer positively biased estimators ofm0 (E [m0] >

m0). This type of estimators are more conservative in the sense that they estimate the number of hypothesis for

which H0 is false (m −m0) to be smaller than it actually is. It turns out that given such estimators it is possible

to prove control of the FDR by α under assumption of independence as the following theorem claims:

Theorem 65. Suppose that m0 = m0 (P1, ..., Pm) is a monotonic (in Pi values) estimator of m0. Denote by m(�1)0

the same estimator calculated for m−1 hypotheses and the same Pi values except one value which is Treu Null (that

is except for one Pj for which H0 is known to be true and thus Pj ∼ Uniform [0, 1]). Then assuming P1, ..., Pm are

independent it holds that FDR ≤ E

[m0

m(�1)0

]α.

Proof. See [13].

Remark 66. This theorem almost ful�lls the demand FDR ≤ m0

E[m0]α at the price of increasing m0 by at most 1.

Furthermore, there are other (biased) estimators that by using some additional conservation corrections guarantee

that E[

1m0

]≤ 1

m0and for these estimators control of the FDR to level α can be shown (under certain conditions).

Conclusion: At the price of being slightly conservative at estimation of m0 it is possible to construct an adaptive

procedure that under independence of the Pvalues controls the FDR as required. Speci�cally when the di�erence

27

4.2 False Discovery Rate (FDR): 4 MULTIPLE HYPOTHESIS TESTING (MHT):

between m0 and m is large using such a procedure will provide a great improvement in power compared to the

standard BH procedure.

4.2.4 What to do if the Pvalues are not independent:

When the Pvalues are not independent the variance of the estimators of m0 signi�cantly grows and the estimators

become unstable such that the corrections required in order to ensure that E[

1m0

]· ≤ 1

m0become greater and

greater. Thus there is no adaptive procedure that controls the FDR under general dependence or even under

speci�c types of dependence unless one is willing to lose a lot of power by using very conservative estimators of

m0. However, doing that is pointless since the original purpose of the adaptive procedure was to improve the power

compared to the standard BH procedure.

Conclusion: if it is known (or shown) that there is only a weak dependence between the Pvalues then the use

of an adaptive procedure can be very attractive. On the other hand, when the dependence is strong it is not

recommended to use these procedures since the FDR can grow beyond what is expected.

It remains an open challenge to �nd an adaptive procedure that both ensures high power and control the FDR at

level α even when there is dependence of the Pvalues (or to prove no such procedure exists).

4.2.5 Estimation versus control of the FDR:

The purpose of the BH procedure and its variations was the control the size of the FDR. An alternative approach

to the problem is to �rst use a certain procedure for MHT and then attempt to estimate the FDR of the procedure

in order to evaluate it (see [9]). Denote by π0 = m0

m the proportion of True Null hypothesis.

De�nition 67. Suppose a MHT with a rejection threshold γ (reject the i'th hypothsis i� Pi ≤ γ), we denote:

V (γ) := # {False Rejection with threshold γ}

R (γ) := # {Rejections with threshold γ} = # {pi ≤ γ}

Storey's Procedure (2002):

1. Select a parameter λ ∈ (0, 1).

2. Compute all the Pvalues p1, ...., pm.

3. Estimate π0 by π0 (λ) = #{pi>λ}(1−λ)m .

4. Given a threshold γ ∈ (0, 1) estimate P (Pi ≤ γ) by

P (Pi ≤ γ) =1

mmax

R(γ)︷ ︸︸ ︷

# {pi ≤ γ}, 1

5. Estimate the FDR by ˆFDRλ (γ) := π0(λ)γ

P(Pi≤γ)

28

4.3 An alternative method for MHT using Qvalues: 4 MULTIPLE HYPOTHESIS TESTING (MHT):

Explanation:

1. The reason π0 (λ) is a sensible (and unbiased) estimator of π0 is that#{pi>λ}

(1−λ) is an unbiased estimator of m0:

E [# {Pi > λ}] = E

[m∑i=1

1{Pi>λ}

]=

m∑i=1

E[1{Pi>λ}

]=

m∑i=1

P (Pi > λ)

†︷︸︸︷= m0 · P (Pi > λ|Pi ∼ Uniform [0, 1]) = m0 (1− λ)

The marked equality is the result of the fact that Pi > λ i� the i'th hypothesis was not rejected with con�dence

λ. Thus P (Pi > λ) = 0 unless Pi is distributed under H0 in which case it is distributed Uniform [0, 1] and has

probability (1− λ) to larger than λ. Since there are m0 such i values for which P (Pi > λ) = (1− λ) and for

the remaining i values P (Pi > λ) = 0 the result is obtained.

2. The reason ˆFDRλ (γ) is a sensible estimator of FDR = E[VR+

]is two fold:

(a) First, V (γ)m is the proportion of false rejections with threshold γ and thus π0 (λ) γ is an estimator

1mE [V (γ)] := E

[V (γ)m

]since π0 is an unbiased estimator of π0 = m0

m .

(b) Second, P (Pi ≤ γ) by de�nition clearly is an estimator of 1mE [R (γ)] = E

[R(γ)m

].

Thus the quotient π0(λ)γ

P(Pi≤γ)is an estimator of E[V (γ)]

E[R(γ)] , meaning we estimated the quotient of the expectation

instead of the expectation of the quotient which is the FDR.

Under certain conditions it can be shown that the estimator ˆFDRλ (γ) has some desirable properties such as in the

case where m grows bu the proportion π0 remains constant (in which case but R and V grow in tandem).

Despite we described this as an estimating procedure for the FDR it can also be used for purpose of control as part

of the adaptive BH procedure if we use the estimator m0 = mπ0 (λ), furthermore we can optimize over λ to achieve

even better results.

4.3 An alternative method for MHT using Qvalues:

Storey ([10]) proposed an alternative method for MHT using what he termed q-values. The goal of the procedure is

to produce a measure of con�dence in each hypothesis which is analogous to the p-value but is suitable for multiple

hypothesis.

De�nition 68. Given a MHT with m hypotheses the q-values q1, ..., qm are the minimal α values such that the

BH procedure with parameter α will reject the i'th hypothesis respectively.

Remark 69. The q-values represent the signi�cance of each hypothesis in the context of being tested as part of a

MHT. Using q-values is obviously equivalent to using the BH procedure but has the advantage of giving a more

accessible representation to MHT by transferring the majority of the di�culty the the computation of the q-values.

After said computation is done it is simply used to determine whether to reject the i'th hypothesis like one would use

29

4.4 Other variations of FDR: 4 MULTIPLE HYPOTHESIS TESTING (MHT):

the p-value for a single hypothesis test. Meaning, if we want to ensure FDR ≤ α it su�ces to reject all hypothesis

for which qi ≤ α. The following algorithm computes the q-values:

1. Compute all the Pvals p1, ..., pm and order them p(1) ≤ ... ≤ p(m).

2. Compute q(i) = min(p(i)m

i , 1).

3. Shrink and order: for i = m− 1 down to 1 set q(i) = min{q(i), q(i+1)

}.

4. To get qi from q(i) perform the opposite permutation to pi 7→ p(i).

Step 3 is performed in order to regain the order of the ordered values q(i) since it is impossible that p(i) < p(j)

and at the same time q(i) > q(j). One can show this algorithm indeed computes the values in accordance with the

de�nition of the q-values.

Remark 70. In general when conducting a MHT we assume that we only have access to the p-values and that we

have no preference for certain hypothesis over others. Assuming this is true we will always use the same rejection

threshold for all hypotheses since there is no logical reason to reject one hypothesis with a certain p-value if we did

not reject another hypothesis that has a lower p-values. Under di�erent assumptions it is possible to de�ne other

procedures that do not use an identical rejection threshold for all hypotheses.

4.4 Other variations of FDR:

- positive FDR

- local FDR (lfdr)

4.5 Empirical Bayes View:

30

REFERENCES REFERENCES

References

[1] Yoav Benjamini and Yosef Hochberg, Controlling the false discovery rate: a practical and powerful approach to

multiple testing, Journal of the Royal Statistical Society. Series B (Methodological) (1995), 289�300.

[2] Yoav Benjamini, Abba M Krieger, and Daniel Yekutieli, Adaptive linear step-up procedures that control the

false discovery rate, Biometrika 93 (2006), no. 3, 491�507.

[3] Yoav Benjamini and Daniel Yekutieli, The control of the false discovery rate in multiple testing under depen-

dency, Annals of statistics (2001), 1165�1188.

[4] Bradley Efron, Robert Tibshirani, John D Storey, and Virginia Tusher, Empirical bayes analysis of a microarray

experiment, Journal of the American statistical association 96 (2001), no. 456, 1151�1160.

[5] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J Smola, A kernel

method for the two-sample-problem, Advances in neural information processing systems 19 (2007), 513.

[6] Wassily Hoe�ding, A non-parametric test of independence, The Annals of Mathematical Statistics (1948),

546�557.

[7] Shachar Kaufman, Ruth Heller, Yair Heller, and Malka Gor�ne, Consistent distribution-free tests of association

between univariate random variables, arXiv preprint arXiv:1308.1559 (2013).

[8] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turnbaugh,

Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti, Detecting novel associations in large data sets,

science 334 (2011), no. 6062, 1518�1524.

[9] John D Storey, A direct approach to false discovery rates, Journal of the Royal Statistical Society: Series B

(Statistical Methodology) 64 (2002), no. 3, 479�498.

[10] , The positive false discovery rate: A bayesian interpretation and the q-value, Annals of statistics (2003),

2013�2035.

[11] Gábor J Székely, Maria L Rizzo, Nail K Bakirov, et al., Measuring and testing dependence by correlation of

distances, The Annals of Statistics 35 (2007), no. 6, 2769�2794.

[12] Gábor J Székely, Maria L Rizzo, et al., Brownian distance covariance, The annals of applied statistics 3 (2009),

no. 4, 1236�1265.

[13] Amit Zeisel, Or Zuk, Eytan Domany, et al., Fdr control with adaptive procedures and fdr monotonicity, The

Annals of Applied Statistics 5 (2011), no. 2A, 943�968.

31

Related Documents