7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
1/12
2004/3/17
[2]
One-Way Analysis of
Variance (ANOVA)
two-sample
t-test
z-test
t-test
Wilcoxon
rank-sum test,
(Mann-Whitney
U test).
Sign test,
Wilcoxon
signed-rank test
Non-Parametric
Hypothesis
Testing
Assumptions and Test for Normality
Histogram, QQplot
Jarque-Bera test, Lilliefors test, Kolmogorov-Smirnov test
Parametric
Hypothesis
Testing
Complex data
More than two Groups
Unpaired dataPaired dataMicroarray Data
[3]
A hypothesis testis a procedure for determining if an assertionabout a characteristic of a population is reasonable.
For example, suppose that someone says that the average price of
a gallon of regular unleaded gas in Massachusetts is $1.15. Howwould you decide whether this statement is true? You could try to find out what every gas station in the state was
charging and how many gallons they were selling at that price.That approach might be definitive, but it could end up costingmore than the information is worth.
A simpler approach is to find out the price of gas at a smallnumber of randomly chosen stations around the state andcompare the average price to $1.15.
Of course, the average price you get will probably not be exactly$1.15 due to variability in price from one station to the next.
Suppose your average price was $1.18. Is this three cent differencea result of chance variability, or is the original assertion incorrect?
A hypothesis test can provide an answer.
[4]
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
2/12
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
3/12
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
4/12
[13]
#$ !
ExampleH0: no differential expressed.
The test is significant
= Reject H0
False Positive
= ( Reject H0 | H0 true)= concluding that a gene is
differentially expressed when in fact
it is not.
[14]
%)
Question: What if I do a t-test on a pair of samples and fail to reject the null hypothesis--does this mean that there is no significant difference?
Answer: Maybe yes, maybe no.
For two-sample t-test, power is the probability of rejecting the hypothesis thatthe means are equal when they are in fact not equal. Power is one minus theprobability of Type-II error.
The power of the test depends upon the sample size, the magnitudes of thevariances, the alpha level, and the actual difference between the twopopulation means.
Usually you would only consider the power of a test when you failed to rejectthe null hypothesis.
High power is desirable (0.7 to 1.0). High power means that there is a highprobability of rejecting the null hypothesis when the null hypothesis is false.
This is a critical measure of precision in hypothesis testing and needs to beconsidered with care.
[15]
*
Two measurements are independent if knowing thevalue of one measurement does not give information
about the value of the other. For any gene, the measurements of expression in two
different patient are independent.
Replicate measurements from the same patient are not
independent. (replicate features on an array)
[16]
+, #$ , ,
10
9
8
7
6
5
4
3
2
1
RanksumWilcoxon rank sum test that two populations are identical (unpaired)(Mann-Whitney test)
signrankWilcoxon signed rank test of equality of medians (paired)
signtestSign test for paired samples (paired)
ttest2Hypothesis testing for the difference in means of two samples (unpaired)
ttestHypothesis testing for a single sample mean (paired)
ztestHypothesis testing for the mean of one sample with known variance
kstest2Kolmogorov-Smirnov test to compare the distribution of two samples
kstestKolmogorov-Smirnov test of the distribution of one sample
lillietestLilliefors test for goodness of fit to a normal distribution
jbtestJarque-Bera test for goodness-of-fit to a normal distribution
anova1One-Way Analysis of Variance (ANOVA)11
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
5/12
[17]
-. ! " "") ) Paired or one-sample t-test (Related samples)
Unpaired or two-sample t-test (Independent samples)
[18]
% - !/ The gene acetyl-Coenzyme A acetyltransferase 2 (ACAT2) is on the
microarray used for the breast cancer data.
We can use a paired t-test to determine whether or not the gene isdifferentially expressed following doxoruicin chemotherapy. The samples from before and after chemotherapy have been hybridized
on separate arrays, with a reference sample in the other channel. Normalize the data. Because this is a reference sample experiment, we calculate the log ratio of
the experimental sample relative to the reference sample for before andafter treatment in each patient.
Calculate a single log ratio for each patient that represents the difference ingene expression due to treatment by subtracting the log ratio for the genebefore treatment from the log ratio of the gene after treatment.
Perform the t-test. t=3.22 compare to t(19).
The p-value for a two-tailed one sample t-test is 0.0045,which is significant at a 1% confidence level.
Conclude: this gene has been significantly down-regulated following chemotherapy at the 1% level.
[19]
' - /
The gene metallothionein IB is on the Affymetrix array used forthe leukemia data. To identify whether or not this gene is differentially expressed
between the AML and ALL patients. To identify genes which are up- or down-regulation in AML relative
to ALL.
Steps the data is log transformed. t=-3.4177, p=0.0016
Conclude that the expression of metallothionein IB is significantlyhigher in AML than in ALL at the 1% level.
[20]
" -
The distribution of the data being tested is normal. For paired t-test, it is the distribution of the subtracted data that must be
normal.
For unpaired t-test, the distribution of both data sets must be normal. Homogeneous: the variances of the two population are equal. Plots: Histogram, Density Plot, QQplot, Test for Normality: Jarque-Bera test, Lilliefors test, Kolmogorov-
Smirnov test. Test for equality of the two variances: Variance ratio F-test.
Note:
If the two populations are symmetric, and if the variances are
equal, then the ttest may be used.If the two populations are symmetric, and the variances are notequal, then use the two-sample unequal variance t-test or Welch's ttest.
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
6/12
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
7/12
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
8/12
[29]
!
Given npairs of data, the sign
test tests the hypothesis that themedian of the differences in thepairs is zero. The test statistic is the numberof positive differences. If the null hypothesis is true,then the numbers of positive andnegative differences should beapproximately the same. In fact, the number of positivedifferences will have a Binomialdistribution with parameters nandp.
[30]
& !- Null hypothesis: the population median from which both samples weredrawn is the same.
The sum of the ranks for the"positive" (up-regulated) values iscalculated and compared against aprecomputed table to a p-value.
Sorting the absolute valuesof the differences fromsmallest to largest. Assigning ranks to theabsolute values. Find the sum of the ranks ofthe positive differences.
If the null hypothesis is true, thesum of the ranks of the positive
differences should be about thesame as the sum of the ranks ofthe negative differences.
[31]
& - -&$ ' 3
The data from the two groups are combined and given ranks. (1 for the smallest, 2for the second smallest,... )The ranks for the larger group are summed and that number is compared against aprecomputed table to a p-value.
[32]
,.#$ !
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
9/12
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
10/12
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
11/12
[41]
$ %
The permutation test is a test where the null-hypothesis allows to reduce theinference to a randomization problem.
The process of randomizations makes it possible to ascribe a probabilitydistribution to the difference in the outcome possible under H0.
The outcome data are analyzed many times (once for each acceptableassignment that could have been possible under H0) and then compared withthe observed result, without dependence on additional distributional or model-based assumptions.
Perform a permutation test (general):1. Analyze the problem, choice of null-hypothesis2. Choice of test statistic T3. Calculate the value of the test statistic for the observed data: tobs
4. Apply the randomization principle and look at all possible permutations, this gives thedistribution of the test statistic T under H0.5. Calculation of p-value:
Ref: Mansmann, U. (2002), Practical microarray analysis: resampling and theBootstraap.Heidelberg. [42]
$ % 0
The permutation test allows determining thestatistical significance of the score for every gene.
[43]
-& " 727
It often happens in research practice that you need to compare morethan two groups (e.g., drug 1, drug 2, and placebo), or compare groupscreated by more than one independent variable while controlling for theseparate influence of each of them (e.g., Gender, type of Drug, and sizeof Dose). In these cases, you need to analyze the data using Analysisof Variance, which can be considered to be a generalization of the t-test.
In fact, for two group comparisons, ANOVA will give results identical toa t-test.
When the design is more complex, ANOVA offers numerousadvantages that t-tests cannot provide (even if you run a series of t-tests comparing various cells of the design).
Analysis of Variance (ANOVA) allows us to extend this to more than twopopulations or measurements (treatments/). That is, we can test thefollowing: Are all the means from more than two populations equal?
Are all the means from more than two treatments on one population equal?(This is equivalent to asking whether the treatments have any overall effect.)
[44]
27 ,
7/25/2019 Hank Statisticswithmatlab Test&Anova 040317
12/12
[45]
27 ! /
To identify the genes that are differentially expressed in one or
more of these four groups. ARP1 (actin-related protein 1).
[46]
"
Enfron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman and Hall. Jarque, C. M. and Bera, A. K. (1980). Efficient tests for normality, homoscedasticity, and serial
independence of regression residuals. Economics Letters 6, 255-9. Kerr, M. K., Martin, M., and Churchill, G. A. (2000). Analysis of variance for gene expression microarray
data, Journal of Computational Biology, 7: 819-837. Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown,
The American Statistical Association Journal. Martinez, W. L. (2002 ). Computational statistics handbook with MATLAB, Boca Raton : Chapman &
Hall/CRC. Runyon, R. P. (1977). Nonparametric statistics : a contemporary approach, Reading, Mass.: Addison-
Wesley Pub. Co. Statistics Toolbox User's Guide, The MathWorks Inc.
http://www.mathworks.com/access/helpdesk/help/toolbox/stats/stats.shtml Stekel, D. (2003). Microarray bioinformatics, New York : Cambridge University Press. Tsai, C. A., Chen, Y. J. and Chen, J. (2003). Testing for differentially expressed genes with microarray
data, Nucleic Acids Research 31, No 9, e52. Turner, J. R. and Thayer, J. F. (2001). Introduction to analysis of variance : design, analysis, &
interpretation, Thousand Oaks, Calif. : Sage Publications.
E-mail: [email protected]: http://www.sinica.edu.tw/~hmwu/