Top Banner
BIOL 582 Lecture Set 21 One-Way MANOVA, Part I
27

BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Jan 02, 2016

Download

Documents

Angelina Shaw
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

BIOL 582

Lecture Set 21One-Way MANOVA, Part I

Page 2: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• So far we have learned two things about multivariate data:1. That linear models work equally well with multivariate data,

compared to univariate data2. That we can visualize general patterns in multivariate data

spaces in reduced dimensions• The next two lectures will attempt to wrap up some other issues

working with multivariate data, namely1. Can I use ANOVA on a multivariate linear model to test

group differences?2. Can I determine which of my response variables is better

for distinguishing group differences?3. What is the difference between scatter plots that resemble

data spaces and ones that resemble statistical spaces?• Before getting into these issues, let’s take a quick look again at

some issues with the Bumpus data

BIOL 582 Multivariate data

Page 3: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• From last time

• This time, let’s try this simple procedure

• It might be obvious that the plot shows the distribution of eigen values for the six possible PCs

• Let’s compare this to random, uncorrelated data…

BIOL 582 Bumpus data redux: interpreting PCA

> bumpus<-read.csv(”bumpus.csv")> attach(bumpus)> # the following morphological variables will be used> # AE, BHL, HL, FL, TTL, SW, SKL> Y<-cbind(AE,BHL,FL,TTL,SW,SKL)

> # PCA on correlation matrix> Y.cor.pca<-princomp(Y,cor=T)> plot(Y.cor.pca)

> Y.cor.pca$sdev^2 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 3.6394870 0.7481825 0.6176918 0.4498947 0.3605078 0.1842361

Page 4: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• Random data:

• Compare plots

• The Bumpus data produced a principal eigenvalue that can be described as “overdispersed” (more variance than expected). Another way to describe the pattern is lower entropy (uncertainty), which means greater information content in fewer dimensions than expected by chance.

• There are (debatable) ways to test the “significance” of PCs (and some use these methods to decide which information to keep and which to throw out). This is not advocated here.

• Just realize that in the Bumpus data, prominent variable covariation is apparent.

BIOL 582 Bumpus data redux: interpreting PCA

> Y.rand<-matrix(rnorm(6*nrow(Y),0,1),nrow=nrow(Y),ncol=6)> Y.rand.cor.pca<-princomp(Y.rand)> par(mfrow=c(1,2))> plot(Y.cor.pca)> plot(Y.rand.cor.pca)

Pop quiz! What is the total variance of any PCA using a correlation matrix?

p, where p is the number of variables.

Page 5: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• Why was the previous page important? Sometimes the covariation of variables helps explain group differences in multivariate responses. Utilizing a multivariate linear model might help reveal this. Below is a hypothetical example to illustrate this point

• Are there obvious differences between the two groups for each trait, independently?

• But the two groups occupy different regions of the data space• By considering the covariation of responses, it is apparent that a test to compare

locations should reveal a difference.

BIOL 582 Covariation of response / group differences

y1

y2

Page 6: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• Let’s go through the steps of a linear model with multivariate response data, as we did with univariate response data (sort of a side by side comparison.

• From each model, the error can be obtained

• Then the error can be “squared” by

• Thus, for univariate data, the inner-product of two vectors produces a scalar, which is SSE. For multivariate data, the result is a sums of squares and cross-products matrix. This is the key difference. The error is not one value; it is a matrix of measures of covariance.

BIOL 582 Multivariate linear model

Page 7: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• Recall that if one wishes to test group differences in means for univariate data, one can compare the error of a “full” linear model with both an intercept and parameters for estimating group effects to a “reduced” model that contains just an intercept. This can be done with a likelihood ratio test (using, e.g., an F stat)

• Inspection of this F value formula indicates that it will only work if the residuals are a single vector, such that the inner product produces a scalar. Here is a way to generalize this for matrices

• Note that the F is bolded but not italicized, as it is a matrix

BIOL 582 Multivariate linear model & ANOVA

Page 8: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• Two questions: What are the dimensions of the matrix? Can we get a P-value?

• First answer is easy, the next answer, not so much

• A sums of squares and cross-products matrix (SSCP) will ALWAYS be a p x p matrix! The difference in two SSCP matrices, like the right side of an equation is also a SSCP matrix. It is the SSCP matrix for the effect that is tested.

• Sometimes a particular nomenclature is used to describe particular SSCPs. The error SSCP is called E (do not confuse with eigenvectors) and the difference between the E matrices of the two models is called H, which is much like “hypothesis.” This nomenclature has some benefit. For example, one could state that the expected value of tr(H) is 0 under the null hypothesis. Thus, F can be written as

BIOL 582 Multivariate linear model & ANOVA

Page 9: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• Some sources take it a step further. Since SSCPs multiplied by the reciprocal of their degrees of freedom produces covariance matrices, some prefer to do the following:

• Where B and H are “between-group” and “within-group” covariance matrices, respectively. This approach obviously assumes one is doing a one-way “MANOVA” rather than regression. However, the original formula is more general for comparing any two nested models.

• As for the second question, there is no way to get a P-value for a matrix…

• But, there are a few multivariate coefficients that can be converted to values that approximately follow F distributions. In some cases (i.e., when Δk = 1), the values are said to be “Exact” F values.

• We will come back to what we can do with this F matrix next lecture

BIOL 582 Multivariate linear model & ANOVA

Page 10: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

• Let’s not forget assumptions!• Multivariate linear model assumptions are similar to univariate

assumptions, with a few twists1. The error is multivariate normal (not easy to test, but one can look at a PC plot of

residuals and check for an elliptical shape of scatter. R has a multivariate Shapiro-Wilk test)

2. Homogeneity of covariance (there is a test for this – Box’s M – but it is overly sensitive and generally not used. One can examine covariance matrices – the off-diagonal elements should be similar; the diagonal elements should be similar)

3. The relationship of dependent variables is linear.4. Independent observation vectors (not values) of subjects5. n >> p (This is more so for MANOVA than the linear model)6. Y is not rank deficient. I.e., it is full-rank. The “rank” is the number of variables.

However, if one variable is a function of another, the matrix has more columns than it needs. If many variables are highly correlated, rank deficiency might occur. (One ad-hoc solution it to do PCA and keep the components that have eigenvalues > 0. This is more so for MANOVA than the linear model. Multivariate test statistics are based on a full rank assumption.) Note: rank deficiency in X is also a problem for either univariate or multivariate responses. Computers have trouble inverting XTX if the X is not full rank. This is called multicollinearity. It means that variables are inherently related, and prediction should use a smaller set. (E.g., Assignment 8 – DPY and PercipIn – Stepwise regression removed one of these most likely)

BIOL 582 Multivariate linear model & ANOVA

Page 11: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

One-way MANOVA Steps

1. State Null/Alternative hypotheses2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model6. Calculate a multivariate test statistic using Sf and/or Sr . (Note: one can use

only Sf to get a test statistic if performing a randomization test.)7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

b. By performing a randomization test to create an empirical probability distribution

8. Generate plots/tables9. Maybe do a discriminant analysis

BIOL 582 Multivariate ANOVA

Page 12: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

1. State Null/Alternative hypotheses2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model6. Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use

only Sf to get a test statistic if performing a randomization test.)7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

b. By performing a randomization test to create an empirical probability distribution

8. Generate plots/tables9. Maybe do a discriminant analysis

BIOL 582 Multivariate ANOVA

Page 13: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

1. State Null/Alternative hypotheses

• H0: Morphology does not differ between survivors and non-survivors, and males and females (i.e., sex-survival groups)

• HA: Morphology differs between survivors and non-survivors, and males and females (i.e., sex-survival groups) in some way.

BIOL 582 Multivariate ANOVA

where ΣH is the population covariance matrix for the effectSince V = 1/Δk*H is a sample estimate of Σ , then the trace of H is an adequate sample statistic for testing the null hypothesis. The expected value for the null hypothesis is tr(H) = 0.

Page 14: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

1. State Null/Alternative hypotheses2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model6. Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use

only Sf to get a test statistic if performing a randomization test.)7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

b. By performing a randomization test to create an empirical probability distribution

8. Generate plots/tables9. Maybe do a discriminant analysis

BIOL 582 Multivariate ANOVA

Page 15: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model

BIOL 582 Multivariate ANOVA

> bumpus<-read.csv(“bumpus.csv")> attach(bumpus)> # the following morphological variables will be used> # AE, BHL, HL, FL, TTL, SW, SKL> Y<-cbind(AE,BHL,FL,TTL,SW,SKL)

> group<-factor(paste(sex,survived,sep='.'))> lm.group<-lm(Y~group)> lm.group

Call:lm(formula = Y ~ group)

Coefficients: AE BHL FL TTL SW SKL (Intercept) 241.571429 31.478571 18.027650 28.728307 15.279914 20.845236groupf.TRUE -0.571429 -0.045238 0.129721 0.319617 -0.037495 -0.035379groupm.FALSE 5.734127 0.082540 -0.008467 -0.113796 0.015825 0.610709groupm.TRUE 5.840336 0.215546 0.174687 0.120116 0.065670 0.887702

>

Page 16: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model

BIOL 582 Multivariate ANOVA

> # Consider assumptions......> # One can go crazy trying to do tests> # Almost any test will be too sensitive> # They will suggest assumptions are violated> # Yet the multivariate test stats are anti-conservative> # So testing assumptions is rather futile> # Better to look at residuals> > e<-resid(lm.group)> pca.e<-princomp(e,cor=F)> yh<-predict(lm.group)> pca.yh<-princomp(yh,cor=F)> par(mfrow=c(2,2))> plot(pca.e,main="distribution of eiqenvalues")> plot(pca.e$scores,asp=1,main="best 2-D view of residuals")> plot(pca.yh$scores[,1],pca.e$scores[,1],main="best residual vs predicted view")> plot(group,pca.e$scores[,1],main="best residual vs group view")

Page 17: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model

BIOL 582 Multivariate ANOVA

Page 18: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model

BIOL 582 Multivariate ANOVA

> Sf<-t(e)%*%e> Sf AE BHL FL TTL SW SKLAE 2966.8490 226.59416 256.26670 433.30517 111.23737 285.34416BHL 226.5942 65.18760 35.11097 57.50294 18.80858 41.58504FL 256.2667 35.11097 49.67377 68.00522 16.19890 35.69328TTL 433.3052 57.50294 68.00522 141.80487 24.52523 55.61453SW 111.2374 18.80858 16.19890 24.52523 19.39935 18.06867SKL 285.3442 41.58504 35.69328 55.61453 18.06867 115.81108

> # potential homogeneity of covariance problem…

Page 19: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

1. State Null/Alternative hypotheses2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model6. Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use

only Sf to get a test statistic if performing a randomization test.)7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

b. By performing a randomization test to create an empirical probability distribution

8. Generate plots/tables9. Maybe do a discriminant analysis

BIOL 582 Multivariate ANOVA

Page 20: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model

BIOL 582 Multivariate ANOVA

> lm.null<-lm(Y~1)> e.r<-resid(lm.null)> Sr<-t(e.r)%*%e.r> Sr AE BHL FL TTL SW SKLAE 4115.0294 261.26912 263.98818 410.11550 123.17917 435.47590BHL 261.2691 66.59993 35.79883 57.34511 19.31347 46.82664FL 263.9882 35.79883 50.64240 69.25259 16.41615 37.77880TTL 410.1155 57.34511 69.25259 144.59029 24.40976 54.03802SW 123.1792 19.31347 16.41615 24.40976 19.58572 19.88596SKL 435.4759 46.82664 37.77880 54.03802 19.88596 136.92128

> # compare to Sf> Sf AE BHL FL TTL SW SKLAE 2966.8490 226.59416 256.26670 433.30517 111.23737 285.34416BHL 226.5942 65.18760 35.11097 57.50294 18.80858 41.58504FL 256.2667 35.11097 49.67377 68.00522 16.19890 35.69328TTL 433.3052 57.50294 68.00522 141.80487 24.52523 55.61453SW 111.2374 18.80858 16.19890 24.52523 19.39935 18.06867SKL 285.3442 41.58504 35.69328 55.61453 18.06867 115.81108

Because AE compared to other variables has such large variance, the Sf matrix might violate the homogeneity of covariance assumption (i.e., MANOVA devolves into ANOVA of AE)

Page 21: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

1. State Null/Alternative hypotheses2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model6. Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use

only Sf to get a test statistic if performing a randomization test.)7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

b. By performing a randomization test to create an empirical probability distribution

8. Generate plots/tables9. Maybe do a discriminant analysis

BIOL 582 Multivariate ANOVA

Page 22: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

6. Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use only Sf to get a test statistic if performing a randomization test.)

• There are several multivariate test statistics that can be used (e.g., check out Wikipedia for “MANOVA”)

• It can be debated which are the best (although differences only appear in small samples and if significance is questionable)

• The following is all we will consider in order to remain practical

• There are two basic ways to make this test stat more “recognizable”• Pillai’s trace PT = tr[(H + E)-1H] resembles an R2 value• Hotelling-Lawley trace HLT = tr[E-1H] resembles an F value (without df)• Both of these test statistics can be converted to values that approximately

follow F distributions, or one can randomize rows of Y and create error distributions of tr(H)

BIOL 582 Multivariate ANOVA

where ΣH is the population covariance matrix for the effectSince V = 1/Δk*H is a sample estimate of Σ , then the trace of H is an adequate sample statistic for testing the null hypothesis. The expected value for the null hypothesis is tr(H) = 0.

Page 23: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

1. State Null/Alternative hypotheses2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model6. Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use

only Sf to get a test statistic if performing a randomization test.)7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

b. By performing a randomization test to create an empirical probability distribution

8. Generate plots/tables9. Maybe do a discriminant analysis

BIOL 582 Multivariate ANOVA

Page 24: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

BIOL 582 Multivariate ANOVA

> # simple solution in R> summary(manova(lm.group)) Df Pillai approx F num Df den Df Pr(>F) group 3 0.51629 4.4692 18 387 7.506e-09 ***Residuals 132 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> summary(manova(lm.group),test="Hotelling") Df Hotelling-Lawley approx F num Df den Df Pr(>F) group 3 0.94039 6.5653 18 377 2.758e-14 ***Residuals 132 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 25: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data7. Evaluate the probability of the test statistic if the null hypothesis were true

b. By performing a randomization test to create an empirical probability distribution

BIOL 582 Multivariate ANOVA

> # Randomization approach> H<-Sr-Sf> t.H<-sum(diag(H)) # this is the test stat> permute<-999> result<-t.H; P<-1> for(i in 1:permute){+ Y.r<-Y[sample(nrow(Y)),] # this randomizes row vectors+ # important to realize that all values for each subject+ # are held together when shuffling values+ lm.r<-lm(Y.r~group)+ Sf.r<-t(resid(lm.r))%*%resid(lm.r)+ t.H.r<-sum(diag(Sr-Sf.r))+ result<-c(result,t.H.r)+ P<-ifelse(t.H.r>=t.H,P+1,P+0)+ }> > > # error distribution> hist(result,xlab="trace(H)",breaks=50,col="blue")> rug(result)> rug(result[1],lwd=2,col='red')> > #Results> P.value<-P/(permute+1)> R2<-sum(diag(solve(Sr,H)))> t.H[1] 1174.643> R2[1] 0.5162905> P.value[1] 0.001

Page 26: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

1. State Null/Alternative hypotheses2. Define model (evaluate assumptions)3. Calculate Sf for the error of the full model4. Define the reduced “null” model (always contains just an intercept)5. Calculate Sr for the error of the reduced model6. Calculate a multivariate test statistic Sf using and/or Sr . (Note: one can use

only Sf to get a test statistic if performing a randomization test.)7. Evaluate the probability of the test statistic if the null hypothesis were true

a. By converting the test statistic to an F stat, which approximately follows and F distribution

b. By performing a randomization test to create an empirical probability distribution

8. Generate plots/tables9. Maybe do a discriminant analysis

BIOL 582 Multivariate ANOVA

Page 27: BIOL 582 Lecture Set 21 One-Way MANOVA, Part I. So far we have learned two things about multivariate data: 1.That linear models work equally well with.

Example MANOVA: Bumpus Data

8. Generate plots/tables9. Maybe do a discriminant analysis

• These two are linked because we are in a position where we might want to do multiple comparisons. Therefore, what we plot or how we present a table might be contingent upon what a multiple comparisons test reveals

• Also, multiple comparisons can be done several ways• These topic will be explored in the next lecture

BIOL 582 Multivariate ANOVA