Unequal Number Observations

Two-way ANOVA with unequal numbers of

observations per cell.

TFG Documentation.

The data. In previous discussions of two-way ANOVA, we assumed that there

were n observations in each cell of the a b design. If the sample sizes are unequal,we will denote the number of observations on cell A = i, B = j as nij. The totalnumber of observations in the dataset is:

N =ai=1

bj=1

nij

Why data are unbalanced. When A and B experimental factors that arerandomly assigned, the researchers will usually plan the experiment to achieve equal

n's per cell. But mishaps may occur, causing observations to be missing. Whenimbalance is caused by missing data, the analyses described in this lecture will be

valid under an assumption that the missing values are missing at random (MAR).

In this context, MAR means that the probability that any given observation Yijk ismissing cannot depend on the actual value of Yijk, although it may be related to thefactors A or B. Formal discussions of MAR are found in texts dealing with missingdata, such as the book by Little and Rubin (2002, Wiley).

When one or both factors are not randomly assigned, as in an observational study,

the investigators may have no control over the nij's, and imbalace will be the rulerather than the exception.

OLS estimates for cell means are still the sample means. Using cell-means

notation, the model is:

Yijk = ij + ijk (1)

for i = 1, ..., a. j = 1, ..., b and k = 1, .., nij. This is no dierent from unbalan-ced one-way ANOVA with k = ab groups. The OLS estimates for the cell means11, ..., ab are simply the sample averages within the cells:

ij = Yij =1

nij

nijk=1

Yijk

Thus the tted values are the Yij's. The error sum of squares is:

SSErr =ai=1

bj=1

nijk=1

(Yijk Yij

)21

and the error degrees of freedom is

ai=1

bj=1

(nij 1) = N ab

The regression sum of squares no longer partitions into independent

pieces corresponding to main efects and interactions. In an intercept-only mo-

del, every tted value would be equal to the grand mean Y... = N1

i

j

k Yijk.Therefore, the regression sum of squares for this model is:

SSReg =ai=1

bj=1

nijk=1

(Yij Y...

)2= nij

ai=1

bj=1

(Yij Y...

)2.

As always, the total sum of squares:

SSTot =ai=1

bj=1

nijk=1

(Yijk Y...

)2can still be written as

SSTot = SSReg + SSErr.

However, when the nij's are unequal, we can no longer partition SSReg intoSSA, SSB and SSAB. We can no longer write the ANOVA table with lines corres-ponding to the main eect of A (Line 1), the main eect of B (Line 2), and the

interaction (Line 3) whose SS's add up to SSReg. We can still test the same hypothe-ses that we tested in the balanced case. For example, we can still test for additivity

(no interactions). But these test will have to be carried out as partial F-tests, which

we learned about in Stat 511.

Recall that in a partial F-test, you need to t two model, the smaller one (null

model) and the larger one (alternative model). The numerator for the F-statistic will

be the change in the tted values,(Ynull Yalternative

)2divided by the number of extra parameters (the number of parameters in the larger

model minus the number of parameters in the smaller model). The denominator of

the partial F-statistic is the MSE from the larger model.

As a practical matter, this means that, for many hypothesis tests, we may have

to explicitly t two model and see how the SS's change in order to compute the

numerator of the F-statistic for the desired test. We will give examples of this shortly,

after discussing how to estimate terms in the linear model.

Linear model formulation. Just as in the balanced-data case, we can re-express

each cell mean ij as the sum of main eects and interactions. That is, we can rewritethe cell-means model (1) as

Yijk = ..+ i + j + ()ij + ijk (2)

2

where

.. =1

ab

ai=1

bj=1

ij

is the mean of the cell means,

i = i. .. = 1b

bj=1

ij ..

is the main eect of a = i,

j = .j .. = 1a

ai=1

ij ..

is the main eect of B = j, and

ij = ij (..+ i + j)

is the interaction. Notice that these parameters are dened in exactly the same way

as the were when the data were balanced. The denition of these parameters does

not depend on nij's.

However, the estimates of these parameters are now a little dierent. The OLS es-

timates of the ij's are still the Yij's. We can still substitute the Yij's for the ij's intothese expressions to obtain estimates for these parameters. But thos expresessions

do not simplify as the did before. With unbalanced data, we can still write:

.. =1

ab

ai=1

bj=1

Yij

but now this expression does not simplify to Y... .The reason is that .. is dened asan unweighted average of the ij's, but Y... is a weighted average of the Yij's, weightedby the sample sizes nij's. Similary, the estimated main eect of A = i is still

i =1

b

bj=1

ij ..

but this no longer simplies to Yi.. Y.... The estimated main eect for B = j is still,

j =1

a

ai=1

ij ..

but this no longer simplies to Y.j. Y.... And the estimated interaction is still,

()ij = Yij. (..+ i + j)

3

but this no longer simplies to Yij. Yi.. Y.j. + Y... .Another important change as we move from balanced to unbalanced data is that

the estimates for one set of terms will change if other are removed from the model.

For example, if we move from the full model (2) to an additive model

Yijk = ..+ i + j + ijk (3)

then the estimates of the i's and j's will change. With balanced data, the estimatedmain eects are the same whether or not the interactions are present in the model.

But with unbalanced data, we no longer have an orthogonal partition of SSReg intoSSA, SSB and SSAB, and the estimates for any set of terms may change as otherterms are added to or removed from the model.

How to compute estimates. Despite these complications with unbalanced

data, all of the computations are very easy to carry out using OLS software. First,

consider the full model

Yijk = ..+ i + j + ()ij + ijk

To t this model, we set up a design matrix as follows.

The rst column of the design matrix will be a constant 1.

The next a 1 columns will be a set of eect codes to distinguish among thelevels of factor A.

The next b 1 columns will be a set of eect codes to distinguish among theleves of factor B.

The remaining (a 1)(b 1) columns are the products of each eect code forA with each eect code for B.

The estimated coecients for this model will then be

.., 1, , a1, 1, , b1, 11, , a1,b1.

Because all of these constraints

ai=1

i = 0,b

j=1

j = 0, andai=1

ij =b

j=1

ij = 0

still hold, we can still compute a as a1i=1

i and so on.

To t this model, we would omit the (a1)(b1) product terms from the designmatrix and re-t the model. The estimates for the main eects will be dierent

from what the were in the full model. The change in the regression sum of squares

between these two models, divided by (a 1)(b 1), will become the numerator ofthe F-statistic for testing the null hypothesis of additivity.

4

ANOVA table? Depending on the software being used, you may be given an

ANOVA table that looks like this :

Source SS df MS

1. A SSA a 1 MSA2. B SSB b 1 MSb3. AB SSAB (a 1)(b 1) MSAB4. Error SSErr N ab MSErr

This looks very much like the ANOVA table from the balanced two-way ANOVA.

When interpreting this table, however, you need to understand what is actually being

printed in the SS column for Lines 1, 2 and 3.

They might be sequential (Type I) sums of squares computed by adding the terms

to the model in the order listed. In that case, Line 1 will represent the improvement in

t when the i's are added to a model with no predictors (an intercept-only model);Line 2 will represent the improvement in t when the j's are added to a model withi's already presents; and Line 3 will be the improvement in t when the ()ij'sare added to a model with i's and j's already present. The sequential SS's fromLines 1, 2 and 3 will add up to the overall SSReg.

Or the might be partial (Type III) sums of squares wich represent the contribution

of that term when it is added last. The partial SS for Line 1 will represent the

improvement in t when the i's are added to a model that already contains j'sand ()ij's. The partial SS for Line 2 will represent the improvement in t whenthe j's are added to a model that aready contains i's and ()ij's. And the partialSS for Line 3 will represent the improvement in t when the ()ij's are added toa model that already contains i's and j's. The partial SS's from Lines 1, 2 and 3will not add up to the overall SSReg.

For Line 3, the partial and sequential SS's will be the same. In either case, we

can test for additivity by the F-statistic comparing Line 3 to Line 4.

For other tests, however, we need to be careful and understand what we are doing.

If the table reports partial (Type III) sums of squares, the a comparison of Line 1 to

Line 4 will be a valid test for no main eects of A.

H0 : 1 = 2 = ... = a = 0

and a comparison of Line 2 to Line 4 will be a valid test for no main eects of B.

H0 : 1 = 2 = ... = b = 0

But as we discussed in a preious lecture, the interpretation of these tests is so-

mewhat unusual when the interactions are not small and insignicant. If interactions

are present, the null hypothesis 1 = 2 = ... = a = 0 does not mean that the

5

Factor A has no eects. Rather, it means that the eects of Factor A, when averaged

over the levels of B (an equally weighted average), become zero. The correct way

to test the null hypothesis that Factor A has no eects whatsoever is to compare

the t of the full model (2) to the model without i's or ()ij's. That is, you mayneed to t both of these models to compute the change in the regression SS when

the main eects for A and the AB interactions are introduced. Or, if your regression

software prints out an ANOVA table with sequential SS's, you may be able to get

the quantities that you need by tting the full model with Factor B introduced rst.

Dening the eects dierently. Whether the data are balanced or unbalan-

ced, we have dened the terms in the lineal model (2) in the same way, by taking

unweighted averages of the ij's. This is usually the sensible thing to do. Under ordi-nary circumstances, the parameters of a statistical model should not be dened with

respect to the nij's. Most modern textbooks, including ours, do it this way. Howe-ver, some textbooks (especially older ones) describe coding schemes that eectively

weight the ij's in proportions determined by the nij's. Alternative denitions forthe terms in the lineal model will lead to dierent expresions for the sums of squares

due to A, B, and the AB interaction. We will not cover those alternative denitions,

because in practice the are not used much anymore.

Contrasts. With unbalanced data, we can dene and contrasts the ij's amongthe i...s, and among the .j's. To derive the SS's for a contrast, we follow the sameprocedure as before.

Consider an arbitrary contrast among the ij's,

L =ai=1

bj=1

cijij

where

i

j

cij = 0. The least-squares estimate of the contrast is

L =ai=1

bj=1

cijYij

and the variance of this estimate is

V (L) =ai=1

bj=1

c2ij

(2

nij

)= 2

ai=1

bj=1

(c2ijnij

)The F-statistic for testing H0 : L = 0 is

F = T 2 =L

S2ai=1

bj=1

(c2ijnij

) = SSLS2

6

where

SSL =L

ai=1

bj=1

(c2ijnij

)is the sum of squares due to the contrast.

Similary, we can derive the sums of squares for a contrast among the ij's ora contrast among the .j's. This will be left as an exercise. A contrast among thei.'s will describe the eect of Factor A averaged over the levels of B (unweightedaverage), and a contrast among the .j's will describe the eect of Factor B averagedover the levels of A (unweighted average).

At the end of Chapter 23, the textbook gives some examples of contrast involving

weighted averages where the weights are determined by the context of the problem.

7

Unequal Number Observations

Documents