Short Course — Applied Linear and Nonlinear Mixed Models* · Short Course — Applied Linear and Nonlinear Mixed Models* Introduction Mixed-eﬀect models (or simply, “mixed models”)

Short Course — Applied Linear and Nonlinear Mixed Models*

Introduction

Mixed-effect models (or simply, “mixed models”) are like classical (“fixed-effects”) statistical models, except that some of the parameters describ-ing group effects or covariate effects are replaced by random variables, orrandom-effects.

• Thus, the model has both parameters, also known as “fixed effects”,and random effects. Thus the model has mixed effects.

• Random effects can be thought of as random versions of parame-ters. So, in some sense, a mixed model has both fixed and randomparameters.

– This can be a useful way to think about it, but it’s really notquite right and can lead to confusion.

– The word “parameter” means fixed, unknown constant, so it isreally something of an oxymoron to say “random parameter.”

– As we’ll see, the distinction between a parameter and a randomeffect goes well beyond vocabulary.

* Temple-Inland Forest Products, Inc., Jan. 17–18, 2005

1

Random effects arise when the observations being analyzed are heteroge-neous, and can be thought of as belonging to several groups or clusters.

• This happens when there is one observation per experimental unit(tree, patient, plot, animal) and the experimental units occur orare measured in different locations, at different time points, fromdifferent sires or genetic strains, etc.

• Also often occurs when repeated measurements of each experi-mental unit are taken.

– E.g., several observations are taken through time of the heightof 100 trees. The repeated height measurements are groupedor clustered by tree.

The use of random effects in linear models leads to linear mixed models(LMMs).

• LMMs are not new. Some examples from this class are among thesimplest, most familiar linear model and are very old.

• However, until recently, software and statistical methods for infer-ence were not well-developed enough to handle the general case.

– Thus, only recently has the full flexibility and power of thisclass of models been realized.

2

Some Simple LMMs:

The one-way random effects model — Railway Rails:

(See Pinheiro and Bates, §1.1) The data displayed below are from an ex-periment conducted to measure longitudinal (lengthwise) stress in railwayrails. Six rails were chosen at random and tested three times each by mea-suring the time it took for a certain type of ultrasonic wave to travel thelength of the rail.

2

5

1

6

3

4

40 60 80 100

Zero-force travel time (nanoseconds)

Rai

l

Clearly, these data are grouped, or clustered, by rail. This clustering hastwo closely related implications:

1. (within-cluster correlation) we should expect that observations fromthe same rail will be more similar to one another than observationsfrom different rails; and

2. (between cluster heterogeneity) we should expect that the mean re-sponse will vary from rail to rail in addition to varying from onemeasurement to the next.

• These ideas are really flip-sides of the same coin.

3

Although it is fairly obvious that clustering by rail must be incorporatedin the modeling of these data somehow, we first consider a naive approach.

The primary interest here is in measuring the mean travel time. Therefore,we might naively consider the model

yij = µ + eij , i = 1, . . . , 6, j = 1, . . . , 3,

where yij is the travel time for the jth trial on the ith rail, and we assume

ε11, . . . , ε63iid∼ N(0, σ2).

• Here, the notation iid∼ N(0, σ2) means, “are independent, identicallydistributed random variables each with a normal distribution withmean 0 and (constant) variance σ2.”

• In addition, µ is the mean travel time which we wish to estimate. Itsmaximum likelihood (ML)/ordinary least-squares (OLS) estimate isthe grand sample mean of all observations in the data set: y·· = 66.5.

• The mean square error (MSE) is s2 = 23.6452, which estimates theerror variance σ2.

However, an examination of the residuals form this model plotted sepa-rately by rail reveals the inadequacy of the model:

-40

-20

020

Boxplots of Raw Residuals by Rail, Simple Mean Model

Res

idua

ls fo

r si

mpl

e m

ean

mod

el

2 5 1 6 3 4

Rail No.

4

Clearly, the mean response is changing from rail to rail. Therefore, weconsider a one-way ANOVA model:

yij = µ + αi + eij . (∗)

Here, µ is a grand mean across the rails included in the experiment, andαi is an effect up or down from the grand mean specific to the ith rail.

Alternatively, we could define µi = µ + αi as the mean response for theith rail and reparameterize this model as

yij = µi + eij.

The OLS estimates of the parameters of this model are µi = yi·, of(µ1, . . . , µ6) = (54.00, 31.67, 84.67, 96.00, 50.00, 82.67) and s2 = 4.022. Theresidual plot looks much better:

-6-4

-20

24

6

Boxplots of Raw Residuals by Rail, One-way Fixed Effects Model

Res

idua

ls fo

r on

e-w

ay fi

xed

effe

cts

mod

el

2 5 1 6 3 4

Rail No.

5

However, there are still drawbacks to this one-way fixed effects model:

• It only models the specific sample of rails used in the experiment,while the main interest is in the population of rails from which theserails were drawn.

• It does not produce an estimate of the rail-to-rail variability in traveltime, which is a quantity of significant interest in the study.

• The number of parameters increases linearly with the number of railsused in the experiment.

These deficiencies are overcome by the one-way random effects model.

To motivate this model, consider again the one-way fixed effects model.Model (*) can be written as

yij = µ + (µi − µ) + eij

where, under the usual constraint∑

i αi = 0, (µi − µ) = αi has mean 0when averaged over the groups (rails).

The one-way random effects model, replaces the fixed parameter (µi − µ)with a random effect bi, a random variable specific to the ith rail, whichis assumed to have mean 0 and an unknown variance σ2

b . This yields themodel

yij = µ + bi + eij , (∗∗)where b1, . . . , b6 are independent random variables, each with mean 0 andvariance σ2

b . Often, the bi’s are assumed normal, and they are usuallyassumed independent of the eij ’s. Thus we have

b1, . . . , baiid∼ N(0, σ2

b ), independent of e11 . . . , eaniid∼ N(0, σ2),

where a is the number of rails, n the number of observations on the ith

rail.

6

• Note that now the interpretation of µ changes from the mean overthe 6 rails included in the experiment (fixed effects model) to themean over the population of all rails from which the six rails weresampled.

• In addition, we don’t estimate µi the mean response for rail i, whichis not of interest. Instead we estimate the population mean µ andthe variance from rail to rail in the population, σ2

b .

– That is, our scope of inference has changed from the six railsincluded in the study to the population of rails from whichthose six rails were drawn.

In addition:

• we can estimate rail to rail variability σ2b ; and

• the number of parameters no longer increases with the number ofrails tested in the experiment.

– The parameters in the fixed-effect model were the grand meanµ, the rail-specific effects α1, . . . , αa, and the error variance σ2.

– In the random effects model, the only parameters are µ, σ2

and σ2b .

σ2b quantifies heterogeneity from rail-to-rail, which is one consequence of

having observations that are grouped or clustered by rail, but what aboutwithin-rail correlation?

7

Unlike a purely fixed-effect model, the one-way random effects model doesnot assume that all of the responses are independent. Instead, it impliesthat observations that share the same random effect are correlated.

• E.g., for two observations from the ith rail, yi1 and yi3, say, the modelimplies

yi1 = µ + bi + ei1 andyi3 = µ + bi + ei3

That is, yi1 and yi3 share the random effect bi, and are thereforecorrelated.

Why?

Because one can easily show that

var(yij) = σ2b + σ2

cov(yij , yij′) = σ2b , j �= j′

corr(yij , yij′) = ρ ≡ σ2b

σ2b + σ2

, j �= j′, and

cov(yij , yi′j′) = 0, i �= i′.

That is, if we stack up all of the observations from the ith rail (the obser-vations that share the random effect bi) as yi = (yi1, . . . , yin)T , then

var(yi) = (σ2b + σ2)

⎛⎜⎜⎝

1 ρ · · · ρρ 1 · · · ρ...

.... . .

...ρ ρ · · · 1

⎞⎟⎟⎠ (†)

and groups of observations from different rails (those that do not sharerandom effects) are independent.

8

• The variance-covariance structure given by (†) has a a special name:compound symmetry. This means that

– observations from the same rail all have constant variance equalto σ2 + σ2

b , and

– all pairs of observations from the same rail have constant cor-relation equal to

ρ =σ2

b

σ2 + σ2b

– ρ, the correlation between any two observations from the samerail, is called the intraclass correlation coefficient.

• In addition, because the total variance of any observation is var(yij) =σ2

b + σ2, the sum of two terms, σ2b and σ2 are called variance com-

ponents.

9

• Both fixed-effects and random-effects versions of the one-way modelare fit to these data in intro.R.

• For (*), the fixed-effect version of the one-way model, we obtainµ = 66.5 with a standard error of 0.948.

• For (**), the random-effect version of the one-way model, we obtainµ = 66.5 with a standard error of 10.17.

– Standard error is larger in random-effects model, because thismodel has a larger scope of inference.

– That is, the two models are estimating different µ’s: the fixedeffect model is estimating the grand mean for the six rails inthe study; the fixed effect model is estimating the grand meanfor all possible rails.

– It makes sense that we would be much less certain of (i.e., therewould be more error in) our estimate of the latter quantityespecially if there is a lot of rail-to-rail variability.

• The usual method of moment/anova/REML estimates of the vari-ance components in model (**) are σ2 = 24.82 and σ2

b = 4.022,so here there is much more between-rail variability than within-railvariability.

10

The randomized complete block model — Stool Example:

In the last example, the data were grouped by rail and we were interestedin only one treatment (there was only one experimental condition underwhich the travel time along the rail was measured).

Often, several treatments are of interest and the data are grouped. Ina randomized complete block design (RCBD), each of a treatments areobserved in each of n blocks.

As an example, consider the data displayed below. These data come froman experiment to compare the ergonomics of four different stool designs.n = 9 subjects were asked to sit in each of a = 4 stools. The responsemeasured was the amount of effort required to stand up.

8

5

4

9

6

3

7

1

2

8 10 12 14

Effort required to arise (Borg scale)

Sub

ject

T1 T2 T3 T4

11

Here, subjects form the blocks and we have a complete set of treatmentsobserved in each block (each subject tests each stool). Thus we have aRCBD.

Let yij be the response for the jth stool type tested by the ith subject.

The classical fixed effects model for the RCBD assumes

yij = µ + αj + βi + eij ,

= µj + βi + eij ,i = 1, . . . , n, j = 1, . . . , a,

where e11, . . . , enaiid∼ N(0, σ2).

• Here, µj is the mean response for the jth stool type, which can bebroken apart into a grand mean µ and a stool type effect αj . βi is afixed subject effect.

• Again, the scope of inference for this model is the set of 9 subjectsused in this experiment.

• If we wish to generalize to the population from which the 9 subjectswere drawn, we would consider the subject effects to be random.

12

The RCBD model with random block effects is

yij = µj + bi + eij ,

where

b1, . . . , bniid∼ N(0, σ2

b ) independent of e11, . . . , enaiid∼ N(0, σ2).

• Since µj ’s are fixed, and bi’s are random, this is a mixed model.

The variance-covariance structure here is quite similar to that in the one-way random effects model.

Again, the model implies that any two observations that share a randomeffect (i.e., any two observations from the same block) are correlated.

In fact, the same compound symmetry structure holds. In particular, ifyi = (yi1, . . . , yia)T is the vector of observations from the ith block, thenas in the last example,

var(yi) = (σ2b + σ2)

⎛⎜⎜⎝

1 ρ · · · ρρ 1 · · · ρ...

.... . .

...ρ ρ · · · 1

⎞⎟⎟⎠ (†)

• All pairs of observations from the same block have correlation ρ =σ2

b

σ2+σ2b

;

• all pairs of observations from different blocks are independent; and

• all observations have variance σ2+σ2b (two components: within-block

and between block variances).

• The RCBD model treating block effects is fit to these data in intro.R.First blocks are treated as fixed, then random.

13

It is often stated that whether block effects are assumed random or fixeddoes not affect the analysis of the RCBD.

This is not completely true.

It is true that whether or not blocks are treated as random does not affectthe ANOVA F test for treatments. Either way we test for equal treatmentmeans with the test statistic F = MST rt

MSE

However, there are important differences in the analysis of the two designs.These differences affect inferences on treatment means.

For instance, the variance of a treatment mean is

var(y·j) =

{σ2

n for fixed block effects,σ2

b+σ2

n for random block effects.

Substituting the usual method of moment/anova estimators for σ2 and σ2b

leads to a standard error of

s.e.(y·j) =√

var(y·j) =

⎧⎨⎩

√MSBlocks+(s−1)MSE

nsfor random block effects√

MSE

n for fixed block effects.

• Again, the standard error of a treatent mean is larger in the randomeffects model, because the scope of inference is broader.

– For these data, s.e.(µj) = .367 in the fixed block effects model,and s.e.(µj) = .576 in the random block effects model.

• For these data, the estimated between- and within-subject variancecomponents are σ2

b = 1.332 and σ2 = 1.102.

– This means that the estimated correlation between any pair ofobservations on the same subject is

ρ =σ2

b

σ2 + σ2b

=1.332

1.102 + 1.332= 0.59

14

A Split-plot model — Grass Example:

A split-plot experimental design is one in which two sizes of experimentalunit are used. The larger experimental unit, known as the whole plot, israndomized to some experimental design (a RCBD, say).

The whole plot is then subdivided into smaller units, known as split plots,which are assigned to a second experimental design within each whole plot.

Example: A study of the effects of three bacterial inoculation treatmentsand two cultivar types on grass yield was conducted as follows.

• Four fields, or blocks, were divided in half, and the two cultivars (A1

and A2) were assigned at random to be grown in the two halves ofeach field.

• Then each half-field (the whole plot) was divided into three sub-units or split-plots, and the three inoculation treatments (B1, B2,and B3) were randomly assigned to the three split-plots in each wholeplot.

The resulting design and data are as follows:

Block 1A1 A2

B2 B3

29.7 34.4

B1 B1

27.4 29.4

B3 B2

34.5 32.5

Block 2A2 A1

B3 B1

36.4 28.9

B2 B2

32.4 28.7

B1 B3

28.7 33.4

Block 3A2 A1

B2 B1

29.1 28.6

B1 B3

27.2 32.9

B3 B2

32.6 29.7

Block 4A1 A2

B1 B3

26.7 30.7

B3 B2

31.8 28.6

B2 B1

28.9 26.8

• Here it was easier to randomize the planting of the two cultivars to afew large units (the whole plots) then to many small units (the splitplots).

– Convenience is the motivation for this design.

15

• Here the 8 columns within the four rectangles are the whole plotsand cultivar is the whole plot factor.

• The 24 smaller squares within the columns are the split plots andinoculation type is the split plot factor.

The Data:

yijk,i = 1, . . . , a (levels of W.P. factor)j = 1, . . . , n (blocks)k = 1, . . . , b (levels of S.P. factor)

• That is, yijk is the response for the ith cultivar in the jth blocktreated with the kth inoculation type.

Model:yijk = µ + αi + τj + bij + βk + (αβ)ik + eijk,

where

αi = effect of ith cultivarτj = effect of jth block (treated here as fixed)βk = effect of kth inoculation treatment

(αβ)ik = interaction between cultivars and inoculations

In addition,

bij ’siid∼ N(0, σ2

b ) independent of eijk’s iid∼ N(0, σ2)

• bij sometimes describes as “whole plot error terms”. In a sense thatis what a random effect is, an additional error term in the model.

16

bij ’s are random effects for each whole plot (one for each half-field). Theyaccount for:

• heterogeneity from one whole plot to the next (quantified by σ2b );

• correlation among the three split-plots within a given whole plot.

Again, the variance-covariance structure in this model is compound sym-metric. Model implies

var(yij) = (σ2b + σ2)

⎛⎜⎜⎝

1 ρ · · · ρρ 1 · · · ρ...

.... . .

...ρ ρ · · · 1

⎞⎟⎟⎠

where yij = (yij1, yij2, . . . , yijb)T (vector of all observations on i, jth wholeplot).

This means:

• all pairs of observations from the same whole-plot have correlationρ = σ2

b

σ2+σ2b

;

• all pairs of observations from different whole-plots are independent;and

• all observations have variance σ2 + σ2b (two components: within-

whole-plot and between-whole-plot variances).

• The split plot model is fit to these data with the lme() function inS-PLUS/R in intro.R.

• Note that here we’ve treated blocks as fixed. Later, we’ll return tothis example and model block effects as random.

17

The Experimental Unit, Pseudoreplication, D.F., & Balance

• The split-plot design involves two different experimental units: thewhole plot and the split plot.

• Whole plots are randomly assigned to the whole plot factor.

– E.g., half-fields were randomized to the two cultivars.

• There are many fewer whole plot experimental units than there areobservations (which equals the number of split plots in the experi-ment).

– Only 8 half-fields, but 24 observations in the grass experiment.

• With respect to cultivar, then, 8 experimental units are randomized.So degrees of freedom for testing cultivar are based on a sample sizeof 8.

– At the whole plot level, we have a RCBD design with twotreatments (cultivars), four blocks, so error d.f. for testingcultivars is the error d.f. in a RCBD of this size. Namely:(2-1)(4-1)=3.

• With respect to cultivar, the measurements on the three split plotsin each whole plot are pseudoreplicates (or subsamples).

– That is, they are not independently randomized to cultivarsand thus proved no additional d.f. (information) regardingcultivar effects.

18

In some sense, modeling whole plots with random effects:

— identifies the appropriate error term for the whole plot factor;

— identifies the appropriate d.f. (amount of relevant information in thedata/design) for testing the whole plot factor (cultivar); and

— identifies which units are true experimental units and which are pseu-doreplicates with respect to each experimental factor.

• If a purely fixed-effect model is used in the split-plot design* thenthe usual MSE and DFE based upon the eijk error term will lead toincorrect inferences on the whole plot factor.

– See model grass.lm1 in intro.R.

• Correct inferences on whole plot factors can sometimes be obtainedfrom a fixed-effects analysis, but

– you have to really know what you’re doing, especially in com-plex situations like split-split-plot models, etc; and

– the design has to be balanced (rare!).

• So, the use of random effects is also motivated by

– use of multiple sizes of experimental units with distinct ran-domizations;

– presence of pseudo-replication; and

– imbalance.

• Mixed effects models handles these complications much more “au-tomatically” than fixed-effects models and, consequently, avoid in-correct inferences to which fixed-effects models are prone in thesesituations.

* this would be done by modeling variability among whole plots witha fixed cult*block interaction effect

19

A More Complex Example — PMRC Site Preparation Study:

• Study of various site preparation and intensive management regimeson the growth of slash pine.

• Involved 191 0.2 ha plots nested within 16 sites in lower coastal plainof GA and FL.

• Data consist of repeated plot-level measurements of hd=dominantheight (m), ba=basal area (m2/ha), tph (100’s of trees per ha), de-rived volume (total volume outside bark in m3/ha), and other vari-ables at ages 2, 5, 8, 11, 14, 17, and 20 years.

• At each site, plots were randomized to eleven treatments consist-ing of a subset of the 25 = 32 combinations of five two-level (ab-sent/present) treatment factors:

A = Chop, site prep. w/ a single pass of a rolling drum chopper;B = Fert, fertilizer following the first, 12th and 17th growing seasons;C = Burn, broadcast burn of site prior to planting;D = Bed, a double pass bedding of the site;E = Herb, veg. control with chemical herbicide.

20

Here is a plot of the data. Each panel represents a site, and each panelcontains tph over time profiles for each plot on that site.

5

10

15

5 10 15 20

10 4

5 10 15 20

9 5

5 10 15 20

6 11

5 10 15 20

20 15

14

5 10 15 20

8 1

5 10 15 20

7 12

5 10 15 20

16 13

5

10

15

5 10 15 20

19

Age (yrs)

Tre

es/H

ecta

re (

100s

of t

rees

)

12

34

56

78

910

1113

Separate profiles for each plot, graphed separately by site

(Each panel is a site)

• These data are grouped, or clustered, by site.

– We would expect heterogeneity from site to site.– We would expect correlation among plots within the same site.

• Data are also grouped by plot, since we have repeated measuresthrough time on each plot.

– Again, we would expect plots to be heterogeneous.– Expect stronger correlation among observations from the same

plot than observations from different plots.

21

• In addition, we’d like to make inferences about the population ofplantation sites for which these sites are representative, not just thesesites alone.

• Also would like to be able to generalize to the population from whichthese plots are drawn.

• Hence, it makes sense to model sites with random site effects, plotswith random plot effects.

– Plots are nested within sites. This would be an example of amultilevel mixed model.

• In addition, plots are randomized to treatments, then repeated mea-sures through time are taken on each plot.

– With respect to treatments, plots are the experimental unit,but measurement unit occurs at a finer scale: times withinplots.

• These time-specific measurements are a bit like measurements onsplit plots.

– However, in a split-plot example, observations from the samewhole plot are correlated due to shared characteristics of thatwhole plot. These are captured by whole plot random effects.

– In a repeated measures context, observations through timefrom the same unit are correlated due to shared characteristicsof that unit and are subject to serial correlation (observationstaken close together in time more similar than observationstaken far apart in time).

– Thus, in a repeated measures context, we may want randomeffects and serial correlation built into our model.

• We’ll soon see how multilevel random effects, serial correlation, andother features can be handled in the general form of the LMM.

22

Fixed vs. random effects: The effects in the model account for variabilityin the response across levels of treatment and design factors. The decisionas to whether fixed effects or random effects should be used depends uponwhat the appropriate scope of generalization is.

• If it is appropriate to think of the levels of a factor as randomlydrawn from, or otherwise representative of, a population to whichwe’d like to generalize, then random effects are suitable.

– Design or grouping factors are usually more appropriately mod-eled with random effects.

– E.g., blocks (sections of land) in an agricultural experiment,days when an experiment is conducted over several days, labtechnician when measurements are taken by several techni-cians, subjects in a repeated measures design, locations or sitesalong a river when we desire to generalize to the entire river.

• If, however, the specific levels of the factor are of interest in and ofthemselves then fixed effects are more appropriate.

– Treatment factors are usually more appropriately modeled withfixed effects.

– E.g., In experiments to compare drugs, amounts of fertilizer,hybrids of corn, teaching techniques, and measurement devices,these factors are most appropriately modeled with fixed effects.

• A good litmus test for whether the level of some factor should betreated as fixed is to ask whether it would be of broad interest toreport a mean for that level. For example, if I’m conducting anexperiment in which each of four different classes of third grade stu-dents are taught with each of three methods of instruction (e.g., ina crossover design) then it will be of broad interest to report themean response (level of learning, say) for a particular method ofinstruction, but not for a particular classroom of third grades.

– Here, fixed effects are appropriate for instruction method, ran-dom effects for class.

23

Preliminaries/Background

• In order to really understand the LMM, we need to study it in itsvector/matrix form. So, we need to discuss/review random vectorsand the multivariate normal distribution.

• Also need to review the classical linear model (CLM) before gener-alizing to the LMM.

• Estimation in the CLM based on least squares, but in the LMM,maximum likelihood (ML) estimation is used. Therefore, need tocover/review the basic ideas of ML estimation.

Random Vectors:

Random Vector: A vector whose elements are random variables. E.g.,

y =

⎛⎜⎜⎝

y1

y2...

yn

⎞⎟⎟⎠ ,

where y1, y2, . . . , yn are each random variables.

• Random vectors we will be concerned with:

– A vector containing the response variable measured on n unitsin the sample: y = (y1, . . . , yn)T .

– A vector of error terms in a model for y: e = (e1, . . . , en)T .– A vector of random effects: b = (b1, b2, . . . , bq)T .

Expected Value: The expected value (population mean) of a randomvector is the vector of expected values, often denoted µ. For yn×1,

E(y) =

⎛⎜⎜⎝

E(y1)E(y2)

...E(yn)

⎞⎟⎟⎠ ≡

⎛⎜⎜⎝

µ1

µ2...

µn

⎞⎟⎟⎠ = µ.

24

(Population) Variance-Covariance Matrix: For a random vectoryn×1 = (y1, y2, . . . , yn)T with mean µ = (µ1, µ2, . . . , µn)T , the ma-trix

E[(y−µ)(y−µ)T ] =

⎛⎜⎜⎝

var(y1) cov(y1, y2) · · · cov(y1, yn)cov(y2, y1) var(y2) · · · cov(y2, yn)

......

. . ....

cov(yn, y1) cov(yn, y2) · · · var(yn)

⎞⎟⎟⎠

≡

⎛⎜⎜⎝

σ11 σ12 · · · σ1n

σ21 σ22 · · · σ2n...

.... . .

...σn1 σn2 · · · σnn

⎞⎟⎟⎠

is called the variance-covariance matrix of y and is denoted var(y).

(Population) Correlation Matrix: For a random variable yn×1, thepopulation correlation matrix is the matrix of correlations amongthe elements of n:

corr(y) =

⎛⎜⎜⎝

1 corr(y1, y2) · · · corr(y1, yn)corr(y2, y1) 1 · · · corr(y2, yn)

......

. . ....

corr(yn, y1) corr(yn, y2) · · · 1

⎞⎟⎟⎠ .

• Recall: for random variables yi and yj ,

corr(yi, yj) =cov(yi, yj)√var(yi)var(yj)

measures the amount of linear association between yi and yj .

• Correlation matrices are symmetric.

25

Properties of expected value, variance:

Let x, y be random vectors of the same dimension, and let C and c be amatrix and vector, respectively, of constants. Then

1. E(y + c) = E(y) + c.

2. E(x + y) = E(x) + E(y).

3. E(Cy) = CE(y).

4. var(y + c) = var(y).

5. var(y + x) = var(y) + var(x) + cov(y,x) + cov(x,y)︸︷︷︸=0 if x,y independent

.

6. var(Cy) = Cvar(y)CT .

26

Multivariate normal distribution:

• The multivariate normal distribution is to a random vector as theunivariate (usual) normal distribution is to a random variable.

– It is the version of the normal distribution appropriate to thejoint distribution of several random variables (collected andstacked as a vector) rather than a single random variable.

• Recall that we write y ∼ N(µ, σ2) to signify that the univariate r.v.y has the normal distribution with mean µ and variance σ2.

– Means that y has probability density function (p.d.f.)

fY (y) =1√

2πσ2exp

[− (y − µ)2

2σ2

]

– Meaning: for two values y1 < y2 the area under the graph ofthe p.d.f. between fY (y1) and fY (y2) gives Pr(y1 < Y < y2).

• We write y ∼ Nn(µ,Σ) to denote that y follows the n−dimensionalmultivariate normal distribution with mean µ and variance-covariancematrix Σ.

• E.g., for a bivariate random vector y =(

y1

y2

)∼ N2(µ,Σ), the p.d.f.

of y maps out a bell over the (y1, y2) plane centered at µ with spreaddescribed by Σ.

• Recall for y ∼ N(µ, σ2) the p.d.f. of y is

f(y) =1

(2πσ2)1/2exp

{−1

2(y − µ)2

σ2

},

• In the multivariate case, for y ∼ Nn(µ,Σ), the p.d.f. of y is

f(y) =1

(2π)n/2|Σ|1/2exp

{−1

2(y − µ)T Σ−1(y − µ)

}.

– Here |Σ| denotes the determinant of the var-cov matrix Σ.

27

Review of Classical (Fixed-Effects) Linear Model

Assume we observe a sample of independent pairs, (y1,x1), . . . ,(yn,xn) where yi is a response variable and xi = (xi1, . . . , xip)T is a p× 1vector of explanatory variables.

The classical linear model can be writtenyi = β1xi1 + · · · + βpxip + ei, i = 1, . . . , n,

= xTi β + εi,

where e1, . . . , eniid∼ N(0, σ2).

Equivalently, we can stack these n equations and write the model as fol-lows: ⎛

⎜⎝ y1...

yn

⎞⎟⎠ =

⎛⎜⎝ x11 x12 · · · x1p

......

. . ....

xn1 xn2 · · · xnp

⎞⎟⎠

⎛⎜⎝ β1

...βp

⎞⎟⎠ +

⎛⎝ e1

...en

⎞⎠

or y = Xβ + e

• Our assumptions on e1, . . . , en can be equivalently restated as

e ∼ Nn(0, σ2In).

• Since y = Xβ+e and e ∼ Nn(0, σ2In), it follows that y is m’variatenormal too:

y ∼ Nn(Xβ, σ2In).

• The var-cov matrix for y is

σ2In =

⎛⎜⎜⎝

σ2 0 · · · 00 σ2 · · · 0...

.... . .

...0 0 · · · σ2

⎞⎟⎟⎠

• ⇒ yi’s are uncorrelated and have constant variance σ2.

• Therefore, in the CLM y is assumed to have multivariate normaljoint p.d.f.

28

Estimation of β and σ2:

Maximum likelihood estimation:

In general, the likelihood function is just the probability density func-tion, but thought of as a function of the parameters rather than of thedata.

• Interpretation: likelihood function quantifies how likely the data arefor a given value of the parameters.

• The idea behind maximum likelihood estimation is to find the valuesof β and σ2 under which the data are most likely.

– That is, we find the β and σ2 that maximize the likelihoodfunction, or equivalently, the loglikelihood function, for thevalue of y actually observed.

– These values are the maximum likelihood estimates (MLEs) ofthe parameters.

For the CLM, the loglikelihood is

�(β, σ2;y) = −n

2log(2π)︸︷︷︸

a constant

−n

2log(σ2) − 1

2σ2(y − Xβ)T (y − Xβ)︸︷︷︸

kernel of �

.

29

Notice that maximizing �(β, σ2;y) with respect to β is equivalent to max-imizing the third term:

− 12σ2

(y −Xβ)T (y − Xβ),

which is equivalent to minimizing

(y − Xβ)T (y −Xβ) =n∑

i=1

(yi − xTi β)2 (Least-Squares Criterion). (∗)

• (y−Xβ)T (y−Xβ) is the squared distance between y and its mean,Xβ.

– Parameter estimate β minimizes this distance.

– That is, β gives the estimated mean Xβ that is closest to y.

• So, the estimators of β given by ML and (ordinary) least squares(OLS) coincide.

– For β in the CLM:ML = OLS

and, if X is of full rank (model is not overparameterized) then:

β = (XT X)−1XT y

.

30

Estimation of σ2:

• Setting the partial derivative of � with respect to σ2 to 0 and solvingleads to the MLE of σ2:

σ2ML =

1n

(y − Xβ)T (y −Xβ) =1n

∑i

(yi − xTi β)2 =

1n

SSE

• Problem: This estimator is biased for σ2.

• This bias can be easily fixed, which leads to the generally preferredestimator:

σ2 =1

n − p(y − Xβ)T (y − Xβ) =

1n − p

SSE =1

dfESSE = MSE

• Note that the MLE of σ2 is biased, and this is due to using the wrongvalue for the dfE (the divisor for SSE).

– dfE = n − p is the information in the data left for estimatingσ2 after having estimated β1, . . . , βp.

– Because σ2ML uses n rather than n−p, it is often said that the

MLE of σ2 fails to account for d.f. used (or lost) in estimatingβ.

• MSE , the preferred estimator of σ2, is an example of what is knownas a restricted ML (REML) estimator.

– As we’ll see, REML is the preferred method of estimating vari-ance components in LMMs. This method simply generalizesusing σ2 = MSE rather than σ2

ML in the CLM.

31

Example – Volume of Cherry Trees:

For 31 black cherry trees the following measurements were obtained:

V = Volume of usable wood (cubic feet)H = Height of tree (feet)D = Diameter at breast height (inches)

Goal: Predict usable wood volume from diameter and height.

• See S-PLUS script, backgrnd.R.

• Here, we first consider a simple multiple regression model cherry.lm1,for these data:

Vi = β0 + β1Hi + β2Di + ei, i = 1, . . . , 31

• Initial plots of V against both explanatory variables, D and H, looklinear, so this model may be reasonable.

• cherry.lm1 gives a high R2 of .941 and most residual plots look prettygood. However, plot of residuals vs. diameter looks “U”-shaped, sowe consider some other models for these data.

32

Inference in the CLM:

Under the basic assumptions of the CLM (independence, homoscedasticity,normality), β, the ML/OLS estimator of β, has distribution

β ∼ N(β, σ2(XT X)−1)

That is,

• β is unbiased for β;• β has var-cov matrix σ2(XT X)−1

• βj has standard error s.e.(βj) =√

MSE [(XT X)−1]jj ;• β is normally distributed.

• Also can be shown that β is optimal estimator (BLUE, UMVUE).

These properties lead to a number of normal-theory methods of inference:

1. t tests, confidence intervals for an individual regression coefficient βj

based onβj − β

s.e.(βj)∼ t(n − p)︸︷︷︸

the t distribution with n − p d.f.

– 100(1 − α)% CI for βj given by βj ± t1−α/2(n − p)s.e.(βj).

– For an α-level test of H0 : βj = β0 versus H1 : βj �= β0 we usethe rule: reject H0 if

|βj − β0|s.e.(βj)

> t1−α/2(n − p)

– Tests of H0 : βj = 0 for each βj given by summary() functionin S-PLUS/R.

33

2. More generally, inference on linear combinations of the βj ’s of theform cT β (e.g., contrasts) based on the t distribution:

cT β − cT β√MSEcT (XT X)−1c

∼ t(n − p)

– E.g., 100(1 − α)% C.I. for the expected response at a givenvalue of the vector of explanatory variables xo is given by

xT0 β ± t1−α/2(n − p)

√MSExT

0 (XT X)−1x0.

– A 100(1 − α)% prediction interval for the response on a newsubject with vector of explanatory variables xo is given by

xT0 β ± t1−α/2(n − p)

√MSE(1 + xT

0 (XT X)−1x0).

– Confidence intervals for fitted and predicted values given bythe predict() function in S-PLUS/R.

3. Inference on the entire vector β is based on the fact that

(β − β)T (XT X)(β − β)pMSE

∼ F (p, n − p)︸︷︷︸the F distribution with p and n − p d.f.

– E.g., we can test any hypothesis of the form H0 : Aβ = c whereA is a k×p matrix of constants (e.g., contrast coefficients) withan F test. The appropriate test has rejection rule: reject if

F =(Aβ − c)T {A(XT X)−1A}−1(Aβ − c)

kMSE> F1−α(k, n − p).

4. The fit of nested models can be compared via an F test comparingtheir MSE ’s.

– Accomplished with the anova() function in S-PLUS/R.

34

Clustered Data:

Clustered data are data that are collected on subjects/animals/trees/unitswhich are heterogenous, falling into natural groupings, or clusters, basedupon characteristics of the units themselves or the experimental design,but not on the basis of treatments or interventions.

• The most common example of clustered data are repeated mea-sures data.

• By repeated measures, people typically mean data consisting of mul-tiple measurements of essentially the same variable on a given subjector unit of observation.

– Repeated measurements are typically taken through time, butcan be at different spatial locations, or can arise from multiplemeasuring devices, obervers, etc.

– When repeated measures are taken through time, the termslongitudinal data, and panel data, are roughly synony-mous.

• We’ll use the more generic term clustered data to refer to any ofthese situations.

– Clustered data also include data from split-plot designs, crossoverdesigns, hierarchical sampling, and designs with pseudorepli-cation/subsampling.

35

Advantages of longitudinal/clustered data:

• Allow study of individual patterns of change — i.e., growth.

• Economize on experimental units.

• Heterogeneous experimental units are often better representative ofthe population to which we’d like to generalize.

• Each subject/unit can “serve as his or her own control”.

– E.g., in a split-plot experiment or crossover design, compar-isons between treatments can be done within the same subject.

– In a longitudinal study comparisons of time effects (growth)can be made within a subject rather than between subjects.

– Between unit heterogeneity can be eliminated when assess-ing treatment or time effects. Leads to more power/efficiency(think paired t-test versus two-sample t-test).

Disadvantages:

• Correlation, multiple sources of heterogeneity in the data.

– Makes statistical methods harder to understand, implement.

– LMMs flexible enough to deal with these features.

• Imbalance, incompleteness in data more common.

– This can be hard for some statistical methods, especially ifmissing data are not missing at random.

– LMMs handle unbalanced data relatively easily, well.

36

Linear Mixed Models (LMMs)

• We will present the LMM for clustered data. It can be presented andused in a somewhat more general context, but most applicationsareto clustered data and this is a simpler case to discuss/understand.

Examples revisited:

Example 1, One-way random effects model — Rails

• Recall that we had three observations on each of 6 rails.

Model:yij = µ + bi + eij , i = 1, . . . , 6, j = 1, . . . , 3,

where

yij = response from jth measurement on ith railµ = grand mean response across population of all railsbi = random effect for the ith rail

eij = error term

• Data are clustered by rail.

Model for all data from the ith rail can be written in vector/matrix form:⎛⎝ yi1

yi2

yi3

⎞⎠ =

⎛⎝ 1

11

⎞⎠ µ +

⎛⎝ 1

11

⎞⎠ bi +

⎛⎝ ei1

ei2

ei3

⎞⎠

or yi = Xiβ + Zibi + ei

37

Example 2, RCBD model — Stools

• Recall that we had n = 9 subjects, each of whom tested all a = 4stool designs under study.

Model:yij = µj + bi + eij, i = 1, . . . , n, j = 1, . . . , a,

where

yij = response from jth stool tested by ith subj.µj = mean response for stool type j across population of all subjectsbi = random effect for the ith subject

eij = error term

• Data are clustered by subject.

Model for all data from the ith subject can be written in vector/matrixform: ⎛

⎜⎝yi1

yi2

yi3

yi4

⎞⎟⎠ =

⎛⎜⎝

1 0 0 00 1 0 00 0 1 00 0 0 1

⎞⎟⎠

⎛⎜⎝

µ1

µ2

µ3

µ4

⎞⎟⎠ +

⎛⎝ 1

11

⎞⎠ bi +

⎛⎜⎝

ei1

ei2

ei3

ei4

⎞⎟⎠

or yi = Xiβ + Zibi + ei

38

Example 3, Split-plot model — Grass

• Recall that we had 8 whole plots (half-fields) randomized to a RCND,and then split into 8 split-plots, which were randomized to 3 differentinoculation types.

Model:yijk = µ + αi + βk + (αβ)ik + τj + bij + eijk,

where yijk is response from the split-plot assigned to the kth inoculationtype within the (i, j)th whole plot (which is assigned to the ith cultivar injth block).

In addition,

µ = grand meanαi = ith cultivar effect (fixed)βk = kth inoculation type effect (fixed)

(αβ)ik = cultivar×inoculation interaction effect (fixed)τj = jth block effect (treated as fixed, but could be random)bij = effect for the (i, j)th whole plot (random)

eijk = error term (random)

• Data are clustered by whole plot.

Model for all data from the (i, j)th whole plot can be written in vec-tor/matrix form:

⎛⎝ yij1

yij2

yij3

⎞⎠ =

⎛⎝ 1 1 1 0 0 1 0 0 1

1 1 0 1 0 0 1 0 11 1 0 0 1 0 0 1 1

⎞⎠

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

µαi

β1

β2

β3

(αβ)i1

(αβ)i2

(αβ)i3

τj

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

+

⎛⎝ 1

11

⎞⎠ bi +

⎛⎝ eij1

eij2

eij3

⎞⎠

or yij = Xijβ + Zijbij + eij

39

The Linear Mixed Model for Clustered Data:

• Notice that all 3 of the previous examples have the same form.

• They are all examples of LMMs with a single (univariate) randomeffect: a random cluster-specific intercept.

Suppose we have data on n clusters, where yi = (yi1, . . . , yiti)T are the ti

observations available on the ith cluster, i = 1, . . . , n.

Then the LMM with random cluster-specific intercept is given (in general)by

yi = Xiβ + Zibi + ei, i = 1, . . . , n,

where Xi is a ti × p design matrices for the fixed effects β, and Zi is ati × 1 vector of ones. ei is a vector of error terms.

If you’re not comfortable with the vector/matrix representation, anotherway to write it is

yij = β1x1ij + β2x2ij + · · · + βpxpij︸︷︷︸fixed part

+ zijbi︸︷︷︸random part

+eij

where zij = 1.

Assumptions:

— cluster effects: bi’s are independent, normal with variance (variancecomponent) σ2

b .

— error terms: eij ’s are independent, normal with variance (variancecomponent) σ2.

• We will relax both the assumption of independence, and con-stant variance (homoscedasticity) later

— bi’s and eij ’s assumed independent of each other.

40

• Often, it makes sense to have more than one random effect in themodel. To motivate this, let’s consider another example.

Example — Microbril Angle in Loblolly Pine

• Whole-disk cross-sectional microfibril angle was measured at 1.4, 4.6,7.6, 10.7 and 13.7 meters up the stem of 59 trees, sampled from fourphysiographic regions.

– Regions (no. trees) were Atlantic Coastal Plain (24), Piedmont(17), Gulf Coastal Plain (9), and Hilly Coastal Plain (9).

A plot of the data:

Height on stem (m)

Who

le−

disk

cro

ss−

sect

iona

l mic

robr

il an

gle

(deg

)

10 20 30 40

15

20

25

30

35

Atlantic

10 20 30 40

Gulf

10 20 30 40

Hilly

10 20 30 40

Piedmont

41

• Here we have 4 or 5 repeated measures on each tree.

– Repeated measures not through time, but through space, upthe stem of the tree.

• Any reasonable model would account for

– heterogeneity between individual trees;

– correlation among observations on the same tree; and

– dependence of MFA on height at which it is measured.

• From the plots it is clear that MFA decreases with height.

– For simplicity, suppose it decreases linearly with height (itdoesn’t, but let’s keep things easy).

Let yijk be the MFA on the jth tree in the ith region, measured at the kth

height.

Then a reasonable model might be

yijk = µi + βheightijk︸︷︷︸fixed part

+bij + eijk

whereµi = mean response for ith regionβ = slope for linear effect of height on MFA

bij = random effect for (i, j)th treeeijk = error term for height-specific measurements

• Fixed part of model says that MFA decreases linearly in height, withan intercept that depends on region.

– I.e., mean MFA is different from one region to next.

• Random effects (the bij ’s) say that the intercept varies from tree totree within region.

42

Rather than just random tree-specific intercepts, suppose we believe thatthe slope (linear effect of height on MFA) also varies from subject to sub-ject.

• This leads to a random intercept and slope model:

yijk = (µi + b1ij)︸︷︷︸intercept

+ (β + b2ij)︸︷︷︸slope

heightijk + eijk

= µi + βheightijk + b1ij + b2ijheightijk + eijk

• Now there are two random effects b1ij and b2ij , or bivariate random

effects: bij =(

b1ij

b2ij

).

– No reason to expect that an individual tree’s effect on theintercept would be independent of that same tree’s effect onthe slope.

– So, we would assume b1ij and b2ij are correlated (probablynegatively).

Model can be written as⎛⎜⎝ yij1

...yij5

⎞⎟⎠ =

⎛⎜⎝

1 heightij1

...1 heightij5

⎞⎟⎠(

µi

β

)+

⎛⎜⎝

1 heightij1

...1 heightij5

⎞⎟⎠(

b1ij

b2ij

)+

⎛⎜⎝ eij1

...eij5

⎞⎟⎠

or yij = Xijβ + Zijbij + eij

43

So, the LMM in general may have > 1 random effect, which leads us tothe general form of the model:

yi = Xiβ + Zibi + ei, i = 1, . . . , n,

whereXi = design matrix for fixed effectsβ = p × 1 vector of fixed effects (parameters)Zi = design matrix for random effectsbi = q × 1 vector of random effectsei = vector of error terms

If you’re not comfortable with the vector/matrix representation, anotherway to write it is

yij = β1x1ij + β2x2ij + · · · + βpxpij︸︷︷︸fixed part

+ z1ijb1i + · · · + zqijbqi︸︷︷︸random part

+eij .

Assumptions:

— cluster effects: bi’s are normal and independent from cluster to clus-ter.

— We allow b1i, . . . , bqi (random effects from same cluster - e.g., randomintercept and slope) to be correlated, with var-cov D:

bi’siid∼ Nq(0,D)

— error terms: eij ’s are independent, normal with variance (variance

component) σ2. That is, ei’siid∼ Nti

(0, σ2I).

• We will relax both the assumption of independence, and con-stant variance (homoscedasticity) later

— bi’s and eij ’s assumed independent of each other.

44

Example — Microbril Angle in Loblolly Pine (Continued)

Recall the original random-intercept (only) model:

yijk = µi + βheightijk + bij + eijk.

• This model is fit with the lme() function in LMM.R:

> mfa.lme1 <- lme(mfa ~ regname + diskht -1 , data=mfa, random= ~1|tree)

> summary(mfa.lme1)

Linear mixed-effects model fit by REML

Data: mfa

AIC BIC logLik

1501.148 1526.311 -743.5738

Random effects:

Formula: ~1 | tree

(Intercept) Residual

StdDev: 1.795762 3.347371

Fixed effects: mfa ~ regname + diskht - 1

Value Std.Error DF t-value p-value

regnameAtlantic 20.267993 0.5996115 55 33.80187 0

regnameGulf 18.266745 0.8800006 55 20.75765 0

regnameHilly 18.425222 0.8719278 55 21.13159 0

regnamePiedmont 20.948914 0.6741677 55 31.07374 0

diskht -0.116966 0.0147071 215 -7.95304 0

Correlation:

rgnmAt rgnmGl rgnmHl rgnmPd

regnameGulf 0.192

regnameHilly 0.219 0.117

regnamePiedmont 0.323 0.172 0.196

diskht -0.601 -0.320 -0.365 -0.537

Standardized Within-Group Residuals:

Min Q1 Med Q3 Max

-1.7505407 -0.6968633 -0.1144518 0.5237712 3.9879323

Number of Observations: 274

Number of Groups: 59

45

The random intercept and random slope model was

yijk = µi + βheightijk + b1ij + b2ijheightijk + eijk.

• This model can be fit with lme() too, but an easy way to refit amodel with a slight change is via update():

> mfa.lme2 <- update(mfa.lme1, random= ~diskht|tree)

> summary(mfa.lme2)


Data: mfa

AIC BIC logLik

1504.586 1536.938 -743.293

Random effects:

Formula: ~diskht | tree

Structure: General positive-definite, Log-Cholesky parametrization

StdDev Corr

(Intercept) 2.21069507 (Intr)

diskht 0.02712458 -0.678

Residual 3.31192368

Fixed effects: mfa ~ regname + diskht - 1


regnameAtlantic 20.237540 0.6221012 55 32.53095 0

regnameGulf 18.312019 0.9048791 55 20.23698 0

regnameHilly 18.449352 0.8925792 55 20.66971 0

regnamePiedmont 20.950184 0.6944940 55 30.16611 0

diskht -0.116822 0.0149822 215 -7.79740 0

Correlation:

rgnmAt rgnmGl rgnmHl rgnmPd

regnameGulf 0.215

regnameHilly 0.248 0.132


diskht -0.636 -0.338 -0.390 -0.573

Standardized Within-Group Residuals:

Min Q1 Med Q3 Max

-1.7613139 -0.7114237 -0.1082732 0.5346069 3.7535692


Number of Groups: 59

46

Questions:

• The models were fit by REML. What does that mean?• Which model is better?• How do know if the model assumptions are met (diagnostics)?• How do we predict MFA at a given height for a given tree? for the

population of all trees from a given region?

Estimation and Inference in the LMM:

Estimation:

• In the classical linear model, the usual method of estimation is ordi-nary least squares.

• However, we saw that if we assume normal errors, then OLS givesthe same estimates of β as maximum likelihood (ML) estimation.

In the LMM, there are fixed effects β, but also parameters related thedistribution of the random effects (e.g., variance components such as σ2

b )as well as parameters related to the error terms (e.g., the error varianceσ2).

• Least-squares doesn’t provide a framework for estimation and infer-ence for all of these parameters, so ML and related likelihood-basedmethods (i.e., restricted maximum likelihood, or REML) are gener-ally preferred.

47

ML: recall that ML proceeds by finding the parameters that maximizesthe loglikelihood, or joint p.d.f. of the data.

• Finds the parameter values under which the observed data are mostlikely.

• Since the LMM assumes that the errors are normal, the randomeffects are normal, and the response y is linearly related to the errorsand random effects via y = Xβ + Zb + e, its not hard to show thatthe LMM implies that the response vector y is normal too.

– That is, its easy to show that the observations from differentclusters are independent, with

yi ∼ N(Xiβ,Vi) where Vi = ZiDZTi + σ2I

• ⇒ the joint p.d.f. of the data is multivariate normal• ⇒ the loglikelihood is the log of a m’variate normal p.d.f.

– This loglikelihood is easy to write down, but requires iterativealgorithm to maximize.

• Implemented optionally in lme() with method=“ML” option.

48

REML: Recall from the classical linear model that the MLE of σ2 wasbiased.

• Did not adjust for d.f. lost in estimating β (fixed effects).

• Instead we used MSE as preferred estimator of σ2.

REML was developed as a general likelihood-based methodology thatwould be applicable to all LMMs, but which would

• take account of d.f. lost in estimation of β to produce less biased esti-mates of variance-covariance parameters (e.g., variance components)than ML;

• generalize the old, well-known, unbiased estimators in those simplecases of the LMM where such estimators are known;

– e.g., REML yields MSE as its estimator of σ2 in the CLM.

• REML is based upon maximizing the restricted loglikelihood

– Can be thought of as that portion of the loglikelihood thatdoesn’t depend on β.

• Like ML estimation, requires iterative algorithm to produce esti-mates.

• REML is the default estimation method for the lme() function andPROC MIXED in SAS.

• It’s generally regarded as the preferred method of estimation forLMMs.

– However, some aspects of model selection are easier with ML,so sometimes competing models are fit and compared with ML,and then “best” model refit with REML at the end.

49

Inference on Fixed Effects:

• Remember, the framework for estimation and inference in the LMMis ML or REML, not least-squares as in the CLM.

The standard methods of inference in a likelihood-based framework areWald tests, and likelihood ratio tests (LRTs).

• LRTs and Wald tests are based upon asymptotic theory. That is,they provide methods that hold exactly when the sample size goesto infinty and only approximately for finite sample sizes.

• LRTs are useful for comparing nested models.

– Shouldn’t be used for comparing random-effect structures/variance-covariance structures.

– Shouldn’t be used with REML, only ML.

• Wald tests are useful for testing linear hypotheses (e.g., contrasts)on fixed effects.

– Wald tests yield approximate z and chi-square tests.

– These tests can be improved as t and F tests to produce betterinferences in small-samples.

50

Wald Tests:

It can be shown that the approximate (i.e., large sample) distribution ofthe (restricted) ML estimator β in the LMM is

β ∼ N

⎛⎜⎜⎜⎜⎜⎜⎝β,

[n∑

i=1

XTi V−1

i Xi

]−1

︸︷︷︸=var(

ˆβ)

⎞⎟⎟⎟⎟⎟⎟⎠ , (♣)

where Vi = ZiDZTi + σ2I.

• In practice var(β) is estimated by plugging in final (restricted) MLestimates obtained from fitting the model.

• Standard errors of βj , the jth component of β are obtained as thesquare root of the jth diagonal element of var(β).

The distributional result (♣) leads to the general Wald test on β.

In particular, we reject H0 : Aβ = c at level α, where A is k×p reject H0

if(Aβ − c)T {A[var(β)]−1AT }−1(Aβ − c) > χ2

1−α(k)

where χ21−α(k) is the upper αth critical value of a chi-square distribution

on k df.

• As a special case, an approximate z test of H0 : βj = 0 versusH0 : βj �= 0 rejects H0 if

βj

s.e.(βj)> z1−α/2

where z1−α/2 is the (1 − α/2) quantile of a standard normal distri-bution.

• In addition, an approximate 100(1 − α)% CI for βj is given by

βj ± z1−α/2s.e.(βj).

51

These Wald tests can be improved in small samples by using the t and Fdistributions in place of the z and χ2.

• An approximate F test of H0 : Aβ = c is based on the test statistic

(Aβ − c)T {A[var(β)

]−1

AT }−1(Aβ − c)

k.∼ F (k, ν).

• In addition, H0 : βj = 0 can be tested via the test statistic

βj

s.e.(βj).∼ t(ν).

What is the appropriate choice for the denominator d.f. ν in thesetests?

• This is a question which is difficult to answer, in general.

• Pinheiro and Bates’ lme() function uses the “containment” method.

– This method produces the right answers in simple cases suchas a split-plot model where those answers are known.

– This approach can give non-optimal answers in non-standardexamples of the LMM, but it tends to work pretty well overall.

– Same method (essentially) is implemented in SAS’s PROCMIXED with the ddfm=contain option (which is the default).

– However, this is one place where PROC MIXED is superiorto lme() because other, better approaches are implemented.In particular, the “Kenward-Roger” (ddfm=kr) method workswell much more generally than the containment method.

• Approximate t and F tests are implemented in lme() in the sum-mary() function and in the anova() function when the Terms or Loptions are specified.

52


• The random-intercept model is fit as mfa.lme1 in LMM.R:

yijk = µi + βheightijk + bij + eijk.

• The hypothesis of equal means/intercepts is

H0 : µ1 = µ2 = µ3 = µ4

which can be written as

H0 : Aβ = 0

where

Aβ =

⎛⎝ 1 −1 0 0 0

0 1 −1 0 00 0 1 −1 0

⎞⎠

⎛⎜⎜⎜⎝

µ1

µ2

µ3

µ4

β

⎞⎟⎟⎟⎠ =

⎛⎝ µ1 − µ2

µ2 − µ3

µ3 − µ4

⎞⎠

• Can be tested with the anova function:

> A <- matrix( c(1,-1,0,0,0, 0,1,-1,0,0, 0,0,1,-1,0),nrow=3,byrow=T)

> A

[,1] [,2] [,3] [,4] [,5]

[1,] 1 -1 0 0 0

[2,] 0 1 -1 0 0

[3,] 0 0 1 -1 0

> fixef(mfa.lme1)

regnameAtlantic regnameGulf regnameHilly regnamePiedmont diskht

20.2679929 18.2667446 18.4252221 20.9489136 -0.1169664

> anova(mfa.lme1,L=A,type="marginal") # use marginal, or "Type 3" SSs

F-test for linear combination(s)

regnameAtlantic regnameGulf regnameHilly regnamePiedmont

1 1 -1 0 0

2 0 1 -1 0

3 0 0 1 -1

numDF denDF F-value p-value

1 3 55 3.679848 0.0173

53

• Alternatively, the same test can be obtained by generating the anovatable for the equivalent model

yijk = µ + αi + βheightijk + bij + eijk

which can be done as follows:

> mfa.lme1a <- lme(mfa ~ regname + diskht, data=mfa, random= ~1|tree)

> anova( mfa.lme1a, type="marginal")


(Intercept) 1 214 1142.5667 <.0001

regname 3 55 3.6798 0.0173

diskht 1 214 63.2508 <.0001

• t-based confidence intervals for the parameters can be obtained withthe intervals() function:

> intervals(mfa.lme1)

Approximate 95% confidence intervals

Fixed effects:

lower est. upper

regnameAtlantic 19.0663446 20.2679929 21.46964124

regnameGulf 16.5031840 18.2667446 20.03030520

regnameHilly 16.6778398 18.4252221 20.17260436


diskht -0.1459550 -0.1169664 -0.08797776

54

LRTs:

In fairly broad generality, for models with a parametric likelihood func-tion L(γ) depending on a parameter γ, nested models can be tested byexamining the ratio

λ ≡ L(γ)L(γ0)

where

γ0 = MLE under H0 (under null, or partial model)γ = MLE under HA (under alternative, or full model)

• Logic: If the observed data are much less likely under the simplethan under the complex model, then λ will be large, and we shouldchoose the complex one.

• If the models explain the data equally well, then λ ≈ 1 and we preferH0, the simpler model.

• We reject H0 for large values of λ, or equivalently, for large valuesof log(λ).

Asymptotic version of the test: Reject partial model in favor of full modelif

2{log L(γ) − log L(γ0)} > χ21−α(ν)

where

ν = (# parameters estimated under full model)− (# parameters estimated under partial model)

= number of restrictions imposed by H0

55

• LRTS generalize the F test for nested models in the CLM.

• LRTs implemented in lme software via the anova() function.

• Maximized loglikelihood for any fitted model via the logLik() func-tion.

• Important: LRTs should never be performed for two models withdifferent fixed effect specifications when using REML, only with ML!


• Suppose we believe that MFA changes in a quadratic way withheight. Then we might consider the model

yijk = µi + β1heightijk + β2height2ijk + bij + eijk.

• If we fit this model and the linear in height model with ML, then aLRT can be done to test the quadratic effect in height.

56

> mfa.lme3.ML <- lme(mfa ~ regname + diskht + I(diskht^2) -1 ,

+ data=mfa, random= ~1|tree,method="ML") #full model

> mfa.lme1.ML <- update(mfa.lme3.ML,

fixed= ~ regname + diskht -1 )#partial model

>

> anova(mfa.lme3.ML,mfa.lme1.ML) # do LRT of diskht^2

Model df AIC BIC logLik Test L.Ratio p-value

mfa.lme3.ML 1 8 1322.110 1351.015 -653.0552

mfa.lme1.ML 2 7 1498.380 1523.672 -742.1899 1 vs 2 178.2694 <.0001

> 2*(logLik(mfa.lme3.ML)[1]-logLik(mfa.lme1.ML)[1])

[1] 178.2694

> summary(mfa.lme3.ML)$tTable #gives Wald-based t-test on diskht^2,

#alternative to LRT


regnameAtlantic 25.29239285 0.6096669291 55 41.48559 3.437954e-43

regnameGulf 23.70407853 0.8796914640 55 26.94590 2.269282e-33

regnameHilly 23.61188490 0.8702167111 55 27.13334 1.591075e-33

regnamePiedmont 25.94446296 0.6800771466 55 38.14929 2.969921e-41

diskht -0.74916125 0.0395495622 214 -18.94234 1.212177e-47

I(diskht^2) 0.01308780 0.0007938345 214 16.48682 5.789352e-40

> anova(mfa.lme3.ML,Terms=3) #Wald F test, just square of previous

F-test for: I(diskht^2)


1 1 214 271.8151 <.0001

• Wald F test obtained above from the anova() function using theTerms option.

• Using either LRT or Wald F test, its clear that the quadratic termin height is necessary.

Which test do I use?

Recommendation: use Wald-based F tests for inference on fixed effects.

57

Inference on Random Effects, Var-Cov Structure:

Inference on the variance-covariance structure (e.g., variance components,serial correlation parameters, heteroscedasticity parameters) is compli-cated by a number of theoretical and technical difficulties.

• For example, we may want to test H0 : σ2b = 0. I.e, that the variance

component associated with some random effect is zero.

– Under H0 the corresponding random effect is zero.

– Complications caused by σ2b being constrained to be ≥ 0 and

fact that null places σ2b at the boundary of its set of possible

values.

• In principle LRTs still apply, but reference distribution is not neces-sarily χ2, and is difficult to determine.

• Wald tests apply as well, but can perform extremely poorly unlesssample size is very large.

Instead, a simpler but somewhat less formal approach to inference on thevar-cov structure is via model selection criteria.

• The two most common model selection criteria are AIC, and BIC.

– Both based on the maximized value of the loglikelihood or re-stricted loglikelihood.

– Idea is to choose model under which data are most likely, buteach criterion imposes a penalty for model complexity (lack ofparsimony).

– Penalties differ for AIC, BIC.

– Hard to say which is better; BIC tends to lead to simpler mod-els, AIC slightly more commonly used.

• Use: choose one or the other criterion, then choose model whichminimizes that criterion.

58


• We’ve now concluded that the following quadratic (in height) modelfor MFA is better than the linear one. Model:

yijk = µi + β1heightijk + β2height2ijk + bij + eijk (mfa.lme3)

• We now consider whether the linear and quadratic effects of heightvary from subject to subject. That is, we consider models

yijk = µi + β1heightijk + β2height2ijk

+ b1ij + b2ijheight2ijk + eijk (mfa.lme4)

yijk = µi + β1heightijk + β2height2ijk

+ b1ij + b2ijheight2ijk + b3ijheight3ijk + eijk (mfa.lme5)

• These models differ in their random effects structure.

• Can’t test model mfa.lme4 vs. mfa.lme3 or mfa.lme5 vs. mfa.lme3with a LRT or Wald test.

– Instead use AIC or BIC to choose best model. Smallest AIC,say, wins:

> mfa.lme3 <- update(mfa.lme3.ML, method="REML" )

> mfa.lme4 <- update(mfa.lme3, random=~diskht|tree )

> mfa.lme5 <- update(mfa.lme3, random=~diskht+I(diskht^2)|tree )

> anova(mfa.lme3,mfa.lme4,mfa.lme5)


mfa.lme3 1 8 1338.214 1366.942 -661.1069

mfa.lme4 2 10 1329.896 1365.805 -654.9477 1 vs 2 12.31831 0.0021

mfa.lme5 3 13 1318.841 1365.524 -646.4205 2 vs 3 17.05449 0.0007

• Conclusion, according to both AIC and BIC, model mfa.lme5 is bestmodel considered so far.

• LRTs and the p-values given here should not be used in this context.

59

Prediction of Random Effects:

To keep things relatively simple, let’s return to the random intercept (only)version of the quadratic-in-height model for MFA:

yijk = µi + β1heightijk + β2height2ijk + bij + eijk (mfa.lme3)

• Mean response in the model is the fixed part:

E(yijk) = µi + β1heightijk + β2height2ijk (†)

• Describes the average behavior over population of all trees fromwhich the trees in the study were drawn.

– (†) is estimated by plugging in the parameter estimates.

• Model also “localizes” to the individual tree level. The i, jth treebehaves somewhat differently than average. That tree’s mean MFAdescribed by

µi + β1heightijk + β2height2ijk + bij (‡)

– Since bij is random, the quantity above is random.

– Makes sense, because it describes a particular tree’s responses,where that tree is random (randomly drawn from a popula-tion).

– (‡) is predicted by plugging in parameter estimates and pre-dictions of bij ’s.

Terminology:

— We estimate fixed, unknown constants (parameters) or quantitiesdepending only on parameters like (†).

— We predict unknown random variables (random effects), or quantitiesinvolving random effects like (‡).

60

The preferred method of predicting the random effects is to use estimatedbest linear unbiased predictors (BLUPs).

These predictions of the (unobserved) random effects are based upon

— the response vector y (observed)

— the distribution of the random effects according to the model (e.g.,iid∼ N(0, σ2

b )) (estimated from fitted model).

— the joint distribution of y and the random effects (estimated fromfitted model).

• Estimated BLUPs of the random effects given by the ranef() functionin S-PLUS/R.

• E.g., for model mfa.lme3, these predictions given by

> ranef(mfa.lme3)

(Intercept)

1 -1.643437203

2 -0.481270493

<portion omitted>

51 -0.908065397

52 0.900232600

53 -0.965476440

54 0.880095610

55 -0.090215557

56 -0.009334064

57 0.541355263

58 -0.145192338

59 -0.203399677

61

• Estimates of (†) can be obtained from the predict() function by spec-ifying level=0 (population level).

– Can think of this as plugging in 0 for random effects in fittedmodel equation.

– Appropriate for 1) estimating the mean for population of alltrees; or for 2) predicting the response for a tree whose randomeffect is unknown (a new tree).

• Predictions of (‡) can be obtained from the predict() function byspecifying level=1 (first level or tree level, in this case).

• A nice way to plot the data, the population-level estimates and thetree-level predictions is via augPred().

– Using this function for the Hilly region only (trees 51–19) pro-duces the following plot:

Data from Hilly Region only

Height on stem (m)

Who

le−

disk

cro

ss−

sect

iona

l mic

robr

il an

gle

(deg

)

10 20 30 40

12

14

16

18

20

22

51 52

10 20 30 40

53

54 55

12

14

16

18

20

22

5612

14

16

18

20

22

57

10 20 30 40

58 59

fixed tree

62

Model Diagnostics:

As in classical linear models, residual plots are the work-horse of modeldiagnostics.

In a LMM, however, we need to think a bit more carefully about what wemean by “residuals”.

• Residuals can be based on the difference between a response and itsestimated mean:

yij − µij︸︷︷︸estimated mean

– These are level 0 (population-level) residuals

• Alternatively, residuals can be based on the difference between aresponse from the ith cluster and its cluster-specific predicted vale:

yij − µij + bi︸︷︷︸predicted value

– These are level 1 (cluster-level) residuals

• Typically, we are interested in the cluster level residual.

– Residuals are extracted via the residuals() (or resid() for short)function. The option type controls the residual type (raw,Pearson, etc.).

– Fitted values are extracted with the fitted() function.

63

• The general form of the command for residual plots is

plot(fitted.model, y ~ x, options)

• In LMM.R, we produce various standard residual plots.

• The resulting conclusion is that the model is misspecified. It appearsthat

– the shape of the MFA vs height function depends on region.

– model is often underpredicting mfa at breast height.

• After considering various alternatives, a much improved model is onewhere we treat diskht as a factor, and use a standard two-way layoutmodel with random tree effects:

yijk = µ + αi + βk + (αβ)ik + bij + eijk (mfa.lme8)

whereµ = grand mean

αi = region effectsβk = disk height effects

(αβ)ik = disk height × region interactionsbij = random tree effect

eijk = within-tree error term

64

Extensions of LMMs

Accommodating Heteroscedasticity (Non-constant Variance):

In the LMM as we’ve presented it so far, the error terms are assumed tohave constant variance.

• I.e., if we write the model as

yi = Xiβ + Zibi + ei

then we assume

var(ei) = σ2I =

⎛⎜⎜⎝

σ2 0 · · · 00 σ2 · · · 0...

.... . .

...0 0 · · · σ2

⎞⎟⎟⎠

– Constant variance among errors (also implies response has con-stant variance).

Such an assumption is often unrealistic and can be violated both empiri-cally and theoretically.

• Common example is variance which increases as with the magnitudeof the response.

– E.g., heights of tall trees more variable than those of short.

• Another example is where the error variance differs across groups.

– Intensively managed trees less variable than natural stands.

• Another possibility is that variability depends upon a covariate.

– E.g., variability in tree heights decreases with increasing sitequality (site index, or soil quality measure)??

65

Such non-constant variance can be accommodated by modeling the vari-ance as a function of covariates, factors (groups), and/or the mean responsein addition to one or more unknown parameters.

• In particular, the lme software allows the error variance to be of theform

var(eij) = σ2g2(vi, δ)

where

vi = a vector of one or more variance covariates,δ = a vector of unknown variance parameters to be estimated,

g2(·) = a known variance function.

• We allow vi, the variance covariates to include µi = E(yi), the meanresponse.

– This requires a more complex fitting algorithm and a bit dif-ferent theory than the standard ML, REML teory that applieswhen the variance doesn’t depend on the mean.

– However, from the user-perspective, variance depending on themean causes no complication and is extremely useful.

66

Variance Functions Available in the lme/nlme Software:

• Variance functions in the nlme software are described in §5.2.1 inPinheiro and Bates (2000). Here, we give only brief descriptions.

1. varFixed. The varFixed variance function is g2(vi) = vi. That is,

var(ei) = σ2vi.

– says that error variance is proportional to the value of a co-variate.

– This is the traditional weighted least squares form.

2. varIdent. This variance specification corresponds to different vari-ances at each level of some stratification (grouping) variable s.

3. varPower. This generalizes the varFixed function so that the er-ror variance can be a to-be-estimated power of the magnitude of avariance covariate:

var(ei) = σ2|vi|2δ so that g2(vi, δ) = |vi|2δ.

The power is taken to be 2δ rather than δ so that s.d.(ei) = σ|vi|δ.A very useful specification is to take the variance covariate to be themean response. That is,

var(ei) = |µi|2δ

4. varComb. Finally, the varComb class allows the other varianceclasses to be combined so that the variance function of the model isa product of two or more component variance functions.

67


We return to the two-way layout model with random tree effects:

yijk = µ + αi + βk + (αβ)ik + bij + eijk (mfa.lme8)

• The residuals vs. fitteds plots shows increasing variance as the fittedvalues increase.

• Appears that variance increases with mean response, not systemati-cally with a covariate such as height.

• Can allow variance to be proportional to power of mean as follows:

> mfa.lme9 <- update(mfa.lme8, weights=varPower(form= ~ fitted(.)))

> anova(mfa.lme9,mfa.lme8)


mfa.lme9 1 23 1205.206 1286.564 -579.6028

mfa.lme8 2 22 1261.743 1339.564 -608.8712 1 vs 2 58.53691 <.0001

> summary(mfa.lme9)


Data: mfa

AIC BIC logLik

1205.206 1286.564 -579.6028

Random effects:

Formula: ~1 | tree


StdDev: 1.987213 0.00575837

Variance function:

Structure: Power of variance covariate

Formula: ~fitted(.)

Parameter estimates:

power

2.018668

<portion omitted>

• Fitted variance model says:

s.d.(ei) = σ|µi|δ = 0.00575837|µi|2.018668

68

• According to AIC, BIC heteroscedastic model mfa.lme9 a big im-provement over mfa.lme8.

• Residual plots look much better too.

Accommodating Serial Correlation:

In the LMM as we’ve presented it so far, the error terms are assumed tobe uncorrelated.

• Only correlation among the responses due to shared random effects.

– Accounts for shared characteristics.

– Ignores serial correlation.

• Serial correlation (autocorrelation) is temporally or spatially related.

– Typically, observations close together in time/space more sim-ilar than observations far apart.

• We can accommodate serial correlation by relaxing our indepedenterrors assumption.

• Instead, we model the correlation among error terms as function oftime lag, spatial distance and unknown parameters.

lme/nlme software allows correlation model of form

corr(eij , eik) = h{d(pij , pik),ρ}where

ρ = a vector of correlation parameters,h(·) = a known correlation function,

pij , pik = position variables corresponding to observations yij , yik,d(·, ·) = is a known distance function.

• The correlation function h(·) is assumed continuous in ρ, returningvalues in [−1,+1]. In addition, h(0,ρ) = 1, so that observations thatare 0 distance apart (identical observations) are perfectly correlated.

69

Correlation Structures Available in the lme/nlme Software:

• Correlation structures in the nlme software are described in §5.3 inPinheiro and Bates (2000).

• There are also several spatial correlation structures.

• Here, we give brief descriptions of the serial and general correlationstructures.

Serial Correlation Structures:

1. corAR1. Autoregressive of order 1.

• Appropriate for observations taken at evenly spaced time points.

• E.g., for ei = (ei1, ei2, . . . , eit)T , taken at times 1, 2, . . . , t, modelssays

corr(eij , eik) = ρ|j−k|

– E.g., for t = 5 model implies

corr(ei) =

⎛⎜⎜⎜⎝

1 ρ ρ2 ρ3 ρ4

1 ρ ρ2 ρ3

1 ρ ρ2

1 ρ1

⎞⎟⎟⎟⎠

2. corCAR1. This correlation structure is a continuous-time versionof an AR(1) correlation structure. The specification is the same asin corAR1, but now the covariate indexing time can take any non-negative non-repeated value and we restrict φ ≥ 0.

3. corARMA. This correlation structure corresponds to an ARMA(p, q)model. AR(p) and MA(q) models can be specified with this function,but keep in mind that the corAR1 specification is more efficient thanspecifying corARMA with p = 1 and q = 0.

70

General Correlation Structures:

1. corCompSymm. Compound symmetry. In this structure,

corr(eij , eik) ={

1 if j = k; andρ if j �= k.

• Same correlation structure as implied by a random cluster-specificintercept with independent errors (e.g., split-plot model).

2. corSymm. Specifies a completely general correlation structure witha separate parameter for every non-redundant correlation.

• E.g., for ei = (ei1, ei2, ei3, ei4, ei5)T

corr(ei) =

⎛⎜⎜⎜⎝

1 ρ1 ρ2 ρ3 ρ4

1 ρ5 ρ6 ρ7

1 ρ8 ρ9

1 ρ10

1

⎞⎟⎟⎟⎠

71

Q: How do we choose a correlation structure?

• Hard question.

• In a single dimension (e.g., time, position on the bole), AR(1) modelsare often sufficient.

If we are willing to consider other ARMA models, two tools that are usefulin selecting the right ARMA model are the sample autocorrelation function(ACF) and the sample partial autocorrelation function (PACF).

• ACF produces in lme/nlme software with the ACF() function. PACFharder to obtain.

• AR(p) models have PACFs that are non-zero for lags ≤ p and 0 forlags > p. Therefore, we can look at the magnitude of the samplePACF to try to identify the order of an AR process that will fit thedata. The number of “significant” partial autocorrelations is a goodguess at the order of an appropriate AR process.

• MA(q) models have ACFs that are nonzero for lags ≤ q and 0 forlags > q. Again, we can look at the sample ACF to choose q.

• Simpler approach: trial and error until ACF looks good.

72


• In the MFA example, following code produces the ACF plot below:

> plot(ACF(mfa.lme9,resType="n", form=~1|tree),alpha=.05)

Lag

Aut

ocor

rela

tion

0 1 2 3 4

−0.2

0.0

0.2

0.4

0.6

0.8

• Plot doesn’t look great, not terrible either.

• Can try continuous AR(1) correlation structure as follows

> mfa.lme10 <- update(mfa.lme9, corr=corCAR1(form= ~diskht|tree))

> anova(mfa.lme10,mfa.lme9)


mfa.lme10 1 24 1206.418 1291.314 -579.2088

mfa.lme9 2 23 1205.206 1286.564 -579.6028 1 vs 2 0.7879748 0.3747

• Doesn’t help. No strong evidence of serial correlation here and itprobably can be safely ignored.

73

Multilevel Models

Sometimes data are clustered at two or more nested levels. E.g.,

• Educational data: students’ test scores are clustered by class withinschool, school within school district.

• Multisite clinical trial with repeated measures: observations are clus-tered by patient within clinics, and by clinics within the overall study.

• Forestry data: repeated measures on trees clustered by tree withinplot, plot within stand, etc.


When introducing these data, I lied and simplified the description of thesampling design.

• In reality, several stands were sampled within each physiographicregion, and then three trees per stand were sampled.

– Data clustered by tree within stand, and by stands within re-gion.

– May want to account for stand-level heterogeneity, tree-levelheterogeneity separately, with nested random effects.

74

Now let yijk� be the response at �th height, for the kth tree, in the jth

stand, in the ith region.

Multilevel extension of our two-way anova with random tree-specific inter-cepts:

yijk� = µ + αi + β� + (αβ)i� + bij + bijk + eijk (mfa2.lme2)

• This model fit as mfa2.lme2 in extend.R. But first must construct atwo-level groupedData object mfa2:

mfa2 <- groupedData(mfa~diskht|stand/tree,data=mfa,

labels=list(x="Height on stem",

y="Whole-disk cross-sectional microbril angle"),

units=list(x="(m)",y="(deg)"),

order.groups=F)

75

• Now refit model mfa.lme9 to new data set (call it mfa2.lme1), andthen update by adding a stand-level random intercept:

> mfa2.lme2 <- update(mfa2.lme1,random=list(stand= ~1,tree= ~1))

> anova(mfa2.lme1,mfa2.lme2)


mfa2.lme1 1 23 1205.206 1286.564 -579.6028

mfa2.lme2 2 24 1198.543 1283.439 -575.2714 1 vs 2 8.662672 0.0032

> summary(mfa2.lme2)


Data: mfa2

AIC BIC logLik

1198.543 1283.439 -575.2714

Random effects:

Formula: ~1 | stand

(Intercept)

StdDev: 1.586422

Formula: ~1 | tree %in% stand


StdDev: 1.401302 0.007036924

Variance function:


Formula: ~fitted(.)


power

1.948697

<portion omitted>

• According to AIC, BIC, two-level model fits better.

• Estimated variance component from stand to stand is 1.5862.

• Estimated variance component between trees within stands is 1.4012.

76

Nonlinear Mixed Effects Models

A Motivating Example — Circumference of Orange Trees

The data in the table below are the circumferences of five orange treesover time.

Tree No.Time (days) 1 2 3 4 5

118 30 33 30 32 30484 58 69 51 62 49664 87 111 75 112 81

1004 115 156 108 167 1251231 120 172 115 179 1421372 142 203 139 209 1741582 145 203 140 214 177

A plot of the data, with observations from the same tree connected, ap-pears below.

Time since December 31, 1968 (days)

Tru

nk c

ircum

fere

nce

(mm

)

500 1000 1500

050

100

150

200 Tree 1

Tree 2Tree 3Tree 4Tree 5

Observed Growth CurveFitted Growth Curve

Orange Tree Data w/ NLS Fit

77

Also displayed in this plot is the fitted curve from a logistic function fitwith NLS.

That is, if we let yij = circumference of the ith tree at age tij , i = 1, . . . , 5,j = 1, . . . , 7, then the fitted model is

yij =θ1

1 + exp[−(tij − θ2)/θ3]+ eij (m1Oran.gnls)

where {eij} iid∼ N(0, σ2).

• Clearly, model (m1Oran.gnls) is inadequate.

Fitted curve goes through the center of the combined data from all trees,but growth curves of individual trees are poorly estimated.

• Because the growth curves of the different trees spread out as thetrees get older, this misspecification will show up as a cone-shapedresiduals vs. fitteds plot suggesting heteroskedasticity.

• Not the problem: it is only (or at least mainly) between-tree vari-ability that is increasing over time. Within-tree error variance looksto be homoskedastic.

78

Another problem with m1Oran.gnls: treats observations as independent.Two obvious potential sources of correlation in these data:

1. Clustering. The data are grouped, or clustered, by tree. Observa-tions from same tree should share characteristics of that tree whichmake them similar, or correlated.

– Minimized by very homogeneous groups.

2. Serial Dependence. Observations close together in time will tend tobe correlated more highly than observations far apart.

– Often reduced by long lags between measurements, and/or ho-mogeneous environmental conditions through time.

The first of these sources almost certainly affects the orange tree data andthe second may as well.

To deal with these problems, we could fit different parameters to eachtree, or perhaps it would be sufficient to just fit different assymptotes, ordifferent rate parameters to each tree.

E.g., the model with 5 separate asymptote parameters, one for each tree,is:

yij =θ1i


where we assumecorr(ei) = C(ρ),

where C is an assumed form for the within-group correlation matrix, de-pending on an unknown parameter ρ.

• In m2Oran.gnls, C(ρ) = I, but we could fit AR(1) or other corrstructure.

79

While this approach is clearly an improvement over (m1Oran.gnls), it hassome disadvantages:

A. # of parameters grows with sample size. In (m2Oran.gnls) we’ve in-troduced a distinct fixed asymptote parameter for each tree. There-fore, if we had measured 500 trees, our model would have 502 regres-sion parameters.

Having the number of parameters increase with the sample size in-troduces a number of problems:

• Theoretical: in ML and LS estimation, asymptotic argumentsestablishing consistency, optimality break down.

• Computational: Difficult to optimize a criterion of estimationwith respect to many parameters.

• Interpretation: We have 500 separate asymptotes and no singleparameter describing the average limit of growth. Do we reallycare what the limit of growth was for tree #391?

• Conceptual: θ1i is the asympote parameter for tree i. Thatis, its the fixed theoretical population constant for the limit ofgrowth for tree i. But what’s the population? and why is theasymptote of tree i a fixed constant? Wasn’t tree i randomlyselected from a population of trees? If so, the asymptote of thisrandomly drawn tree should be regarded as a random variable,not a parameter.

• Scope of inference: Results apply to sample at hand, not topopulation from which the trees were drawn.

80

B. Correlation structure. The correlation structure in model (m2Oran.gnls)accounts for within–tree correlation by modelling source 2 (serial cor-relation) but not source 1 (grouping correlation). It is often difficultand unnecessary to model both sources, but for short time series,modelling 2 is often harder than modelling 1.

That is, it is often not easy to fit an ARMA model to the within-group observations through time. This can be so because of:

• Short series.

• Non-stationary series.

• Unbalanced/missing data and/or irregular or continuous timeindexing.

• Having many fixed, cluster-specific parameters to better fit the datafrom each cluster (tree, plot, etc.) in nonlinear growth curve models,is (essentially) the approach taken in the self-referencing functionspopular in forest biometrics.

– I’m not a fan.

81

An alternative: A nonlinear mixed-effects model (NLMM) for the orangetree data.

Again, our fixed effects nonlinear model (m2Oran.gnls) with 5 separatetree-specific asymptotes is

yij =θ1i


Using an ANOVA-type parameterization for θ1i we can write θ1i = θ1 + τi

where∑5

i=1 τi = 0. Here θ1 is the average or typical θ1-value (asymptote)and τi is the ith tree effect.

Under this parameterization, model (m2Oran.gnls) becomes

yij =θ1 + τi

1 + exp[−(tij − θ2)/θ3]+ eij

5∑i=1

τi = 0.

In the ordinary nonlinear regression model, the θ’s and the τ ’s are allconsidered to be fixed unknown parameters, a.k.a. fixed effects.

In the NLMM, we consider the τi’s to be random variables, or randomeffects. τi is the deviation from θ1 of the asymptote of the ith tree; itis considered to be random because the tree itself is a randomly selectedrepresentative element of the population to which we want to generalize.

82

Changing symbols from τi to bi, the model becomes

yij =θ1 + bi

1 + exp[−(tij − θ2)/θ3]+ eij ,

b1, . . . , b5iid∼ N(0, σ2

b )

{eij} iid∼ N(0, σ2)(†)

Here we’ve also dropped the bar from θ1.

• Now the asymptote for the ith tree is θ1 + bi, a random variablebecause bi is a random variable. The asymptote for the typical treeis θ1 (when bi = 0).

• If we write θ1i ≡ θ1 + bi, then we have that the 5 asymptotes are

randomly distributed around θ1: θ11, . . . , θ15iid∼ N(θ1, σ

2b ).

Fitting Model (†):The fact that the random effects {bi} enter into the NLMM (†) nonlin-early complicates the methodology and theory of NLMMs substantiallycompared to ordinary NLMs and LMMs.

• To focus on the motivation, interpretation, and basic ideas of NLMMswe temporary skip this material and just assume that the nlme()function in S-PLUS can fit an NLMM with a “good” method.

• See the R script NLMM.R, where we analyze these data with NLMMs.

• In this script, we first fit models (m1Oran.gnls) and (m2Oran.gnls).We then fit the NLMM (†) as m1Oran.nlme using the nlme() func-tion.

83

• Notice that the NLME (m1Oran.nlme) has estimated regression pa-rameter θ similar to the estimated regression parameter in the fixed-effects model that fit a mean curve to all the data ignoring tree effects(m1Oran.gnls):

> m1Oran.nlme <- nlme(circumference ~ SSlogis(age,Asym,th2,th3), data=Orange,

+ fixed= Asym+th2+th3~1,random= Asym~1,start=coef(m1Oran.gnls))

>

> fixef(m1Oran.nlme) #compare fixed effect estimates

Asym th2 th3

191.0499 722.5590 344.1681

> coef(m1Oran.gnls)

Asym th2 th3

192.6876 728.7564 353.5337

> summary(m1Oran.nlme)

Nonlinear mixed-effects model fit by maximum likelihood

Model: circumference ~ SSlogis(age, Asym, th2, th3)

Data: Orange

AIC BIC logLik

273.1691 280.9459 -131.5846

Random effects:

Formula: Asym ~ 1 | Tree

Asym Residual

StdDev: 31.48255 7.846255

<portion omitted>

• Variability in the asymptotes from tree to tree is captured throughbi, which is assumed normal, mean 0, with estimated variance σ2

b =(31.48)2. The error variance is estimated to be σ2 = (7.85)2.

84

> AIC(m1Oran.nlme,m1Oran.gnls,m2Oran.gnls)

df AIC

m1Oran.nlme 5 273.1691

m1Oran.gnls 4 324.7974

m2Oran.gnls 8 254.1040

• The NLMM m1Oran.nlme has AIC=273.2, BIC=280.9 for 5 esti-mated parameters: θ1, θ2, θ3, σ

2b , σ2. This compares with AIC=324.8,

BIC=331.0 for the 4-parameter model (m1Oran.gnls) and AIC=254.1,BIC=266.5 for the 8-parameter model (m1Oran.gnls).

• So, the addition of random effects in the asymptote of (m1Oran.gnls)only costs us 1 df and results in a vast improvement in fit.

• Fit is even better when fitting separate asymptotes to each tree(m2Oran.gnls), but that shouldn’t be surprising.

– In (m1Oran.nlme) we save on df in comparison to (m2Oran.gnls)by making a parametric assumption on the distribution of therandom effects: that they’re normal with only an unknownvariance to be estimated.

– In contrast, model (m2Oran.gnls) doesn’t make any assump-tion about the tree-to-tree variability in asymptotes, it sepa-rately estimates each asymptote.

However:

– Problems with many, cluster-specific parameters cited above;and

– Advantage would go away if we had more trees in the dataset. Then the penalty for lack of parsimony would increase,and model m2Oran.gnls would have higher AIC, BIC thanm1Oran.nlme.

85

• Of course the residuals of model (m1Oran.gnls) looked terrible be-cause the individual trees were poorly fit by the average curve. Theresiduals of models (m2Oran.gnls) and (m1Oran.nlme) look aboutequally good.

Our fitted model for tree i at time tij is

yij =θ1 + bi

1 + exp[−(tij − θ2)/θ3]

• The bi’s aren’t estimated parameters of the model. They’re predictedquantities based on the fitted model, the data, and the assumption

that b1, . . . , b5iid∼ N(0, σ2

b ).

The bi’s are as follows (note these aren’t sorted by tree #):

> ranef(m1Oran.nlme)

Asym

3 -37.000247

1 -29.403585

5 -5.179485

2 31.565006

4 40.018311

Therefore, the predicted circumference of tree 1, say, at time tij is givenby

yij =191.0 − 29.4

1 + exp[−(tij − 722.6)/344.2]

86

• Plugging the bi’s into the fitted model equation yields the pinkcurves (tree-level), and plugging bi = 0 in yields the blue curves(population-level) in the following plot obtained from the augPred()function:

Time since December 31, 1968 (days)

Tru

nk c

ircum

fere

nce

(mm

)

500 1000 1500

50

100

150

200

3

500 1000 1500

1

500 1000 1500

5

500 1000 1500

2

500 1000 1500

4

fixed Tree

87

• Finally, we examine the ACFs for models (m2Oran.gnls), and (m1Oran.nlme).

ACF, Model m2Oran.gnls

Lag

Aut

ocor

rela

tion

0 1 2 3 4 5 6

−0.5

0.0

0.5

ACF, Model m1Oran.nlme

Lag

Aut

ocor

rela

tion

0 1 2 3 4 5 6

−0.5

0.0

0.5

• The ACFs of these models (**) and (†) are similar — the two models“account for” the residual correlation structure similarly and ade-quately here.

88

The NLME Model Formulation

• We consider a single level of clustering, before tackling the multilevelcase.

Formulation for Single Level Data:

Let yij denote the jth observation (e.g., through time) on the ith cluster(e.g, subject, tree, plot) where we have n clusters, and ti observations inthe ith cluster. Let wij be a vector of covariates corresponding to responseyij .

The general form of the NLMM for this situation is

yij = f(θij ,wij) + eij ,i = 1, . . . , nj = 1, . . . , ti

,

where θij = Xijβ + Zijbi,b1, . . . ,bn

iid∼ N(0,D)

{eij} iid∼ N(0, σ2)

(∗)

where

β = a p × 1 vector of fixed effects,bi = a q × 1 vector of cluster-specific random effects with var-cov matrix D

Xij = a model/design matrix for β

Zij = a model/design matrix for bi

• Note that we assume homoscedastic, uncorrelated errors for now,but can be relaxed as in the LMM.

89

Model (*) can be equivalently expressed in matrix form as

yi = fi(θi,wi) + ei,

θi = Xiβ + Zibi,(∗∗)

for i = 1, . . . , n, where

yi =

⎛⎜⎝ yi1

...yiti

⎞⎟⎠ , θi =

⎛⎝ θi1

...θiti

⎞⎠ , εi =

⎛⎝ ei1

...eiti

⎞⎠ , fi(θi,wi) =

⎛⎜⎝ f(θi1,wi1)

...f(θiti

,witi)

⎞⎟⎠ ,

wi =

⎛⎝ wi1

...witi

⎞⎠ , Xi =

⎛⎝ Xi1

...Xiti

⎞⎠ , Zi =

⎛⎝ Zi1

...Ziti

⎞⎠ .

We assume

b1, . . . ,bniid∼ Nq(0,D), {ei} iid∼ Nti

(0, σ2I)

and the random effects {bi} are independent of the errors {ei}.Example — Orange Tree Data

To illustrate the model formulation, we write model (†)=(m1Oran.nlme)that we used for these data in the form (**). Model (†) can be written as

yij =θ1ij

1 + exp[−(tij − θ2ij)/θ3ij ]+ eij ,

where

⎛⎝ θ1ij

θ2ij

θ3ij

⎞⎠

︸︷︷︸θij

=

⎛⎝ 1 0 0

0 1 00 0 1

⎞⎠

︸︷︷︸Xij

⎛⎝β1

β2

β3

⎞⎠

︸︷︷︸β

+

⎛⎝ 1

00

⎞⎠

︸︷︷︸Zij

( b1i )︸︷︷︸bi

=

⎛⎝β1 + b1i

β2

β3

⎞⎠ ,

where bi = bi is a scalar, so q = 1 and

b1, . . . , bniid∼ N(0, σ2

b︸︷︷︸D

), {eij} iid∼ N(0, σ2).

90

A Multilevel Example — TPH in the PMRC Site PreparationStudy:

• Study of various site preparation and intensive management regimeson the growth of slash pine.

• Involved 191 0.2 ha plots nested within 16 sites in lower coastal plainof GA and FL.

• Data consist of tph (100’s of trees per ha), site index, soil type, andtreatment variables (herb, fert, chop, etc.) at ages 2, 5, 8, 11, 14,17, and 20 years.

A plot of the data:

5

10

15

5 10 15 20

10 4

5 10 15 20

9 5

5 10 15 20

6 11

5 10 15 20

20 15

14

5 10 15 20

8 1

5 10 15 20

7 12

5 10 15 20

16 13

5

10

15

5 10 15 20

19

Age (yrs)

Tre

es/H

ecta

re (

100s

of t

rees

)

12

34

56

78

910

1113

Separate profiles for each plot, graphed separately by site

(Each panel is a site)

91

• Because the profiles of tph over time appear to be sigmoidal (‘S’-shaped) for at least some plots, we consider the four-parameter lo-gistic model for these data.

• The function SSfpl() in the lme/nlme software implements this func-tion in the following form:

yijk = θ1ij +θ2ij − θ1ij

1 + exp{(θ3ij − Ageijk)/θ4ij} + eijk ♥

where

θ1ijk = left (upper) asymptote (mixed effect)θ2ijk = right (lower) asymptote (mixed effect)θ3ijk = age at inflection point (mixed effect)θ4ijk = scale param determining how quickly response reaches lower asym. (mixed)yijk = tph at kth age, for jth plot in ith siteeijk = error term

92

The θijk’s are mixed effects. Each one can be modeled in terms of ex-planatory variables, parameters, and random plot and random site effects.

• E.g., we might believe that the lower asymptote θ2ijk might dependon whether or not the site was fertilized.

– In that case we might set

θ2ijk = β20 + β21Fertij (fixed effects only)

• However, we might also believe that there is variability from plot toplot and variability from site to site in the lower asymptote.

– In that case, we would consider modeling the lower asymptoteas

θ2ijk = β20 + β21Fertij + b2i + b2ij (mixed effects)

where the b2i’s are site-specific random effects and b2ij ’s areplot-specific random effects.

• The other θijk’s (the basic “parameters” of the nonlinear model) caneach be modeled this way as well.

93

• E.g., suppose that:

– the upper asymptote depends upon whether the site was fer-tilized, and varies across sites and plots.;

– the lower asymptote depends upon whether the site was beddedand whether it was burned, and varies across sites;

– the inflection point depends upon site index;

– and the scale parameter is constant.

• Then a NLMM to describe such a situation would be


1 + exp{(θ3ij − Ageijk)/θ4ij} + eijk

= (β10 + β11Fertij + b1ij + b1i)

+(β10 + β11Bedij + β12Burnij + b2i) − (β10 + β11Fertij + b1ij + b1i)

1 + exp{((β30 + β31SIij) − Ageijk)/β40} + eijk

Note that multivariate random effects at a given level occur more naturallyand often in NLMMs than LMMs.

• E.g., here we have random site effects on both the upper and lowerasymptotes. That is the site effects are bivariate:

bi =(

b1i

b2i

)

• Its natural to expect that if a site has a unique effect on the lowerasymptote and a unique effect on the upper asymptote, then thoseeffects are probably related. Therefore, we typically assume multi-variate random effects are correlated. E.g.,

{bi} iid∼ N(0,D) where D =(

σ11 σ12

σ12 σ22

)(∗)

94

• There are several functions built into the lme/nlme software that canspecific various var-cov structures on multivariate random effects inthe model.

– pdDiag() specifies uncorrelated (independent) random effects.

– pdSymm() specifies general covariance among random effects(as above in (*)).

– pdBlocked() specifies block-diagonal covariance among randomeffects.

Example — TPH (continued):

In NLMM.R we consider model (♥) for the tph data. This involves

— building mixed models for the “basic” parameters of the logistic func-tion,

— modeling the variance-covariance structures for the random effects,and

— modeling the error variance-covariance structure to allow heteroscedas-ticity and serial correlation, if necessary.

• First, we create a two-level groupedData object siteprep.tph to con-tain the data. The data are clustered by site (level 1) and plot withinsite (level 2).

– Level 0 refers to the population level (corresponding to randomeffects equal to their mean, 0.

95

• Next, we try to fit the basic model with θ’s not depending on co-variates and no serial correlation or heteroscedasticity in the errorterms. We do assume random plot and site effects in θ1, θ2, and θ3:

m1tph.fpl <- nlme(tph100~SSfpl(age,th1,th2,th3,th4),data=siteprep.tph,

fixed=list(th1~1,th2~1,th3~1,th4~1),random=list(site=pdDiag(th1+th2+th3~1),

plot=pdDiag(th1+th2+th3~1)),start=c(10,8,10,1))

• The model here is


1 + exp{(θ3ij − Ageijk)/θ4ij} + eijk

= (β10 + b1ij + b1i)

+(β10 + b2i + b2ij) − (β10 + b1ij + b1i)

1 + exp{((β30 + b3ij + b3i) − Ageijk)/β40} + eijk

– The pdDiag() specification in the random option requests thatthe random site effects in the model are independent and therandom plot effects are identical as well. I.e., for site effects:

var(bi) = var

⎛⎝ b1i

b2i

b3i

⎞⎠ =

⎛⎜⎝ σ

(1)11 0 00 σ

(1)22 0

0 0 σ(1)33

⎞⎟⎠

and for plot effects

var(bij) = var

⎛⎝ b1ij

b2ij

b3ij

⎞⎠ =

⎛⎜⎝σ

(2)11 0 00 σ

(2)22 0

0 0 σ(2)33

⎞⎟⎠

– Assuming independent random effects that operate on the samelevel of clustering (e.g., site) is usually not realistic. However,it may be necessary, at least when fitting the model intially, toobtain convergence.

96

• Here is a summary of the model fit:

> summary(m1tph.fpl)

Nonlinear mixed-effects model fit by maximum likelihood

Model: tph100 ~ SSfpl(age, th1, th2, th3, th4)

Data: siteprep.tph

AIC BIC logLik

2151.744 2207.161 -1064.872

Random effects:

Formula: list(th1 ~ 1, th2 ~ 1, th3 ~ 1)

Level: site

Structure: Diagonal

th1 th2 th3

StdDev: 0.855265 1.86076 2.941078


Level: plot %in% site

Structure: Diagonal

th1 th2 th3 Residual

StdDev: 1.424189 1.262397 1.378790 0.3068705

Fixed effects: list(th1 ~ 1, th2 ~ 1, th3 ~ 1, th4 ~ 1)


th1 11.969625 0.2395291 945 49.97149 0

th2 10.646728 0.4759046 945 22.37156 0

th3 12.675188 0.7959850 945 15.92390 0

th4 2.020528 0.1151651 945 17.54462 0

<portion omitted>


Number of Groups:

site plot %in% site

16 191

• Variance components from all site and plot effects appear to be large,so we don’t consider dropping any of these effects (yet).

97

• A plot of the observed and fitted values (obtained with augPred())indicates how well the model fits, and reveals its nested structure:

Age (yrs)

Tre

es/H

ecta

re (

100s

of t

rees

)

5 10 15 20

5

10

15

1/1 1/2

5 10 15 20

1/3 1/4

1/5 1/6 1/7

5

10

15

1/8

5

10

15

1/9 1/10 1/11 1/13

4/1 4/2 4/3

5

10

15

4/4

5

10

15

4/5

5 10 15 20

4/6 4/7

5 10 15 20

4/8

fixed site plot

• The data from only a subset of the data are plotted here.

• At this stage we might consider relaxing the assumption of uncorre-lated (independent) random site and plot effects. However, the dataand model are big enough here that this causes lots of convergenceproblems.

98

• Instead, we consider whether the asymptotes and other parametersdepend on covariates (e.g., treatment indicators, site index, soil vari-ables, etc.).

– These are all plot-level measurements.

• One useful way to determine which parameters may depend uponwhich covariates, is to graph the predicted plot-specific random ef-fects for each parameter against potential covariates.

– It’s as though we’re building a separate little linear model foreach mixed-effect parameter (each θ) and the predicited plot-level random effects are the residuals from those models.

• The most obvious covariate on which one of the parameters dependsis initial trees per hectare (itph100, measured at age 2). Clearly,we’d expect the left (upper) asymptote to be highly dependent onthis variable.

– This is apparent from the plot of the θ1ijk random effects vs.itph100.

• Therefore, we add itph100 to the model, yielding m2tph.fpl, whichfits much better than m1tph.fpl.

• Continuing in this manner we build up the mixed effects until arriv-ing at the specifications (m5tph.fpl):

θ1ijk = β10 + β11itphij + (fixed trtmnt effects) + b1ij + b1i

θ2ijk = β20 + (fixed trtmnt effects) + b2ij + b2i

θ3ijk = β30 + (fixed trtmnt effects) + b3ij + b3i

θ3ijk = β40

99

No additional fixed effects appear necessary at this point from the plots ofrandom effects and residuals vs. explanatory variables, so we now considerthe assumptions on the error var-cov structure:

• A plot of the residuals vs. fitted reveals no obvious pattern of non-constant variance.

• However, a plot of the residuals versus site index (si) shows a slightincreasing variance pattern.

• Therefore, in model m6tph.fpl, we add heteroskedasticity of the form

var(eijk) = σ2si2δ, where σ = 0.0041, δ = 1.49

> m6tph.fpl <- update(m5tph.fpl, weights=varPower(form=~si),

start=fixef(m5tph.fpl))

> summary(m6tph.fpl)

<portion omittted>

Random effects:


Level: site

Structure: Diagonal

th1.(Intercept) th2.(Intercept) th3.(Intercept)

StdDev: 0.1559281 1.849531 2.303681


Level: plot %in% site

Structure: Diagonal

th1.(Intercept) th2.(Intercept) th3.(Intercept) Residual

StdDev: 0.5110908 1.203069 1.403265 0.004075045

Variance function:


Formula: ~si


power

1.494952

<portion omitted>

• This improves the model (decreases AIC) substantially.

100

Now consider serial correlation:

• The ACF plot of m6tph.fpl shows clear evidence of negative auto-correlation among the residuals.

– To deal with this, we consider adding an AR(1) autocorrelationstructure to the model (done in two steps, models m7tph.fpl,and m8tph.fpl).

– However, this does not improve the model according to AIC,BIC and has almost no impact on the ACF plot.

– Whenever modeling the autocorrelation has virtually no effecton the ACF plot, we should be suspiscious that the problem isnot autocorrelation per se, but misspecification of the mean,which manifests as autocorrelated residuals.

– To investigate this, a plot of the residuals by age can help:

Age (yrs)

Sta

ndar

dize

d re

sidu

als

5 10 15 20

−5

0

5

101

• The plot reveals a slight mean misspecification in the model whichresults in a wavy shape to the residuals over time. This leads to thelarge negative autocorrelation at lag 1 found in the ACF plot.

– To remove this autocorrelation, it would probably be necessaryto consider another nonlinear form for tph over time whichdecreases more rapidly from the upper asymptote. For now,however, we satisfy ourselve with the logistic model and livewith the apparent autocorrelation.

• Finally, I fit a number of other models in an attempt to simplify therandom effect structures (not all shown).

– In particular, in model m9tph.fpl we find that dropping theplot-level random effect in θ2ijk decreases the AIC (improvesthe fit).

– Therefore, the “final” model here is m9tph.fpl although, un-doubtedly, we could do some further tweaking.

102

Short Course — Applied Linear and Nonlinear Mixed Models* · Short Course — Applied Linear and Nonlinear Mixed Models* Introduction Mixed-eﬀect models (or simply, “mixed models”)

Documents