Mixed Models Symposium Agricultural field trials 10.10.2007 University of Hohenheim Dr. sc. agr. Andreas Büchse 07/2002 - 04/2007 Fachgebiet Bioinformatik. Universität Hohenheim since 05/2007 Agrarzentrum Limburgerhof. BASF Aktiengesellschaft [email protected]
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mixed Models
Symposium Agricultural field trials10.10.2007
University of Hohenheim
Dr. sc. agr. Andreas Büchse07/2002 - 04/2007 Fachgebiet Bioinformatik. Universität Hohenheimsince 05/2007 Agrarzentrum Limburgerhof. BASF Aktiengesellschaft
Littell, Milliken, Stroup, Wolfinger (1996): SAS System for Mixed Models. SAS Institute, Cary, NC. 2nd ed. 2006!
Verbeke, G & G. Molenberghs, 2000: Linear Mixed Models for Longitudinal Data. Springer-Verlag, New York.
Piepho, H.P., Büchse, A., and Emrich, K. (2003): A hitchhiker's guide to the mixed model analysis of randomized experiments. Journal of Agronomy and Crop Science, 189, 310-322
Piepho, H.P., Büchse, A., Richter, C. (2004): A mixed modelling approach to randomized experiments with repeated measures. Journal of Agronomy and Crop Science, 190, 230-247
Zimmerman, D. L. & D. A. Harville, 1991: A Random Field Approach to the Analysis of Field-Plot Experiments and Other Spatial Experiments. Biometrics 47: 223-23.
Longford, N. T., 1993: Random Coefficient Models. Oxford University Press.
Definition
What does this mean, „mixed“ model?
a variety trial
One location, several varieties, e.g. 4 replications
Randomized Complete Block Design
fixed effects model
Model of a randomized complete block design
yik = μ + bk + αi + eik
yij = yield of the i-th variety in the k-th blockμ = constant (fix)bk = effect of the k-th block (fix)αi = effect of the i-th variety (fix)eik = residual error ~N (0, σ²e)
Only one random effect ⇒ no mixed model
a variety trial (again)
One location, several varieties, maybe from a cross of twolines, e.g. 4 replicationsRandomized Complete Block Design
a variety trial (again)
If the breeder is interested in estimating response to selection
R = i h σG
R = response to selectioni = selection intensityh = sqrt of heritabilityσG = sqrt of genetic variance
Then, it is reasonable to treat variety effects as randomeffects (sampled from a Gaussian distribution)
mixed model
Model of a randomized complete block design, withrandom variety effects
yik = μ + bk + αi + eik
yik = yield of the i-th variety in the k-th blockμ = constant (fix)bk = effect of the k-th block (fix)αi = effect of the i-th variety ~N (0, σ²G)eik = residual error ~N (0, σ²e)
more than one random effect ⇒ mixed model !
Definition
A mixed linear model is a model that have morethan one random effect (not only the residual error).
It has random and fixed effects.
Random = random sample from a (normal) distribution
Fixed effects versus random effects
Interest in linear models [...] lies mainly in estimating (and testing hypotheses about) linear functions of the effects in the models. These effects are what we call fixed effects, and the models are correspondingly called fixed effectsmodels. There are, however, situations where we have no interestin linear functions of effects but where, by the nature of the data and their derivation, the things of prime interestconcerning the effects are variances. Effects of this nature are called random effects [...] and the models are calledrandom effects models.Models involving a mixture of fixed and random effects arecalled mixed models.
SEARLE, Linear models, 1971
Fixed or random?
Are inferences going to be drawn from these dataabout just these levels of the factor?
„Yes“ – then the effects are to be considered as fixed effects
„No“ – then, presumably, inferences will be madenot just about the levels occurring in the data butabout some population of levels [...] the effectsare considered as being random.
SEARLE, Linear models, 1971
Fix or random?
Levels of factor aresample from a randomdistribution?
no fix factor estimate fixedeffects (BLUE)
yes
randomfactor
What isthe maininterest?
only distribution of random effects
Estimate variancecomponents (REML)
distribution of randomeffects as well as effectsthemselves
Estimate variancecomponents (REML) and realized random effects(BLUP)
Are years random? (SEARLE, Linear models, 1971)
[...] take the case of year effects, for example in studying wheat yields: are the effects of years on yield to be considered fixed or random?
The years themselves are unlikely to be random, for they will probably be a group of consecutiveyears over which the data have been gathered orthe experiments run.
But the effects on yield may reasonably beconsidered random - unless, perhaps, one isinterested in comparing specific years for somepurpose.
Are blocks fix or random?
Everlasting story...
• With fix blocks s.e.m. is underestimated
• For s.e.d. it does not matter!
• With random blocks, in small experiments, the variancecomponent is sometimes estimated <0, which results in model reduction
• > my opinion: blocks fix is OK.
Some examples
• Genotypic lines in a breeding experiment (random)• Varieties in a VCU-trial (fix, or random with Shrinkage ->
BLUP, see Michel & Piepho)• Different doses of plant protection agents (fix)• Different agents at same level (fix)• Locations (fix or random)• Regions (fix)• Years (random, mostly)• Complete blocks (both is possible)• Incomplete blocks (both is possible, random allows use of
interblock information (see Williams, Proceedings p. 266)
some theory
General form of a mixed model in matrix notation
y = Xβ + Zu + e
y = a vector with n observations
X, Z = Designmatrices
β = a vector of fixed effectsu = a vector of random effectse = a vector of residuals
e ∼ N(0,R) u ∼ N(0,G)
R, G = variance-covariance matrices
General form of a mixed model in matrix style
y = Xβ + Zu + e e ∼ N(0,R) u ∼ N(0,G)
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
+⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛⋅
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
+⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛⋅
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
6
5
4
3
2
1
3
2
1
2
1
100001010100010001
101101101011011011
141515151612
eeeeee
bb
αααμ
6 plots, 2 fixed block effects, 3 random variety effects
G: the variance-covariance-matrix of random effects
A 3 by 3 matrix because we have 3 varieties
Genetic variance on the diagonal
No covariance between genotypic effects
IG 22
2
2
2
100010001
000000
GG
G
G
G
σσσ
σσ
=⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
=
R: the residual variance-covariance-matrix (iid)
6 by 6 matrix because we have 6 plots.Residual variance on the diagonalNo covariance (in case of independent errors)This is in every case the form of R in a fixed effects model. But don‘t have to be in a mixed model. In a mixed model alternative forms of R are possible, we will see later.
IR 22
2
2
2
2
2
2
100000010000001000000100000010000001
000000000000000000000000000000
ee
e
e
e
e
e
e
σσ
σσ
σσ
σσ
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
Estimation
The fixed and random effects are estimated withthe method of WEIGHTED LEAST SQUARES
Weights depend on variance components (G, R)
Normally the variance components are unknownand have to be estimated themselves
An accepted method for the estimation is REML(Patterson & Thompson 1971)
Estimation of random and fixed effects
( ) yVX'XVX'β 11 −−−=ˆ
Weighted least square mean estimation of:
V = ZGZ' + R
For more details see e.g. Littell et al. „SAS-System for MIXED Models“and cited literature
Difficult matrix equations, but the computer does thisfor us ☺
Software for Linear Mixed Models
SAS PROC MIXED
MIXED Procedure in SPSS
GenStat
ASReml
S-Plus
R
Several other packages (Statistica, Systat, ...)
Some examples for using mixed models
Example 1: Trial series
A cultivar performance trial14 oat varieties, 18 sites
Model for a single siteyik = μ + bk + αi + eik
yik = yield of the i-th variety in the k-th blockμ = constant (fix)bk = effect of the k-th block (fix)α i = effect of the i-th variety (fix)eik = residual error ~N (0, σ²e)
From single trial to a series
yik = μ + αi + bk + eik add index j for sites
yijk = μj + αij + bjk + eijk
yijk = yield of the i-th variety in the k-th block at site jμj = mean yield at site j (fix or random?)αij = effect of the i-th variety at site j (fix?) bjk = effect of the k-th block at site j (fix?)eijk = residual error ~N (0, σ²e)
yijk = yield of the i-th variety in the k-th block at site jμ = global meanαi = effect of the i-th varietyβj = effect of the j-th site(αβ)ij = interaction of the i-th variety by the j-th sitebjk = effect of the k-th block at site jeijk = residual error ~N (0, σ²e)
Which effects should be random?
have sites fixed or random effects?
Aim of the series?
want to make inferences for a region or a state (in thisexample whole Germany) -> random
Are the sites a random sample from „a population“ of sitesin Germany?
The sites are mostly no random sample, but maybe thesite effects are.
(for single years site effects are confounded with site byyear interaction!)
Are variety effects fixed or random?
Aim of the series?
want to know about performance of the testedvarieties
-> fix
The varieties are no random sample from „a population“ of varieties
mixed model
yijk = μ + αi + βj + (αβ)ij + bjk + eijk
sites: randomvarieties: fixedsite × variety: random (sites are random)blocks within site: random (sites are random)
proc mixed data=t lognote; class site variety block; model yield= variety / ddfm=kenwardroger; random site variety*site block(site);
repeated / group=site;lsmeans variety/pdiff; run;
CPU-time effect of different syntax
/*heterogenous errors - long cpu time*/proc mixed data=t lognote; class site variety block; model yield = variety / ddfm=kenwardroger; random site variety*site block(site); large G-matrixrepeated / group=site;run;
/*heterogenous errors - short cpu time*/proc mixed data=t lognote; class site variety block; model yield = variety / ddfm=kenwardroger; random int variety block / subject=site;repeated / group=site;run;
allow heterogenous error for each siteCov Parm Subject Group Estimate
Intercept site 82.3446variety site 6.0686block site 7.7449Residual site 1 3.2903Residual site 2 5.3894Residual site 3 11.0351Residual site 4 5.9991Residual site 5 3.5523Residual site 6 9.7576Residual site 7 5.9514Residual site 8 4.7286Residual site 9 9.7070Residual site 10 15.2087Residual site 11 6.3460Residual site 12 5.3596Residual site 13 7.2015Residual site 14 7.9454Residual site 15 44.0650Residual site 16 4.9044Residual site 17 7.6436Residual site 18 11.1075
Model selection
Homogenous errors or heterogenous errors?
Comparison of AIC (Akaike criterion)
Fit Statistics heterogeous homogenous
-2 Res Log Likelihood 5471.4 5635.7AIC (smaller is better) 5513.4 5643.7AICC (smaller is better) 5514.3 5643.7BIC (smaller is better) 5532.1 5647.2
Model selection
Likelihood Ratio test
Fit Statistics heterogeous homogenous
-2 Res Log Likelihood 5471.4 5635.7
The log-Likelihood of heterogenous is higher
T = 164.3
Compare to a Chi²-distribution at 17 df.
Chi²(5%, 17) = 27.6 ⇒ significantly better
outlook
Several different structures for genotype-environmentinteraction possible
Piepho, H.-P., 1997: Analyzing genotype-environment data by mixedmodels with multiplicative effects. Biometrics 53: 761-766.
F. v. Eeuwijk: Statistical methods for the analysis of seriesof plant breeding experiments. (Proceed. of this meeting, 242 ff.)
2-step-analysis: Estimate means „variety×site“ in first step. Then analysethis means (not the plot values) over all sites in 2nd step
Michel et al. (Proceedings 136-141)
Example 2: Hierarchical designs
(analogous to a example from Snedecor & Cochran, Statistical methods, 1967)
plant calcium content [%] of single leaves
1 3.28 3.09 3.03 3.03
2 3.52 3.48 3.38 3.38
3 2.88 2.80 2.81 2.76
4 3.34 3.38 3.23 3.26
Calcium content of the four plants
calcium
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
plant
1 2 3 4
Statistic inference
What is the best estimator for average Ca-content?
1) Simple arithmetic mean from 16 leaves
2) Calculate the average per plant first and thencalculate the average over four plant means
Does this matter?
Calculate plant means
data beet;input plant @@;do leaf=1 to 4;input calcium @@; output;end;
Every row or column represents one leaf, every 4 by 4 block represents one plant. The values on the diagonal are thevariance of single observations, added up from residual and plant variance. Beside the diagonal are the covariances of leaves from the same plant (= the plant variance).
Repeated measurements
Each plant is measured four times (4 leaves)
The plant is the „subject“ in repeated measurementsyntax and is measured „repeatedly“
A B C D E FS1 S6S2 S2S3 S8S4 S3S5 S1S6 S7S7 S9S8 S4S9 S5
D F A E C B
E D C B F A
bloc
k 1
bloc
k 2
bloc
k 3
3 blocks, 6 crops, 9 varieties
Linear Model for RCB
yijk = µ + ai + bj + rk + abij + eijk
yij yield of the j-th variety after i-th crop in k-th block
µ global mean or constant (fix)
ai effect of the i-th crop (fix)
bj effect of the j-th variety (fix)
rk effect of the k-th block (fix)
abij interaction between i-th crop and j-th variety (fix)
eijk residual error of the ijk-th plot ~N(0,σ²e)
Linear Model for Split-plot-design
yijk = µ + ai + bj + rk + abij + fik + eijk
yij yield of the j-th variety after i-th crop in k-th blockµ global mean or constant (fix)ai effect of the i-th crop (fix)bj effect of the j-th variety (fix)rk effect of the k-th block (fix)abij interaction between i-th crop and j-th variety (fix)fik error of the ik-th mainplot ~N(0, σ ²ra)eijk residual error of the ijk-th plot ~N(0,σ²e)
Why do we need the mainplot error?
1: the observations are not independent. Plots in same mainplot are closer to another, should bemore equal to another. -> Covariance betweenplots from same mainplot
2: Repeated measurement design/ mainplots aremeasured 9 times (9 varieties)
3: „Analyze as randomized“: Two steps of randomization. Each step is connected to oneeffect. Randomization units that not contain all treatments should be random.
µ = interceptαi = effect of i-th varietyuj = effect of j-h site (random)(αu)ij = interaction variety by site (random)xijk = infestation of ijk-th plotß = general regression between infestation and yieldγi = deviation of i-th variety from regressionvj = devitation of j-th site from regression (random)(γv)ij = deviation from regression due to interaction (random)bjk = effect of k-th block at j-th site (random)eijk= residual error
Complete model for the trial series
„Random Coefficient Model“
Random intercept (site effects) and randomdeviations from mean regression
Regressions are random sample from a population of regression lines
• Laird & Ware, Biometrics 1982; • Longford 1993, Random
Coefficient Models; • Wolfinger 1996, JABES; • Littell et al. 1996, SAS System for
Mixed Models
Σ1 und Σ2 unstructured variance-covariance-matrices
⎟⎟⎠
⎞⎜⎜⎝
⎛==⎟⎟
⎠
⎞⎜⎜⎝
⎛2
2
varvuv
uvu
j
j
vu
σσσσ
1Σ
⎟⎟⎠
⎞⎜⎜⎝
⎛==⎟⎟
⎠
⎞⎜⎜⎝
⎛2
,
,2
)()(
varvvu
vuu
ij
ij
vu
γγα
γαα
σσσσ
γα
2Σ
Model
• allowing for covariance between intercept and slope makes parameter estimates „resistent“against scale effects
Variance of site-intercepts(σ²u) and site-slopes (σ²ν); σuν = covariance
SAS-Code (1)
Proc mixed data=gesamt method=reml maxiter=100;class site variety block;model yield = variety variety*infestation/ solutionnoint ddfm=kr;random int infestation / subject= site type=unr ;random int infestation / subject= site*variety
type=unr ;random block/subject= site;
lsmeans variety /pdiff at infestation =0;lsmeans variety /pdiff at infestation =0.1;lsmeans variety /pdiff at infestation =0.3;lsmeans variety /pdiff at infestation =0.5;lsmeans variety /pdiff at infestation =1;ods output lsmeans=estimates_yield ;ods output diffs=diffs_yield ;
run;
SAS-Output (1)
WARNING: Did not converge.
Covariance Parameter ValuesAt Last Iteration
Cov Parm Subject Estimate
UN(1,1) site 1.5704UN(2,1) site -0.5173UN(2,2) site -0.4490Var(1) site*variety 0.2140Var(2) site*variety 1.4670Corr(2,1) site*variety 8.13E-34Block site 0.02270Residual 1.4028
• Change scaling of covariate so (additition, subtraction, division, multiplication) that variances on diagonal of variance-cov-matrices have equal dimension
• Reduction of model
• try models with heterogenous errors
Model selection to reach convergence
SAS-Code (2)
data gesamt; set Gesamt;infestation_v=((infestation)-0)/100; run;Proc mixed data=gesamt method=reml maxiter=100;
class site variety block;model yield = variety variety*infestation/ solution nointddfm=kr;
* random int infestation/ subject=site type=unr ;random int infestation/ subject=site*variety type=unr ;random block/subject=site;parms (2)(2)(0.2)(0.03)(1.3);
lsmeans variety/pdiff at infestation=0;lsmeans variety/pdiff at infestation=0.1;lsmeans variety/pdiff at infestation=0.3;lsmeans variety/pdiff at infestation=0.5;lsmeans variety/pdiff at infestation=1;ods output lsmeans=estimates_yield ;ods output diffs=diffs_yield ;
run;
Covariance Parameter Estimates
Cov Parm Subject Estimate
Var(1) site*variety 2.8899Var(2) site*variety 2.3892Corr(2,1) site*variety -0.6637Block site 0.03468Residual 1.3555
StandardEffect variety Estimate Error DF t Value Pr > |t|
High residual error, due to small plots (30 plants)
Example 6: Spatial models
Several examples for spatial models in thissymposium (e.g. SCHNEIDER et al. page 209 Proceedings)
What are thebasic methodsof spatialstatistics?
Starting point
Spatial statistics can be seen as an multidimensional extension of repeatedmeasurements. [...] Observations are correlatedin two spatial (mostly) dimensions.
LITTELL et al. 1996
• The nearer two points are together, the higherthe correlation should be
• Correlation depends on distance
• Several models possible to describe thisdistance-correlation dependency
Semivariance: a measure of similarity
h = a distance „class“
νi, νj = measurements at single observations
∑=
−=hhji
jiij
vvhN
h)|,(
)²()(2
1)(γ
The higher the correlation, the lower the semi-variance
Anisotropic Variogram
dotted = East west, solid = north-south
sill
nugget
Some models to describe the empirical variogram
)/exp()( ρijij ddf −=
0)(;(22
31)( 3
3
=≤⎥⎥⎦
⎤
⎢⎢⎣
⎡+−= ijij
ijijij dfelsed
dddf ρ
ρρ
)/exp()( 22 ρij
ddf ij −=
dij = distance between two points i and j
ρ = an estimable parameter (the correlation between points)
Exponential, Spherical and Gaussian model
How to fit spatial models with SAS: A uniformity trial