Beyond the Generalized Linear Mixed Model: a Hierarchical Bayesian Perspective

Hierarchical Bayesian Modeling of Heterogeneity in the Association between Milk Production and Reproductive Performance of Dairy Cows

Beyond the Generalized Linear Mixed Model:a Hierarchical Bayesian Perspective

Robert J. Tempelman,Professor

Department of Animal ScienceMichigan State UniversityEast Lansing, MI, USAKSU Conference on Applied Statistics in Agriculture,April 30, 2012

1It is safe to say that improper attention to the presence of random effects is one of the most common and serious mistakes in the statistical analysis of dataLittell, R.C, W.W. Stroup and R.J. Freund. SAS System for Linear Models (2002) pg. 92

This statement was likely intended to apply to biologists analyzing their own data

It surely does not apply to the experts.right?

2Mixed models in genetics & genomics and agricultureHave we often thought carefully about how we use mixed models?Do we sometimes mis-state the appropriate scope of inference? (lapses in both designand analyses??)Do we always fully appreciate/stipulate what knowledge we are conditioning on in data analyses?Are there too far many other efficiencies going untapped?Shrinkage is a good thing Allison et al. (2006)Hierarchical/Mixed model inference should be carefully calibrated to exploit data structure while maintaining integrity of inference scope and upfronting on any conditioning specifications.Particularly important with GLMM as opposed to LMM?

3Research and Public Credibility.A disheartening article Raise standards for preclinical cancer research by Glenn Begley in Nature (March 29, 2012)Out of 53 papers deemed to be landmark studies in last decade, scientific findings were confirmed in only 6 cases.Some of these non-reproducible studies have spawned entire fields (and were cited 100s of times!) and triggered subsequent clinical trials.What happened?1) some studies based on small number of cell lines (narrow scope!), 2) obsession to provide a perfectly clean story; 3) poor data analyses too?

Would having the data available in the public domain help?Maybe notconsider the case of gene expression microarrays.Data from microarray studies are routinely deposited in public repositories (GEO and Array Express)most based on very simple designs.Ioannidis, J. P. A. et al. 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 149-155.Out of 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006, only two analyses could be reproduced in principleWhy? Generally because of incomplete data annotation or specification of data processing and analysis.

Outline of talkThe scope of inference issue!Its a little murky sometimes,How scope depends on proper VC estimation.(Generalized) linear mixed modelsThe implications of good/poor VC estimates on proper scope of inference.Bayesian methods.Beyond the generalized linear mixed modelHierarchical extensions provide a potentially richer class of modelsand a calibration of scope of inference that may best match intended scope.

6A ridiculously obvious exampleTreatment A vs. B; two mice per treatmentSuppose you weigh each mouse 3 timesA1B1A220.121.222.121.520.021.322.221.520.121.322.221.520.06721.26722.16721.5How many experimental (biological replicates)?A1A1B2Are the subsamples (technical replicates) useful?Duh, Robits n = 2Oh yeah, well sureit helps control measurement error7Rethinking experimental replication (Based on a true story!)

Suppose you have one randomly selected (representative) litter with 12 piglets.assign 4 piglets at random to each of 3 treatments (A),(B), and (C)ACAAABBIs there any biological replication???BCB

CCWell, actually no.n = 14 piglets /trt is better than 2 piglets /trt -> but youre only controlling for measurement error or subsampling with one litter8Ok, well lets now replicate.

Have three randomly (representative) selected litters with 6 pigs each.assign 2 pigs at random to each of 3 treatments (A),(B), and (C) within each litter6 pigs per treatmentAAAABBHow many experimental replicates per trt?

BC

C

C

B

AAB

C

C

BC9Scope of inferenceWell, it might depend on your scope of inference.its n=6 (pigs) if intended scope of inference is intended for just those three litters. (narrow scope)Its n=3 (litters) if intended scope of inference is intended for population of litters from which the three litters are a random sample. (broad scope) Analysis better respects the experimental design and intended scope.McLean, R. A., W. L. Sanders, and W. W. Stroup. 1991. A unified approach to mixed linear-models. American Statistician 45: 54-64.

10How do we properly decipher scope of inferenceHierarchical statistical modelingSPECIFICALLY, (generalized) linear mixed modelsProper distinction of, say, litters as fixed versus random effects.Previous (RCBD with subsampling) example:Litter effects as fixed: Narrow scope (sometimes desired!)Litter effects as random: Broad scope.specify Litter*Treatment as the experimental error-> determines n.Proper mixed model analysis helps delineate true (experimental) from pseudo (subsampling) replication.

11Scope of inferenceFocus of much agricultural research is ordinary replication (Johnson, 2006; Crop Science)All replication conducted at a single research farm/station.If inference scope is for all farms from which study farm is representative (i.e. farms are random effects) then.Single farm study -> no replication in principle: treatment x herd serves as the experimental error term.Treatment inferences may be influenced by management nuances.Similar criticism could be levied against university research.Hence, continued need for multi-state research projects.

12Meta scope of inferenceTreatments cross-classified with 6 random farms

Treatment ATreatment BTreatment C

Specify farm as fixed: n = 12 cows per trt

Specify farm random: n = 6 trt*farm replicates per group.

Shouldnt treatment effects be expected to be consistent across farms?.....

13

You can never tell even in the best of cases.THE CURSE OF ENVIRONMENTAL STANDARDIZATIONenvironmental standardization is a cause rather than a cure for poor reproducibility of experimental outcomes14We shouldnt aspire for environmental standardization in agricultural researchFor the sake of agricultural sustainability & organismic (plant and animal) plasticitywe just cant!2 MORE billion people to feed by 2050on less land!management systems & environments are changing more rapidly than animal populations can adapt to such changes through natural selection (Hohenboken et al., 2005)..e.g.Energy policy (corn distillers grain)More intensive management (larger farms)Climate changeWhat are the implications for agricultural statisticians? Even greater importance in terms of using reliable inference procedures AND careful calibration on scope of inference.

15Desired calibration of mixed model analyses requires reliable inference proceduresBroader the scope, the greater the importance.Under classical assumptions (CA), inference on treatment effects depends on reliability of fixed effects and variance component inference.Linear mixed models (LMM) under CA.No real issuesweve already got great software (e.g. PROC MIXED)E-GLS based on REML works reasonably well.ANOVA (METHOD= TYPE3) for balanced designs might be even better (Littell et al., 2006)Common tool for many agricultural statistical consulting centers

Analysis of non-normal data.i.e., binary/binomial/count datamaybe a different storyFixed effects models (no design structure)Generalized linear models (GLM) inference is based on Wald tests/ likelihood ratio tests.Nice asymptotic/large sample properties.Mixed effects models (treatment and design structure).More at stake with generalized linear mixed models (GLMM).Asymptotic inference on fixed effects conditioned upon.asymptotic inference on variance components (VC)

From Rod Littles 2005 ASA Presidential addressDoes asymptotic behavior really depend on just n

Impact of design complexity and number of fixed effects/random effects factors/levels relative to n ?Murky sub-asymptotial forestsHow many more to reach the promised land of asymptotia?Status of VC inference in GLMMDefault method in SAS PROC GLIMMIX: RSPL (PQL-based method).PQL has been discouraged (McCulloch and Searle,2002) especially with binary data.generally downward bias!Transformations of count data followed by LMM analyses sometimes even advocated (Young et al. 1998).Method = LAPLACE and Method = QUAD might be better suited.Agghbut ML-like rather than REML- like.Also, cant use QUAD for all model specifications.How big is this issue?PQL (RSPL) inference (Bolker et al., 2009)As a rule works poorly for Poisson data when mean < 5 or for binomial data when y,n-y both < 5.Yet 95% of analyses of binary responses (n=205) and 92% of Poisson responses with means < 5 (n=48) and 89% of binomial responses with y Animal Science Example1 2 Pen # 1 3 4 1 2 Pen # 2 3 4 1 2 Pen # 3 3 4 Diet11 2 Pen # 1 3 4 1 2 Pen # 2 3 4 1 2 Pen # 3 3 4 Diet 2PEN serves a dual role:Experimental Unit for DietBlock for Drug

Animal is experimental unit for drug4 animals per pen. 1 of 4 diets assigned to one animal within each pen. Pens numbered within diet23Split Plot-> Plant Science Example1 2 Field # 1 3 4 3 2 Field # 2 4 1 3 1 Field # 3 2 4 Irrigation level 1Irrigation level 2Field serves a dual role:Experimental Unit for irrigation levelBlock for variety

Plot is experimental unit for variety4 plots per field. 1 of 4 corn varieties assigned to one plots within each field. Fields numbered within irrigation level4 2 Field # 1 3 1 2 1 Field # 2 4 3 3 1 Field # 3 4 2 24Split plot ANOVA (LMM)

So Inference on Treatment A (Whole Plot Factor) effects should be more sensitive to VC inference than Treatment B (Sub Plot Factor) effects.

Since s2e is constrained for binary data in GLMM (logit or probit link), even less of an issue for Treatment B thereright?A simulation study.Simulate data from a split plot design.A: whole plot factor a=3 levels.B: sub plot factor b=3 levels.

B1 B2

B3 WP 1(A1)B3 B1

B2 WP 2(A1)B2 B3

B1 WP 3(A1)A1B1 B2

B3 WP 1(A2)B3 B1

B2 WP 2(A2)B2 B3

B1 WP 3(A2)A2B1 B2

B3 WP 1(A3)B3 B1

B2 WP 2(A3)B2 B3

B1 WP 3(A3)A3n = number of whole plot (WP) per whole plot factor

n = 3 in figure.

Note: if data is binary at subplot level, it is binomial at wholeplot level.

Simulation study detailsLets simulate normal (l) and binary (y) data s2e = 1.00 (lets assume known)s2wp = 0.5Convert normal to binary data as y = I(l > 0.5)

Note with binary data, s2e = 1.00 is not identifiable in probit link GLMM (also for logit link).WholePlot factor: binomial; SubPlot factor: binary.Lets compare standard errors of differences (A1 vs. A2, B1 vs. B2) as functions of s2wp.

A1A2A3B1-1.0-0.50.0B2-0.50.00.5B30.00.51.0A1A2A3B10.0670.1580.308B20.1580.3080.500B30.3080.5000.692mijProb(l>0.5|mij)SED of differences (A1 vs A2; B1 vs B2) for conventional LMM analysis of Gaussian data (n=8)

s2wpStandard errors of A1 vs A2, B1 vs B2No surprises here.

WholePlot Factor Inferences are sensitive to s2wp.

SubPlot Factor Inferences are insensitive to s2wp.SED of differences (A1 vs A2; B1 vs B2) for conventional (asymptotic) GLMM analysis of binary data (n=8)s2wpStandard errors of A1 vs A2, B1 vs B2WholePlot Factor Inferences are sensitive to s2wp.

But so are SubPlot Factor Inferences (albeit less so)!

Misspecification of VC have stronger implications for GLMM than for LMM?

s2wpImplicationsIf underestimate s2wp -> then understate standard errors, inflate Type I errors on conventional GLMM inference on both marginal mean comparisons involving whole plot AND subplot factors!Obvious opposite implications whenever overestimating s2wp as well.What kind of performance on VC estimation do we get with METHOD = RSPL (PQL), LAPLACE, QUADMCMC methods?And what about Bayesian (MCMC) methods? Should you use Bayesian mean? median? Others?Back to simulation studyConsider the same 3 x 3 split plotConsider two different scenarios,n = 16 wholeplots / A level and n=4 wholeplots / A level.20 replicated datasets for each comparison.Compare LAPLACE, QUAD, RSPL with Bayesian estimates for estimates s2wp Bayesian estimates: mean or median (others?).Prior on s2wp :

Prior variance not defined for v 4.

Scatterplot of VC estimates from 20 replicated datasets n = 16 wholeplots / A level

Everything lines up pretty well!

Boxplots of VC estimates from 20 replicated datasets n = 16 wholeplots / A levelProportionof reps withconvergence 19/20 16/20 17/20RSPL biased downwards(conventional wisdom)Scatterplot of VC estimates from 20 replicated datasets n = 4 wholeplots / A level

Much less agreementbetween methods

Boxplots of VC estimates from 20 replicated datasets n = 4 wholeplots / A level

Proportionof reps withconvergence 16/20 7/20 0/20Influenced by prior?Are Bayesian point estimators better/worse than other GLMM VC estimators?No clear answersI could have tried different non-informative priors and actually got rather different point estimates of VC for n = 4.Implications then for Bayesian inference on fixed treatment effects? Embarassment of riches Inferences is more than just point estimates; it involves entire posterior densities!Bayesian inferences on fixed treatment effects average out (integrate) over the uncertainty on VC.n = 4 might be so badly underpowered that point is moot for this simulation studybut recall Bolkers review!!!Any published formal comparisons between GLS/REML/EB(M/PQL) and MCMC for GLMM?Check Browne and Draper (2006).Normal data (LMM)Generally, inferences based on GLS/REML and MCMC are sufficiently close.Since GLS/REML is faster, it is the method of choice for classical assumptions.Non-normal data (GLMM).Quasi-likelihood based methods are particularly problematic in bias of point estimates and interval coverage of variance components.Some fixed effects are poorly estimated too!Bayesian methods with certain diffuse priors are well calibrated for both properties for all parameters.Comparisons with Laplace not done yet (Masters project anyone?).37 Browne, W.J. and Draper, D. 2006. A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Analysis. 1: 473-514Why do (some) animal breeders do Bayesian analysis?Consider the linear mixed model (Henderson et al. 1959; Biometrics)Y = Xb + Zu + e; e ~ N(0,Is2e)Yn x 1bp x 1 (p >n)u ~ N(0,As2u)i.e., more animals to genetically evaluate than have data.Animal modelsSomewhat pathological models but mixed model inference is viable (Tempelman and Rosa, 2004): borrowing of informationREML seems to work just fine for Gaussian data.Put Bayes on the shelf

Greatest interest in u, s2u, and s2e.For GLMM in animal breedingits hard not to be Bayesian.Binary or ordinal categorical data.Probit link animal models (Tempelman and Rosa, 2004)PQL methods are completely unreliable in animal modelsRestricted Laplace a little better but still biased (Tempelman, 1998: Journal of Dairy Science).Fully Bayes inference using MCMC - > most viable.Our models are becoming increasingly pathological!Tens/Hundreds of 1000s of genetic markers (see later) on each animal for both normal and non-normal traits.In that case, Bayesian methods become increasingly importanteven for normal data.i.e., asymptotic inference issues due to p.for same n.

Tempelman, R. J. and G. J. M. Rosa (2004). Empirical Bayes approaches to mixed model inference in quantitative genetics. Genetic Analysis of Complex Traits Using SAS. A. M. Saxton. Cary, NC, SAS Institute Inc.: 149-176.Where is the greatest need for Bayesian modeling?Multi-stage hierarchical models.When the classical distributional assumptions do not fit.e.g. e is NOT ~ N(0,Rs2e) or u is NOT ~ N(0,As2u) Examples:Heterogeneous variances and covariances across environments (scope of inference issue?)Different distributional forms (e.g. heavy-tailed or mixtures for residual/random effects).High dimensional variable selection models (animal genomics)

Heteroskedastic error (Kizilkaya and Tempelman, 2005)Given:

has a certain heteroskedastic specification. determines the nature of heterogeneous residual variances.

Could do so something similar for GLMM (with overdispersion..binomial, Poisson, or ordinal categorical data > 3 categories).

41Mixed Model for Heterogeneous Variances.Suppose

with as a fixed intercept residual variancegk > 0 kth fixed scaling effect (e.g. sex)vl > 0 lth random scaling effect; vl ~ IG(av, av-1) (e.g. herd). E(vl )=1; Var(vl )=1/(av-2) -> Adjusting (estimating) av: calibrates scope of inference.High av less heterogeneity, low av higher heterogeneity.

42Birthweights in Italian Piedmontese cattle

95% Credible Intervals in Residual Variances for birthweights for each of 66 random herdsAlso fitted fixed effects of calf sex:

ParameterPost Mean/Std95%Cred.Int.14.44 1.03[12.63; 16.70]10.19 0.73[8.89; 11.77]- 4.26 0.53[3.29; 5.36]0.60 0.09[0.46; 0.82]

Lower the , the greater the shrinkage in estimated residual variances across herds.calibrated pooling of error degrees of freedom.

Heterogeneous bivariate G-side and R-side inferences!

Bello et al. (2010, 2012)Investigated herd-level and cow-level relationship between 305-day milk production and calving interval (CI) as a function of various factors Random (herd) effectsResidual (cow) effectswell established that joint modeling of correlated traits provides efficiencies, especially with GLMM!P.S. Nora has also done this for bivariate Gaussian-binary analyses too. See Bello et al. (2012b) in Biometrical Journal.44

Herd-Specific and Cow- Specific (Co)variances

Herd kCow j

Letand

45

Rewrite this

Herd kCow jModel each of these different colored terms as functions of fixed and random effects (in addition to the classical b and u)!46

Random effect variability in RESIDUAL associations between traits across herds for

DICM0 DICM1 = 243

Expected range between extreme herd-years 2 = 0.7 d / 100 kg

Ott and Longnecker, 2001

0.00.20.40.6 0.81.0Increase in # of days of CI / 100 kg herd milk yield0.160.7 d/100kg

0.86

47Sois research irreproducibility sometimes heterogeneity across studies?and/orfailure to distinguish or calibrate between narrow versus broad scope of inference.Recall treatment*station (or study) as the error term for treatment in a meta-replicated study.But heterogeneity might exist at other levels as well?heterogeneous residual and random effects (co)variances across farmseven overdispersion???So inference may require proper calibration at various levels.Discrete (binary/count) datarecall the problem with VC inference in GLMM?We actually may need to go even deeper than that!

Rethinking that perfect storyShould model heterogeneity (of means, variances, covariances) across studies/farms/times, etc. explicitly with multi-stage models.Estimates of the corresponding hyperparameters will indicate how clean or messy the story really is.Estimate low heterogeneity? HEAVY SHRINKAGE: Broad scope and narrow scope inference more closely matches up.Estimate high heterogeneity? LIGHT SHRINKAGE: Broad scope inference might be almost pointlessinference should be better calibratedShrinkage is a good thing (Allison et al., 2006)!Neither too broad nor too narrow.but calibrated.Borrow information across studies/environments.

Parameter q1

Too broad?Too narrow?Parameter q2

Meta-estimateStudy-spec.High HeterogeneityCalibrationLow HeterogeneityCalibrationStudy-spec.Mixture (including point mass on 0) priors could be considered as well. KbKb + MuWhat would it take to model informative heterogeneity?Useful estimates require moderate to large number of environments! Revisit the utility/rigor of public data repositories (e.g., through journals)Data AND source code should be provided (Peng et al. 2011) for ordinary replication studies.Consider the Biostatistics journal reproducibility review standard.Worried about somebody elses prioror even model?You could then reassess for yourself.Concluding commentsMixed model inference continues to have the primary role for inference in agriculture, genetics, and other fields.Multi-stage hierarchical Bayesian extensions offer shrinkage-based inference that may better calibrate scope for ordinary replication (Kb+Mu)yet provide reliable broad scope inference(Kb).Might be useful to retrospectively identify covariates contributing to heterogeneity to facilitate further shrinkage and hence even better agreement between broad and narrow scope inference.Lets be careful about inferences that condition on other estimates as if they were true values (e.g. empirical Bayes, conventional GLMM)Otherwise we may be overselling the precision of our inferencesparticularly with ordinary replication!

Strong implications for technology transfer mission in agriculture.52SourcedfSSMSEMS

Aa-1SSAMSA

WholePlot Error

(WP(A))n(a-1)SSWP(A)MSWP(A)

Bb-1SSBMSB

A*B(a-1)(b-1)SSABMSAB

Subplot Errorn(a-1)(b-1)SSEMSE

_1396689039.unknown

_1396692701.unknown

_1396692702.unknown

_1396689046.unknown

_922866482.unknown