Top Banner
Stat Comput (2012) 22:597–613 DOI 10.1007/s11222-011-9254-z On a hybrid data cloning method and its application in generalized linear mixed models Hossein Baghishani · Håvard Rue · Mohsen Mohammadzadeh Received: 9 May 2010 / Accepted: 27 April 2011 / Published online: 12 May 2011 © Springer Science+Business Media, LLC 2011 Abstract The data cloning method is a new computational tool for computing maximum likelihood estimates in com- plex statistical models such as mixed models. This method is synthesized with integrated nested Laplace approxima- tion to compute maximum likelihood estimates efficiently via a fast implementation in generalized linear mixed mod- els. Asymptotic behavior of the hybrid data cloning method is discussed. The performance of the proposed method is il- lustrated through a simulation study and real examples. It is shown that the proposed method performs well and rightly justifies the theory. Supplemental materials for this article are available online. Keywords Asymptotic normality · Data cloning · Generalized linear mixed models · Integrated nested Laplace approximation 1 Introduction Non Gaussian repeated measurements such as longitudinal and clustered data are common in many sciences such as Electronic supplementary material The online version of this article (doi:10.1007/s11222-011-9254-z) contains supplementary material, which is available to authorized users. H. Baghishani ( ) · M. Mohammadzadeh Department of Statistics, Tarbiat Modares University, Tehran, Iran e-mail: [email protected] M. Mohammadzadeh e-mail: [email protected] H. Rue The Norwegian University of Science and Technology, Trondheim, Norway e-mail: [email protected] biology, ecology, epidemiology and medicine. The General- ized Linear Mixed Models (GLMMs) are a flexible and ex- tensively used class of models for modeling these types of data. As an extension of Generalized Linear Models (GLMs) (McCullagh and Nelder 1989), a GLMM assumes that the response variable follows a distribution from the exponential family and is conditionally independent given latent vari- ables, while the latent variables are modeled by random effects that are typically Gaussian (Breslow and Clayton 1993). Statistical inferences in such models have been the sub- ject of a great deal of research over the two past decades. Both frequentist and Bayesian methods have been devel- oped for inference in GLMMs (McCulloch 1997). Compu- tational difficulties rendered likelihood based inferences for GLMMs cumbersome, i.e. computing the likelihood func- tion needed for such inferences requires computing an in- tractable, high dimensional integral. Due to the advances in Markov Chain Monte Carlo (MCMC) sampling methods, nowadays, a rather commonly used approach for inference in these models is based on the Bayesian paradigm. How- ever, Bayesian inferences depend on the choice of the prior distributions and the specification of prior distributions is not straightforward, in particular, for variance components (Fong et al. 2010). Moreover, MCMC algorithms applied to these models come with a wide range of problems in terms of convergence and computational time. A recent suitable alternative method to carry out statisti- cal inferences in a GLMM could be the Data Cloning (DC) method, which was first introduced by Lele et al. (2007) in ecological studies. This method, as a computational trick, uses a MCMC algorithm to sample from an artificially con- structed distribution, named the DC-based distribution, for computing Maximum Likelihood Estimates (MLEs) and their variance estimates. The trick is to generate samples
17

Global optimization of the generalized cross-validation criterion

Jan 27, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613DOI 10.1007/s11222-011-9254-z

On a hybrid data cloning method and its applicationin generalized linear mixed models

Hossein Baghishani · Håvard Rue ·Mohsen Mohammadzadeh

Received: 9 May 2010 / Accepted: 27 April 2011 / Published online: 12 May 2011© Springer Science+Business Media, LLC 2011

Abstract The data cloning method is a new computationaltool for computing maximum likelihood estimates in com-plex statistical models such as mixed models. This methodis synthesized with integrated nested Laplace approxima-tion to compute maximum likelihood estimates efficientlyvia a fast implementation in generalized linear mixed mod-els. Asymptotic behavior of the hybrid data cloning methodis discussed. The performance of the proposed method is il-lustrated through a simulation study and real examples. It isshown that the proposed method performs well and rightlyjustifies the theory. Supplemental materials for this articleare available online.

Keywords Asymptotic normality · Data cloning ·Generalized linear mixed models · Integrated nestedLaplace approximation

1 Introduction

Non Gaussian repeated measurements such as longitudinaland clustered data are common in many sciences such as

Electronic supplementary material The online version of this article(doi:10.1007/s11222-011-9254-z) contains supplementary material,which is available to authorized users.

H. Baghishani (�) · M. MohammadzadehDepartment of Statistics, Tarbiat Modares University, Tehran, Irane-mail: [email protected]

M. Mohammadzadehe-mail: [email protected]

H. RueThe Norwegian University of Science and Technology,Trondheim, Norwaye-mail: [email protected]

biology, ecology, epidemiology and medicine. The General-ized Linear Mixed Models (GLMMs) are a flexible and ex-tensively used class of models for modeling these types ofdata. As an extension of Generalized Linear Models (GLMs)(McCullagh and Nelder 1989), a GLMM assumes that theresponse variable follows a distribution from the exponentialfamily and is conditionally independent given latent vari-ables, while the latent variables are modeled by randomeffects that are typically Gaussian (Breslow and Clayton1993).

Statistical inferences in such models have been the sub-ject of a great deal of research over the two past decades.Both frequentist and Bayesian methods have been devel-oped for inference in GLMMs (McCulloch 1997). Compu-tational difficulties rendered likelihood based inferences forGLMMs cumbersome, i.e. computing the likelihood func-tion needed for such inferences requires computing an in-tractable, high dimensional integral. Due to the advancesin Markov Chain Monte Carlo (MCMC) sampling methods,nowadays, a rather commonly used approach for inferencein these models is based on the Bayesian paradigm. How-ever, Bayesian inferences depend on the choice of the priordistributions and the specification of prior distributions isnot straightforward, in particular, for variance components(Fong et al. 2010). Moreover, MCMC algorithms applied tothese models come with a wide range of problems in termsof convergence and computational time.

A recent suitable alternative method to carry out statisti-cal inferences in a GLMM could be the Data Cloning (DC)method, which was first introduced by Lele et al. (2007) inecological studies. This method, as a computational trick,uses a MCMC algorithm to sample from an artificially con-structed distribution, named the DC-based distribution, forcomputing Maximum Likelihood Estimates (MLEs) andtheir variance estimates. The trick is to generate samples

Page 2: Global optimization of the generalized cross-validation criterion

598 Stat Comput (2012) 22:597–613

from a DC-based distribution constructed by duplicating theoriginal data set enough times, k say, such that the samplemean as well as the scaled sample variance converge to MLEand its variance estimate.

Computation, however, is an important issue since themethod applies intensive MCMC simulations to k clones ofthe data. This issue becomes more drastic when one requiresusing an increasing sequence of k. As Christian P. Roberthas discussed in his personal weblog (http://xianblog.wordpress.com) if k is large enough, the MCMC algorithmwill face difficulties in the exploration of the parameterspace, and hence in the subsequent discovery of the globalmodes, while, if k is too small, there is no certainty thatthe algorithm will identify the right mode. The practical im-plementation thus requires using an increasing sequence ofk’s, which is very demanding on computing time, especiallywhen k is large, and consequently cancels somehow the ap-peal of the method.

The computational challenges of the DC method mo-tivate us to synthesize the method and Integrated NestedLaplace Approximation (INLA), introduced by Rue andMartino (2007) and Rue et al. (2009), to compute MLE ef-ficiently, which is the main purpose of this paper. It is ex-pected that synthesizing the DC method and INLA can re-duce the computational efforts dramatically. Asymptotic be-havior of the proposed method for a GLMM is explored aswell.

The paper is organized into two parts. The first part illus-trates computational aspects of the proposed hybrid methodaccessible to most of the readers who are less interested intheoretical aspects (Sects. 2–5). In this part, we present ourmain results, however in an imprecise form. Formal state-ments of our results are given and proved in the second part(Appendices A–C).

In the next section, we describe the model and INLAmethodology. Section 3 describes the new Hybrid DC(HDC) method. The performance of the method is exploredthrough simulation studies and real data examples in Sects. 4and 5. Finally, Sect. 6 concludes with a brief discussion. Alltechnical details and proofs of our results are relegated toAppendices A–C and more technical details are available inthe online supplemental file.

2 Model and INLA

In this section, we introduce our basic model and INLAmethodology.

2.1 The model

Generalized linear mixed models are flexible models formodeling non-Gaussian repeated measurements. On the ba-

sis of GLM, a GLMM assumes that the responses are in-dependent conditional on the random effects and are dis-tributed according to a member of the exponential family.

We consider clustered data in which repeated measures ofa response variable are taken on a random sample of m clus-ters. Consider the response vectors yi = (yi1, . . . , yini

)T ,i = 1, . . . ,m. Let n = ∑m

i=1 ni be the total sample size. Con-ditional on r ×1 vector of unobservable cluster-specific ran-dom effects ui = (ui1, . . . , uir )

T , these data are distributedaccording to a member of the exponential family:

f (yij |ui ,β)

= exp{yij (xTijβ + vT

ijui ) − a(xTijβ + vT

ijui ) + c(yij )},for i = 1, . . . ,m; j = 1, . . . , ni , in which xij and vij , arethe corresponding p- and r-dimensional covariate vectorsassociated with the fixed effects and the random effects re-spectively, β is a p-dimensional vector of unknown regres-sion parameters, and a(·) and c(·) are specific functions.Here τij = xT

ijβ + vTijui is the canonical parameter. Let

μij = E[Yij |β,ui] = a′(τij ) with g(μij ) = ηij = xTijβ +

vTijui , where g(·) is a monotonic link function. Furthermore,

assume ui comes from a Gaussian distribution, ui |Q ∼N(0,Q−1), in which the precision matrix Q = Q(θ) de-pends on the parameters θ . Let θ denote the d × 1 vector ofthe variance components for which prior π(θ) is assigned.Let also ψ = (β,u) denote the s × 1 vector of parametersassigned Gaussian priors. Moreover, let y = (y1, . . . ,ym),θ ∈ �, an open subset of �d , and ψ ∈ � , an open subsetof �s . The joint posterior density is defined by

π(ψ, θ |y) ∝ π(ψ |y, θ)π(θ |y)

∝ π(θ)π(β)|Q(θ)|1/2

× exp

{

−1

2uT Q(θ)u +

m∑

i=1

logf (yi |ψ)

}

.

(1)

2.2 Integrated nested Laplace approximation

Because of usefulness and easy implementation of theMCMC methods, a rather commonly used approach forinference in GLMMs is based on Bayesian methods andMCMC sampling. Considering (1) the main aim is to com-pute the posterior marginals π(ψl |y), l = 1, . . . , s andπ(θv|y), v = 1, . . . , d . It is well-known, however, thatMCMC methods tend to exhibit poor performance whenapplied to such models (Rue et al. 2009).

INLA is a new tool for Bayesian inference on latentGaussian models introduced by Rue et al. (2009). Themethod combines Laplace approximations and numericalintegration in a very efficient manner. INLA substitutes

Page 3: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 599

MCMC simulations with accurate, deterministic approxima-tions to posterior marginal distributions. The quality of suchapproximations is high in most cases, such that even verylong MCMC runs could not detect any error in them.

We can write

π(ψl |y) =∫

π(ψl |θ ,y)π(θ |y)dθ, l = 1, . . . , s,

π(θv|y) =∫

π(θ |y)dθ−v, v = 1, . . . , d,

where θ−v is equal to θ with eliminated vth element. Thekey feature of INLA is to use this form to construct nestedapproximations

π(ψl |y) =∫

π (ψl |θ ,y)π(θ |y)dθ ,

π(θv|y) =∫

π(θ |y)dθ−v,

where Laplace approximation is applied to carry out the in-tegrations required for evaluation of π (ψl |θ ,y). We have tonote that the Laplace approximation does depend on repa-rameterization, but for these models, it is natural to keep theparameterization unchanged. For more details, we refer thereaders to Rue et al. (2009, p. 387). The approximate poste-rior marginals obtained from INLA can then be used to com-pute summary statistics of interest, such as posterior means,variances or quantiles.

3 A hybrid data cloning method

In this section, we describe how the DC method could becombined with INLA. According to the theoretical resultsgiven in Appendices A–C, we need to establish the asymp-totic normality of the HDC-based distribution to be able touse INLA within the DC method. For this purpose, we firstestablish the asymptotic normality of the approximate pos-terior as well as the DC-based distributions. These two re-sults will enable us to establish the asymptotic normality ofthe HDC-based distribution. The formal statements and theirproofs are provided in Appendices A–C.

The DC-based distribution is constructed by duplicatingthe original data set k times. In other words, we create ak-repeated cloned data set y(k) = (y, . . . ,y) where the ob-served data vector is repeated k times. In this way, the co-variates are cloned as well. Finally, k independent copies ofthe random effects, u, are generated from its Gaussian den-sity, thus contributing to the cloned likelihood. Then, usingthe new cloned data, the DC-based distribution will be con-structed. Figure 1 shows schematically the steps of cloningdata to construct the DC-based distribution.

Let π(k)(β, θ |y) ∝ π(k)(β|y, θ)π(k)(θ |y) be the artifi-cially constructed density, the DC-based density, from k

Fig. 1 Cloning the original data to construct the DC-based distribution

identical and independent clones of the data and prior dis-tributions. Although this distribution looks like a Bayesianposterior distribution, it is constructed from two functionswhich are not, in fact, a prior distribution and a likelihood.However, considering them as prior and likelihood canbe mimicked virtually (Baghishani and Mohammadzadeh2011). A simple explanation for why the data cloningmethod works, is that by cloning the data, the effect of theprior distribution is diminished and the DC-based estimatorsconverge to the MLEs. Furthermore, by the Central LimitTheorem, the DC-based estimators are approximately nor-mally distributed and their variances are scaled versions ofthe asymptotic variances estimates for the MLEs, i.e.

E(k)(β, θ |y) −→ (β, θ),

Var(k)(β, θ |y) −→ k−1 × Var((β, θ)),

as k −→ ∞.Following, we will combine the DC method with INLA

by three steps. The marginal posterior of the parameters isgiven by

π(β, θ |y) = L(β, θ;y)π(β, θ),

where L(β, θ;y) is the marginal likelihood function. Then

logπ(β, θ |y)

π(β, θ)= logL(β, θ;y) = �n(β, θ).

For our first step, we denote the approximation of π(β, θ |y)

by π (β, θ |y) which obtained by using INLA. Therefore, wecan write

π(β, θ |y) ∝ exp{�n(β, θ)}π(β, θ).

Result 1 The approximate posterior distribution obtainedby using INLA, π(β, θ |y), converges to a multivariate nor-mal distribution as n −→ ∞.

Precise statement of Result 1 and its proof are provided inAppendix A. Result 1 states that the INLA-based estimators

Page 4: Global optimization of the generalized cross-validation criterion

600 Stat Comput (2012) 22:597–613

in a GLMM are asymptotically normally distributed. How-ever, for our purpose we need two additional results. Let

logπ(k)(β, θ |y)

π(β, θ)= log[L(β, θ;y)]k = �(k)

n (β, θ).

Then, the DC-based density is given by

π(k)(β, θ |y) = exp{�(k)n (β, θ)}π(β, θ).

Result 2 The DC-based distribution, π(k)(β, θ |y), con-verges to a multivariate normal distribution as k −→ ∞.

Precise statement of Result 2 and its proof are providedin Appendix B. This result has recently been presented andproved by Lele et al. (2007) and Baghishani and Moham-madzadeh (2011). Combining Results 1 and 2, we can estab-lish the asymptotic normality of the new HDC-based distri-bution. Let π (k)(β, θ |y) be the corresponding approximateof π(k)(β, θ |y) obtained by INLA. Then,

π (k)(β, θ |y) ∝ exp{�(k)n (β, θ)}π(β, θ).

Now we can state the following result, which is stated pre-cisely and proved in Appendix C.

Result 3 The HDC-based distribution, π (k)(β, θ |y), con-verges to a multivariate normal distribution as k −→ ∞.

By having Result 3, we can develop a new HDC al-gorithm to carry out the likelihood based inferences inGLMMs with high accuracy and low computational costs.This claim is illustrated in the following two sectionsthrough simulation studies and real examples.

4 Simulation study

In this section, we present a simulation study designed to as-sess the performance of the HDC algorithm across a range ofconditions that are realistic for clustered/longitudinal data.We also compare its performance to that of the AdaptiveGauss-Hermite Quadrature (AGHQ) method.

All analyses were conducted in R version 2.10.1 andon a Windows workstation using two Intel 2.53 GHz pro-cessors. The R packages INLA (www.r-inla.org) and lme4(www.r-project.org) were used for HDC and AGHQ basedanalyses, respectively. The R script for reproducing all re-sults in the paper, can be downloaded from the Paper Sectionof the www.r-inla.org website.

The setting mimics a study with clustered count datawhere both continuous and dichotomous covariates influ-ence the distribution of the response variable. In addition,there might be an overall heterogeneity between clusters.

Namely, the data were generated according to the randomintercept model:

ln(μij ) = ηij = β0 + β1x1ij + β2x2ij + ui,

i = 1, . . . ,m; j = 1, . . . , ni,

where ui |σ 2 iid∼ N(0, σ 2) with σ = 0.75 and β = (1,2,−3).The covariate x1ij was binary taking a value of 1 with proba-bility 0.6. The covariate x2ij was generated according to theuniform distribution, U(0,1).

We examined all combinations of three numbers ofclones, k, (20, 40 and 80), three numbers of clusters, m,(20, 50 and 100) and two cluster sizes, ni , (2 and 4), result-ing in 18 conditions. For each condition, we simulated 100data sets and estimated the model using HDC and AGHQmethods. For each condition, the same 100 data sets wereanalyzed by the two methods to enable accurate compar-isons. For the AGHQ method, 51 quadrature points wereused to obtain the exact MLEs of the parameters.

To show that the proposed HDC-based inferences are in-variant to the choice of the priors, we used three differentsets of prior distributions: a log-gamma prior for log(σ−2),an informative prior, with mean equal to 0.5 and varianceequal to 0.25, a log-uniform prior for log(σ−2), a non-informative prior, and a log-gamma prior, a vague prior,with mean equal to 60 and variance equal to 1200. Wealso used the default INLA Gaussian priors for the fixed ef-fects, β .

Tables 1, 2, 3 show the mean parameter estimates (Est.),their standard deviations (SD) and mean squared errors(MSE) over 100 simulated data sets. Due to space limita-tions, the results for m = 20,100 and informative and non-informative priors only are reported in Tables 1–3. The fullresults are given in the online supplemental file. In addi-tion, because the results for the AGHQ method are identi-cal for different values of k, they are not shown in Tables 2and 3. Here, HDC1 and HDC2 refer to the HDC method thatuses informative and non-informative priors, respectively.According to the results, the estimates for regression param-eters are captured very accurately under various conditions.Although, generally, the variance component, σ , is well es-timated, it is slightly underestimated. Further, in most cases,when the number of clusters increases the estimate of vari-ance component gets closer to the true value. The resultsobtained using HDC method with different priors are alsoquite close to the results obtained using AGHQ method asthe exact MLEs.

Besides the good performance to obtain the MLE effi-ciently, a main advantage of the HDC method is its imple-mentation. In our simulation study, we considered a ran-dom intercept model for which the AGHQ method couldbe implemented. However, for crossed designs, the AGHQmethod is impractical, e.g. using R package lme4. For the

Page 5: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 601

Table 1 Results from AGHQ and HDC estimations on 100 simulated data sets from the random intercept Poisson model with k = 20

m ni Par. True AGHQ HDC1 HDC2

Est. SD MSE Est. SD MSE Est. SD MSE

20 2 β0 1 0.967 0.411 0.168 0.962 0.412 0.169 0.967 0.411 0.168

β1 2 1.987 0.364 0.132 1.988 0.365 0.132 1.987 0.364 0.132

β2 −3 −2.936 0.532 0.284 −2.939 0.531 0.283 −2.936 0.531 0.284

σ 0.75 0.691 0.177 0.035 0.708 0.168 0.030 0.688 0.179 0.036

4 β0 1 1.026 0.249 0.062 1.024 0.250 0.062 1.026 0.249 0.062

β1 2 1.999 0.195 0.038 2.000 0.195 0.038 2.000 0.195 0.038

β2 −3 −3.007 0.254 0.064 −3.007 0.254 0.064 −3.007 0.254 0.064

σ 0.75 0.702 0.129 0.019 0.710 0.126 0.017 0.699 0.129 0.019

100 2 β0 1 0.985 0.168 0.028 0.986 0.167 0.028 0.986 0.168 0.028

β1 2 2.004 0.151 0.023 2.003 0.151 0.022 2.003 0.151 0.023

β2 −3 −2.987 0.205 0.042 −2.986 0.205 0.042 −2.986 0.205 0.042

σ 0.75 0.743 0.073 0.005 0.739 0.072 0.005 0.736 0.073 0.005

4 β0 1 1.007 0.129 0.017 1.008 0.129 0.017 1.008 0.129 0.017

β1 2 2.000 0.088 0.008 2.000 0.088 0.008 2.000 0.088 0.008

β2 −3 −2.997 0.128 0.016 −2.997 0.128 0.016 −2.997 0.128 0.016

σ 0.75 0.737 0.071 0.005 0.732 0.070 0.005 0.729 0.070 0.005

Table 2 Results from AGHQ and HDC estimations on 100 simulated data sets from the random intercept Poisson model with k = 40

m ni Par. True HDC1 HDC2

Est. SD MSE Est. SD MSE

20 2 β0 1 0.965 0.411 0.169 0.967 0.411 0.168

β1 2 1.987 0.365 0.132 1.987 0.364 0.132

β2 −3 −2.937 0.531 0.284 −2.936 0.532 0.284

σ 0.75 0.696 0.171 0.032 0.686 0.178 0.035

4 β0 1 1.026 0.250 0.062 1.027 0.249 0.062

β1 2 2.000 0.195 0.038 2.000 0.195 0.038

β2 −3 −3.007 0.254 0.064 −3.007 0.254 0.064

σ 0.75 0.703 0.126 0.018 0.697 0.128 0.019

100 2 β0 1 0.986 0.168 0.028 0.986 0.168 0.028

β1 2 2.003 0.151 0.023 2.003 0.151 0.023

β2 −3 −2.986 0.205 0.042 −2.986 0.205 0.042

σ 0.75 0.737 0.073 0.005 0.736 0.073 0.005

4 β0 1 1.008 0.129 0.017 1.008 0.129 0.017

β1 2 2.000 0.088 0.008 2.000 0.088 0.008

β2 −3 −2.997 0.128 0.016 −2.997 0.128 0.016

σ 0.75 0.726 0.069 0.005 0.725 0.070 0.005

AGHQ method with 51 quadrature points, evaluation of thedeviance at a single parameter value would require 51r indi-vidual GLM deviance evaluations, where r is the total num-ber of random effects. Even to evaluate the deviance for aGLMM at the starting values of the parameters could take

several days and that is before the optimization with respectto the parameter values begins. But, for a crossed randomeffects model, the HDC method can be implemented eas-ily. All computations required by the HDC method are effi-ciently performed by the R INLA package.

Page 6: Global optimization of the generalized cross-validation criterion

602 Stat Comput (2012) 22:597–613

Table 3 Results from AGHQ and HDC estimations on 100 simulated data sets from the random intercept Poisson model with k = 80

m ni Par. True HDC1 HDC2

Est. SD MSE Est. SD MSE

20 2 β0 1 0.967 0.412 0.169 0.968 0.411 0.168

β1 2 1.987 0.365 0.132 1.987 0.364 0.132

β2 −3 −2.936 0.531 0.284 −2.935 0.531 0.284

σ 0.75 0.689 0.172 0.033 0.684 0.176 0.035

4 β0 1 1.027 0.249 0.062 1.027 0.249 0.062

β1 2 2.000 0.195 0.038 2.000 0.195 0.038

β2 −3 −3.007 0.254 0.064 −3.007 0.254 0.064

σ 0.75 0.699 0.127 0.019 0.696 0.128 0.019

100 2 β0 1 0.986 0.168 0.028 0.987 0.168 0.028

β1 2 2.003 0.151 0.023 2.003 0.151 0.023

β2 −3 −2.986 0.205 0.042 −2.986 0.205 0.042

σ 0.75 0.735 0.074 0.006 0.734 0.074 0.006

4 β0 1 1.007 0.128 0.017 1.008 0.129 0.017

β1 2 2.000 0.088 0.007 2.000 0.088 0.008

β2 −3 −2.997 0.128 0.015 −2.997 0.128 0.016

σ 0.75 0.726 0.069 0.005 0.725 0.070 0.005

The computing time for the HDC method on a typicaldata set was 31 s for k = 100, while usual DC estimates forthe same data set and the same k were computed in 2026 s.For the DC method, we used the one-block MCMC samplerdescribed in Chap. 4 in the book by Rue and Held (2005).

5 Real data examples

This section provides examples of applications of the pro-posed hybrid method. We consider three examples by whichboth nested (Sect. 5.1) and crossed (Sects. 5.2 and 5.3) ran-dom effects are introduced. These examples have been con-sidered previously by Breslow and Clayton (1993) and Fonget al. (2010).

Fong et al. (2010) analyzed these data sets by using INLAmethod and gave a number of prescriptions for prior speci-fication, especially for variance components. They also no-ticed that sometimes the specification of a prior distributionfor variance components is not straightforward. However,the DC-based results are invariant to the choice of the priordistributions.

5.1 Overdispersion

This example concerns data on the proportion of seeds thatgerminated on each of m = 21 plates arranged accordingto a 2 × 2 factorial design with respect to seed variety andtype of root extract (Crowder 1978). The sampling model isYi |β, ui ∼ Bin(ni,pi) where, for plate i, yi is the number

of germinating seeds and ni is the total number of seeds fori = 1, . . . ,m. To account for the extraneous between platevariability, Breslow and Clayton (1993) introduced plate-level random effects and then fitted two main effects andinteraction models:

logit(pi) = β0 + β1x1i + β2x2i + ui,(2)

logit(pi) = β0 + β1x1i + β2x2i + β3x1ix2i + ui,

where ui |σ 2 iid∼ N(0, σ 2), and x1i , x2i represent the seed va-riety and type of root extract for plate i, respectively.

To show that the DC-based estimators are invariant to thechoice of the priors, we used three different sets of priordistributions. Following Fong et al. (2010) for the main ef-fects model, the first set includes N(0,10) for the fixed ef-fects and Ga(0.5,0.0164) for σ−2. The second set includesN(1,100) for the fixed effects and a flat prior for the log-precision, log(σ−2) and the third set includes N(−2,1),N(−1,1) for the fixed effects and Ga(0.01,0.01), as avague prior, for σ−2. For the interaction model, the first setincludes N(1,100) for the fixed effects and Ga(0.5,0.0164)

for σ−2. The second set includes N(−1,1), N(−2,1),N(−1,1) for the fixed effects and a flat prior for log(σ−2)

and the third set includes N(−2,100), N(2,100), N(1,100)

for the fixed effects and Ga(0.01,0.01) for σ−2.To implement the method using R INLA package, we first

prepared the cloned data by duplicating the original data.Notice that the number of plates for the cloned data, plate.k,is m × k = 21k. Let x.k1i and x.k2i denote the new cloned

Page 7: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 603

covariates. Let also r.ki and n.ki denote the proportion andthe total number of seeds for i = 1, . . . ,21k, respectively.

Table 4 shows how the cloned data set is provided. Fittingthe model (2), say, is done by calling the inla() function:

> formula = r.k ~ x.k1+x.k2+f(plate.k,model="iid",param=c(.5,.0164))> result = inla(formula,data=clone.data,family="binomial",Ntrials=n.k)

The option param = c(.5, .0164) specifies the parameters for the gamma prior for σ−2. The summary() function is available to obtainsummary results:

> summary(result)

Fixed effects:mean sd 0.025quant 0.5quant 0.975quant kld

(Intercept) -0.3890820 0.01163724 -0.4119139 -0.3890906 -0.3662326 0.07416333x.k1 -0.3452493 0.01491875 -0.3746143 -0.3452261 -0.3160662 0.03071876x.k2 1.0290746 0.01449569 1.0008734 1.0290775 1.0572161 0.30432151

Random effects:Name Model Max KLDplate.k IID model 0.00026

Model hyperparameters:mean sd 0.025quant 0.5quant 0.975quant

Precision for plate.k 11.7230 0.6342 10.4854 11.7246 12.9717

Expected number of effective parameters(std dev): 1653.33(36.31)Number of equivalent replicates : 2.540

The variances estimates for the MLEs can be obtained as the variances estimates of the HDC-based estimators multiplied by k. Forexample, the square root of variance estimate for β2 in model (2) is equal to

√200 × 0.0145 = 0.205. The estimate and standard deviation

of the variance parameter σ can be easily extracted by calling the inla.hyper() and inla.expectation() functions:

> result.hyperpar = inla.hyperpar(result)> precision = result.hyperpar$marginals[[1]]> m1 = inla.expectation(function(x) (1/x)^0.5, precision)> m2 = inla.expectation(function(x) 1/x, precision)> sd = sqrt(k*(m2 - m1^2))> print(c(mean=m1, sd=sd))

mean sd0.2922371 0.10823

Table 5 presents the results obtained by the HDC method

with k = 200 for three different prior sets in the main and in-

teraction effects models, respectively. These results are com-

pared with MLEs obtained by Breslow and Clayton (1993)

using Gaussian quadrature. There is very close correspon-

dence between the MLE and obtained results from the HDC

method for different priors. Furthermore, the HDC-based

densities for different priors are indistinguishable. These

findings are illustrated in Fig. 2 for the interaction effect and

precision parameter of the random effect in the interaction

model. For other fixed effects, the HDC-based densities are

very close, for different priors.

To compare the accuracy and computational costs for theDC and HDC methods, we generated 10000 samples fromDC-based distribution by using a one-block MCMC sam-pler. The great advantage of the HDC method is the highaccuracy and low computational cost. We obtained the re-sults for the HDC method in the main effects model in lessthan 6 s, whereas the MCMC samples for the DC-based dis-tribution required about 557 s. High accuracy of the HDCmethod is shown in Figure 3 as well.

5.2 Longitudinal data

The Epilepsy data of Thall and Vail (1990) are a well-knowndataset that was analyzed several times by various authors.

Page 8: Global optimization of the generalized cross-validation criterion

604 Stat Comput (2012) 22:597–613

Table 4 The proportion ofseeds that germinated on each of21 plates arranged according toa 2 by 2 factorial layout by seed(x1) and type of root extract (x2)

Original data Cloned datar n x1 x2 plate r.k n.k x.k1 x.k2 plate.k

10 39 0 0 1 10 39 0 0 123 62 0 0 2 23 62 0 0 2...

.

.

....

.

.

....

.

.

....

.

.

....

.

.

.

3 7 1 1 21 3 7 1 1 2110 39 0 0 2223 62 0 0 23...

.

.

....

.

.

....

3 7 1 1 21k

Table 5 MLEs and HDC-based estimates with k = 200 for Seed data in the main and interaction effects model

Model Par. MLEa HDC1 HDC2 HDC3

Main Intercept −0.389 (0.166) −0.389 (0.165) −0.389 (0.164) −0.389 (0.165)

Seed −0.347 (0.215) −0.346 (0.211) −0.345 (0.211) −0.345 (0.211)

Extract 1.029 (0.205) 1.029 (0.203) 1.029 (0.203) 1.029 (0.203)

σ 0.295 (0.112) 0.292 (0.108) 0.292 (0.110) 0.292 (0.110)

Interaction Intercept −0.548 (0.167) −0.548 (0.165) −0.548 (0.165) −0.548 (0.166)

Seed 0.097 (0.278) 0.097 (0.276) 0.098 (0.276) 0.098 (0.276)

Extract 1.337 (0.237) 1.337 (0.235) 1.337 (0.235) 1.337 (0.235)

Interaction −0.811 (0.385) −0.810 (0.383) −0.810 (0.382) −0.810 (0.382)

σ 0.236 (0.110) 0.234 (0.108) 0.233 (0.108) 0.234 (0.108)

a From Breslow and Clayton (1993). Standard deviations are in brackets

They presented data from a clinical trial of 59 epilepticswho were randomized to a new drug (Trt = 1) or a placebo(Trt = 0). Baseline data available at entry into the trial in-cluded the number of epileptic seizures recorded in the pre-ceding 8-weeks period and age in years. The logarithm of 1

4the number of baseline seizures (Base) and the logarithm ofage (Age) were treated as covariates in the analysis. A multi-variate response variable consisted of the counts of epilepticseizures, yij , for patient i during the 2-weeks before eachof four clinic visits j (Visit, coded −3,−1,+1,+3), with

Yij |β, uiiid∼ Po(μij ), i = 1, . . . ,59; j = 1, . . . ,4. An indi-

cator of the fourth visit was also constructed to account fora possible differential effect. Following Fong et al. (2010),we concentrate on the three random effects models fitted byBreslow and Clayton (1993):

log(μij ) = xTijβ + u1i , (3)

log(μij ) = xTijβ + u1i + u0ij , (4)

log(μij ) = xTijβ + u1i + u2iVj /10, (5)

where xij is a 6 × 1 vector containing a 1 for intercept,the baseline by treatment interaction and above mentionedcovariates with β as the associated fixed effect coefficient.

All three models include patient-specific random effectsu1i ∼ N(0, σ 2

1 ), while in model (4) we introduce indepen-dent measurement errors u0ij ∼ N(0, σ 2

0 ) and model (5) in-cludes random effects on the slope associated with visit, u2i ,with(

u1i

u2i

)

∼ N(0,Q−1).

Following to Fong et al. (2010) we assume Q ∼Wishart(r, T ) with

T =[

T11 T12

T21 T22

]

.

Here, similar to the previous subsection, we used threedifferent sets of prior distributions. The first set for threemodels are priors considered by Fong et al. (2010). Thesecond and third sets for the first model include N(1,100)

for β , a flat prior for log(σ−21 ) and N(0,10), N(−1,10),

N(−2,100), N(2,100), N(0,10), N(1,100) for β andGa(0.05,0.02) for σ−2

1 , respectively. For the second model,the second prior set includes N(1,100) for β and flat priorsfor log(σ−2

1 ) and log(σ−20 ). The third set includes N(0,10),

N(−1,10), N(−2,100), N(2,100), N(0,10), N(1,100)

Page 9: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 605

Fig. 2 HDC-based densities ofthe interaction effect (top panel)and precision parameter of therandom effect (bottom panel) inthe interaction model withk = 200; The graphs showingthe densities for the first (solid),second (dots), and third(dot-small dash) prior sets

for β and Ga(0.05,0.02) and Ga(0.01,0.01) for σ−21 and

σ−20 respectively. Finally, for the third model, the second

and third prior sets include the same priors for fixed effectsas second model but they include r = 4 and T = diag(3,4)

and r = 6 and T = diag(0.5,0.5), respectively.Table 6 presents the results obtained by the HDC method

with k = 100. The results are compared with AGHQ andINLA results. It is clear that the results are indistinguishablefor different priors. Note that σ 2

2 = T −122 .

Figures 4 and 5 show the HDC-based densities of fixedeffects and precision parameters of the random effects forthe model (5), respectively. According to the figures, allthree HDC-based densities for different prior sets coin-

cide. The results remain the same for the other models aswell.

5.3 Crossed random effects: the salamander data

McCullagh and Nelder (1989) described an interestingdataset on the success of matings between male and femalesalamander of two population types, roughbutts (RB) andwhitesides (WS). The experimental design involves threeexperiments having multiple pairings, with each salamanderbeing involved in multiple matings, so that crossed randomeffects are required. The first experiment conducted duringthe summer of 1986 and the second and third conducted inthe fall. Each experiment involved 30 matings of each of

Page 10: Global optimization of the generalized cross-validation criterion

606 Stat Comput (2012) 22:597–613

Fig. 3 Comparison of the HDC(solid) and DC-based (dashes)densities of seed variety effect(top panel) and the precisionparameter of the random effect(bottom panel) in the maineffects model with k = 100

the four gender-population combinations. There are 360 bi-nary responses in total. This complex data is reanalyzed byseveral authors such as Karim and Zeger (1992), Breslowand Clayton (1993), Bellio and Varin (2005) and Fong et al.(2010).

Let yijk be the binary response for female i and malej in experiment k. Here, we focus on the model that wasconsidered by Fong et al. (2010):

logit Pr(Yijk = 1|β, ufik, u

mjk) = xT

ijkβk + ufik + um

jk,

where xijk is a 4 × 1 vector representing the intercept, anindicator WSf of whiteside females, an indicator WSm of

whiteside males and their interaction and βk is the corre-sponding fixed effect. As Fong et al. (2010) have mentionedthis model allows the fixed effects to vary by experiment andthe model contains six random effects

ufik

iid∼ N(o,σ 2f k), u

mik

iid∼ N(o,σ 2mk), k = 1,2,3

one for each of males and females, and in each experiment.Again similar to the previous subsections, we used three

different sets of prior distributions and the first set presentspriors considered by Fong et al. (2010). The second set in-cludes N(0,10) for β and flat priors for both log(σ−2

f k ) and

Page 11: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 607

Table 6 HDC-based estimates obtained using k = 100 for Epilepsy data in the models (3)–(5) compared with AGHQ and INLA estimates.Standard deviations are in brackets

Model Par. AGHQ INLA HDC1 HDC2 HDC3

(3) Intercept −1.34 (1.18) −1.30 (1.30) −1.34 (1.18) −1.34 (1.17) −1.34 (1.17)

Base 0.88 (0.13) 0.88 (0.15) 0.88 (0.13) 0.88 (0.13) 0.88 (0.13)

Trt −0.93 (0.40) −0.94 (0.44) −0.93 (0.40) −0.93 (0.40) −0.93 (0.40)

Base×Trt 0.34 (0.20) 0.34 (0.22) 0.34 (0.20) 0.34 (0.20) 0.34 (0.20)

Age 0.48 (0.35) 0.47 (0.38) 0.48 (0.34) 0.49 (0.34) 0.48 (0.34)

V4 or V/10 −0.16 (0.05) −0.16 (0.05) −0.16 (0.05) −0.16 (0.05) −0.16 (0.05)

σ1 0.50 (0.09) 0.50 (0.06) 0.50 (0.04) 0.50 (0.04) 0.50 (0.04)

(4) Intercept −1.41 (1.16) −1.40 (1.32) −1.41 (1.17) −1.41 (1.18) −1.41 (1.17)

Base 0.88 (0.13) 0.88 (0.15) 0.88 (0.13) 0.88 (0.14) 0.88 (0.13)

Trt −0.95 (0.40) −0.96 (0.45) −0.95 (0.40) −0.95 (0.40) −0.95 (0.40)

Base×Trt 0.35 (0.20) 0.35 (0.23) 0.35 (0.20) 0.35 (0.20) 0.35 (0.20)

Age 0.49 (0.34) 0.48 (0.39) 0.48 (0.34) 0.49 (0.34) 0.48 (0.34)

V4 or V/10 −0.10 (0.09) −0.10 (0.09) −0.10 (0.09) −0.10 (0.09) −0.10 (0.09)

σ0 0.36 (0.08) 0.41 (0.04) 0.36 (0.04) 0.36 (0.04) 0.36 (0.04)

σ1 0.46 (0.09) 0.54 (0.06) 0.46 (0.06) 0.46 (0.06) 0.46 (0.06)

(5) Intercept −1.37 (1.17) −1.36 (1.30) −1.36 (1.19) −1.38 (1.20) −1.37 (1.19)

Base 0.89 (0.13) 0.88 (0.14) 0.89 (0.13) 0.88 (0.13) 0.89 (0.13)

Trt −0.93 (0.40) −0.94 (0.44) −0.93 (0.40) −0.93 (0.40) −0.93 (0.40)

Base×Trt 0.34 (0.20) 0.34 (0.22) 0.34 (0.20) 0.34 (0.20) 0.34 (0.20)

Age 0.48 (0.35) 0.47 (0.38) 0.47 (0.35) 0.48 (0.35) 0.48 (0.35)

V4 or V/10 −0.27 (0.16) −0.27 (0.16) −0.26 (0.17) −0.27 (0.17) −0.27 (0.16)

σ1 0.50 (0.09) 0.56 (0.08) 0.50 (0.06) 0.50 (0.06) 0.50 (0.06)

σ2 0.73 (0.11) 0.70 (0.14) 0.73 (0.15) 0.73 (0.15) 0.72 (0.15)

log(σ−2mk ). The third set also includes N(0,10) for β and

Ga(0.1,0.1) for both σ−2f k and σ−2

mk .Table 7 shows the results obtained by the HDC method

with k = 100. The results are compared with Laplace Ap-proximation (LA) and INLA results. As one can see the re-sults are indistinguishable for different priors. The resultsare also close to LA but, in some cases, there are somedifferences between their standard deviations. Also there isa strong discrepancy between HDC and INLA estimates,usually with slightly larger standard deviations under the lat-ter.

Figures 6, 7, 8 show the HDC-based densities of the pre-cision parameters of the random effects obtained for thethree experiments. According to the figures, in most cases,there is a good match between curves.

We also compared the computational costs for the DCand HDC methods by using 10000 generated samples fromDC-based distribution. The computing time on the sum-mer experiment data for the HDC method with k = 100was about 65 s, whereas for the DC method it was about5331 s.

6 Discussion

Although Fong et al. (2010) gave a number of prescriptionsfor prior specification, especially for variance componentsin a GLMM, sometimes the specification of a prior in thesemodels is not straightforward. But, DC-based inferences areinvariant to the choice of the prior distributions. Computa-tion and convergence, however, are issues since the usualimplementation of the DC method is via MCMC. On theother hand, INLA provides precise estimates in seconds orminutes, even for models involving thousands of variables,in situations where any MCMC computation typically takeshours or even days. In this paper, we synthesized these twoapproaches and introduced a new HDC method so that itsperformance, according to the obtained results, is very goodand it inherits the invariance property of the DC method aswell.

The benefits of our proposed method are the simplicityof implementation using the R INLA package and to obtainMLE efficiently. Most available alternative methods to com-pute MLE in GLMMs, especially in models with crossed

Page 12: Global optimization of the generalized cross-validation criterion

608 Stat Comput (2012) 22:597–613

Fig. 4 HDC-based densities of the fixed effects in the model (5) with k = 100; The graphs showing the densities for the first (solid), second (dots),and third (dot-dash) prior sets

Fig. 5 HDC-based densities ofthe precision parameters of therandom effects in the model (5)with k = 100; The graphsshowing the densities for thefirst (solid), second (dots), andthird (dot-dash) prior sets

random effects, have disadvantages in the sense of consis-tent estimation, loss of efficiency, computational time re-quired and convergence assessment, e.g. penalized quasilikelihood (Breslow and Clayton 1993), composite likeli-hood (Bellio and Varin 2005) and Monte Carlo expectationmaximization (Booth et al. 2001).

A disadvantage of our work is that, according to INLAmethodology, the prior distributions for the fixed effectsβ must be Gaussian. However, we can use Gaussian pri-ors with high variances to consider approximately flat pri-ors and the results, theoretically, are invariant to the choiceof the priors as well. Alongside good performance of the

Page 13: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 609

Table 7 HDC-based estimates obtained using k = 100 for Salamander data for the summer and fall experiments compared with LA and INLAestimates. Standard deviations are in brackets

Model Par. LA INLA HDC1 HDC2 HDC3

Summer Intercept 1.34 (0.62) 1.48 (0.72) 1.34 (0.62) 1.32 (0.62) 1.34 (0.62)

WSf −2.94 (0.88) −3.26 (1.01) −2.94 (0.89) −2.91 (0.90) −2.94 (0.90)

WSm −0.42 (0.63) −0.50 (0.73) −0.43 (0.64) −0.42 (0.63) −0.43 (0.64)

WSf × WSm 3.18 (0.94) 3.52 (1.03) 3.17 (0.94) 3.14 (0.96) 3.18 (0.96)

σf 1 1.25 (0.10) 1.29 (0.46) 1.24 (0.38) 1.24 (0.38) 1.23 (0.38)

σm1 0.27 (0.05) 0.78 (0.29) 0.34 (0.32) 0.38 (0.31) 0.37 (0.26)

First fall Intercept 0.57 (0.67) 0.56 (0.71) 0.54 (0.64) 0.55 (0.65) 0.55 (0.65)

WSf −2.46 (0.93) −2.51 (1.01) −2.37 (0.94) −2.39 (0.94) −2.40 (0.95)

WSm −0.77 (0.72) −0.75 (0.75) −0.72 (0.69) −0.73 (0.70) −0.73 (0.70)

WSf × WSm 3.71 (0.96) 3.74 (1.03) 3.55 (1.01) 3.58 (1.01) 3.60 (1.03)

σf 2 1.35 (0.11) 1.38 (0.50) 1.30 (0.42) 1.30 (0.42) 1.31 (0.42)

σm2 0.96 (0.09) 1.00 (0.36) 0.90 (0.38) 0.91 (0.39) 0.91 (0.40)

Second fall Intercept 1.02 (0.65) 1.07 (0.73) 0.99 (0.64) 0.93 (0.64) 1.00 (0.64)

WSf −3.23 (0.83) −3.39 (0.92) −3.16 (0.85) −3.15 (0.86) −3.18 (0.80)

WSm −0.82 (0.86) −0.85 (0.94) −0.79 (0.85) −0.75 (0.85) −0.79 (0.85)

WSf × WSm 3.82 (0.99) 4.02 (1.05) 3.74 (1.03) 3.75 (1.04) 3.76 (0.97)

σf 3 0.59 (0.07) 0.81 (0.28) 0.55 (0.40) 0.54 (0.42) 0.54 (0.43)

σm3 1.36 (0.11) 1.47 (0.48) 1.33 (0.39) 1.33 (0.43) 1.33 (0.42)

Fig. 6 HDC-based densities ofthe precision parameters of therandom effects obtained for thesummer experiment withk = 100; The graphs showingthe densities for the first (solid),second (dots), and third(dot-dash) prior sets

DC method, Baghishani and Mohammadzadeh (2011) havementioned some of its limitations and possible solutions. Al-though we have not discussed about selecting the number ofclones, k, since, in general, selecting k under different cir-cumstances such as sample size, random effects dimensionand the number of parameters needs further research.

In this paper, we assumed that the dimension of the ran-dom effects is fixed. But in some frameworks such as spatial

models and spline smoothing models, the number of ran-dom effects (spline basis) increases with the sample size.Exploration to extend the HDC method to such frameworksis interesting and a topic of further research.

Based on INLA methodology, we need to assume a Gaus-sian distribution for random effects. However, it is commonto assume other distributions for the random effects suchas (penalized) mixture of normals (Komarek and Lesaffre

Page 14: Global optimization of the generalized cross-validation criterion

610 Stat Comput (2012) 22:597–613

Fig. 7 HDC-based densities ofthe precision parameters of therandom effects obtained for thefirst fall experiment withk = 100; The graphs showingthe densities for the first (solid),second (dots), and third(dot-dash) prior sets

Fig. 8 HDC-based densities ofthe precision parameters of therandom effects obtained for thesecond fall experiment withk = 100; The graphs showingthe densities for the first (solid),second (dots), and third(dot-dash) prior sets

2008) or skew normal distribution (Hosseini et al. 2011). Itmight be possible to combine the DC method with other ap-proximate Bayesian methods like the method proposed byEidsvik et al. (2009) to be able to use other distributions forrandom effects; again this needs further research.

Finally, as Ponciano et al. (2009) have noticed, nowa-days, the choice between Bayesian and frequentist ap-proaches in GLMMs is no longer a matter of feasibilitybut rather can be based on the philosophical views of re-searchers.

Acknowledgements We thank the associate editor and three anony-mous referees for their valuable comments and suggestions which ledto a considerably improvement of the paper. The first author wishes tothank the NTNU Department of Mathematical Sciences, Norway, fortheir support during his visit there. The third author would like to thankpartial support of the Ordered and Spatial Data Center of Excellenceof Ferdowsi University of Mashhad.

Appendix A: Asymptotic normality of approximateposterior distribution

A formal statement of Result 1, stated in Sect. 3, is givenand proved in this section. The necessary lemmas for thefollowing proofs are given in the supplemental materials. Wefirst introduce some notation and calculations.

We suppose η = (β, θ) be the q × 1 vector of parametersfor notational simplicity. Assume that the functions �n(η)

and �n(η) are twice continuously differentiable with respectto η. Let ∇�n(η) and ∇ �n(η) be the vectors of the first-order partial derivatives with respect to η. Furthermore, let∇2�n(η) and ∇2�n(η) be the matrices of the second-orderpartial derivatives with respect to η. Here and subsequently,let ηn be the mode of �n(η), satisfying ∇�n(ηn) = 0.

To facilitate asymptotic theory arguments, whenever ηn

exists and −∇2�n(ηn) and −∇2�n(ηn) are positive definite,

Page 15: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 611

we define Qn and xn as follow:

QTn Qn = −∇2�n(ηn), xn = Qn(η − ηn).

Then the approximate posterior density of xn is given by

πn(xn|y) ∝ πn(η(xn)|y) ∝ e�n(η)−�n(ηn)π(η), (A.1)

where the subscript n emphasizes the dependence ofπ(xn|y) on sample size. Let η0 be the true underlying pa-rameters. Let also P c

n and Ecn denote the approximate condi-

tional probability and expectation given data y. In what fol-lows, all probability statements are with respect to the trueunderlying probability distribution. Then we must show

P cn (xn ∈ B) −→ q(B), as n → ∞,

where B is any Borel set in �q and q is the standard q-variate Gaussian distribution.

To conduct the posterior distribution in a form suitablefor Stein’s Identity, we need the following calculations. Forconverting �n(η) into a form close to normal, we first takethe Taylor’s expansion

�n(η) = �n(ηn) + 1

2(η − ηn)

T

× ∇2�n(η∗)(η − ηn)

where η∗ lies between η and ηn.Now by Remark 1 in the supplement we have,

�n(η) = �n(ηn) + 1

2(η − ηn)

T

× ∇2�n(η∗)(η − ηn) + Rn,

where

Rn = [�n(ηn) − �n(ηn)] + [�n(η) − �n(η)]

+ 1

2(η − ηn)

T [∇2�n(η∗) − ∇2�n(η

∗)](η − ηn).

Let

kn(η) = −1

2(η − ηn)

T [∇2�n(ηn) − ∇2�n(η∗)](η − ηn).

Thus,

�n(η) ≈ �n(ηn) − 1

2‖xn‖2 + kn(η). (A.2)

Therefore we can rewrite (A.1) as

πn(xn|y) ∝ φq(xn)fn(xn), (A.3)

where fn(xn) = exp{kn(η)}π(η) and φq(·) display the stan-dard q-variate Gaussian density.

Let ∇xnf (xn) denote the derivative of fn(xn) with re-spect to xn. Hence,

∇xnfn(xn)

fn(xn)= (QT

n )−1[∇π(η)

π(η)+ ∇kn(η)

]

. (A.4)

Let also D = {∇ �n(ηn) = 0,−∇2�n(ηn) > 0}. Here A > 0means that the matrix A is positive definite.

A.1 Derivation and Proof of Result 1

Let

S = {xn : xn = Qn(η − ηn);η ∈ N

}, (A.5)

where N is given in the supplement. The following theoremdeals with the asymptotic normality of the approximate pos-terior distribution. The regularity conditions (B1)–(B3) and(C1)–(C4) are given in the supplement.

Theorem 1 Let h ∈ Hs , the prior π(η) satisfies (B1)–(B3)

and �n(η) satisfies (C1)–(C4). Then, Ecn[h(xn)] p−→ h.

Proof Note that Uh and π(η) are bounded by Lemma 3.1of Weng and Tsai (2008) and (B1)–(B2). From (A.4) andLemma 3 in the supplement, for a.e. on D, we have

Ecn[h(xn)] − h = Ec

n,xn,

where

Ecn,xn

= Ecn

{

(Uh(xn))T (QT

n )−1 ∇π(η)

π(η)

}

+ Ecn

{(Uh(xn))

T (QTn )−1∇kn(η)

}

= A1 + A2. (A.6)

Since P(Dc) −→ 0 by (C1), it suffices to show A1 +A2−→0.

First, A1 −→ 0 follows from Lemma 6 in the supplement

and the assumption ‖Q−1n ‖ p−→ 0 under (C1). Next, from

(A.6) we have,

A2 =∫S(Uh(xn))

T (QTn )−1∇kn(η)φq(xn)fn(xn)dxn

∫Sφq(xn)fn(xn)dxn

.

The denominator is bounded below by some K1 > 0 byLemma 5(D1) in the supplement. Then we just need to showthat the numerator converges to 0 in probability. First wedecompose the numerator into two integrals over {‖xn‖ ≤bn} ∩ S and {‖xn‖ > bn} ∩ S and call the corresponding in-tegrals as A2,1 and A2,2, respectively. With respect to (B1)–(B2), Lemma 3.1 of Weng and Tsai (2008), and

(QTn )−1∇kn(η)

={

Iq − (QTn )−1

[

−(

∂2�n

∂ηi∂ηj

(η∗ij )

)]

Q−1n

}

xn,

Page 16: Global optimization of the generalized cross-validation criterion

612 Stat Comput (2012) 22:597–613

there exists a constant C1 > 0 such that

|A2,1| ≤∫

{‖xn‖≤bn}∩S

|(Uh(xn))T (QT

n )−1∇kn(η)|

× e�n(η)−�n(η)dxn

≤ C1 supη:{‖xn‖≤bn}∩S

∥∥∥∥Iq − (QT

n )−1

×[

−(

∂2�n

∂ηi∂ηj

)

(η∗ij )

]

Q−1n

∥∥∥∥

×∫

{‖xn‖≤bn}∩S

‖xn‖e�n(η)−�n(η)dxn.

Using (C2) and Lemma 4 part 2 in the supplement, we

conclude that A2,1p−→ 0. Next, by (C3), (B1)–(B2) and

Lemma 3.1 of Weng and Tsai (2008), there exists a constantC2 > 0 such that

|A2,2| ≤ C2

{‖xn‖>bn}∩S

‖xn‖re�n(η)−�n(η)dxn,

which using Lemma 4 part 2 in the supplement, converges

to 0 in probability. Hence, A2p−→ 0. This completes the

proof. �

Appendix B: Asymptotic normality of the DC-baseddistribution

We define Fn,k and zn,k as follow:

FTn,kFn,k = −∇2�(k)

n (ηn), zn,k = Fn,k(η − ηn).

Therefore,

π(k)n (zn,k|y) ∝ π(k)

n (η(zn,k)|y) ∝ e�(k)n (η)−�

(k)n (ηn)π(η).

Theorem 2 below shows the asymptotic distribution of theDC-based distribution is normal.

Theorem 2 Let h ∈ Hs , the prior π(η) satisfies (B1)–(B3)and �

(k)n (η) satisfies appropriate conditions, similar to con-

ditions (C1)–(C4) of Theorem 1 replacing n with nk. Then,

Ecnk[h(zn,k)] p−→ h as k → ∞.

Proof The proof is based on similar techniques of the proofof Theorem 1. �

Appendix C: Asymptotic normality of the hybridDC-based distribution

Combining obtained results of Theorems 1 and 2, we canestablish the asymptotic normality of the HDC-based distri-

bution. Let

QTn,kQn,k = −∇2�(k)

n (ηn), xn,k = Qn,k(η − ηn). (C.1)

Now we can state the following theorem.

Theorem 3 Let h ∈ Hs , the prior π(η) satisfies (B1)–(B3) and �

(k)n (η) satisfies appropriate conditions replacing

n with nk. Then, Ecnk[h(xn,k)] p−→ h as k → ∞.

Proof The proof follows by combining (C.1) with Theo-rem 1. �

References

Baghishani, H., Mohammadzadeh, M.: A data cloning algorithm forcomputing maximum likelihood estimates in spatial generalizedlinear mixed models. Comput. Stat. Data Anal. 55, 1748–1759(2011)

Bellio, R., Varin, C.: A pairwise likelihood approach to generalizedlinear models with crossed random effects. Stat. Model. 5, 217–227 (2005)

Booth, J.G., Hobert, J.P., Jank, W.: A survey of Monte Carlo algorithmsfor maximizing the likelihood of a two-stage hierarchical model.Stat. Model. 1, 333–349 (2001)

Breslow, N.E., Clayton, D.G.: Approximate inference in generalizedlinear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993)

Crowder, M.J.: Beta-binomial ANOVA for proportions. Appl. Stat. 27,24–37 (1978)

Eidsvik, J., Martino, S., Rue, H.: Approximate Bayesian inference inspatial generalized linear mixed models. Scand. J. Stat. 36, 1–22(2009)

Fong, Y., Rue, H., Wakefield, J.: Bayesian inference for generalizedlinear mixed models. Biostatistics 11, 397–412 (2010)

Hosseini, F., Eidsvik, J., Mohammadzadeh, M.: Approximate Bayesianinference in spatial generalized linear mixed models with skewnormal latent variables. Comput. Stat. Data Anal. 55, 1791–1806(2011)

Karim, M.R., Zeger, S.L.: Generalized linear models with randomeffects: Salamander mating revisited. Biometrics 48, 631–644(1992)

Komarek, A., Lesaffre, E.: Generalized linear mixed model with a pe-nalized Gaussian mixture as a random effects distribution. Com-put. Stat. Data Anal. 52, 3441–3458 (2008)

Lele, S.R., Dennis, B., Lutscher, F.: Data cloning: easy maximum like-lihood estimation for complex ecological models using BayesianMarkov chain Monte Carlo methods. Ecol. Lett. 10, 551–563(2007)

McCullagh, P., Nelder, J.A.: Generalized Linear Models. Chapman andHall, London (1989)

McCulloch, C.E.: Maximum likelihood algorithms for generalized lin-ear mixed models. J. Am. Stat. Assoc. 92, 162–170 (1997)

Ponciano, J.M., Taper, M.L., Dennis, B., Lele, S.R.: Hierarchical mod-els in ecology: confidence intervals, hypothesis testing, and modelselection using data cloning. Ecology 90, 356–362 (2009)

Rue, H., Held, L.: Gaussian Markov Random Fields: Theory andApplications. Chapman & Hall/CRC Press, Boca Raton/London(2005)

Rue, H., Martino, S.: Approximate Bayesian inference for hierarchicalGaussian Markov random fields models. J. Stat. Plan. Inference137, 3177–3192 (2007)

Page 17: Global optimization of the generalized cross-validation criterion

Stat Comput (2012) 22:597–613 613

Rue, H., Martino, S., Chopin, N.: Approximate Bayesian inference forlatent Gaussian models by using integrated nested Laplace ap-proximations. J. R. Stat. Soc., Ser. B 71, 319–392 (2009)

Thall, P.F., Vail, S.C.: Some covariance models for longitudinal countdata with overdispersion. Biometrics 46, 657–671 (1990)

Weng, R.C., Tsai, W.C.: Asymptotic posterior normality for multipa-rameter problems. J. Stat. Plan. Inference 138, 4068–4080 (2008)