Modelling the impact and in MCMC JR · Modelling the impact of households and geographies in health research: Multilevel models and MCMC methods using the new Stat‐JR package Practical

Modelling the impact of households and

geographies in health research:

Multilevel models and MCMC methods

using the new Stat‐JR package

Practical 3: Normal response multilevel analysis of the

Health Survey for England using Stat‐JR’s MCMC‐based

in‐house engine: e‐STAT

Contents

3.1 Fitting a single‐level model using the e‐STAT engine ............................................................. 1

3.2 Setting‐up a variance components model .............................................................................. 5

3.2.1 Residuals ....................................................................................................................... 12

3.3 Fitting a main effects model ................................................................................................. 12

3.4 Generating predicted values ................................................................................................. 16

3.5 Random slopes ...................................................................................................................... 17

3.6 Adding interaction terms and quadratic age effects ............................................................ 22

3.7 Exploring other options ......................................................................................................... 22

3.8 References ............................................................................................................................ 22

1

In this practical we will revisit the modelling that we performed this morning using MLwiN via Stat‐

JR, but this time we will use Stat‐JR’s own MCMC‐based engine: e‐STAT. We will fit similar models to

those we fitted this morning, using the same 2005 subset of the Health Survey of England dataset.

3.1 Fitting a single‐level model using the e‐STAT engine

We will, in fact, begin by fitting the simplest possible model to the bmival response, namely a null

model containing just a mean and variance estimate. To begin, start‐up webtest and on the main

screen choose the template Regression1 and the dataset HSE2005, clicking Set for each.

Regression1 is a simplified version of the Regression2 template, in that it only allows you to use the

e‐STAT estimation engine.

After you have made these choices, click Run and set‐up the template inputs as follows (i.e.

response: bmival; explanatory variables: cons; number of chains: 3; Random Seed: 1; length of

burnin: 500; number of iterations: 2000; thinning: 1; Use default algorithm settings: Yes; Name of

output results: out):

As discussed, since this template only uses the e‐STAT engine you are not offered a choice of

engines, but since e‐STAT is an MCMC‐based engine there are more inputs to specify. Here we are

asked how many chains we will run, and for each how long we will run in terms of a burnin followed

by a main run. These should all be terms that were described in the MCMC lecture. Thinning

describes how often we store values: a choice of 1 means every value is stored. For MCMC methods

Stat‐JR will create an output file containing the parameter chains – this will be stored as a dataset

(which we, again, have called “out” here) which can then be post‐processed.

2

Clicking on Next will construct a model description based on the inputs, and will feed this through e‐

STAT’s algebra system for processing. If we then choose the object equation.tex in the right‐hand

pane the screen will look as shown below:

To the left we see the model description that e‐STAT uses, written in a variant of the WinBUGS

model language. To the right‐hand side you will see a LaTeX description of the model. Essentially we

are stating that each observed BMI comes from a Normal distribution with a mean β0consi (which is

actually just β0 as cons is a constant vector) and precision tau (variance sigma2). We also see the

default priors used: an improper uniform prior for beta0 and a gamma(0.001,0.001) prior for tau.

With the algebra system now run, the algebra for fitting the model can be seen by selecting the

object algorithm.tex from the right‐hand list and displaying it in its own tab:

As we saw in the lecture, the fixed effect beta0 has a Normal conditional posterior distribution,

whilst the precision has a gamma distribution. The other parameters are just functions of the

3

precision. The right‐hand list also includes several pieces of C++ code that will be compiled and run

to fit the model, but we will not look at these now.

If we now click on Run then the model will compile and run, and after a short while the screen will

update accordingly. All the new outputs will appear in the right‐hand list and so firstly we can look at

the familiar ModelResults output:

Here we see (posterior mean) estimates for the model parameters, along with posterior standard

deviations and effective sample sizes (ESSs); as described in the lecture, the latter indicate how well

the method performed and whether we need to run for more iterations. Here all parameters have

ESS of >5000 based on 6000 stored iterations, so the mixing of the chains was good.

When we fitted the same model in Practical 2 (towards the end of Section Error! Reference source

not found.), we found the ML estimates from MLwiN were 25.53 (0.066) for the mean, and 33.68

(0.54) for the variance: i.e. identical to those we see here, which is to be expected as both

parameters should have fairly symmetric posterior distributions and hence the mode (ML estimate)

and posterior mean with flat priors coincide.

The results here also include a fit diagnostic, the DIC (with value 48643.61), as discussed in the

lecture; this is of little use in isolation, but can be compared with the DIC from other models.

The final point of note is that the value of pD, the effective number of parameters, is 2, which is the

actual number of parameters in the model.

In the right‐hand list you will also find diagnostic plots for the parameters; for example, if you

choose beta0.png from the right‐hand list, and view it in a new tab, you will see the following:

4

Here we have six plots, as described in the lectures. The three chains are shown on top of each other

in the top‐left plot, and the predominance of the red chain, which was drawn last and almost covers

the other two, is indicative that the mixing here was very good. The top‐right panel gives kernel

density plots for each chain, and these all look approximately the same, and Normally‐distributed.

Both of the time series diagnostic plots in the middle row indicate little, if any, autocorrelation,

whilst the Monte Carlo Standard Error (MCSE) graph in the bottom‐left shows that the MCSE is

already very small, whilst finally the Brooks‐Gelman‐Rubin diagnostic in the bottom‐right (BGRD in

red) is almost immediately at value 1 indicating good mixing.

If we next look at the precision (tau.png) graph we see similar good behaviour:

5

We will contrast these findings with later graphs when we fit further models. Next we consider a

multilevel model.

3.2 Setting‐up a variance components model

In Practical 2 we fitted a series of multilevel models to study the relationship between BMI and

several predictors in the context of clustering due to area and household. Here we will repeat this

analysis using MCMC, employing the same datasets (HSE2005 and HSE2005b) which have been

ordered in a way that facilitates the fitting of multilevel models. We begin again with the ‘variance

components’ model, which enables us to assess the extent of variation in BMI by area, household,

and between individuals.

The first model we will fit is a variance components model for bmival which, as previously discussed,

effectively divides the variance into amounts attributable to each level of the hierarchical structure.

As we have 3 levels (people within households within areas) we will again use the template

NLevelMod.

6

To begin select NLevelMod from the template list and click on Set and then click on Run to run the

template. Fill in the inputs as follows (i.e. Number of Classifications: 2; Classification 1: hserial;

Classification 2: area; response: bmival; specify distribution: Normal; explanatory variables: cons;

Store residuals: No; Choose estimation engine: eSTAT; number of chains: 3; Random Seed: 1; length

of burnin: 500; number of iterations: 2000; thinning: 1; Use default algorithm settings: Yes; Name of

output results: out):

Here you can see that the first inputs are the same as in Practical 2 but having chosen eSTAT as the

engine we need to add the MCMC inputs. We have kept the classifications in the order of nesting, so

observations are nested within classification 1 (hserial) which is, in turn, nested within classification

2 (area). This is not necessary for eSTAT, as it basically places together all observations with the

same hserial identifier, but this does mean that the data need to have unique identifiers: i.e. we

cannot have Household 1 in Area 1 being labelled as Household 1, when the first household indexed

in Area 2 is also labelled as Household 1 – each will need a unique number, which fortunately is

already the case in this dataset. Clicking on Next we see the model code and the maths equations

(after selecting equation.tex from the right‐hand list) as follows:

7

Here we see that the model has been extended to allow the mean to contain two random effects,

one for household (hserial) and one for area (area) which themselves are Normally‐distributed and

have precisions with gamma priors. We can also look at the algebra required to fit this model by

choosing algorithm.tex from the right‐hand list, viewing it in a separate tab:

There are several steps (which need to be scrolled through to view all) but essentially the same

building blocks of Normal and gamma posteriors, and deterministic calculations, are used.

8

Clicking on Run will start the estimation in the background, although fitting the model may take a

little while. When the estimation has finished we can select ModelResults in the right‐hand list to

get the following numbers:

These results are fairly similar to those we achieved from the ML fit in MLwiN (see Section Error!

Reference source not found.), although the between‐areas variance is a little smaller. More

importantly, the ESS value for this parameter (sigma2_u2) is very small at only 25. Looking at the

corresponding plot (sigma2_u2.png) in a separate tab we get the following interesting diagnostics:

9

The three chains here overlap, but the plot shows that the posterior is very skew and so they follow

somewhat different trajectories. Their kernel plots show similar shapes but again with considerable

variability. The time series plots show severe autocorrelation, whilst the MCSE is large and the BGRD

is still quite high. All these diagnostics point to the need to run the chains for longer. To do this we

could, for example, type 7000 into the Extra Iterations box (to run 10,000 iterations per chain in

total) and hit the More button. If we do this, then after a short delay, we see the graph for

sigma2_u2 changes as follows:

10

Here we do not see much improvement! Selecting ModelResults from the right‐hand list we get:

As we discussed in Practical 2, the variance in question (between‐areas) is fairly small and there is

some evidence that it is not significant. The other parameters are not substantially‐changed by

running for longer and not very different to the ML estimates. In fact, we can repeat the ML

interpretation of the model and can conclude from these results:

1. The overall average for bmival in our dataset is 25.742 (with a standard error of 0.076), as with

the ML method.

11

2. The standard assumption for multilevel models with a nested hierarchy (each level contained

within the next) is that the random effects are uncorrelated. This means that the total variation in

bmival is the sum of the three variance components. Hence, the total variance is 0.173 + 5.769 +

27.686 = 33.628

3. The percentage of the overall variation in BMI between areas is therefore 0.173 / 33.628 * 100 =

0.51 %

4. The percentage of the overall variation in BMI between households within areas is 5.769 / 33.628

* 100 = 17.16 %

5. The percentage of the overall variation in BMI between individuals within households within areas

is 27.686 / 33.628 = 82.33 %

These numbers are very close to those given by our earlier maximum likelihood estimates, and so

the same points apply: i.e. before controlling for explanatory information collected in the HSE, we

find that most of the variation is at the individual‐level, but there is also some variation at the

household‐level and, to a very small extent, at the area‐level. If we ignored the household and area‐

levels in a single‐level analysis, our results, especially the standard errors of regression coefficients,

would be incorrectly calculated because we would be ignoring the clustering of the data at these

levels. At the household‐level there is a lot of clustering, and far too many households to use dummy

variables; there are also too many areas to accommodate with dummy variables as well. Hence the

multilevel modelling approach is appropriate here.

Finally we can see towards the top of the ModelResults that the DIC is 48230.7; this is significantly

smaller than the DIC for the single‐level model (48643.6), indicating that this is a much better model.

Remember that the DIC is an information criterion, so it controls for model complexity thus allowing

us to compare the values directly; however, values that are very close (i.e. less than 1 or 2 apart: e.g.

the change in DIC between running for 6,000 and 30,000 iterations, in this example) are probably

equally good models. We also see that the pD (the effective number of parameters) in this case is

approximately 1,094. The model has 718 area effects and 3,872 household effects, but these share

prior distributions and so the effective number of parameters is shrunken from the actual number of

parameters.

If we fit the model without area effects, after running for 6,000 iterations we get a DIC of 48,232.2

which is actually better than the model with area effects, run for the same length. Despite this, we

will keep area in the model for now, as it is important in the random slopes example we look at later.

12

When we start adding explanatory variables to the model, we can also use the DIC diagnostic to see

if the additional parameters we have added to the model make a difference to the model fit.

3.2.1 Residuals

As in practical 2, although residual plotting and diagnostics are an important part of fitting a

multilevel model, you will see that we answered “No” to the suggestion of storing residuals – this is

because there are so many residuals that we would not want to store a chain of many thousand

values for each. We intend to work on storing summaries of the residuals as an output when using

Stat‐JR, and hence have a template for easily plotting caterpillar plots, etc., without having to store

each residual, but this has not yet been implemented. A caterpillar plot is possible for smaller

datasets with less residuals, though, as illustrated in the User’s Guide.

3.3 Fitting a main effects model

In this section we will see how MCMC can be used to add some covariates to the model from the

HSE: sex, age and ethnic group. In the earlier practicals (see Sections Error! Reference source not

found. and Error! Reference source not found.) we conducted some exploratory data analysis on

the dataset, using Stat‐JR’s SummaryStats, Tabulate, and XYPlot templates, so rather than

repeating that here we will proceed and fit the main effects model.

As we saw in Practical 1 (e.g. Section Error! Reference source not found.), there are a number of

ways we can deal with categorical predictors: they can be constructed using the MakeCats template,

the model can be fitted using a template that specifically asks for categorical predictors (e.g.

1LevelCat), or we can model a dataset that already has the categorical predictors and interactions

constructed. Here we will once again employ the third solution and use the dataset HSE2005b.

For our first model we will continue with the NLevelMod template and include main effects for the

three predictors that we looked at in our exploratory analyses. We will need to return to the main

menu, by clicking on either of the Change buttons towards the top of the browser window, and then

select the HSE2005b dataset, pressing Set, and then Run. Then fill in the template inputs as follows

(i.e. Number of Classifications: 2; Classification 1: hserial; Classification 2: area; response: bmival;

specify distribution: Normal; explanatory variables: cons, sex, agedev, ethnic_2, ethnic_3, ethnic_4,

ethnic_5; Store residuals: No; Choose estimation engine: eSTAT; number of chains: 3; Random Seed:

1; length of burnin: 500; number of iterations: 5000; thinning: 1; Use default algorithm settings: Yes;

Name of output results: out):

13

Given the poor mixing previously seen for the between‐area variance, here we have increased the

number of iterations to 5,000. If we then click Next and Run, then after a short delay during which

the model runs, we can then select ModelResults from the right‐hand list to see the following:

14

Here, as with the simpler variance components model, the ESS values for sigma2_u2 are poor, but

otherwise for other parameters they are reasonable. The fixed effects beta0‐beta6 have values

similar to the maximum likelihood fit (Section Error! Reference source not found.). Note that you

need to interrogate the model code to find out what predictor each beta refers to, but they are in

the order of the inputs!

As before, the results indicate that the coefficient for sex (female) is positive (0.020), although we

can readily see that the result is non‐significant, because the standard error (0.108) is bigger than

the absolute value of the coefficient (as discussed earlier, the absolute value of the coefficient would

need to be about twice (1.96 times) as big as the coefficient to be statistically significant at the 5%

level). In addition, with MCMC we can look at the kernel plot for this parameter (beta1.png), as

shown below, and see that the value 0 is very close to the centre of the posterior distribution, again

indicating it is not statistically‐significant at the 5% level.

15

Overall the total variance has been reduced, and the variance components are smaller in magnitude

than those for the variance components model, aside from the variance attributable to the area‐

level which has increased slightly; otherwise, some of the variation at the household‐ and individual‐

level has been explained by the explanatory variables.

We can see that age is highly significant with its estimate being far larger than its standard error. The

ethnicity effects vary from 1.071 for ethnicity_4 (Black or Black British) to ‐1.246 for ethnicity_5

(Chinese or any other group). These are all relative to ethnicity_1 (‘white’), and although none of the

estimates are greater than their standard errors, as discussed before the fact that some are positive

and some negative, and that the base category is somewhat arbitrary, means they might be

significant.

Overall, the model DIC is reduced to 46017 from 48230.7 indicating it is a much better‐fitting model.

This will be mainly due to the importance of the very significant age predictor. If we wish to

determine if ethnicity is, in fact, important we can fit the model without the ethnicity terms and

compare the DIC. We leave this as an exercise for the interested reader.

16

3.4 Generating predicted values

As noted in Practical 2, Stat‐JR can do some predictions, for example plotting predicted lines for each

ethnicity for an average area, by using the Calculate and XYGroupPlot templates, and we can do

exactly the same with the MCMC estimates.

So, firstly click on Change to return to the main menu and then select Calculate from the template

list, then Set and Run, before filling in the template inputs as follows (i.e. Output column name:

prediction; Numeric expression: 25.466666+0.135026*agedev‐0.984491*ethnic_2‐0.318983*

ethnic_3+1.071005*ethnic_4‐1.246019*ethnic_5; Name of output dataset: out):

Here we include the age and ethnicity coefficients, but leave out gender, thus predicting for the

latter’s base category (men). If we now click on Run, we will create a new dataset, called “out”,

containing an additional column called “prediction”. To next create the plot, we first click Change to

get to the main menu, and then select XYGroupPlot as the template, and “out” as the dataset

(pressing Set for each). After you have done that, click Run and set‐up the template inputs as shown

(X values: agedev; Y values: prediction; Grouped by: ethnic), before finally pressing Next and Run

once more to get the following:

17

Here we see the prediction lines for all five ethnicities and we get an idea of the differences. These

predictions are almost identical to those obtained by the maximum likelihood fit.

Next we will investigate expanding the model still further, by fitting random slopes models using

MCMC.

3.5 Random slopes

As discussed in Practical 2, if we wanted to assess whether the relationship between BMI and age is

different in different areas, we could do this by adding a ‘random slope’ for age to our model. To fit

random slopes models in Stat‐JR using MCMC we require the template NLevelRS. To set‐up the

model, firstly click on Change to return to the main menu, and then select NLevelRS from the

template list before pressing Set. Then, choose HSE2005b as the dataset, and click on the Set button

alongside that. Once you have made those choices, click on Run and specify the template inputs as

follows (i.e. Number of Classifications: 2; Classification 1: hserial; Classification 2: area; response:

bmival; specify distribution: Normal; explanatory variables: cons, sex, agedev, ethnic_2, ethnic_3,

ethnic_4, ethnic_5; explanatory variables random at hserial classification: cons; explanatory

variables random at area classification: cons, agedev; Priors: Uniform; Store residuals: No; Choose

estimation engine: eSTAT; number of chains: 3; Random Seed: 1; length of burnin: 500; number of

iterations: 5000; thinning: 1; Use default algorithm settings: Yes):

18

We will start by using an (improper) uniform prior for the variance matrix at the area‐level. Fitting a

random slopes model is harder than fitting a random intercept model, and in fact Stat‐JR’s algebra

system currently cannot fit such models (there is work in progress to rectify this) as it doesn’t know

how to update the variance matrix. However, the template we have chosen will fit the model, thanks

to some its in‐built code. If (after entering a name for the output results) we click on Next and then

select algorithm.tex from the right‐hand list, viewing it in its own tab, we will see the following:

19

The steps are slightly larger, and the algorithm updates the area intercepts (u0_1) and the slopes

(u1_1) in separate steps. We also see that omega_u2 (the area‐level variance) is “updated using a

custom user defined step” which indicates that the template has itself supplied code for this

parameter.

If we return to the main browser tab and click Run, we find the model takes quite a while to run;

after it has done so, however, if we select ModelResults from the right‐hand list we see the

following:

20

Here we see that we get three terms beginning with the prefix omega_u2: these are the variance

terms at the area‐level. The variance values are generally larger than those obtained using maximum

likelihood estimation (see Section Error! Reference source not found.), which may be an artefact of

using a uniform prior on the variance matrix, and also the skewed nature of the variance parameters

when comparing means to modes: the between‐area variance of the intercept changes (0.479 here;

0.299 for ML), the between‐area variance of the slope for age‐gm changes (0.0009 here; 0.0007 for

ML), and the co‐variance between intercept and slope at the area‐level changes as well (0.0166

here; 0.021 for ML).

We can test the sensitivity of our results to the choice of prior by comparing these results to those

from an alternative model run with a Wishart prior. This requires us to supply a prior guess (the “R

matrix” in the template inputs), and so we will try 0.1 for both variances and 0 for the covariance,

and set the degrees of freedom to be 2: the minimal possible. To do this we click on Start again and

then set‐up the inputs as follows (i.e. Number of Classifications: 2; Classification 1: hserial;

Classification 2: area; response: bmival; specify distribution: Normal; explanatory variables: cons,

sex, agedev, ethnic_2, ethnic_3, ethnic_4, ethnic_5; explanatory variables random at hserial

classification: cons; explanatory variables random at area classification: cons, agedev; Priors:

Wishart; R matrix: 0.1,0,0.1; Degrees of Freedom: 2; Store residuals: No; Choose estimation engine:

eSTAT; number of chains: 3; Random Seed: 1; length of burnin: 500; number of iterations: 5000;

thinning: 1; Use default algorithm settings: Yes):

21

This time after clicking on Next, entering a name for the output results, and then clicking on Next

and Run again, we get the following ModelResults:

Compared to the model we ran with a uniform prior, we can see that the area‐level variance terms

in this model are closer to the prior values we provided (in the R matrix), indicating that these

parameter estimates are quite sensitive to the choice of prior, although it has less impact on the

other parameters in the model. Indeed, if we look at the DIC diagnostic we get 45516.7 for the

model with a uniform prior, and 45930.5 for the model with the Wishart prior we have just specified,

22

suggesting that the latter was a poorer model, possibly because our prior guess was a long way from

that suggested by the data.

Both priors, however, give better DIC values than the 46017 for the model without random slopes,

suggesting that it is important to allow the slopes to randomly‐vary in this manner. As we discussed

earlier, the reason the variance in slopes is small is because age is recorded on a large scale (0‐99): if

we had standardised age instead, we would have seen a larger slope variance.

Default priors for variance matrices are not a solved issue (see Browne and Draper, 2000).

3.6 Adding interaction terms and quadratic age effects

In Practical 2 we explored fitting further predictors to the model. Rather than repeat instructions for

this here you are very welcome, if time allows, to try this yourself and see if you get similar answers

from MCMC to those we obtained in Practical 2.

3.7 Exploring other options

By now you should be familiar with using the e‐STAT engine in Stat‐JR. MLwiN has its own MCMC

sampler, so if you wanted to try using that instead, it’s just a case of selecting it when asked to

choose the estimation method. You can then check it gives similar estimates to those you have

obtained from e‐STAT.

The NLevelMod template we have used fits each fixed effect via a univariate Normal updating step.

If we were to use the NLevelBlock template, on the other hand, this instead performs one

multivariate Normal step for all fixed effects. You are welcome to try this template for any models

with several predictors to see if it improves the ESS values.

3.8 References

Browne, W. J. and Draper, D. (2000). Implementation and performance issues in the Bayesian and

likelihood fitting of multilevel models. Computational Statistics, 15: 391‐420.

Modelling the impact and in MCMC JR · Modelling the impact of households and geographies in health research: Multilevel models and MCMC methods using the new Stat‐JR package Practical

Documents