Stat 544 Outline Spring 2008 - Iowa State Universityvardeman/stat544/544outline-08.pdfStat 544 Outline Spring 2008 Steve Vardeman Iowa State University August 13, 2009 Abstract This

Stat 544 OutlineSpring 2008

Steve VardemanIowa State University

August 13, 2009

Abstract

This outline summarizes the main points of the course lectures.

Contents

1 Bayes Statistics: What? Why? A Worry ... ? How? 41.1 What is Bayesian Statistics? . . . . . . . . . . . . . . . . . . . . . 41.2 Why Use Bayesian Statistics? . . . . . . . . . . . . . . . . . . . . 61.3 A Worry ...? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 How Does One Implement the Bayesian Paradigm? . . . . . . . . 8

2 Some Simulation Methods Useful in Bayesian Computation 92.1 The Rejection Algorithm . . . . . . . . . . . . . . . . . . . . . . . 102.2 Gibbs (or Successive Substitution) Sampling . . . . . . . . . . . . 122.3 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . 142.5 Metropolis-Hastings-in-Gibbs Algorithms . . . . . . . . . . . . . 15

3 The Practice of Modern Bayes Inference 1: Some General Is-sues 173.1 MCMC Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Considerations in Choosing Priors . . . . . . . . . . . . . . . . . 19

3.2.1 "The Prior" . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.2 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . 203.2.3 "Flat"/"Di¤use"/"Non-Informative"/"Robust" Priors . . 203.2.4 Je¤reys Priors . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Considerations in Choice of Parametrization . . . . . . . . . . . . 223.3.1 Identi�ability . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.2 Gibbs and Posterior Independence . . . . . . . . . . . . . 233.3.3 Honoring Restrictions Without Restricting Parameters . . 23

3.4 Posterior (Credible) Intervals . . . . . . . . . . . . . . . . . . . . 24

1

3.5 Bayes Model Diagnostics and Bayes Factors for Model Choice . . 253.6 WinBUGS, Numerical Problems, Restarts, and "Tighter Priors" . . 273.7 Auxiliary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 Handling Interval Censoring and Truncation in WinBUGS . . . . . 29

4 The Practice of Bayes Inference 2: Simple One-Sample Models 304.1 Binomial Observations . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Poisson Observations . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Univariate Normal Observations . . . . . . . . . . . . . . . . . . 32

4.3.1 �2 Fixed/Known . . . . . . . . . . . . . . . . . . . . . . . 334.3.2 � Fixed/ Known . . . . . . . . . . . . . . . . . . . . . . . 344.3.3 Both � and �2 Unknown . . . . . . . . . . . . . . . . . . 36

4.4 Multivariate Normal Observations . . . . . . . . . . . . . . . . . 374.4.1 � Fixed/Known . . . . . . . . . . . . . . . . . . . . . . . 384.4.2 � Fixed/Known . . . . . . . . . . . . . . . . . . . . . . . 384.4.3 Both � and � Unknown . . . . . . . . . . . . . . . . . . . 41

4.5 Multinomial Observations . . . . . . . . . . . . . . . . . . . . . . 42

5 Graphical Representation of Some Aspects of Large Joint Dis-tributions 445.1 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . 445.2 Directed Graphs and Joint Probability Distributions . . . . . . . 45

5.2.1 Some Graph-Theoretic Concepts . . . . . . . . . . . . . . 455.2.2 First Probabilistic Concepts and DAG�s . . . . . . . . . . 465.2.3 Some Additional Graph-Theoretic Concepts and More on

Conditional Independence . . . . . . . . . . . . . . . . . . 475.3 Undirected Graphs and Joint Probability Distributions . . . . . . 50

5.3.1 Some Graph-Theoretic Concepts . . . . . . . . . . . . . . 505.3.2 Some Probabilistic Concepts and Undirected Graphs . . . 51

6 The Practice of Bayes Inference 3: (Mostly) Multi-Sample Mod-els 536.1 Two-Sample Normal Models (and Some Comments on "Nested"

Models) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 r-Sample Normal Models . . . . . . . . . . . . . . . . . . . . . . 556.3 Normal Linear Models (Regression Models) . . . . . . . . . . . . 556.4 One-Way Random E¤ects Models . . . . . . . . . . . . . . . . . . 576.5 Hierarchical Models (Normal and Others) . . . . . . . . . . . . . 586.6 Mixed Linear Models (in General) (and Other MVNModels With

Patterned Means and Covariance Matrices) . . . . . . . . . . . . 606.7 Non-Linear Regression Models, etc. . . . . . . . . . . . . . . . . . 616.8 Generalized Linear Models, etc. . . . . . . . . . . . . . . . . . . . 626.9 Models With Order Restrictions . . . . . . . . . . . . . . . . . . 646.10 One-Sample Mixture Models . . . . . . . . . . . . . . . . . . . . 656.11 "Bayes" Analysis for Inference About a Function g (t) . . . . . . 66

2

7 Bayesian Nonparametrics 687.1 Dirichlet and Finite "Stick-Breaking" Processes . . . . . . . . . . 687.2 Polya Tree Processes . . . . . . . . . . . . . . . . . . . . . . . . . 71

8 Some Scraps (WinBUGS and Other) 768.1 The "Zeroes Trick" . . . . . . . . . . . . . . . . . . . . . . . . . . 768.2 Convenient Parametric Forms for Sums and Products . . . . . . 77

9 Some Theory of MCMC for Discrete Cases 789.1 General Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789.2 Application to the Metropolis-Hastings Algorithm . . . . . . . . 819.3 Application to the Gibbs Sampler . . . . . . . . . . . . . . . . . . 839.4 Application to Metropolis-Hastings-in-Gibbs Algorithms . . . . . 849.5 Application to "Alternating" Algorithms . . . . . . . . . . . . . . 86

3

1 Bayes Statistics: What? Why? A Worry ... ?How?

In this initial qualitative introduction to Bayesian statistics, we�ll consider fourquestions:

1. What is it?

2. Why use it?

3. What about a worry?

4. How does one implement it in practice?

1.1 What is Bayesian Statistics?

Standard probability-based statistical inference begins with (typically vector)data Y modeled as an observable random vector/variable. The distributionof Y is presumed to depend upon some (typically vector) parameter � that isunknown/unobservable, and potentially on some (typically vector) "covariate"X that is observed. The object is often to make plausibility statements about�. Sometimes, one thinks of Y as comprised of two parts, that is

Y = (Y 1;Y 2)

where one �rst observes Y 1 and also needs to make plausibility statements aboutY 2.In any case, one supposes that the � distribution of Y is speci�ed by some

probability densityf (yj�;X) (1)

and this function then speci�es an entire family of probability models for Y ,one for each di¤erent �. A couple of comments are in order regarding the form(1). In the �rst place, f (yj�;X) could be a probability density used to getprobabilities by doing "ordinary" Riemann integration over some part of <k,or it could be a probability mass function used to get probabilities by addingover some discrete set of points y, or it could be some combination of the two,used to get probabilities by doing Riemann integration over some coordinatesof y while adding over values of the other coordinates of y. Secondly, since onetakes the values of the covariates as known/�xed, we will typically not botherto display the dependence of (1) on X."Classical" statistical inference treats the density (with the observed data

y plugged in)L (�) � f (yj�)

as a (random) function of � called the likelihood function and uses it aloneto guide inference/data analysis. Formally, "Bayes" statistical inference addsto the model assumptions embodied in (1) a model assumption on �, that says

4

there is a density g (�) that speci�es a "prior" distribution for �. This isintended to describe a "pre-data" view of the parameter. It too can be aprobability density, a probability mass function, or a combination of the two.(More generally, it can in some cases simply be a non-negative function of �,but more on that in a bit.)In the standard/most easily understood case where g (�) does specify a prob-

ability distribution for �, the product

f (y;�) = f (yj�) g (�)

speci�es a joint distribution for (Y ;�). This in turn means that the conditionaldistribution of � given Y = y is speci�ed by the conditional density

g (�jy) = f (y;�)Rf (y;�) d�

(2)

(where the "integral" in the denominator of (2) is a Riemann integral, a sum,or some combination of the two). Of course, the denominator of (2) is

fY (y) �Zf (y;�) d�

which is NOT a function of �. Thus the posterior density g (�jy) is a functionof � proportional to

f (yj�) g (�) = L (�) g (�)

and Bayes statistical inference is based on the notion that it is this product thatshould be the basis of plausibility statements about � (and in the case that onlyY 1 = y1 is observed, that the product f (y1;y2j�) g (�) should be the basis ofall plausibility statements about Y 2 and/or �).Notice, that it is su¢ cient but not always necessary that g (�) be a den-

sity for the product f (yj�) g (�) to be proportional to a density for � or forf (y1;y2j�) g (�) to be proportional to a joint density for Y 2 and �. That is,sometimes g (�) can fail to be a density because it has an in�nite "integral" andyet f (yj�) g (�) or f (y1;y2j�) g (�) be perfectly useful (after normalization) asa density for � or (Y 2;�). In this case, it is common to say that g (�) speci�esan "improper prior" (a prior "distribution" that has total mass not 1, butrather 1).The Bayes Paradigm is then:

All plausibility statements about � are based on a product

f (yj�) g (�) = L (�) g (�)

�and in the case that only Y 1 = y1 is observed, plausibility state-ments about Y 2 and/or � are based on a product

f (y1;y2j�) g (�)

� the �rst of which speci�es a "posterior distribution" for �, thesecond of which speci�es a joint predictive posterior/posterior dis-tribution for (Y 2;�).

5

1.2 Why Use Bayesian Statistics?

There are at least 3 kinds of answers to the "why?" question concerning Bayesstatistical methods. These are

1. philosophical answers,

2. optimality/decision theoretic answers, and

3. pragmatic answers.

Real "card-carrying" philosophical Bayesians argue that the only rationallycoherent way of making statistical inferences is through the use of the BayesParadigm. The early parts of most books on Bayes inference provide thesekinds of arguments. I�m not terribly interested in them. You should probablyhave a look at one or two such discussions.Optimality/decision theory arguments for Bayes methods are based on min-

imization of expected costs. That is, supposes that

� � 2 �, a parameter space,

� there are possible "actions" a 2 A (an action space),

� associated with each pair (�; a) there is some "loss" L (�; a) � 0,

� actions may be chosen on the basis of Y � f (yj�) a distribution oversome observation space Y, that is, there are "decision rules" � : Y ! A

Then there are theorems that say things roughly like "Essentially only decisionrules �g (y) that for some prior speci�ed by g have the form

�g (y) = an a minimizingZL (�; a) f (yj�) g (�) d� (3)

can be any good in terms of

E�L (�; � (Y )) =

ZL (�; � (y)) f (yj�) dy ,

the expected loss function." Notice thatRL (�; a) f (yj�) g (�) d� is propor-

tional to the "posterior mean loss of action a," that is,ZL (�; a) f (yj�) g (�) d � = fY (y)

ZL (�; a) g (�jy) d�

so �g (y) of the form (3) is an action that minimizes the (g) posterior expectedloss (and is called a "Bayes rule" for prior g).As a bit of a digression, it�s worth noting that most philosophical Bayesians

do not like optimality arguments for Bayes procedures (or at least do not �ndthem compelling). This is because an expected loss E�L (�; � (Y )) involves anintegration/averaging over the observation space Y: A philosophical Bayesian

6

would question the relevance of averaging over outcomes that one knows didnot occur ... that is, once Y = y is known, such a person would argue that theonly probability structure that is at all relevant is g (�jy).The third kind of answer to the "why?" question of Bayesian statistics is

purely pragmatic. The Bayesian paradigm provides an almost alarmingly sim-ple and uni�ed framework for statistical analysis. There is the model for thedata (the likelihood) and the prior that give a joint distribution for "everything"(data, parameters, and future observations) that in turn gives a conditional (pos-terior) for everything that is not observed given everything that is observed.End of story. "All" one has to do is describe/understand/summarize the pos-terior. All of statistical inference has been reduced to probability calculationswithin a single probability model.In contrast to "classical" statistics with its family of probability models

indexed by �, and the seeming necessity of doing "custom development" ofmethods for each di¤erent family and each di¤erent inference goal (estimation,testing, prediction, etc.), Bayesian statistics takes essentially the same approachto all problems of inference. (Bayesians might go so far as to say that whilethere is a well-de�ned "Bayes approach" to inference, there really is no corre-sponding classical or non-Bayesian "approach"!) Further, recent advances inBayesian computation have made it possible to implement sensible Bayes so-lutions to statistical problems that are highly problematic when attacked fromother vantage points. These are problems with particularly complicated datamodels, and especially ones where � is of high dimension. For example, max-imum likelihood for a 100-dimensional � involves optimization of a functionof 100 variables ... something that (lacking some kind of very speci�c helpfulanalytic structure) is often numerically di¢ cult-to-impossible. In the sameproblem, modern Bayesian computation methods can make implementation ofthe Bayes paradigm almost routine.

1.3 A Worry ...?

Possibly the most worrisome feature of the Bayes paradigm is that the posteriordistribution speci�ed by g (�jy) (or g (�; y2jy1)) of course depends upon thechoice of prior distribution, speci�ed by g (�). Change the form of the priorand the �nal inferences change. This obvious point has long been a focal pointof debate between philosophical Bayesians and anti-Bayesians. Anti-Bayesianshave charged that this fact makes Bayes inferences completely "subjective" (aserious charge in scienti�c contexts). Bayesians have replied that in the �rstplace "objectivity" is largely an illusion, and besides, the choice of prior is amodeling assumption in the same class as the modeling choice of a likelihood,that even anti-Bayesians seem willing to make. Anti-Bayesians reply "No, alikelihood and a prior are very di¤erent things. A likelihood is somethingthat describes in probabilistic terms what reality generates for data. In theoryat least, its appropriateness could be investigated through repetitions of datacollection. Everyone admits that a prior exits only in one�s head. Puttingthese two di¤erent kinds of things into a single probability model is not sensi-

7

ble." Bayesians reply that doing so is the only way to be logically consistent ininferences. And so the debate has gone ...Two developments have to a large extent made this debate seem largely

irrelevant to most onlookers. In the �rst place, as the probability models thatpeople wish to use in data analysis have grown more and more complicated, thedistinction between what is properly thought of as a model parameter and whatis simply some part of a data vector that will go unobserved has become lessand less clear. To many pragmatists, there are simply big probability modelswith some things observable and some things unobservable. To the extent that"Bayes" methods provide a way to routinely handle inference in such models,pragmatists are willing to consider them without taking sides in a philosophicaldebate.The second development is that Bayesians have put a fair amount of work

into the search for "�at" or "di¤use" or "objective" or "non-informative" or"robust" priors (or building blocks for priors) that tend to give posteriors leadingto inferences similar to those of "classical"/non-Bayesian methods in simpleproblems. The idea is then that one could then hope that when these buildingblocks are used in complicated problems, the result will be inferences that are"like" classical inferences and do not depend heavily on the exact forms used forthe priors, i.e. perform reasonably, regardless of what the parameter actually is.(An extreme example of an "informative" prior lacking this kind of "robustness"is one that says that with prior probability 1, � = 13. The posterior distributionof � given Y = y says that with posterior probability 1, � = 13. This is �neas long as the truth is that � � 13. But if the prior is badly wrong, the Bayesinference will be badly wrong.)

1.4 How Does One Implement the Bayesian Paradigm?

Conceptually, the Bayes paradigm is completely straightforward. Prior andlikelihood are "multiplied" to produce something proportional to the posterior.Nothing could be simpler. But the practical problem is making sense of whatone ends with. The questions become "What does a distribution speci�ed by

f (yj�) g (�) or f (y1;y2j�) g (�) (4)

look like? What are posterior probabilities that it speci�es? What are (poste-rior) means and standard deviations of the unobserved quantities?"Except for very special circumstances where ordinary freshman/sophomore

pencil-and-paper calculus works, making sense of a posterior speci�ed by (4) is amatter of numerical analysis. But numerical analysis (particularly integration)in any more than a very few (2 or 3) dimensions is problematic. (Asymp-totic approximations are sometimes mentioned as a possible "solution" to thiscomputational problem. But that possibility is illusory, as large sample ap-proximations for Bayes methods turn out to be fundamentally non-Bayesian(the prior really washes out of consideration for large samples) and it is, afterall, the non-asymptotic behavior that is of real interest.) So it might seem

8

that the discussion has reached an impasse. While the paradigm is attractive,actually using it to do data analysis seems typically impossible.But there is another way. The basic insight is that one doesn�t have to

compute with form (4) if one can simulate from form (4). Armed with alarge number of realizations of simulations from a posterior, one can do simplearithmetic to approximate probabilities, moments, etc. as descriptors of theposterior. The �rst impulse would be to look for ways of drawing iid observa-tions from the posterior. Sometimes that can be done. But by far the mostpowerful development in Bayesian statistics has been methods for doing not iidsimulation from a posterior, but rather appropriate so-called "Markov ChainMonte Carlo" simulation. This is �nding and using a suitable Markov Chainwhose state space is the set of � or (y2;�) receiving positive posterior probabil-ity and whose empirical distribution of states visited for long runs of the chainapproximates the posterior.People with superior computing skills often program their own MCMC sim-

ulations. At the present time, the rest of us typically make use of a free Bayessimulation package called WinBUGS. You are welcome to use any means at yourdisposal to do computing in Stat 544. In practice, that is likely to mean somecombination of WinBUGS and R programming.

2 Some Simulation Methods Useful in BayesianComputation

There are a number of basic methods of generating realizations from standardsimple distributions discussed in Stat 542 that begin from the assumption thatone has available a stream of iid U(0; 1) realizations. For example, if U1; U2; : : :are such iid uniform realizations

1. F�1 (U) for a univariate cdf F has distribution F ,

2. � ln (U) is exponential with mean 1,

3. maxhintegers j � 0j �

Pji=1 ln (U) < �

iis Poisson(�),

4. I [U < p] is Bernoulli(p),

5.Pn

i=1 I [Ui < p] is Binomial(n; p),

6. Z1 =p�2 ln (U1) cos (2�U2) and Z2 =

p�2 ln (U1) sin (2�U2) are iid

N(0; 1),

and so on.In the following introductory discussion, we consider several much more gen-

eral simulation methods that are widely useful in Bayesian computation, namely

1. rejection sampling,

9

2. Gibbs (or more properly, successive substitution) sampling,

3. slice sampling,

4. the Metropolis-Hastings algorithm, and

5. Metropolis-Hastings-in-Gibbs algorithms.

The last four of these are MCMC algorithms. Later in the course we willdiscuss some theory of Markov Chains and why one might expect the MCMCalgorithms to work. This initial introduction will be simply concerned withwhat these methods are and some aspects of their use.Throughout this discussion, we will concern ourselves with simulation from

some distribution for a vector � that is speci�ed by a "density" that is propor-tional to a function

h (�)

We will not need to assume that h has been normalized to produce integral1 and therefore already be a density. The fact that we don�t have to knowthe integral of h (�) is an essential point for practical Bayes computation. InBayesian applications of this material, most often � will be either � or (�; Y 2),the unknown parameter vector or the parameter vector and some future (pos-sibly vector) observation and h (�) will be respectively either f (yj�) g (�) orf (y1;y2j�) g (�) and computation of the integral may not be feasible.

2.1 The Rejection Algorithm

Suppose that I can identify a density p (�) (of the same type as a normalizedversion of h (�)) from which I know how to simulate, and such that

1. p (�) = 0 implies that h (�) = 0 so that the distribution speci�ed by p (�)has support at least as large as that of the distribution speci�ed by ofh (�), and

2. one knows a �nite upper bound M for the ratio

h (�)

p (�)

(this is essentially a requirement that the p (�) tails be at least as heavyas those for h (�) and that one can do the (pencil-and-paper or numerical)calculus necessary to produce a numerical value for M).

Then it is a standard Stat 542 argument to establish that the following worksto produce � � h (�) (we�ll henceforth abuse notation and write "�" when wemean that the variable on the left has a distribution with density proportionalto the function on the right):

1. generate �� p (�),

10

2. generate U �U(0; 1) independent of ��, and

3. ifh (��)

p (��)� U �M

then set � = ��, otherwise return to step 1.

Notice that if I can use the rejection algorithm (repeatedly) to create iidrealizations �1;�2;�3; : : : ;�N I can use (sample) properties of

the empirical distribution of �1;�2;�3; : : : ;�N (5)

to approximate properties of the distribution speci�ed by h (�).An application of this algorithm most naturally relevant to Bayes calculation

is that where p (�) is g (�) and h (�) is L (�) g (�). In this case the ratioh (��) =p (��) is simply L (��), and if one can bound L (�) by some number M(for example because an MLE of �, �̂, can be found and one can take M =

L��̂�), the rejection algorithm becomes:

Generate �� (a "proposal" for �) from the prior distribution andaccept that proposal with probability L (��) =M , otherwise generateanother proposal from the prior ...

This may initially seem like a natural and general solution of the Bayes com-putation problem. But it is not. Both in theoretical and operational terms,there are problems where it is not possible to �nd a bound for the likelihood.And more importantly (particularly in problems where � is high-dimensional)even when a bound for the likelihood can be identi�ed, the part of the para-meter space where the likelihood is large can be so "small" (can get such tinyprobability from the prior) that the acceptance rate for proposals is so low as tomake the algorithm unusable in practical terms. (A huge number of iterationsand thus huge computing time would be required in order to generate a largesample of realizations.)The nature of the typical failure of the rejection algorithm in high-dimensional

Bayes computation provides qualitative motivation for the MCMC algorithmsthat can be successful in Bayes computation more generally. Rejection sam-pling from a posterior would involve iid proposals that take no account of any"success" earlier proposals have had in landing in regions where h (�) is large.It would seem like one might want to somehow "�nd a place where h (�) is largeand move around in �-space typically generating realizations "near" or "like"ones that produce large h (�)". This kind of thinking necessarily involves algo-rithms that make realized ��s dependent. It is essential to the success of modernBayes analysis that there are ways other than iid sampling (like the next fourMCMC algorithms) to create (5) with sample properties approximating thoseof the distribution speci�ed by h (�).

11

2.2 Gibbs (or Successive Substitution) Sampling

Suppose now that � is explicitly k-dimensional or is divided into k pieces/sub-vectors (that may or may not each be 1-dimensional), that is, write

� = (�1; �2; : : : ; �k)

Then with some starting vector

�0 =��01; �

02; : : : ; �

0k

�for j = 1; 2; : : : a Gibbs sampler

1. samples �j1 from h��; �j�12 ; �j�13 ; : : : ; �j�1k

�2. samples �j2 from h

��j1; �; �

j�13 ; : : : ; �j�1k

�3. samples �j3 from h

��j1; �

j2; �; �

j�14 ; : : : ; �j�1k

�...

(k � 1). samples �jk�1 from h��j1; �

j2; : : : ; �

jk�2; �; �

j�1k

�, and

k. samples �jk from h��j1; �

j2; : : : ; �

jk�1; �

�in order to create �j from �j�1.Under appropriate circumstances, for large N , at least approximately

�N � h (6)

and theoretical properties of the h distribution can be approximated using sam-ple properties of �

�B+1;�B+2; : : : ;�N

(7)

(for B a number "burn-in" iterations disregarded in order to hopefully mitigatethe e¤ects of an unfortunate choice of starting vector).Use of this algorithm requires that one be able to make the random draws

indicated in each of the steps 1 through k. This is sometimes possible becausethe indicated "sections" of the function h (h with all but one �l held �xed) arerecognizable as standard densities. Sometimes more clever methods are needed,like use of rejection algorithm or the "slice sampling" algorithm we will discussnext.Why one might expect the "Gibbs sampler" to work under fairly general

circumstances is something that we will discuss later in the term, as an appli-cation of properties of Markov Chains. For the time being, I will present avery small numerical example in class, and then point out what can "go wrong"

12

in the sense of (6) failing to hold and the empirical properties of (7) failing toapproximate properties of h.The principle failings of the Gibbs sampler occur when there are relatively

isolated "islands of probability" in the distribution described by h, leading to"poor mixing" of the record of successive �j�s. Tools for detecting the possibilitythat the output of the Gibbs algorithm can�t be trusted to represent h include:

1. making and comparing summaries of the results for several "widely dis-persed" starts for the algorithm (di¤erent starts producing widely di¤erentresults is clearly a bad sign!),

2. making and interpreting "history plots" and computing serial correlationsfor long runs of the algorithm (obvious jumps on the history plots andimportant high order serial correlations suggest that the Gibbs outputmay not be useful), and

3. the Brooks-Gelman-Rubin statistic and corresponding plots.

As the term goes along, we will discuss these and some of their applications.At this point we only note that all are available in WinBUGS.

2.3 Slice Sampling

The Gibbs sampling idea can be used to sample from a 1-dimensional continuousdistribution. In fact, WinBUGS seems to use this idea (called "slice sampling")to do its 1-dimensional updates for non-standard distributions of bounded sup-port (i.e. where the density is 0 outside a �nite interval). The "trick" is thatin order to sample from a 1-dimensional

h (�)

I implicitly invent a convenient 2-dimensional distribution for (�; V ) and dowhat amounts to Gibbs sampling from this distribution to produce�

�0; V 0�;��1; V 1

�;��2; V 2

�; : : : ;

��N ; V N

�and then for large N use �N as a simulated value for �:The slice sampling algorithm begins with some starting vector�

�0; V 0�

and then for j = 1; 2; : : : one

1. samples �j from a distribution uniform on��jh (�) � V j�1

, and

2. samples V j from the Uniform�0; h

��j��distribution

13

in order to create��j ; V j

�from

��j�1; V j�1

�.

Slice sampling is the Gibbs sampler on a distribution that is uniform on

f(�; v) jv < h (�)g � <2

The only di¢ cult part of implementing the algorithm is �guring out how toaccomplish step 1. Sometimes it�s possible to do the algebra necessary toidentify the set of ��s indicated in step 1. When it is not, but I know thath (�) is positive only on a �nite interval [a; b], I can instead generate iid U(a; b)realizations, checking the corresponding values of h until I get one larger thanV j�1.It is worth noting that at least in theory (whether the following is practically

e¢ cient is a separate question), the restriction of slice sampling to cases whereh (�) is known to be positive only on a �nite interval [a; b] is not really intrinsic.That is, one may de�ne a smooth strictly monotone transformation : < !(0; 1), use slice sampling to sample from the distribution (�), and then applythe inverse transform to get realizations of � from h (�). Take, for example, thetransformation

(�) =1

1 + exp (��)with inverse transformation

�1 (t) = � ln�1

t� 1�= ln

�t

1� t

�that has derivative

d

dt �1 (t) =

1

t (1� t)If � has pdf proportional to h (�), then (�) has pdf on (0; 1) proportional tothe function of t

h� �1 (t)

�t (1� t) (8)

and one can do slice sampling for (�) as indicated above based on (8) andapply �1 to the result to simulate from h:Together, the rejection algorithm and slice sampling (each with its own lim-

itations) make two ways of implementing one of the k Gibbs updates for thecommon cases where the indicated density is not one of a standard form (i.e. isnot one for which simulation methods are well known).

2.4 The Metropolis-Hastings Algorithm

A second basic MCMC algorithm alternative to or complementary to the Gibbsalgorithm is the so-called Metropolis-Hastings algorithm. It begins from somestarting vector �0. Then for j = 1; 2; : : :

14

1. let Jj (�0j�) specify for each � a distribution (for �0) over the part ofEuclidean space where h (�0) > 0, called the "jumping" or "proposal"distribution for the jth iteration of updating (a distribution that I knowhow to simulate from), and generate

�j� � Jj��j�j�1

�as a proposal or candidate for �j ,

2. compute

rj =h��j��=Jj

��j�j�j�1

�h (�j�1) =Jj (�j�1j�j�)

and generateWj � Bernoulli (min (1; rj))

and,

3. take�j =Wj�

j� + (1�Wj) �j�1

(i.e. one jumps from �j�1 to the proposal �j� with probability min (1; rj)and otherwise stays put at �j�1).

In contrast to the Gibbs algorithm, this algorithm has the great virtue ofrequiring only simulation from the proposal distribution (and not from non-standard conditionals of h). These can be chosen to be "standard distributions"with well-known fast simulation methodsThe situation where each

Jj (�0j�) = Jj (�j�0)

(i.e. the jumping distributions are symmetric) is especially simple and gives thevariant of the algorithm known simply as the "Metropolis Algorithm." Note toothat the proposal distributions may depend upon the iteration number and thecurrent iterate, �j�1. Strictly speaking, they may not depend upon any moreof the history of iterates beyond �j�1. However, it is very common practiceto violate this restriction early in a run of an MCMC algorithm, letting thealgorithm "adapt" for a while before beginning to save iterates as potentiallyrepresenting h. The idea of this tuning of the algorithm early in a run is toboth "get from the starting vector to the �important part of the distribution�"and to "tune the parameters of the jumping distributions to make the algorithme¢ cient" (i.e. make the rj�s tend to be large and create frequent jumps).

2.5 Metropolis-Hastings-in-Gibbs Algorithms

The Gibbs sampler is attractive in that one can use it to break a large sim-ulation problem down into small, manageable chunks, the updating of the ksubvectors/pieces of �. It requires, however, methods of sampling from each of

15

the (h) conditional distributions of an �l given the rest of the � vector. Thisrequires the recognition of each conditional as of some convenient parametricform, or the use of the rejection or slice sampling algorithm, or yet somethingelse. Sometimes it�s not so easy to �nd a suitable method for sampling fromeach of these conditionals.The Metropolis-Hastings algorithm does not require sampling from any dis-

tribution de�ned directly by h, but rather only from proposal distributions thatthe analyst gets to choose. But, at least as described to this point, it seems thatone must deal with the entirety of the vector � all at once. But as it turns out,this is not necessary. One may take advantage of the attractive features of boththe Gibbs and Metropolis-Hastings algorithms in a single MCMC simulation.That is, there are Metropolis-Hastings-in-Gibbs algorithms.That is, in the Gibbs sampling setup, for the update of any particular sub-

vector �l, one may substitute a "Metropolis-Hastings step." In place of

sampling �jl from h��j1; : : : ; �

jl�1; �; �

j�1l+1 ; : : : ; �

j�1k

�one may

1. let Jlj��0lj�1; : : : ; �l�1; �l; �l+1; : : : ; �k

�specify for each

��1; : : : ; �l�1; �l; �l+1; : : : ; �k

�a distribution (for �0l) over the part of Euclidean space where the functionof �0l; h

��1; : : : ; �l�1; �

0l; �l+1; : : : ; �k

�> 0, and generate

�j�l � Jlj

��j�j1; : : : ; �

jl�1; �

j�1l ; �j�1l+1 ; : : : ; �

j�1k

�as a proposal or candidate for �jl ,

2. compute

rlj =h��j1; : : : ; �

jl�1; �

j�l ; �

j�1l+1 ; : : : ; �

j�1k

�h��j1; : : : ; �

jl�1; �

j�1l ; �j�1l+1 ; : : : ; �

j�1k

��Jlj

��j�1l j�j1; : : : ; �

jl�1; �

j�l ; �

j�1l+1 ; : : : ; �

j�1k

�Jlj

��j�l j�

j1; : : : ; �

jl�1; �

j�1l ; �j�1l+1 ; : : : ; �

j�1k

�and generate

Wlj � Bernoulli (min (1; rlj))and,

3. take�jl =Wlj�

j�l + (1�Wlj) �

j�1l

(i.e. one jumps from �j�1l to the proposal �j�l with probability min (1; rlj)and otherwise stays put at �j�1l ).

This kind of algorithm is probably the most commonly used MCMC algo-rithm in modern Bayesian computation, at least where people do their ownprogramming instead of relying on WinBUGS.

16

3 The Practice of Modern Bayes Inference 1:Some General Issues

We now take for granted computing algorithms for approximating a posteriordistribution via MCMC and consider a series of issues in the practical applicationof the Bayes paradigm.

3.1 MCMC Diagnostics

For purposes of being in a position to detect whether there are potentiallyproblems with "poor mixing"/"islands of probability" in a MCMC simulationfrom a posterior (or posterior/predictive posterior) distribution, it is standardpractice to:

1. pick several widely dispersed and perhaps even "unlikely under the poste-rior" starting vectors for posterior MCMC iterations,

2. run several (say m) chains in parallel from the starting points in 1.,

3. monitor these several chains until "transient" e¤ects of the starting vectorswash out and they start to have "similar" behaviors, i.e. monitor themuntil they "burn in," and

4. use for inference purposes only simulated � and/or Y 2 values coming fromiterations after burn-in.

The question is how one is to judge if and when burn-in has taken place.A fairly qualitative way of trying to assess burn-in is to visually monitor

"history plots" (of all parallel chains on a given plot) of individual coordinatesof � and/or Y 2. (These are simply plots of values of the coordinate againstiteration number, with consecutive points for a given chain connected by linesegments.) WinBUGS allows one to run multiple chains and make such plotswith each chain getting a di¤erent color on the plot. One simply waits untilthese look "alike" to the statistically practiced eye.A more or less quantitative tool for judging when burn-in has occurred is

the "Gelman-Rubin statistic" and related plots, implemented in WinBUGS in avariant form called the "BGR" (Brooks-Gelman-Rubin) statistic and plots. Theoriginal version of the idea (discussed in the textbook) is the following. Let stand for some coordinate of � and/or Y 2 (possibly after "transformation tonormality"). Beginning after some number of iterations of MCMC simulations,let

ji = jth saved iterate of in chain i for i = 1; 2; : : : ;m and j = 1; 2; : : : ; n

If burn-in has occurred, I expect that the set of ji obtained from each chaini will "look like" the set of ji obtained from pooling across all chains. Ways

17

of measuring the extent to which this is true can be based on within-chain andgrand means

i =1

n

nXj=1

ji and : =1

m

mXi=1

i

and within-chain sample variances and a pooled version of these

s2i =1

n� 1

nXj=1

� ji � i

�2and W =

1

m

mXi=1

s2i

and a kind of between-chain variance

B =n

m� 1

mXi=1

� i � :

�2W and B are, in fact, respectively the "One-Way ANOVA" error and treatment(within and between) mean squares from a One-Way analysis with "chains" as"treatments." The Gelman-Rubin statistic based on these quantities is

R̂n =

sn� 1n

+1

n

�B

W

�If each chain�s record begins to "look like" a random sample from the samedistribution as n ! 1, R̂n should approach 1. If the records of the m chains"look di¤erent" one should expect R̂n to stay larger than 1 with increasing n.(One plots R̂n against n.) (Note also in passing that the ratio B=W is exactlythe one-way ANOVA F statistic for this problem.)The Brooks-Gelman modi�cation of this idea implemented in WinBUGS is as

follows. Let

Lni = the lower 10% point of the n values ji (from chain i)

Uni = the upper 10% point of the n values ji (from chain i)

Ln = the lower 10% point of the nm values ji (from all chains)

Un = the upper 10% point of the nm values ji (from all chains)

Then plotted versus n in WinBUGS are 3 quantities:

1. (Un � Ln) =� plotted in green,

2.�1m

Pmi=1 (U

ni � Lni )

�=� plotted in blue, and

3. (Un � Ln) =�1m

Pmi=1 (U

ni � Lni )

�plotted in red.

The idea is that the value in red needs to approach 1 and the values plottedin green and blue need to stabilize. The constant � used in 1. and 2. ischosen to make the largest plotted green or blue plotted value 1. The WinBUGS

18

manual says that "bins of 50 are used" and I believe that this means that thecomputation and plotting is done at multiples of 50 iterations.Ideally, properly burned-in history plots look like patternless "white noise"

(iid observations) plots. When instead they show (similar across the chains)behavior that might be characterized as "slow drift" one is faced with a situationwhere long MCMC runs will be necessary if there is any hope of adequatelyrepresenting the posterior. In some sense, one has many fewer "observations"from the posterior than one has iterations. A "slowly drifting MCMC record"means that values of a coordinate of � and/or Y 2 change only slowly. This canbe measured in terms of how fast serial correlations in the MCMC records fall o¤with lag. For example, suppose �1 is the �rst coordinate of the parameter vector� and that the jth MCMC iterate of this variable is �j1. One might computethe sample correlation between the �rst and second coordinates of ordered pairs�

�j1; �j+s1

�for s = 1; 2; 3; : : : (for j after burn-in) as a measure of "lag-s serial correlation"in the �1 record. Nontrivial positive serial correlations for large s are indicativeof "slow drift"/"poor mixing" in the simulation and the necessity of long runsfor adequate representation of the posterior.

3.2 Considerations in Choosing Priors

How does one choose a prior distribution? The answer to this question isobviously critical to Bayes analysis, and must be faced before one can even "getstarted" in an application. A couple of points are obvious at the outset. In the�rst place, a posterior can place probability on only those parts of a parameterspace where the prior has placed probability. So unless one is absolutely "sure"that some subset of ��s simply can not contain the actual parameter vector,it is dangerous to use a prior distribution that ignores that set of parameters.(Unless I am willing to take poison on the proposition that � < 13, I should notuse a prior that places 0 probability on the event that � � 13.) Secondly, allthings being equal, if several di¤erent choices of prior produce roughly the sameposterior results (and in particular, if they produce results consistent with thosederivable from non-Bayesian methods) any of those priors might be thought ofas attractive from a "robustness" perspective.

3.2.1 "The Prior"

A real philosophical Bayesian would �nd the previous statement to be heretical-to-irrelevant. That is, for a card-carrying Bayesian, there is only one "true"prior, that re�ects his or her carefully considered prior opinions about �. Thisprobability structure is unashamedly personal and beyond criticism on any otherthan logical or philosophical grounds. Bayesians have put a fair amount ofe¤ort into developing theory and tools for the "elicitation of prior beliefs" andwould argue that the way one ought to get a prior is through the careful use

19

of these. While this logical consistency is in some respects quite admirable,I am unconvinced that it can really be pushed this far in a practical problem.However, you are invited to investigate this line of thought on your own. Wewill take more eclectic approaches in this set of notes.

3.2.2 Conjugate Priors

Before the advent of MCMC methods, there was a particular premium placed onpriors for which one can do posterior calculations with pencil-and-paper calcu-lus, and "conjugate" priors were central to applications of the Bayes paradigm.That is, some simple forms of the likelihood L (�) = f (yj�) themselves looklike all or parts of a density for �. In those cases, it is often possible to identifya simple prior g (�) that when multiplied by L (�) produces a function that bysimple inspection can be seen to be "of the same family or form as g (�)." (Forexample, a Binomial likelihood multiplied by a Beta prior density produces aproduct proportional to a di¤erent Beta density.) When this is the case andthe form of g (�) is simple, posterior probability calculations can be done with-out resort to MCMC simulation. The jargon for this kind of nice interactionbetween the form of a likelihood and a convenient prior form is that the prioris a "conjugate" prior.These days, conjugate priors are important not so much because they are

the only ones for which posterior computations can be done (MCMC methodshave removed the necessity of limiting consideration to posteriors that yield topencil-and-paper calculus), but rather because the explicit formulas that theycan provide often enlighten the search for priors that are minimally informa-tive/robust. In fact, many useful "non-informative" prior distributions can beseen to be limits (as one sends parameters of a prior to some extreme) of con-jugate priors. (For example, where the elements of Y are iid N(�; 1) variables,one conjugate prior for � is the N

�0; �2

�distribution, and in some sense the

"limit" of this prior as �2 !1 is the "uniform on <" improper prior. This isin many respects an attractive non-informative choice for this problem.)

3.2.3 "Flat"/"Di¤use"/"Non-Informative"/"Robust" Priors

The notions of prior "di¤useness," "�atness," and "non-informativeness"/"robustness"are not really terribly concrete concepts. What one hopes to achieve in thesearch for priors that might be described in these ways is fairly clear: posteriordistributions that behave sensibly no matter what be �. But it is worth sayingexplicitly here, that whether a prior "looks" �at or di¤use is dependent uponthe particular parameterization that one adopts, and thus whether a �at/di¤usechoice of prior will function in a robust/non-informative way is not obvious fromsimply examining its shape.For example, consider a hypothetical inference problem with parameter p 2

(0; 1). One "�at"/"di¤use" prior for a Bayes problem involving p would be aU(0; 1) prior for p. But an alternative parameterization for the problem might

20

be in terms of the log-odds

= ln

�p

1� p

�and a "�at" improper prior for is "uniform on <." These are not equivalentspeci�cations. For example, the �rst says that the prior probabilities assignedto the intervals (:5; :6) and (:6; :7) are the same, while the second says that the(improper) prior weights assigned to these sets of p�s are in the ratio

ln�

:61�:6

�� ln

�:51�:5

�ln�

:71�:7

�� ln

�:61�:6

� = :9177

Whether either of these priors will function in a "non-informative" way in aBayes analysis is not obvious from their qualitative "�atness"/"di¤useness" ev-ident from simple inspection.

3.2.4 Je¤reys Priors

In the case that � is 1-dimensional, there is a standard method due to H. Jef-freys for identifying a prior (or improper prior) that often turns out to be oper-ationally "non-informative." That is this. Associated with a likelihood f (yj�)(di¤erentiable in �) is the Fisher Information (a function of �)

IY (�) = E��d

d�ln f (Y j�)

�2It is well known that sometimes (but not always) the Fisher Information mayalso be computed as

�E�d2

d�2ln f (Y j�)

In any case, the Je¤reys prior for a Bayes analysis involving this likelihood isspeci�ed by

g (�) /pIY (�) (9)

An especially attractive feature of this prescription is that it is invariant tomonotone reparameterization. So one may speak of "the" Je¤reys prior for theproblem without ambiguity. That is, for a monotone function u (�), consider asecond parameterization of this problem with parameter

= u (�)

With prior (say) pdf (9), Stat 542 transformation theorem material shows that has pdf proportional top

IY (u�1 ( ))�� 1

u0 (u�1 ( ))

�� (10)

21

But the information in y about is (for _f (Y j�) the partial derivative of f (Y j�)with respect to �)

Eu�1( )

�d

d ln f

�Y ju�1 ( )

��2= Eu�1( )

_f�Y ju�1 ( )

�@@ u

�1 ( )

f (Y ju�1 ( ))

!2= IY

�u�1 ( )

� 1

(u0 (u�1 ( )))2 (11)

Clearly, the square root of rhs(11) is the pdf (10) that inherits from theassumption (9) that � has a Je¤reys prior.

3.3 Considerations in Choice of Parametrization

The necessity of specifying a prior distribution for � and then sampling from aposterior for it probably causes one to think harder about the most convenientway to parameterize a model for Y than might otherwise be necessary. Weproceed to make several observations about the issue of parameterization.

3.3.1 Identi�ability

A basic requirement for sensible inference, Bayesian or non-Bayesian, is thatany two di¤erent parameter vectors � and �0 correspond to genuinely di¤erentdistributions for Y . But it is not impossible to fail to recognize that one hasviolated this basic sanity requirement. When this happens, MCMC simulationscan behave in seemingly inexplicable ways.For example, consider a mixture problem, where one is presented with iid

observations, which each are N(�1; 1) with probability � and N(�2; 1) with prob-ability (1� �). As just stated (with the implicit choice of parameter space<�<� (0; 1) for (�1; �2; �)) this model is not identi�able. The parameter vec-tors (0; 1; :7) and (1; 0; :3) produce the same distribution for the data. MCMCfor "obvious" choices of prior in this problem will behave in what seems tobe "odd" ways. One needs to somehow either reduce the parameter space tosomething like

f(�1; �2; �) j�1 < �2 and 0 < � < 1g (12)

and place a prior on that subset of <3 or �nd an alternative parameterization.For example, one might think of

�1 = the smaller of the two means

and set� = �2 � �1

(so that �2 = �1 + �) and do the inference in terms of (�1; �; �) rather than(�1; �2; �) directly. Note then that the parameter space becomes <� (0;1)�(0; 1) and that choosing a prior over this space seems less complicated thanmaking a choice of one over (12).

22

3.3.2 Gibbs and Posterior Independence

In terms of e¢ ciency/properly representing a posterior in as few iterations aspossible, Gibbs-like algorithms work best when the subvectors of � being up-dated in turn are (according to the posterior) roughly independent. Whenthe posterior portrays strong dependencies, the range of each update is in ef-fect limited substantially by the dependence, and Gibbs algorithms tend to takevery small steps through the parameter space and thus take a large number ofiterations to "cover the parameter space" adequately.This means that all other things being equal, for purposes of e¢ cient com-

putation, one prefers parameterizations with product parameter spaces, andthat tend to produce likelihoods that as functions of � do not contribute toposterior dependencies. (To the extent that large sample loglikelihoods tendto be approximately quadratic with character determined by the correspondingFisher information matrix, one prefers parameterizations with essentially diag-onal Fisher information matrices.) And again for purposes of computationale¢ ciency (at least if a prior is not going to be e¤ectively "overwhelmed" bya "large sample" likelihood) priors of independence for such parameterizationsseem most attractive.This discussion suggests that at least from a computational standpoint, the

parameter space (12) discussed above is less attractive than the < � (0;1) �(0; 1) product space associated with the second ((�1; �; �)) parameterization. Asecond, very familiar, example relevant to this discussion is that of simple linearregression. The simple linear regression Fisher information matrix typically failsto be diagonal in the usual parameterization where the regression coe¢ cients arethe slope and y intercept (the mean value of y when x = 0). However, if insteadof using raw values of covariates one centers them so that regression coe¢ cientsbecome the slope and the mean value of the response when the covariate is at itssample mean value (�x), this potential computational complication disappears.

3.3.3 Honoring Restrictions Without Restricting Parameters

The most convenient/straightforward way of specifying a high-dimensional priordistribution is by making an independence assumption and specifying only mar-ginal distributions for coordinates of � on some product space. That makesparameter spaces like (12) that involve some restrictions in a product spaceproblematic. There are at least 2 ways of getting around this unpleasant-ness. First, one might look for alternate parameterizations that simply avoidthe di¢ culty altogether. (In the mixture example, this is the approach of us-ing (�1; �; �) instead of the original (�1; �2; �) parameterization.) A secondpossibility (that might not work in the mixture problem, but will work in othercontexts) is to ignore the restrictions and use a prior of independence on a prod-uct space for purposes of running an MCMC algorithm, but to "post-process"the MCMC output, deleting from consideration vectors from any iteration whosevector violates the restrictions. For example, in a problem where a parametervector (p1; p2; p3) 2 (0; 1)3 must satisfy the order restriction p1 � p2 � p3, one

23

might adopt and use in an MCMC algorithm independent Beta priors for eachpi. After-the-fact using only those simulated values whose vectors (p1; p2; p3)satisfy the order restrictions essentially then employs a prior with density pro-portional to the product of Beta densities but restricted to the part of (0; 1)3

where the order restriction holds. (Using WinBUGS and R this can be accom-plished by saving the WinBUGS results using the Coda option, turning theminto a text �le, and loading the text �le into R using the Coda package forpost-processing.)

3.4 Posterior (Credible) Intervals

A posterior distribution for � (or for (Y 2;�)) is often summarized by makingrepresentations of the corresponding marginal (posterior) distributions. Forsake of discussion here, let � stand for some 1-dimensional element of � (or(Y 2;�)). Upon �nishing an MCMC simulation from a posterior one has alarge number of realizations of �, say �1; �2; : : : ; �N . These can be summarizedin terms of a histogram, or in the case that � is a continuous variable, withsome kind of estimated probability density (WinBUGS provides such density es-timates). It is also common to compute and report standard summaries ofthese values, the sample mean, sample median, sample standard deviation, andso on.Probably the most e¤ective way of conveying where most of the posterior

probability is located is through the making and reporting of posterior proba-bility intervals, or so-called Bayesian "credible intervals." The simplest of theseare based on (approximate) quantiles of the marginal posterior. That is, if

�:025 = the :025 quantile of��1; �2; : : : ; �N

and

�:975 = the :975 quantile of��1; �2; : : : ; �N

then the interval

[�:025; �:975]

encloses posterior probability :95 (at least approximately) and can be termed a95% credible interval for �. It might be thought of as a Bayes alternative to a"classical" 95% con�dence interval (though there is no guarantee at all that themethod that produced it is anything like a 95% con�dence procedure).A theoretically better/smaller construction of credible sets is the "highest

posterior density" (hpd) construction. That is, rather than using quantiles toidentify a credible interval, one might look for a number c so that with g (�jy)the posterior marginal density of �, the set

f�jg (�jy) > cg (13)

has posterior probability :95. That set is then the smallest one that has poste-rior probability content :95, and can be called the "95% highest posterior densitycredible set for �."

24

Unless g (�jy) is unimodal, there is no guarantee that the hpd constructionwill produce an interval. And unless g (�jy) has a simple analytic form, itmay not be easy to identify the set (13). Further, while the use of quantilesto make credible intervals is invariant under monotone transformations of theparameter, the result of using the hpd construction is not. (This is reallya manifestation of the same phenomenon that makes apparent "�atness" of aprior dependent upon the particular parameterization one adopts.) For thesereasons, the quantile method of producing intervals is more common in practicethan the hpd construction.

3.5 Bayes Model Diagnostics and Bayes Factors for ModelChoice

Since a Bayes statistical model is simply an "ordinary" statistical model withthe addition of a prior g (�), any kind of "model checking" appropriate in anon-Bayesian context (e.g. residual plotting, etc.) is equally appropriate inthe Bayes context. The new feature present in the Bayes context is what theprior does. Speci�cally Bayesian model checking is usually approached fromthe point of view of "posterior predictive distribution checking." That is, if thedensity of the observable Y is

f (yj�)

let Y new also have this density and (conditional on �) be independent of theobservable Y . So the joint density of all of (Y ;Y new ;�) is proportional to

f (yj�) f (ynew j�) g (�)

and one can make posterior (to Y = y) simulations of Y new . One can then askwhether the data in hand, y, look anything like the simulated values. Chapter6 of the text discusses some ways of assessing this numerically and graphically.For my money, this strikes me as "stacked in favor of concluding that the Bayesanalysis is OK." Roughly speaking, the posterior uncertainty in � will havethe e¤ect of making the posterior predictive distribution of Y new more spreadout than any single f (yj�) for a �xed �. So it seems rare that one will getposterior predictive simulations that fail to "cover" the observed data, unlessthere is some huge blunder in the modeling or simulation.A somewhat di¤erent question is how one might compare the appropriate-

ness of several (either nested or un-nested) Bayes models for an observable Y .So called "Bayes factors" have been o¤ered as one means of doing this. Supposem di¤erent models have densities fi (yj�i) of the same type, where the parame-ters �i take values (in possibly di¤erent) ki-dimensional Euclidean spaces, <ki ,and priors are speci�ed by densities (improper densities are not allowed in thisdevelopment) gi (�i). Each of these models produces a (marginal) density forY ,

fi (y) =

Zfi (yj�i) gi (�i) d�i

25

(where, as usual, the indicated integral is a Riemann integral, a sum, or somecombination of the two). One might then look at

BFi0i =fi0 (y)

fi (y)(14)

as an appropriate statistic for comparing models i and i0.In Neyman-Pearson testing of

H0:the correct model is model i (Y � fi)

versusHa :the correct model is model i0 (Y � fi0)

BFi0i is the optimal test statistic (is the "likelihood ratio"). Further, if onesets prior probabilities on models 1 through m, say p1; p2; : : : ; pm the posteriorprobability for model i is

pifi (y)Pml=1 plfl (y)

so that the posterior "odds ratio" for models i0 and i is

pi0fi0 (y)

pifi (y)=

�pi0

pi

�BFi0i

which is the prior odds ratio times the Bayes factor.Notice that one is typically not going to be able to do the calculus necessary

to compute the fi (y) needed to �nd Bayes factors. But often (especiallybecause it is common to make independence assumptions between coordinatesof � in specifying priors) it�s easy to generate

�1i ;�2i ; : : : ;�

ni that are iid gi (�i)

Then the law of large numbers implies that

1

n

nXl=1

fi

�yj�li

�P! fi (y)

from which one can get approximate values for Bayes factors.How to interpret Bayes factors has been a matter of some dispute. One set

of qualitative interpretations suggested by Je¤reys for a Bayes factor BF21 is

� BF21 > 1 favors model 2,

� 0 > log10BF21 > � 12 provides minimal evidence against model 2,

� � 12 > log10BF21 > �1 provides substantial evidence against model 2,

� �1 > log10BF21 > �2 provides strong evidence against model 2, and

26

� �2 > log10BF21 provides decisive evidence against model 2.

One variant on this Bayes factor idea is that for comparing a Bayes modelfor observable Y (say model 1) to a Bayes model for observable S = s (Y )where s (�) is a 1-1 function (say model 2). That is, suppose that what is to becompared are models speci�ed by

f1 (yj�1) and g1 (�1)

and byh (sj�2) and g2 (�2)

Now the ratio that is a Bayes factor involves two marginal densities for the sameobservable. So in this case we must express both models in terms of the sameobservable. That requires remembering what was learned in Stat 542 aboutdistributions of transformations of random variables. In the case that Y isdiscrete, it is easy enough to see that

f2 (yj�2) = h (s (y) j�2)

so that

BF21 =

Rh (s (y) j�2) g2 (�2) d�2Rf1 (yj�1) g1 (�1) d�1

And in the case that Y is continuous, for Js (y) the Jacobian of the transfor-mation s, the probability density for y under model 2 is

f2 (yj�2) = jJs (y)jh (s (y) j�2)

so that

BF21 = jJs (y)jRh (s (y) j�2) g2 (�2) d�2Rf1 (yj�1) g1 (�1) d�1

3.6 WinBUGS, Numerical Problems, Restarts, and "TighterPriors"

In complicated problems it is not uncommon for WinBUGS to stop in the mid-dle of a simulation and report having numerical problems. It is rarely clearfrom the diagnostics the program provides exactly what has gone wrong. Onecan usually restart the simulation from the previous iterate (often after severalattempts) and continue on in the simulation. The WinBUGS documentationsuggests "tighter"/more informative priors as a general "�x" for this kind ofproblem. It is worth thinking about (even in the face of complete ignorance ofnumerical details) what could be the implications of this kind of di¢ culty, whatmight happen if one "ignores it" and routinely restarts the simulation, and theimplication of following the manual�s advice.Getting the WinBUGS error warning is indication that there is some part of

the � or (y2;�) space that gets nontrivial posterior probability and where thecurrent implementation of some evaluation of some function or some updating

27

algorithm breaks down. One could hope that in the most benign possiblesituation, this part of the space is some "relatively small/unimportant isolatedcorner of the space" and that a strategy of just blindly restarting the simulationwill e¤ectively replace the real posterior with a posterior that is the posteriorconditioned on being in the "large/important part" of the space. (That countson restarts from "just inside the �good�part of the space and restricted to landingin the �good�part" being equivalent to steps into the good part from inside the�bad�part.)Of course there are also less benign possibilities. Consider, for example,

the possibility that the region where there are numerical problems serves as aboundary between two large and equally important parts of the � or (y2;�)space. It�s possible that one would then only see realizations from the part inwhich the chain is started, and thus end up with a completely erroneous viewof the nature of the posterior. And it�s not clear that there is really any wayto tell whether the di¢ culty that one faces is benign or malignant.The WinBUGS "�x" for this problem is a "�x" only in that it restricts the

part of the � or (y2;�) space that gets nontrivial posterior probability, andthereby keeps the sampler from getting into trouble. That is helpful only if onedecides that really a less di¤use prior is adequate/appropriate in the context ofthe application. At the end of the day, the "real" �x for this kind of problemis doing one�s own MCMC coding so that there is a chance of understandingexactly what has happened when something does go wrong.

3.7 Auxiliary Variables

Return to the notation of the exposition of the Gibbs, Metropolis-Hastings, andMetropolis-Hastings-in-Gibbs algorithms. It can sometimes be advantageous tosimulate not only realizations of �, but realizations of (�; ) for some additional(typically vector) unobserved variable . That is, suppose that r ( j�) is aconditional density for . Rather than doing MCMC from

h (�)

it can be more e¤ective to do MCMC from

r ( j�)h (�)

and then simply ignore the values of so generated, using the ��s to approximateproperties of the (h (�)) marginal of the joint distribution of (�; ). As a matterof fact, slice sampling is an example of this idea. But it is also more generallyhelpful, and is related to the idea of data-augmentation used in the famous EMalgorithm for maximization of a likelihood.One nice application of the idea is in the analysis of interval-censored data

from a continuous distribution belonging to some parametric family. In thiscontext, � consists of the parameter vector and the likelihood is a product of �probabilities of intervals in which a sample of observations are known to lie (oneterm for each observation). But the Bayes analysis would typically be simpler if

28

one had instead the exact values of the observations, and could use a likelihoodthat is the product of the density values for the observations. The applicationof the auxiliary variables idea is then to let consist of the unobserved sampleof exact values. In a Gibbs algorithm, when one is updating the parametervector, one gets to use the posterior based on the exact values (instead of thetypically more complicated posterior based on the identities of the intervalscorresponding to the observations). The updates on the exact values are madeusing the (�xed parameters) conditional distributions over the intervals in whichthey are known to lie.Another helpful application of the auxiliary variables idea is in the analysis

of mixture data. That is a context where one has several parametric formsand assumes that data in hand are iid from a weighted (with positive weightssumming to 1) average of these. The objects of interest are usually the weightsand the parameters of the constituent distributions. A way of using auxil-iary variables is to conceive of the individual observations as produced by atwo-stage process, where �rst one of the constituents is chosen at random ac-cording the weights, and then the observation is generated from the individualconstituent distribution. The "constituent identities" of all observations thenbecome helpful auxiliary variables.Finally, any "missing data" problem where one would naturally model an

entire vector of observations but actually gets to observe only part of the vectoris a candidate for use of the auxiliary variables idea. The missing or unobservedvalues are the obvious auxiliary variables.

3.8 Handling Interval Censoring and Truncation in WinBUGS

WinBUGS provides an "automatic" implementation of the auxiliary variables ideafor interval censoring. Suppose that a part of the data vector, y, amountsto provision of the information that an incompletely observed variable from aparametric probability density f (�j�) (� is part of the parameter vector �) issomewhere in the interval (a; b). Let yaux be this uncensored observation andF (�j�) the cdf corresponding to f (�j�). y�s contribution to the likelihood is

F (bj�)� F (aj�) (15)

and conditioned on y, yaux has pdf

f (yaux j�)F (bj�)� F (aj�)I [a < yaux < b] (16)

So (multiplying (15) and (16)) we see that the net e¤ect of including the auxiliaryvariable in MCMC is to replace (15) with

f (yaux j�) I [a < yaux < b] (17)

in h (�) from which one must simulate. The WinBUGS method for doing this isthat instead of even trying to enter something like (15) into consideration, one

29

speci�es that an unobserved variable yaux contributes a term like (17) to h (�).For the speci�c correct syntax, see the Censoring and truncation subsectionof the Model Specification section of the WinBUGS User Manual.The pdf (16) is of independent interest. It is the pdf on (a; b) that has

the same shape as the density f (�j�) (a pdf typically on all of < or on <+)on that interval. It is usually known as a truncated version of f (�j�). Onemight imagine generating observations according to f (�j�), but that somehowall escape detection except those falling in (a; b). Density (16) is the pdf of anyobservation that is detected.There is no easy/automatic way to use a truncated distribution as a model

in WinBUGS: In particular one CAN NOT simply somehow make use of thecensoring idea, somehow declaring that an observed variable has the distributionf (�j�) but is censored to (a; b). In the �rst place, (16) and (17) are not the samefunctions. Besides, if one uses the WinBUGS code for censoring and essentiallyincludes terms like (17) in h (�) but then turns around and provides observedvalues, one might as well have simply speci�ed that the observation was fromf (�j�) alone (the indicator takes the value 1). And the density (16) is NOTequivalent to f (�j�) as a contributor to a likelihood function.The only way to make use of a truncated distribution as part of a WinBUGS

model speci�cation is to essentially program one�s own version of the truncatedpdf and use the WinBUGS "zeros trick" to get it included as a factor in theh (�) from which WinBUGS samples. (See the "Tricks: Advanced Use ofthe BUGS Language" section of the WinBUGS User Manual.)

4 The Practice of Bayes Inference 2: SimpleOne-Sample Models

Both because one-sample statistical problems are of interest in their own rightand because what we will �nd to be true for one-sample models becomes rawmaterial for building and using more complicated models, we now consider theapplication of the Bayes paradigm to single samples from some common para-metric models.

4.1 Binomial Observations

Suppose that observable Y �Binomial(n; p) for the unknown parameter p 2(0; 1). Then when Y = y 2 f0; 1; 2; : : : ; ng the likelihood function becomes

L (p) =

�ny

�py (1� p)n�y

From this it is immediate that a convenient form for a prior will be Beta(�; �),that is, a continuous distribution on (0; 1) with pdf

g (p) =1

B (�; �)p��1 (1� p)��1 (18)

30

for some values � > 0 and � > 0. (� and � thus become parameters of theprior distribution and are thus often termed "hyperparameters.") It is clear thatthe product L (p) g (p) is proportional to a Beta(�+ y; � + (n� y)) density, i.e.with prior speci�ed by (18)

g (pjy) is B (�+ y; � + (n� y))

The Beta(�; �) distributions are conjugate priors for the simple binomial model.Notice that the Fisher information in Y about p is

IY (p) = �Epd2

dp2ln f (Y jp) = n

p (1� p)So the Je¤reys prior for the binomial model is speci�ed by

g (p) =pIY (p) = p�1=2 (1� p)�1=2

That is, in this case the Je¤reys prior is a member of the conjugate Beta family,the Beta(1=2; 1=2) distribution.The Beta(�; �) mean is �= (�+ �) so the posterior mean of p with a Beta

prior is

E [pjY = y] =�+ y

�+ � + n

=

��+ �

�+ � + n

��

�+ �

�+

�n

�+ � + n

�� yn

�and this motivates thinking about the hyperparameters of a Beta prior in termsof (�+ �) being a kind of "prior sample size" and � being a corresponding"prior number of successes." (The posterior mean is a weighted average of theprior mean and sample mean with respective weights in proportion to (�+ �)and n:)Notice that if one de�nes

� = logit (p) � ln�

p

1� p

�and chooses an improper prior for � that is "uniform on <," then

p = logit�1 (�) =exp (�)

1 + exp (�)

has an improper prior with "density" (proportional to)

g (p) = p�1 (1� p)�1 (19)

The meaning here is that for 0 < a < b < 1Z b

a

p�1 (1� p)�1 dp / logit (b)� logit (a) =Z logit(b)

logit(a)1d�

Now the improper prior speci�ed by (19) is in some sense the � = 0 and � = 0limit of (proper) Beta(�; �) priors. As long as 0 < y < n this improper priorfor p and the likelihood combine to give proper Beta posterior for p.

31

4.2 Poisson Observations

Suppose that observable Y �Poisson(�) for the unknown parameter � 2 (0;1).Then when Y = y 2 f0; 1; 2; : : : ; g the likelihood function becomes

L (�) =exp (��)�y

y!

A conjugate form for a prior is � (�; �), that is, a distribution on (0;1) withpdf

g (�) =��

� (�)��1 exp (��) (20)

It is then clear that the product L (�) g (�) is proportional to the � (�+ y; � + 1)density, that is with prior speci�ed by (20)

g (�jy) is � (�+ y; � + 1)

The � (�; �) mean is �=� and the variance of the distribution is �=�2. So for� = � = some small number, the prior (20) has mean 1 and a large variance.The corresponding posterior mean is (�+ y) = (� + 1) � y and the posterior

standard deviation isq(�+ y) = (� + 1)

2 � py.Notice that the Fisher information in Y about � is

IY (�) = �E�d2

d�2ln f (Y j�) = 1

�

So the (improper) Je¤reys prior for � is speci�ed by

g (�) =1p�

and for � = 1=2 and � = some small number, the � (�; �) prior is approxi-mately the Je¤reys prior.Finally, note that for an improper prior for � that is "uniform on (0;1),"

i.e. g (�) = 1 on that interval, the posterior density is

g (�jy) / exp (��)�y

i.e. the posterior is � (y + 1; 1).

4.3 Univariate Normal Observations

One- and two-parameter versions of models involving N��; �2

�observations can

be considered. We start with the one-parameter versions.

32

4.3.1 �2 Fixed/Known

Suppose �rst that Y �N��; �2

�where �2 is a known constant (and thus is not an

object of inference). Notice that here Y could be a sample mean of iid normalobservations, in which case �2 would be a population variance over sample size.(Note too that in such a case, su¢ ciency considerations promise that inferencebased on the sample mean is equivalent to inference based on the original set ofindividual observations.)The likelihood function here is

L (�) =1p2��2

exp

� (y � �)

2

2�2

!(21)

Then consider a (conjugate) N�m; 2

�prior for � with density

g (�) =1p2� 2

exp

� (��m)

2

2 2

!(22)

Then

L (�) g (�) / exp��12

�1

�2+1

2

��2 +

�y

�2+m

2

��

�So the posterior pdf g (�jy) is again normal with

variance =

�1

�2+1

2

��1=

2�2

�2 + 2(23)

and

mean =

�y

�2+m

2

�� variance =

y

�2+m

2

1

�2+1

2

(24)

For purposes of Bayes analysis, it is often convenient to think in terms of adistribution�s

precision =1

variance

In these terms, equation (23) says that in this model

posterior precision = prior precision+ precision of likelihood (25)

and equation (24) says that in this model the posterior mean is a precision-weighted average of the prior and sample means.As a bit of an aside, notice that while (25) and (24) describe the posterior

distribution of �jY = y, one might also be interested in the marginal distributionof Y: This distribution is also normal, with

EY = m and VarY = �2 + 2

33

So in the case of the marginal distribution, it is variances (not precisions) thatadd.The Fisher information in Y about � is

IY (�) = �E�d2

d�2lnL (�) =

1

�2

This is constant in �. So the (improper) Je¤reys prior for � is "uniform on<," g (�) = 1. With this improper prior, the posterior is proportional to L (�).Looking again at (21) we see that the posterior density g (�jy) is then N

�y; �2

�.

Notice that in some sense this improper prior is the 2 =1 limit of a conjugateprior (22) and the corresponding N

�y; �2

�posterior is the 2 = 1 limit of the

posterior for the conjugate prior.Consider the proper U(a; b) prior with density

g (�) / I [a < � < b]

With this prior the posterior has density

g (�jy) / I [a < � < b] exp

� (y � �)

2

2�2

!

That is, the posterior is the N�y; �2

�distribution truncated to the interval

(a; b). Then, as long as a � y � b (relative to the size of �) the posterior isessentially N

�y; �2

�. That is, this structure will allow one to approximate a

Je¤reys analysis using a proper prior.

4.3.2 � Fixed/ Known

Suppose now that Y = (Y1; Y2; : : : ; Yn) has components that are iid N��; �2

�where � is a known constant (and thus is not an object of inference and can beused in formulas for statistics to be calculated from the data). The likelihoodfunction here is

L��2�=

�1p2��2

�nexp

�Pn

i=1 (yi � �)2

2�2

!Let

w =1

n

nXi=1

(yi � �)2

and with this notation note that

L��2�/��2��n=2

exp�� n

2�2w�

A conjugate prior here is the so-called Inv-� (�; �) distribution on (0;1)with pdf

g��2�/��2��(�+1)

exp

��

�2

�(26)

34

It is then obvious (upon inspection of the product L��2�g��2�) that using prior

(26), the posterior is

Inv-��+

n

2; � +

nw

2

�(27)

A useful/standard re-expression of this development is in terms of the so-called "scaled inverse �2 distributions." That is, one could start by using aprior for �2 that is the distribution of

�2�

Xfor X � �2� (28)

From (28) it it clear that �2 is a scale parameter for this distribution and that� governs the shape of the distribution. The textbook uses the notation

�2 � Inv-�2��; �2

�or �2 � Inv-�

��2;�

2� �2�

for the assumption that �2 has the distribution of (28). With this notation,the posterior is

Inv-��1

2(� + n) ;

�

2�2 +

nw

2

�or Inv-�2

�� + n;

��2 + nw

� + n

�(29)

This second form in (29) provides a very nice interpretation of what happenswhen the prior and likelihood are combined. The degrees of freedom add,with the prior essentially having the same in�uence on the posterior as woulda legitimate sample of size �. The posterior scale parameter is an appropri-ately weighted average of the prior scale parameter and w (the known-mean-n-denominator sample variance).It�s fairly easy to determine that the Fisher information in Y = (Y1; Y2; : : : ; Yn)

about �2 isIY��2�=

n

2�4

so that a Je¤reys (improper) prior for �2 is speci�ed by

g��2�/ 1

�2(30)

Notice that since for 0 < a < b <1Z b

a

1

xdx = ln b� ln a =

Z ln b

ln a

1dx

the improper prior for �2 speci�ed by (30) is equivalent to an (improper) priorfor ln�2 (or ln�) that is uniform on <.Notice also that the Je¤reys improper prior (30) is in some sense the � = 0

and � = 0 limit of the Inv-� (�; �) prior, or equivalently the �xed �2 and� = 0 limit of the Inv-�2

��; �2

�prior. The posterior for this improper prior is

speci�ed by

g��2jw

�/��2��1 �

�2��n=2

exp�� n

2�2w�

35

that is, the (proper) posterior is

Inv-��n2;nw

2

�or Inv-�2 (n;w)

4.3.3 Both � and �2 Unknown

Suppose �nally that Y = (Y1; Y2; : : : ; Yn) has components that are iid N��; �2

�,

where neither of the parameters is known. The likelihood function is

L��; �2

�=

�1p2��2

�nexp

�Pn

i=1 (yi � �)2

2�2

!

There are several obvious choices for a (joint) prior distribution for��; �2

�:

First, one might put together the two improper Je¤reys priors for � and�2 individually. That is one might try using an improper prior on < � (0;1)speci�ed by

g��; �2

�/ 1 � 1

�2(31)

Since this a product of a function of � and a function of �2, this is a priorof "independence." As it turns out, provided that n � 2 the prior (31) has acorresponding proper posterior, that is of course speci�ed by a joint density for� and �2 of the form

g��; �2jy

�/ L

��; �2

�g��; �2

�(32)

The posterior density (32) is NOT the product of a function of � and a functionof �2, and thus does not specify a posterior of independence. This is not a badfeature of (31). We should, for example, want cases where the usual samplevariance s2 is small to be ones that produce posteriors that indicate 1) that �2

is likely small, and that therefore 2) � has been fairly precisely determined ...one does not want the posterior to have the same conditional variance for � forall �2 values.Pages 73-77 of the text show that posterior (32) has attractive marginals.

First, it turns out that

g��2jy

�is Inv-�2

�n� 1; s2

�that is, conditioned on Y = y (and therefore the value of the usual samplevariance, s2) �2 has the distribution of

(n� 1) s2X

for X � �2n�1

Further, conditioned on Y = y (and therefore the values of �y and s2) � has thedistribution of

�y + Tspnfor T � tn�1

36

These two facts imply that Bayes posterior (credible) intervals for � and �2 willagree exactly with standard Stat 500 con�dence intervals (at the same level) forthe parameters.Of course it is possible to approximate the improper prior (31) with proper

joint distributions and get posterior inferences that are essentially the same asfor this improper prior. For example, as an approximation to (31), one mightspecify that a priori � and ln�2 are independent with

� � U (small1; large1) and ln�2 � U (small2; large2)

and expect to get essentially frequentist posterior inferences.Another possibility is to use a product of two proper conjugate marginal

priors for a joint prior. That is, one might specify that a priori � and �2 areindependent with

� � N�m; 2

�and �2 � Inv-�2

��; �2

�As it turns out, nothing works out very cleanly with this choice of prior. Seepages 80-82 of the textbook. Analysis of the posterior here is really a job forsimulation. Obviously, one expects that for large 2 and small �, inferencesbased on this structure should look much like those made using the form (31),and therefore a lot like Stat 500 inferences.Finally, on pages 78-80 the textbook discusses what seems to me to be a

very unattractive but conjugate prior for��; �2

�. I �nd the assumed prior

dependence between the two parameters and the speci�cation of the constant�0 to be quite unnatural.

4.4 Multivariate Normal Observations

As is completely standard, for a nonsingular covariance matrix �, we will saythat a k-dimensional random vector Y �MVNk (�;�) provided it has a pdf on<k

f (yj�;�) = (det�)�1=2 exp��12(y � �)0��1 (y � �)

�Then if Y = (Y 1;Y 2; : : : ;Y n) where the components Y i are iid MVNk (�;�),the joint pdf is

f (yj�;�) = (det�)�n=2

exp

�12

nXi=1

(yi � �)0��1 (yi � �)

!

= (det�)�n=2

exp

��12tr��1S0

��where

S0 =nXi=1

(yi � �) (yi � �)0 (33)

We proceed to consider models involving multivariate normal observations.

37

4.4.1 � Fixed/Known

Suppose that Y = (Y 1;Y 2; : : : ;Y n) where the components Y i are iid MVNk (�;�).If � is known, the likelihood function is

L (�) = (det�)�n=2

exp

��12tr��1S0

��for S0 the function of � de�ned in (33). Then consider a conjugate MVNk (m;�0)prior for � here. As it turns out, in direct generalization of the univariate nor-mal case with known variance and (24) and (23), the posterior pdf g (�jy) isMVNk with mean vector

�n =��10 + n��1

��1 ��10 m+ n��1�y

�(34)

and covariance matrix�n =

��10 + n��1

��1(35)

Thinking of a covariance matrix as an "inverse precision matrix," the samplingprecision of �Y is n��1, and (35) says that the posterior precision is the sumof the prior precision and the precision of the likelihood, while (34) says theposterior mean is a precision-weighted average of the prior mean and the samplemean. If the matrix �0 is "big" (the prior precision matrix �

�10 is "small") then

the posterior for the conjugate prior is approximately MVNk��y; 1n�

�. More

directly, if one uses an improper prior for � that is uniform on <k one gets thisMVNk

��y; 1n�

�posterior exactly.

4.4.2 � Fixed/Known

Suppose that Y = (Y 1;Y 2; : : : ;Y n) where the components Y i are iid MVNk (�;�).If � is known, the likelihood function is

L (�) = (det�)�n=2

exp

��12tr��1S0

��= (det�)

�n=2exp

��12tr�S0�

�1�� (36)

for S0 de�ned in (33). In order to do Bayes inference, we then need to placeprior distributions on covariance matrices. This requires doing some post-Stat 542 probability (that essentially generalizes the chi-squared distributionsto multivariate cases) and introducing the Wishart (and inverse Wishart) dis-tributions.

Wishart DistributionsLet �1;�2; : : : ;�� be iid MVNk (0;�) for a non-singular covariance matrix

�. Then for � � k consider the "sum of squares and cross-products matrix"

38

(for these mean 0 random vectors)

W =�Xi=1

�0i�i =

�Xi=1

�il�im

!l = 1; 2; : : : ; km = 1; 2; : : : ; k

(37)

This random k � k matrix has the so-called Wishart(�;�) distribution. NowW has only k + 1

2

�k2 � k

�di¤erent entries, as those below the diagonal are

the same as their opposite numbers above the diagonal. So if one wishes towrite a pdf to describe the distribution ofW it will really be a function of thosefewer than k2 distinct elements. It turns out that (thought of on the right as afunction of k� k matrix w and on the left as a function of the elements on andabove the diagonal of w) W has a pdf

f (wj�;�) / (detw)(��k�1)=2 exp��12tr��1w

��(38)

It follows from either representation (37) or density (38) that ifW �Wishart(�;�)

EW = ��

and in fact the diagonal elements of W are scaled �2 variables. That is

Wii � ��

2;�ii2

�where �ii is the ith diagonal element of �. That is, Wii has the same distrib-ution as �iiX for X � �2� :For users of WinBUGS a serious caution needs to be interjected at this point.

V �WinBUGS-Wishart (�;�)

means thatV �Wishart

��;��1

�in the present notation/language. That is, WinBUGS parameterizes with preci-sion matrices, not covariance matrices.

Inverse Wishart DistributionsNext, for W �Wishart(�;�), consider

U =W�1

This k � k inverse sum of squares and cross-products random matrix can beshown to have probability density

f (uj�;�) / (detu)�(�+k+1)=2 exp��12tr��1u�1

��(39)

39

We will call the distribution of U = W�1 (as speci�ed by the pdf in (39))Inv-Wishart(�;�). That is

W �Wishart (�;�) ) U =W�1 � Inv-Wishart (�;�)

As it turns out,

EU = EW�1 =1

� � k � 1��1 (40)

and the diagonal entries of U are scaled inverse �2 variables. That is, for iithe ith diagonal entry of ��1, the ith diagonal entry of U is

Uii � Inv-�� k + 1

2; ii2

�or Inv-�2

�� k + 1; ii

� � k + 1

�(recall the de�nitions in Section 4.3.2), i.e. Uii has the same distribution as ii=X for X � �2��k+1. Notice also that with the conventions of this discussion,W �Wishart(�;�) implies that � is a scaling matrix for W and ��1 is ascaling matrix for U =W�1.

Application of Inverse Wishart PriorsSo, now comparing the form of the "known �multivariate normal likelihood"

in (36) and the Inv-Wishart pdf in (39) it is clear that the inverse Wishartdistributions provide conjugate priors for this situation. If for � � k and agiven non-singular covariance matrix �, one makes the prior assumption that� �Inv-Wishart(�;�), i.e. assumes that

g (�) / (det�)�(�+k+1)=2 exp��12tr��1��1

��(41)

multiplying (36) and (41) shows that the posterior is

Inv-Wishart�n+ �;

�S0 +�

�1��1� (42)

So, for example, the posterior mean for � is

1

n+ � � k � 1�S0 +�

�1�and, for example, �ii has the posterior distribution of sii=X for X � �2n+��k+1and sii the ith diagonal entry of S0+�

�1, i.e. the posterior distribution of theith diagonal entry of � is Inv-�

�n+��k+1

2 ; sii2�.

It�s fairly obvious that the smaller one makes � the less in�uential is theprior on the form of the posterior (42). � is sometimes thought of as a �ctitious"prior sample size" in comparison to n.

40

Using WinBUGS With Inverse Wishart PriorsConsider what is required in order to set an Inv-Wishart prior for � with a

desired/target prior mean. The form (40) implies that if one has in mind sometarget prior mean for �, say �, one wants prior parameters � and � such that

� =1

� � k � 1��1

that is� =

1

� � k � 1��1 (43)

One may set a prior � �Inv-Wishart(�;�) by setting ��1 �Wishart(�;�) andhave a desired prior mean for� by using (43). If one is then using WinBUGS thisis done by setting ��1 �WinBUGS-Wishart

��;��1�, and to get a target prior

mean of � requires that one use ��1 �WinBUGS-Wishart(�; (� � k � 1)�).

An Improper Limit of Inverse Wishart PriorsA candidate for a "non-informative" improper prior for � is

g (�) / (det�)�(k+1)=2 (44)

which is in some sense the � = 0 and "� = 1" formal limit of the form (41).The product of forms (36) and (44) produces an Inv-Wishart

�n;S�10

�posterior.

So, for example, under (44) the posterior mean is

1

n� k � 1�S�10

��1=

1

n� k � 1S0

4.4.3 Both � and � Unknown

If one now treats both � and � as unknown, for data Y = (Y 1;Y 2; : : : ;Y n)where the components Y i are iid MVNk (�;�) the likelihood function is

L (�;�) = (det�)�n=2

exp

��12tr��1S0

��where S0 is still (the function of �) de�ned in (33).Probably the most appealing story to be told concerning Bayes inference in

this context concerns what happens when one uses an improper prior for theelements of � and � that is put together by taking the product of the twonon-informative priors from the "known �" and "known �" cases. That is, onemight consider

g (�;�) / 1 � (det�)�(k+1)=2 (45)

With improper prior (45) as a direct generalization of what one gets in theunivariate normal problem with unknown mean and variance, the posterior dis-tribution of � is Inv-Wishart, i.e.

�j y � Inv-Wishart�n� 1;S�1

�(46)

41

where

S =nXi=1

(yi � �y) (yi � �y)0

is the sum of squares and cross-products around the sample means matrix (i.e.is n � 1 times the sample covariance matrix). By the way, the textbook hasthis wrong on its page 88 (wrongly substituting S for S�1 in the Inv-Wishartform for the posterior). (46) and (40) then imply that the posterior mean for� using (45) is

1

n� k � 2S

Further, again as a direct generalization of what one gets in the univariatenormal problem with unknown mean and variance, the posterior distribution of� is multivariate t. That is,

(�� y) j y � Multivariate t�n� k; 1

n

�1

n� kS��

meaning that � has the (posterior) distribution of

�y+1pn

�1

n� kS�1=2r

n� kW

Z

where Z is a k�1 vector of independent N(0; 1) random variables, independent

of W � �2n�k, and�

1n�kS

�1=2is a matrix square root of

�1

n�kS�. (Notice

that this fact allows one to easily (at least by simulation) �nd the posteriordistribution of (and thus credible sets for) any parametric function h (�).)An alternative to the improper prior (45) is a product of two proper priors for

� and �. The obvious choices for the two marginal priors are a MVNk (m;�0)prior for � and an Inv-Wishart(�;�) prior for �. Nothing works out verycleanly (in terms of analytical formulas) under such assumptions, but one shouldexpect that for �0 "big," � small, and � "big," inferences for the proper priorshould approximate those for the improper prior (45).

4.5 Multinomial Observations

Consider now n independent identical trials where each of these has k possibleoutcomes with respective probabilities p1; p2; : : : ; pk (where each pi 2 (0; 1) andPpi = 1). If

Yi = the number of outcomes that are of type i

then Y = (Y1; Y2; : : : ; Yk) is Multinomialk (n;p) (for p = (p1; p2; : : : ; pk)) andhas (joint) probability mass function

f (yjp) =�

ny1; y2; : : : ; yk

� kYi=1

pyii

42

(for vectors y of non-negative integers yi with sum n). The coordinate variablesYi are, of course, Binomial(n; pi).Consider inference for p based on Y �Multinomialk (n;p). The likelihood

function is

L (p) =

�n

y1; y2; : : : ; yk

� kYi=1

pyii (47)

and in order to do Bayes inference, one must �nd a way to put a prior distrib-ution on the set of (k-vectors) p that have each pi 2 (0; 1) and

Ppi = 1.

The most convenient (and conjugate) form for a distribution on the set of p�sthat have each pi 2 (0; 1) and

Ppi = 1 is the Dirichlet form. If X1; X2; : : : ; Xk

are independent random variables with Xi � � (�i; 1) for positive constants�1; �2; : : : ; �k and one de�nes

Wi =XiPki=1Xi

then

W =

0BBB@W1

W2

...Wk

1CCCA � Dirichletk (�)

Using this characterization it is easy to argue that the ith marginal of a Dirichlet

vector is Beta��i;P

j 6=i �j

�and that conditional distributions of some coordi-

nates given the values of the others are the distributions of multiples of Dirichletvectors. The pdf for k� 1 coordinates ofW �Dirichletk (�) (written in termsof all k coordinates) is

f (wj�) /kYi=1

w�ii (48)

Using (47) and (48) it is clear that using a Dirichletk (�) prior for p, theposterior is

pjy is Dirichletk (�+ y)

So, for example, the Beta posterior marginal of pi has mean

�i + yiPki=1 �i + n

=

�Pki=1 �i

� �iPki=1 �i

!+ n

�yin

�Pk

i=1 �i + n(49)

The form of the posterior mean (49) suggests the common interpretation thatPki=1 �i functions as a kind of "prior sample size" in comparison to n for weight-

ing the prior against the sample information (encoded in the relative frequenciesyi=n). If the former is small in comparison to n, the posterior means (49) arenearly the sample relative frequencies. Otherwise, the posterior means are the

43

sample relative frequencies shrunken towards the prior means �i=�Pk

i=1 �i

�.

Of course, the larger isPk

i=1 �i the more concentrated/less dispersed is theprior and the larger is

Pki=1 �i + n the more concentrated/less dispersed is the

posterior.

5 Graphical Representation of Some Aspects ofLarge Joint Distributions

This section of the outline covers some material taken from Chapters 17 and18 of All of Statistics by Wasserman. (Wasserman�s book refers its readersto Introduction to Graphical Modeling by Edwards for a complete treatment ofthis subject. Lauritzen�s Graphical Models is another standard reference.) Itconcerns using graphs (both directed and undirected) as aids to understandingsimple (independence) structure in high-dimensional distributions of randomvariables (X;Y; Z; : : :) and in relating that structure to functional forms forcorresponding densities.The developers of WinBUGS recommend making a "directed graphical" ver-

sion of every model one uses in the software, and the logic of how one naturallybuilds Bayes models is most easily related to directed graphs. So although con-cepts for undirected graphs (and how they represent independence relationships)are simpler than those for directed graphs, we will discuss the more complicatedcase (of directed graphs) �rst. But before doing even this, we make someobservations about conditional independence.

5.1 Conditional Independence

Random variables X and Y are conditionally independent given Z written

X kY jZ

providedfX;Y jZ (x; yjz) = fXjZ (xjz) fY jZ (yjz)

A basic result about conditional independence is that

X kY jZ () fXjY;Z (xjy; z) = fXjZ (xjz)

Conditional independence (like ordinary independence) has some impor-tant/useful properties/implications. Among these are

1. X kY jZ ) Y kX jZ

2. X kY jZ and U = h (X)) U kY jZ

3. X kY jZ and U = h (X)) X kY jZ;U

4. X kY jZ and X kW j (Y;Z)) X k (W;Y ) jZ

44

5. X kY jZ and X kZ jY ) X k (Y; Z)

A possibly more natural (but equivalent) version of property 3. is

X kY jZ and U = h (X)) Y k (X;U) jZ

A main goal of this material is representing large joint distributions in graph-ical ways that allow one to "see" conditional independence relationships in thegraphs.

5.2 Directed Graphs and Joint Probability Distributions

A directed graph (that might potentially represent some aspects of the jointdistribution of (X;Y; Z; : : :)) consists of nodes (or vertices) X;Y; Z; : : : andarrows (or edges) pointing between some of them.

5.2.1 Some Graph-Theoretic Concepts

For a graph with nodes/vertices X;Y; Z; : : :

1. if an arrow points from X to Y we will say that X is a parent of Y andthat Y is a child of X

2. a sequence of arrows beginning at X and ending at Y will be called adirected path from X to Y

3. if X = Y or there is a directed path from X to Y , we will say that X isan ancestor of Y and Y is a descendent of X

4. if an arrow pointing in either direction connects X and Y they will be saidto be adjacent

5. a sequence of adjacent vertices starting at X and ending at Y withoutreference to direction of any of the arrows will be called an undirectedpath from X to Y

6. an undirected path from X to Y has a collider at Z if there are twoarrows in the path pointing to Z

7. a directed path that starts and ends at the same vertex is called a cycle

8. a directed graph is acyclic if it has no cycles

As a matter of notation/shorthand an acyclic directed graph is usually called aDAG (a directed acyclic graph) although the corresponding word order is notreally as good as that corresponding to the unpronounceable acronym "ADG."

Example 1 A �rst DAGIn Figure 1, X and Y are adjacent. X and Z are not adjacent. X is a

parent of Y and an ancestor of W . There is a directed path from X to W andan undirected path from X to Z. Y is a collider on the path XY Z and is nota collider on the path XYW .

45

Figure 1: A First DAG

5.2.2 First Probabilistic Concepts and DAG�s

For a vector of random variables and vertices X = (X1; X2; : : : ; Xk) and adistribution F for X, it is said that a DAG G represents F (or F is Markovto G) if and only if

fX (x) =

kYi=1

fXijparentsi (xijparentsi)

whereparentsi = fparents of Xi in the DAG Gg

Example 2 More on the �rst DAGA joint distribution F for (X;Y; Z;W ) is represented by the DAG pictured

in Figure 1 if and only if

fX;Y;Z;W (x; y; z; w) = fX (x) fY (y) fY jX;Z (yjx; z) fW jY (wjy) (50)

In WinBUGS there is the "Doodles" facility that allows one to input a modelin terms of an associated DAG (augmented with information about speci�cforms of the conditional distributions). The joint distribution that is built bythe software is then one represented by the Doodle DAG. Notice, for example,what a DAG tells one about how Gibbs sampling can be done. The DAGpictured in Figure 1 with guaranteed corresponding form (50) implies that whenupdating X one samples from a distribution speci�ed by

fX (�) fY jX;Z (ycurrent j�; zcurrent)

updating of Z is similar, updating of Y is done sampling from a distributionspeci�ed by

fY jX;Z (�jxcurrent ; zcurrent) fW jY (wcurrent j�)

and updating of W is done by sampling from a distribution speci�ed by

fW jY (�jycurrent)

A condition equivalent to the Markov condition can be stated in terms ofconditional independence relationships. That is, let eXi stand for the set of all

46

vertices X1; X2; : : : ; Xk in a DAG G except for the parents and descendents ofXi. Then

F is represented by G , for every vertex Xi; Xi k eXi j parentsi (51)

Example 3 Yet more on the �rst DAGIf a joint distribution F for (X;Y; Z;W ) is represented by the DAG pictured

in Figure 1, it follows that

X kZ and W k (X;Z) jY

5.2.3 Some Additional Graph-Theoretic Concepts and More on Con-ditional Independence

Relationship (51) provides some conditional independence relationships impliedby a DAG representation of a joint distribution F . Upon introducing somemore machinery, other conditional independence relationships that will alwayshold for such F can sometimes be identi�ed. These can be helpful for thinkingabout the nature of a large joint distribution.

Example 4 A second, more complicated DAG

Figure 2: A Second DAG

Figure 2 provides a second example of a DAG. It follows from (51) that forF represented by the DAG in Figure 2, all of the following conditional indepen-dence relationships hold:

1. X1 kX2

2. X2 k (X1; X4)

3. X3 kX4 j (X1; X2)

4. X4 k (X2; X3) jX1

5. X5 k (X1; X2) j (X3; X4)

47

But it is also true that(X4; X5) kX2 j (X1; X3)

and that with proper additional machinery, this relationship can be read fromthe DAG.

The basic new graph-theoretic concepts needed concern connectedness andseparatedness of vertices on a DAG. For a particular DAG, G,

1. if X and Y are distinct vertices and Q a set of vertices not containingeither X or Y , then we will say that X and Y are d-connected givenQ if there is an undirected path P between X and Y such that

(a) every collider on P has a descendent in Q, and

(b) no other vertex (besides possibly those mentioned in (a)) on P is inQ.

2. if X and Y are not d-connected given Q, they are d-separated givenQ.

3. if A;B, and Q are non-overlapping sets of vertices, A 6= ; and B 6= ;, thenwe will say that A and B are d-separated given Q if every X 2 A andY 2 B are d-separated given Q.

4. if A;B, and Q are as in 3. and A and B are not d-separated given Q thenwe will say that A and B are d-connected given Q.

Example 5 A third DAG (Example 17.9 of Wasserman)

Figure 3: A Third DAG

In the DAG shown in Figure 3

1. X and Y are d-separated given ;.

2. X and Y are d-connected given fS1; S2g.

3. X and Y are d-connected given fU;Wg.

48

4. X and Y are d-separated given fS1; S2; V g.

The relationship between these graph-theoretic concepts and conditional in-dependence for vertices of a DAG is then as follows. For disjoint sets of verticesA;B; and C of a DAG, G, that represents a joint distribution F

A kB jC , A and B are d-separated by C (52)

Example 6 More on the second DAGConsider a joint distribution F for X1; X2; X3; X4; and X5 represented by

the DAG shown in Figure 2. Take

A = fX4; X5g ; B = fX2g ; and C = fX1; X3g

Then

1. X4 and X2 are d-separated given C,

2. X5 and X2 are d-separated given C, so

3. A and B are d-separated given C.

Thus by (52) one may conclude that

(X4; X5) kX2 j (X1; X3)

as suggested earlier.

Figure 4: A Fourth DAG

Notice that in Figure 4, X3 is a collider on the undirected path from X1 toX2, and X1 and X2 are d-connected given X3. So in general, X1 and X2 willnot be conditionally independent given X3 for F represented by the DAG. Thisshould not surprise us, given our experience with Bayes analysis. For examplewe know from Section 4.3.3 that in a Bayes model where Y = (Y1; Y2; : : : ; Yn)has components that are iid N

��; �2

�(conditioned on

��; �2

�), even where a

priori � and �2 are assumed to be independent, the posterior g��; �2jy

�will

typically NOT be one of (conditional) independence (given Y = y). The DAGfor this model is, of course, the version of Figure 4 shown in Figure 5.

49

Figure 5: DAG for the 2 Parameter Single Sample Normal Bayes Model

5.3 Undirected Graphs and Joint Probability Distribu-tions

An undirected graph (that might potentially represent some aspects of thejoint distribution of (X;Y; Z; : : :)) consists of nodes (or vertices) X;Y; Z; : : :and edges between some of the possible pairs of vertices. (Formally, one mightthink of edges as vertex pairs.)

5.3.1 Some Graph-Theoretic Concepts

Some of the terminology introduced above for directed graphs carries over toundirected graphs. And there are also some important additional concepts.For a graph with nodes/vertices X;Y; Z; : : :

1. two vertices X and Y are said to be adjacent if there is an edge betweenthem, and this will here be symbolized as X � Y

2. a sequence of vertices fX1; X2; : : : ; Xng is a path if Xi � Xi+1 for each i

3. if A;B; and C are disjoint sets of vertices, we will say that C separatesA and B provided every path from a vertex X 2 A to a vertex Y 2 Bcontains an element of C

4. a clique is a set of vertices of a graph that are all adjacent to each other

5. a clique is maximal if it is not possible to add another vertex to it andstill have a clique

Example 7 A �rst undirected graph

Figure 6: A First Undirected Graph

In Figure 6, X1; X2; and X3 are vertices and there is one edge connectingX1 and X3 and another connecting X2 and X3.

50

Example 8 A second undirected graph

Figure 7: A Second Undirected Graph

In Figure 7

1. fX1; X3g and fX4g are separated by fX2g

2. fX3g and fX4g are separated by fX2g

3. fX1; X2g is a clique

4. fX1; X2; X3g is a maximal clique

5.3.2 Some Probabilistic Concepts and Undirected Graphs

Suppose that F is a joint distribution for X1; X2; : : : ; Xk. For each i and j let�Xij stand for all elements of fX1; X2; : : : ; Xkg except elements i and j. Wemay associate with F a pairwise Markov graph G by

failing to connect Xi and Xj with an edge if and only if Xi kXj j �Xij

A pairwise Markov graph for F can in theory be made by considering only�k2

�pairwise conditional independence questions. But as it turns out, many

other conditional independence relationships can be read from it. That is, itturns out that if G is a pairwise Markov graph for F , then for non-overlappingsets of vertices A;B; and C,

C separates A and B ) A kB jC (53)

Example 9 A third undirected graph and conditional independence

Figure 8: A Pairwise Markov (Undirected) Graph for F

51

If Figure 8 is a pairwise Markov graph for a distribution F for X1; X2; : : : ; X5,we may conclude from (53) that

(X1; X2; X5) k (X3; X4) and X2 kX5 jX1

Property (53) says that for a pairwise Markov (undirected) graph for F , sep-aration implies conditional independence. Condition (52) says that for a DAGrepresenting F , d-separation is equivalent to conditional independence. A nat-ural question is whether the forward implication in (53) might be strengthenedto equivalence. As it turns out, this is possible as follows. For F a jointdistribution for X1; X2; : : : ; Xk and G an undirected graph, we will say that Fis globally G Markov provided for non-overlapping sets of vertices A;B; andC

C separates A and B , A kB jC

Then as it turns out,

F is globally G Markov, G is a pairwise Markov graph associated with F

so that separation on a pairwise Markov graph is equivalent to conditional in-dependence.

Example 10 A fourth undirected graph and conditional independence

Figure 9: A Second (Undirected) Pairwise Markov Graph

Whether one thinks of Figure 9 as a pairwise Markov graph G associatedwith F or thinks of F as globally G Markov, it follows (for example) that

X1 kX3 jX2 and X1 kX4 jX2

There remains to consider what connections there might be between an undi-rected graph related to F and a functional form for F . It turns out that sub-ject to some other (here unspeci�ed) technical conditions, a distribution F forX = (X1; X2; : : : ; Xk) is globally G Markov if and only if there are positivefunctions C such that

fX (x) /YC2C

C (C)

where C is the set of maximal cliques associated with G. (Any vertices thatshare no edges get their own individual factors in this kind of product.)

Example 11 (Example 18.7 of Wasserman) Another undirected graphand the form of fX

52

Figure 10: Another (Undirected) Pairwise Markov Graph

The set of maximal cliques associated with the undirected graph G in Figure10 is

C = ffX1; X2g ; fX1; X3g ; fX2; X5; X6g ; fX2; X4g ; fX3; X5gg

So (subject to some technical conditions) F is globally G Markov if and only if

fX (x) / 12 (x1; x2) 13 (x1; x3) 24 (x2; x4) 35 (x3; x5) 256 (x2; x5; x6)

for some positive functions 12; 13; ; 24; 35; and 256.

6 The Practice of Bayes Inference 3: (Mostly)Multi-Sample Models

Essentially everything that is done in M.S. level statistical methods courses likeStat 500 and Stat 511 (and more besides) can be recast in a Bayes frameworkand addressed using the kind of methods discussed thus far in this outline. Weproceed to indicate how some of these analyses can be built.

6.1 Two-Sample Normal Models (and Some Comments on"Nested" Models)

Several versions of two-sample univariate normal models are possible. Thatis, suppose that observable Y = (Y 1;Y 2) consists of iid univariate N

��1; �

21

�variables Y11; Y12; : : : ; Y1n1 independent of iid univariate N

��2; �

22

�variables

Y21; Y22; : : : ; Y2n2 . The joint pdf of Y is then

f�yj�1; �2; �21; �22

�=

1p2��21

!n1exp

�Pn1

j=1 (y1j � �1)2

2�21

!

�

1p2��22

!n2exp

�Pn2

j=1 (y2j � �2)2

2�22

!(54)

Depending then upon what one wishes to assume about the 4 parameters�1; �2; �

21; and �

22 there are submodels of the full model speci�ed by this joint

53

density (54) that might be considered. There is the full 4-parameter modelthat we will here term modelM1 with likelihood

L��1; �2; �

21; �

22

�= f

�yj�1; �2; �21; �22

�It is fairly common to make the model assumption that �21 = �22 = �2, therebyproducing a 3-parameter model that we will call modelM2 with likelihood

L��1; �2; �

2�= f

�yj�1; �2; �2; �2

�In both modelsM1 andM2, primary interest usually centers on how �1 and�2 compare. The assumption �1 = �2 imposed on �1 and �2 in model M2

produces the one sample univariate normal model of Section 4.3 that we mightcall modelM3."Obvious" priors for

��1; �2; �

21; �

22

�in model M1 or

��1; �2; �

2�in model

M2 can be built using the pieces introduced in Section 4.3. In particular, priors(improper or proper) of "independence" (of product form) for the parametersseem attractive/simple, where

1. means are a priori either "iid" uniform on < (or some large interval) orare iid normal (typically with large variance), and

2. logvariance(s) is (are) a priori either uniform on < (or some large interval)or variance(s) is (are) inverse gamma, i.e. scaled inverse �2 (typically withsmall degrees of freedom).

Models M1;M2; and M3 are nested. But under priors like those justsuggested forM1,M2 (and thereforeM3) has prior and posterior probability0: So a Bayes "test ofM2 in modelM1" can never decide in favor of "exactlyM2:" Similarly, under priors like those just suggested for M2, M3 has priorand posterior probability 0: So a Bayes "test of �1 = �2 in model M2" cannever decide in favor of "exactly �1 = �2."This situation is a simple illustration of the fact that in a Bayesian context,

rational consideration of whether a lower dimensional submodel of the workingmodel is plausible must be typically be done by either explicitly placing positiveprobability on the submodel or by taking some other approach. The Bayesfactors of Section 3.5 can be employed. Or sticking strictly to calculationswith the working model, one can assess the posterior probability "near" thesubmodel.Take for explicit example the case of working modelM2 and submodelM3.

If one wants to allow for positive posterior probability to be assigned to thesubmodel, one will need to do something like assign prior probability p to theworking model and then a prior distribution for �1; �2, and �

2 in the workingmodel, together with prior probability 1 � p to the submodel and then a priordistribution for � = �1 = �2, and �

2 in the submodel. Lacking this kind ofexplicit weighting ofM2 andM3, one might �nd a Bayes factor for comparingBayes models for M2 and M3. Or, working entirely within model M2, onemight simply �nd a posterior distribution of �1 � �2 and investigate how muchposterior probability for this variable there is near the value �1 � �2 = 0.

54

6.2 r-Sample Normal Models

This is the natural generalization of the two sample normal model just discussed.Y = (Y 1;Y 2; : : : ;Y r) is assumed to consist of r independent vectors, whereY i = (Yi1; Yi2; : : : ; Yini) has iid N

��i; �

2i

�components. The joint pdf for Y is

then

f�yj�1; : : : ; �r; �21; : : : ; �2r

�=

rYi=1

1p2��2i

!niexp

�Pni

j=1 (yij � �i)2

2�2i

!

and the most commonly used version of this model is one where one assumesthat �21 = �22 = � � � = �2r = �2 and thus has a model with r + 1 parameters andlikelihood

L��1; : : : ; �r; �

2�= f

�yj�1; : : : ; �r; �2; : : : ; �2

�Exactly as in the two-sample case, "obvious" priors for

��1; : : : ; �r; �

2�can

be built using the pieces introduced in Section 4.3. In particular, priors (im-proper or proper) of "independence" (of product form) for the parameters seemattractive/simple, where

1. means are a priori either "iid" uniform on < (or some large interval) orare iid normal (typically with large variance), and

2. ln�2 is a priori either uniform on < (or some large interval) or �2 is inversegamma, i.e. scaled inverse �2 (typically with small degrees of freedom).

(Of course, if one doesn�t wish to make the constant variance assumption, it ispossible to make use independent priors of the type in 2. above for r di¤erentvariances.)

6.3 Normal Linear Models (Regression Models)

The �rst half of Stat 511 concerns statistical analysis based on the linear model

Yn�1

= Xn�k

�k�1

+ �n�1

for � �MVNn�0; �2I

�and known matrix X (that for present purposes we will

assume has full rank). This implies that Y �MVNn�X�; �2I

�so that this

model has parameters � and �2 and likelihood

L��; �2

�=

�1p2��2

�nexp

�� 1

2�2(y�X�)0 (y�X�)

�(With proper choice of X this is well known to include the constant variancecases of the two- and r-sample normal models just discussed.)The most obvious priors for

��; �2

�are of a product/independence form,

where

55

1. � is either uniform on <k (or some large k-dimensional rectangle) or isMVNk (typically with large covariance matrix), and


The textbook in Section 14.2 considers the case of the improper prior

g��; �2

�/ 1

�2(55)

and argues that with this choice, the conditional distribution of �2jY = y is

Inv-�2�n� k; s2

�where

s2 =1

n� k

�y�X�̂

�0 �y�X�̂

�for �̂ =

�X 0X

��1X 0y the least squares estimate of �. (s2 is the usual linear

model estimate of �2.) Further, the conditional distribution of�� ̂

�jY = y

is multivariate t. That is, the posterior distribution of � is that of

�̂ + s��X 0X

��1�1=2rn� kW

Z

where Z is a k-vector of iid N(0; 1) random variables, independent ofW � �2n�k

and��X 0X

��1�1=2is a matrix square root of

�X 0X

��1.

Further, if xnew is k � 1 and one has not yet observed

ynew = x0new� + �new

for �new independent of �n�1

with mean 0 and variance �2, one might consider

the posterior predictive distribution of ynew based on the improper prior (55).As it turns out, the posterior distribution is that of

x0new �̂ + s

qx0new

�X 0X

��1xnew + T

for T � tn�k.The upshot of all this is that Bayes credible intervals for model parameters

(and linear combinations of elements of the vector �) and future observationsbased on the improper prior (55) for any full rank linear model are the sameas con�dence intervals based on the usual Stat 511 linear model theory. (This,of course, is true for the (constant variance) one-, two-, and r-sample normalmodels, as they are instances of this model.) A clear advantage of takingthis Bayes point of view is that beyond the "ordinary" inference formulas ofStat 511, one can easily simulate the posterior distribution of any parametricfunction h

��; �2

�and provide credible sets for this.

56

6.4 One-Way Random E¤ects Models

The standard r-sample normal model of Section 6.2 is often written in the form

Yij = �i + �ij (56)

where the �ij are iid N�0; �2

�(and as before, �1; : : : ; �r; �

2 are unknown para-meters). The one-way random e¤ects model treats �1; : : : ; �r as unobservableiid random draws from a N

��; �2�

�distribution, producing a model with para-

meters�; �2� ; and �

2 (57)

Considering only the observables Yij , the joint distribution of these is multivari-ate normal where the mean vector is �1 and the covariance matrix has diagonalentries

VarYij = �2� + �2

and o¤-diagonal elements

Cov (Yij ; Yij0) = �2� (if j 6= j0) and Cov (Yij ; Yi0j0) = 0 (if i 6= i0)

However, particularly for purposes of setting up a WinBUGS simulation froma posterior distribution, it is often very convenient to use the unobservableauxiliary variables �1; : : : ; �r. (See Section 3.7 regarding the use of auxiliaryvariables.) Further, just as in classical treatments of this model that ofteninclude prediction of the random e¤ects, there may be independent interest inthe realized but unobservable values �1; : : : ; �r.Whether one models only in terms of the observables or includes the unob-

servable �1; : : : ; �r, ultimately the one-way random e¤ects model has only the3 parameters (57). Once again, choices of joint priors (improper or proper) of"independence" (of product form) for the parameters seem attractive/simple,where

1. � is a priori either uniform on < (or some large interval) or is normal(typically with large variance), and

2. ln�2� and ln�2 are a priori either uniform on < (or some large interval) or

variances �2� and �2 are inverse gamma, i.e. scaled inverse �2 (typically

with small degrees of freedom).

The formal connection (56) between the one-way random e¤ects model hereand the r-sample normal model in Section 6.2 invites consideration of how thepresent modeling might be appropriate in the earlier r-sample (�xed e¤ects)context. Formally, a Bayes version of the present one-way random e¤ectsmodel with unobservable auxiliary variables �1; : : : ; �r might be thought of asan alternative Bayes model for the r-sample normal situation, where insteadof the prior of independence for the r means, one uses a prior of conditionalindependence given parameters � and �2� , and puts priors on these. This kindof modeling might be termed use of a two stage prior or hyper-prior in

57

the �xed e¤ects model. (The values � and �2� , as parameters of the �rst-levelprior on the r means that themselves get prior distributions, are termed hyper-parameters in this kind of language.) The ultimate e¤ect of such modeling isthat instead of making the r means a priori independent, they are dependent.Posterior means for �1; : : : ; �r tend to be shrunken from the r sample meanstoward a common compromise value representing one�s perception of � (whereasif a proper normal prior for them is used in the style of the discussion of Section6.2, the shrinking is towards the known prior mean).This discussion should not be allowed to obscure the basic fact that the

"data models," corresponding parameters, and likelihoods here and in Section6.2 are fundamentally di¤erent, a point that is very important to philosophicalanti-Bayesians. (It is much less important to most Bayesians, for whom thedistinction between unknown parameters and unobserved auxiliary variables isquite unimportant, if not completely arti�cial.) In the r-sample model thereare r unknown means and one unknown variance parameter. Here there is asingle unknown mean and two unknown variance parameters.

6.5 Hierarchical Models (Normal and Others)

The kind of hierarchy of parameters and auxiliary variables just illustrated inthe one-way random e¤ects model can be generalized/extended in at least twodirections. First, more levels of hierarchy can be appropriate. Second, the con-ditional distributions involved can be other than normal. This section providesa small introduction to these possibilities.We consider a context where (more or less in "tree diagram fashion") each

level of some factor A gives rise to levels (peculiar to the given level of A) of afactor B, which in turn each gives rise to levels (peculiar to the given level of Bwithin A) of a factor C, etc. and at the end of each branch, there is some kindof observation. For example, heats of steel (A) could be poured into ingots(B), which are in turn are cut into specimens (C), on which carbon content ismeasured. Or work weeks (A) have days (B), which have in them hours ofproduction (C), in which items (D) are produced and subjected to some �nalproduct test like a "blemish count." Notice that in the �rst of these examples,a normal measurement (of carbon content) might ultimately be made, while inthe second, a Poisson model for each blemish count might be appropriate.To be slightly more concrete, let us consider a hierarchical situation involving

factors A, B, and C, with (possibly multivariate)

Yijk = the data observed at level k of factor C within level j of factor B

within level i of factor A

A hierarchical model for the entire set of observables is then constructed asfollows. Suppose that the distribution of Yijk depends upon some parameter ijk and possibly a parameter c, and that conditional on the ijk�s and c, theYijk�s are independent. Then, in the obvious (abused) notation, a conditional

58

density for the observables becomes

f (yj ; c) =Yi;j;k

f�yijkj ijk; c

�Then we suppose that for some parameters �ij and possibly a parameter b, the ijk�are conditionally independent, the distribution of each ijk governed byits �ij and b. That is, the conditional density for the ijk�s becomes (again inobvious notation)

f ( j�; b) =Yi;j;k

f� ijkj�ij ; b

�And then we suppose that for some parameters �i and possibly a parameter athe �ij�s are conditionally independent, the distribution of each �ij governedby its �i and a. So the conditional density for the �ij�s becomes

f (�j�;a) =Yi;j

f��ij j�i;a

�Finally, we suppose that conditional on a parameter vector �, the �i are con-ditionally independent. So the conditional density for �i�s becomes

f (�j�) =Yi

f (�ij�)

The joint density for all of the Yijk�s, ijk�s, and �ij�s and �i�s is then

f (y; ;�;�jc; b;a;�) = f (yj ; c) f ( j�; b) f (�j�;a) f (�j�) (58)

Notice that this form is consistent with a directed graph representing the jointdistribution of the Yijk�s, ijk�s, and �ij�s and �i�s where

1. each Yijk has parent ijk

2. each ijk has parent �ij

3. each �ij has parent �i

This is illustrated in the small example in Figure 11.The hierarchical form indicated in (58) has parameter � (and possibly pa-

rameters a; b; and c). A Bayes analysis of a hierarchical data structure thenrequires specifying a prior for � and if relevant a; b; and c. This would put� onto a directed graph like that in Figure 11 as a parent of all �i�s, a as aparent of all �ij�s, b as a parent of all ijk�s, and c as a parent of all Yijk�s.The Bayes modeling thus breaks the independence of the two main branches ofthe directed graph in Figure 11 and makes all of the data relevant in inferencesabout all of the quantities represented on the �gure and all the parameters.

59

Figure 11: A Small Hierarchical Structure and Directed Graph

6.6 Mixed Linear Models (in General) (and Other MVNModels With Patterned Means and Covariance Ma-trices)

The normal one-way random e¤ects model of Section 6.4 is not only a specialcase of the hierarchical modeling just discussed in Section 6.5, it is a specialcase of the so called mixed linear model of Stat 511. That is, it is also a specialcase of a model that is usually represented as

Yn�1

= Xn�k

�k�1

+ Zn�q

uq�1

+ �n�1

where X and Z are known matrices, � is a parameter vector and�u�

�� MVNq+n

0@0;0@ G

q�q0q�n

0n�q

Rn�n

1A1Afrom which Y is multivariate normal with

EY =X� and VarY = ZGZ 0 +R � V

In typical applications of this model, the covariance matrix V is a patternedfunction of several variance components, say �2 =

��21; �

22; : : : ; �

2p

�, and we

might then write V��2�. This then produces a likelihood based on the multi-

variate normal density

L��;�2

�/��detV ��2��1=2 exp��1

2(y �X�)0 V

��2��1

(y �X�)�

As in Section 6.3 the most obvious priors for��; �2

�are of a product/independence

form, where

60

1. � is either uniform on <k (or some large k-dimensional rectangle) or isMVNk (typically with large covariance matrix), and

2. each ln�2i is a priori either uniform on < (or some large interval) or �2iis inverse gamma, i.e. scaled inverse �2 (typically with small degrees offreedom).

Notice that although only Y is observable, just as noted in the speci�c mixedmodel of Section 6.4, there may be good reasons to include the vector of randome¤ects u in a posterior simulation. There may be independent interest in these.And since common models for u make its components independent (and thusG diagonal) and simply assembles linear combinations of these in the de�nitionof the entries of Y , the coding of a model that includes these variables may beoperationally much simpler than coding of the multivariate normal form for Yalone.One way to look at the mixed linear model is that it is a multivariate normal

model with both mean vector and covariance matrix that are parametric func-tions. There are problems where one observes one or more multivariate normalvectors that don�t �t the completely-unrestricted-mean-and-covariance-matrixcontext of Section 4.4 or the linear-mean-and-patterned-function-of-variancescontext of mixed linear models. Instead, for some parameter vector � andparametric forms for a mean vector � (�) and covariance matrix � (�), one ob-serves n � 1 iid multivariate normal vectors Y 1;Y 2; : : : ;Y n and has likelihood

L (�) / jdet� (�)j�n=2 exp �12

nXi=1

(yi � � (�))0� (�)

�1(yi � � (�))

!

Consideration of the particulars of the situation being modeled and physicalmeanings of the coordinates of � can then sometimes be called on to produce aplausible prior for � and then a Bayes analysis.

6.7 Non-Linear Regression Models, etc.

A natural generalization of the linear model discussed in Section 6.3 (and aspecial case of the parametric mean vector and covariance matrix multivariatenormal inference problem just alluded to) is a model where the means of n inde-pendent univariate normal observations yi depend upon corresponding k-vectorsof predictors xi and some parameter vector � through a function m (x;�), andthe variances are some constant �2. This is usually written as

yi = m (xi;�) + �i

where the �i are iid N�0; �2

�. This produces a likelihood that is

L��; �2

�=�2��2

��n=2exp

� 1

2�2

nXi=1

(yi �m (xi;�))2!

61

(The usual normal linear model is the case where x and � have the same di-mension and m (xi;�) = x0i�.)For a Bayes analysis in this context, a prior distribution is needed for

��; �2

�.

A product (independence between � and �2) form seems most obvious where

1. consideration of the particulars of the situation being modeled and phys-ical meanings of the coordinates of � can then sometimes be called on toproduce a plausible prior for �, and


The main point here is that operationally, where non-Bayesian analyses for thelinear and non-linear regression models are quite di¤erent (for example di¤erentsoftware and theory), Bayes analyses for the linear and non-linear regressionmodels are not substantially di¤erent.Notice too that mixed e¤ects versions of non-linear regression models are

available by assuming that � is a MVNn�0;V

��2��vector of random e¤ects

with patterned covariance matrix V��2�depending upon a vector of variance

components �2 =��21; �

22; : : : ; �

2p

�. The parameters for which one needs to

specify a prior are � and �2, and this is a special case of the "Other MVNModelsWith Patterned Means and Covariance Matrices" discussion of the previoussection.

6.8 Generalized Linear Models, etc.

The intent of the so-called "generalized linear model" introduced in Stat 511 is toextend regression/linear models type modeling of the e¤ects of covariates beyondthe realm of normal observations, particularly to cases of discrete (binomialand Poisson) responses. In the generalized linear model, one assumes that nindependent univariate (binomial or Poisson) observations yi have distributionsdepending upon corresponding k-vectors of predictors xi and some parametervector � through some appropriate "link" function h (�). That is, one assumesthat

Eyi = h�1 (x0i�)

Probably the most common Poisson version of the generalized linear modelis the case where one assumes that

Eyi = exp (x0i�)

which is the case of the so-called "log-linear model." Notice that the joint pmffor an n-vector of observations under the log-linear model is then

f (yj�) =nYi=1

exp (� exp (x0i�)) (exp (x0i�))yi

yi!

62

(more generally, one replaces exp (x0i�) with h�1 (x0i�)). So the likelihoodunder the log-linear model is

L (�) =nYi=1

exp (� exp (x0i�)) (exp (x0i�))yi

yi!

and upon making some choice of (proper) prior distribution for �, say MVNkwith large covariance matrix or uniform on a large but bounded part of <k (onewould need to think about whether an improper "�at" prior on <k for � willproduce a proper posterior), a Bayes analysis will proceed as usual.This could be easily extended to a mixed e¤ects version by assuming that

for � some MVNn�0;V

��2��vector of random e¤ects with patterned covari-

ance matrix V��2�depending upon a vector of variance components �2 =�

�21; �22; : : : ; �

2p

�, conditional on � the yi are independent with

yi � Poisson (exp (x0i� + �i))

This would produce a model with parameters � and �2 that could be handledin WinBUGS by including in the analysis the auxiliary variables in � (or likelyeven more fundamental independent mean 0 normal random e¤ects that whenadded appropriately produce � with the desired patterned covariance matrix).That is, in principle, there is no special di¢ culty involved in handling regressiontype or even mixed e¤ects type modeling and analysis of Poisson responses froma Bayes viewpoint.Common binomial versions of the generalized linear model set the binomial

"success probability" parameter p to be

pi =exp (x0i�)

1 + exp (x0i�)

(the case of so-called "logistic regression") or

pi = �(x0i�)

(the case of so-called "probit analysis") or

pi = 1� exp (� exp (x0i�))

(the case of the "complimentary log log" link). Under any of these, a joint pmffor n independent binomial observations yi is

f (yj�) =nYi=1

�niyi

�pyii (1� pi)

ni�yi

and the likelihood is thus

L (�) =nYi=1

�niyi

�pyii (1� pi)

ni�yi

63

Again upon making some prior assumption (like multivariate normal, uniformon a subset of <k or possibly uniform on all of <k) on �, a Bayes analysis isin principle straightforward. And just as discussed above in the Poisson case,the generalization to a random or mixed e¤ects version is available by replacingx0i� with x

0i� + �i in any of the expressions for pi above.

Finally, notice that from the point of view of simulation-based Bayes analysisthat doesn�t require that one develop specialized inference methods or distrib-ution theory before doing statistical analysis, there is not even anything specialabout the linear form x0i� appearing in the expressions of this section. It isconceptually no more di¢ cult to replace x0i� with an expression like m (xi;�)than it was in the normal non-linear regression case of Section 6.7.

6.9 Models With Order Restrictions

The following is a bit of an ampli�cation of the discussion of Section 3.3.3. Asindicated in that section, if a natural parameter space � � <k is of productform but some �0 � � that is not of a product form is of real interest, directMCMC simulation from a posterior on �0 may not be obvious. But if h (�)speci�es a posterior on �, one can sample from the posterior speci�ed by

h (�) I [� 2 �0]

by sampling instead from h (�) and simply "throwing away" those MCMC it-erates �j that do not belong to �0. As indicated in Section 3.3.3 this can bedone in WinBUGS using coda to transfer iterates to R.The other way to address this kind of issue is to �nd a parameterization that

avoids it altogether. Consider, for example, what is possible for the commontype of order restriction

�1 � �2 � � � � � �k

1. Where � = <k, one can de�ne

�i = �i � �i�1 for i = 2; 3; : : : ; k

(so that �j = �1 +Pj

i=2 �i for j � 2) and replace the parameter vector� with the parameter vector (�1; �2; : : : ; �k) 2 < � [0;1)k�1. Placing aprior distribution of product form on < � [0;1)k�1 leads to a posterioron a product space and straightforward posterior simulation.

2. Where � = (0;1)k, one can do essentially as in 1., or parametrize in ratioform. That is, with

ri =�i�i�1

for i = 2; 3; : : : ; k

(so that �j = �1 �Qji=2 ri for j � 2), one may replace the parameter

vector � with the parameter vector (�1; r2; : : : ; rk) 2 (0;1) � [1;1)k�1.Placing a prior distribution of product form on (0;1)� [1;1)k�1 leads toa posterior on a product space and straightforward posterior simulation.

64

3. Where � = (0; 1)k, a modi�cation of the ratio idea can be used. That is,with

di =�i�i+1

for i = 1; 2; : : : ; k � 1

(so that �j = �k �Qk�1i=j di for j � k � 1), one may replace the parameter

vector � with the parameter vector (d1; d2; : : : ; dk�1; �k) 2 (0; 1]k. Placinga prior distribution of product form on (0; 1]k leads to a posterior on aproduct space and straightforward posterior simulation.

Of course the reparameterization ideas above are not speci�cally or essentiallyBayesian, but they are especially helpful in the Bayes context.

6.10 One-Sample Mixture Models

For pdf�s f1; f2; : : : ; fM and � = (�1; �2; : : : ; �M ) a probability vector (each�i � 0 and

PMi=1 �i = 1), a pdf

f� =MXi=1

�ifi (59)

speci�es a so-called "mixture distribution." In cases where the fi are completelyspeci�ed and linearly independent functions, (frequentist or) Bayes estimationof � is straightforward. On the other hand, where each fi is parameterized bya (potentially multivariate) parameter i and the whole vector

� = (�; 1; 2; : : : ; M )

is unknown, the problem is typically technically more di¢ cult.In the �rst place, there are often identi�ability problems (see Section 3.3.1)

unless one is careful. For example, as suggested in Section 3.3.1, in a problemwhere M = 2; f1 is N

��1; �

21

�and f2 is N

��2; �

22

�, with all of �1; �1; �

21; �2; and

�22 unknown, the parameter vectors��1; �1; �

21; �2; �

22

�= (:3; 1; 1; 2; 1)

and ��1; �1; �

21; �2; �

22

�= (:7; 2; 1; 1; 1)

produce the same mixture distribution. In order to avoid this kind of di¢ culty,one must do something like parameterize not by the two means, but rather bythe smaller of the two means and the di¤erence between the larger and thesmaller of the means.The form (59) can be thought of as the density of an observable Y generated

by �rst generating I from f1; 2; : : : ;Mg according to the distribution speci�edby �, and then conditional on I, generating Y according to the density fI .Then given an iid sample from f�, say Y = (Y1; Y2; : : : ; Yn), this motivates

65

associating with each Yj a (potentially completely �ctitious) auxiliary variableIj indicating which fi gave rise to Yj . Bayes analyses of mixture samplestypically make use of such variables. And this perspective begins to perhapsmotivate a well known di¢ culty often encountered Bayes analyses of the one-sample mixture problem. That is that unless one constrains � (either by useof a very strong prior essentially outlawing the possibility that any �i = 0, orby simply adopting a parameter space that is bounded away from cases whereany �i = 0) to prevent "extreme" mixture parameters and cases where not allof the elements of f1; 2; : : : ;Mg are represented in fI1; I2; : : : ; Ing, a posteriorsampling algorithm can behave badly. Degenerate submodels of the full mixturemodel (59) that have one or more �i = 0 can act (at least for practical purposes)as "absorbing states" for MCMC algorithms. In the language of De�nition 15on page 79, the chains in e¤ect fail to be "irreducible."

6.11 "Bayes" Analysis for Inference About a Function g (t)

An interesting application of random process theory and what are really Bayesideas is to the estimation/interpolation of values of a function g (t) for t 2 (a; b)(possibly observed with error) from values at some points in (a; b). (A t 2 <kversion of what follows can be created using the ideas in the present "func-tion of a single real variable" version, but the simplest case will su¢ ce here.)This kind of material is very important in modern "analysis of computer ex-periments" applications, where in order to evaluate g (t) a long and thereforeexpensive computer run is required. It is then desirable to get a few val-ues of g and use them to derive some cheap/computationally simple interpola-tor/approximator/surrogate for g at other values of t.So suppose that for

t1 < t2 < � � � < tk

one calculates or observes

g (t1) ; g (t2) ; : : : ; g (tk) (60)

or perhapsg (t1) + �1; g (t2) + �2; : : : ; g (tk) + �k (61)

where one might model the �i as iid N�0; �2

�random variables, and that one

wishes to estimate/predict g (t�) for some t� 2 (a; b). A way to use Bayesmachinery and the theory of stationary Gaussian random processes here is tomodel and calculate as follows.I might invent a "prior" for the function g (�) by

1. specifying my best prior guess at the function g (�) as � (�),

2. writingg (t) = � (t) + (g (t)� � (t)) = � (t) + (t)

and

66

3. modeling (t) as a mean 0 stationary Gaussian process.

Item 3. here means that we assume that E (t) = 0 8t; Var (t) = �2 8t; forsome positive de�nite function � (�) taking values in (0; 1) (a mathematicallyvalid correlation function)

Cov ( (t) ; (t0)) = �2� (jt� t0j)

and that for any �nite number of values t, the joint distribution of the corre-sponding (t)�s is multivariate normal. Standard choices of the function � (�)are exp (��) and exp

��2

�. (The �rst tends to produce "rougher" realiza-

tions than does the second. In both cases, the positive parameter � governs howfast correlation dies o¤ as a function of distance between to values t and t0.) Inthis model for (t), �2 in some sense quanti�es overall uncertainty about g (t),the form of � (�) can be made to re�ect what one expects in terms of smooth-ness/roughness of deviations of g (t) from � (t), and for a typical choice of � (�),a parameter � governs how fast (as one moves away from one of t1; t2: : : : ; tk) aprediction of g (t) should move toward � (t).Then, say in the case (60) where there are no errors of observation, with t�

di¤erent from any of t1; t2: : : : ; tk, the model here (a prior for g (t)) implies that0BBB@g (t1)...

g (tk)g (t�)

1CCCA � MVNk+1

0BBB@0BBB@

� (t1)...

� (tk)� (t�)

1CCCA ;�

1CCCA (62)

for�

(k+1)�(k+1)= �2 (� (jti � tj j))i=1;:::;k+1

j=1;:::;k+1(63)

with the understanding that we are letting tk+1 = t�. Then multivariate normaltheory gives fairly simple formulas for the conditional distribution of part of themultivariate normal vector given the value of the rest of the vector. That is, it isstraightforward to �nd from (62) the normal conditional (posterior) distributionof

g (t�) j (g (t1) ; g (t2) ; : : : ; g (tk))This, in turn, produces plausibility statements about the unevaluated g (t�).The case (61) is much the same. The only di¤erences are that the covari-

ance matrix for (g (t1) + �1; g (t2) + �2; : : : ; g (tk) + �k; g (t�)) is not � speci�edin (63), but rather

�� = �+ diag��2; �2; : : : ; �2; 0

�and that one is concerned with the conditional distribution of

g (t�) j (g (t1) + �1; g (t2) + �2; : : : ; g (tk) + �k)

for purposes of prediction/interpolation at t�.

67

7 Bayesian Nonparametrics

This section outlines an introduction to Bayesian analysis of some "nonpara-metric" and "semi-parametric" models. The standard textbook reference forsuch material is Bayesian Nonparametrics by Ghosh and Ramamoorthi and thetopics here are discussed in Chapter 3 of that book. The basic concern is distri-butions on (and thus "priors" on) distributions F (a cdf, or P the corresponding"probability measure" that assigns probabilities to sets of outcomes, B) whereF (or P ) is not con�ned to any relatively simple parametric family (like theGaussian, the Weibull, the beta, etc.)

7.1 Dirichlet and Finite "Stick-Breaking" Processes

Suppose that � (�) is a multiple of a probability distribution on < (meaning thatit assigns "mass" to subsets of < like a probability distribution does, but is notnecessarily normalized to have total mass 1, i.e. potentially � (<) 6= 1). (Muchof what follows could be done in <d, but for simplicity of exposition, we willhere work in 1 dimension.) It turns out to be "mathematically OK" to inventa probability distribution for P (or F ) (itself a probability distribution on <)by specifying that for any partition of < into a �nite number of disjoint setsB1; B2; : : : ; Bk with

Ski=1B = <, the vector of probabilities that P assigns to

these sets is Dirichlet distributed with parameters speci�ed by �, that is

(P (B1) ; P (B2) ; : : : ; P (Bk)) � Dirichletk (� (B1) ; � (B2) ; : : : ; � (Bk)) (64)

When (64) holds for all such partitions of <, we�ll say that P is a Dirichletprocess on < with parameter measure �, and write

P � D�Now the de�ning property (64) doesn�t give one much feeling about what real-izations from D� "look like," but it does turn out to be very tractable and enablethe proof of all sorts of interesting and useful facts about Dirichlet processes,and particularly about models where

P � D� (65)

and conditioned on P ,

Y1; Y2; : : : ; Yn � iid P (or F ) (66)

(this is the one-sample model where P or F has the "prior" D�).Some Dirichlet process facts are:

1. If P (or F )� D� there is the "neutral to the right property" that says fort1 < t2 < � � � < tk, the random variables

(1� F (t1)) ;1� F (t2)1� F (t1)

;1� F (t3)1� F (t2)

; : : : ;1� F (tk)1� F (tk�1)

are independent.

68

2. Under the assumptions (65) and (66), the posterior for P is also a Dirichletprocess, that is

P j (Y1; Y2; : : : ; Yn) � D(�+Pni=1 �Yi)

for �Yi a unit point mass distribution located at Yi. That is, the posteriorfor P is derived from the prior for P by updating the "parameter" measureby the addition of unit point masses at each observation.

3. It follows directly from 2. and (64) under the assumptions (65) and (66),that conditioned on Y1; Y2; : : : ; Yn the variable F (t) = P ((�1; t]) is Betaand has mean

E (F (t) jY1; Y2; : : : ; Yn) =� ((�1; t]) +

Pni=1 �Yi ((�1; t])

� (<) + n

=� (<)

� (<) + n �� ((�1; t])� (<) +

n

� (<) + n �# [Yi � t]

n

That is, this conditional mean is a weighted average of the probability thata normalized version of � assigns to (�1; t] and the relative frequency withwhich the observations Yi are in (�1; t], where the weights are � (<) (theprior mass) and n (the sample size).

4. It similarly follows from 2. (and the fact that if P � D� and Y jP � Pthen Y � �=� (<)) that under the assumptions (65) and (66), (posterior)predictive distributions are tractable. That is

Yn+1j (Y1; Y2; : : : ; Yn) ��+

Pni=1 �Yi

� (<) + n

(Yn+1 has a predictive distribution that is a normalized version of theparameter measure of the posterior.)

5. Despite the fact that 4. has a "sequential" nature, under assumptions (65)and (66), the marginal of (Y1; Y2; : : : ; Yn) is "exchangeable"/symmetric.Every Yi has the same marginal, every pair (Yi; Yi0) has the same bivariatedistribution, etc.

6. Probably the initially least appealing elementary fact about Dirichletprocesses is that with probability 1 their realizations are discrete. Thatis

D� (fdiscrete distributions on <g = 1)

(P generated according to a Dirichlet "prior" is sure to be concentratedon a countable set of values.)

More insight into fact 6. above and ultimately motivation for other relatednonparametric priors for probability distributions is provided by an important

69

representation theorem of Sethuraman. D� has a representation as a "stick-breaking prior" as follows. Suppose that

X1; X2; X3; : : : are iid according to1

� (<)�

independent of�1; �2; �3; : : : that are iid Beta (1; � (<))

Set

p1 = �1 and pm = �m

m�1Yi=1

(1� �i) 8m > 1

(these probabilities pm are created in "stick-breaking" fashion). Then the(random) probability distribution

P �1Xm=1

pm�Xm� D� (67)

This representation says that to simulate a realization from D�, one placesprobability mass �1 at X1, then places a fraction �2 of the remaining probabilitymass at X2, then places a fraction �3 of the remaining probability mass at X3,etc.Representation (67) (involving as it does an in�nite sum) is nothing that

can be used in practical computations/data analysis. But it motivates theconsideration of something that can be used in practice, namely a truncatedversion of P that has not a countable number of discrete components, but onlya �nite number, N , instead. That is, suppose that

X1; X2; X3; : : : ; XN are iid according to1

� (<)�

independent of

�1; �2; �3; : : : ; �N�1 that are iid Beta (1; � (<))

Set

pm = �m

m�1Yi=1

(1� �i) 81 � m < N and pN =N�1Yi=1

(1� �i)

and de�ne

PN =NXm=1

pm�Xm

Presumably, for "large" N , in some appropriate sense PN _�D�.A natural generalization of this "truncated Dirichlet process" idea can be

formulated as follows. Let =� 1; 2; : : : ; N�1

�and � = (�1; �2; : : : ; �N�1)

be vectors of positive constants. For

X1; X2; X3; : : : ; XN iid according to a probability distribution H (68)

70

independent of

�i for i = 1; : : : ; N � 1 independent Beta ( i; �i) variables (69)

set

pm = �m

m�1Yi=1

(1� �i) 81 � m < N and pN =N�1Yi=1

(1� �i)

and de�ne

PN =NXm=1

pm�Xm(70)

One might say thatPN � SB (N;H; ; �)

i.e. that PN is a general N component stick-breaking process.The beauty of representation (70) is that it involves only the N ordinary

random variables (68) and the N � 1 ordinary random variables (69). So itcan be used to specify nonparametric components of practically implementableBayes models and thus be used in data analysis.

7.2 Polya Tree Processes

A second nonparametric way of specifying a distribution on distributions isthrough the use of so-called "Polya trees." We begin the exposition of Polyatree processes with a small relatively concrete example.Suppose that one has in mind 8 real numbers x1 < x2 < � � � < x8 and is

interested in distributions over distributions on these values. (Actually, it isnot at all essential that these x�s are real numbers rather than just arbitrary"points," but with an eye to the ultimate application we might as well thinkof them as ordered real numbers.) For convenience, we will rename the valueswith binary labels and think of them at the bottom of a binary tree as in Figure12.The p�s marked on Figure 12 are meant to add to 1 in the obvious pairs

(p0 + p1 = 1; p00 + p01 = 1; p10 + p11 = 1; etc.). For �xed values of these, thetree structure in the �gure can be used to de�ne a probability distribution overthe 8 elements at the bottom of the tree according to the prescription "multiplyp�s on the branches you take to go from the top to a given �nal node." That is,with � = (�1; �2; �3) 2 f0; 1g3 the p�s de�ne a probability distribution on ��s by

P (f�g) = p�1p�1�2p�1�2�3 (71)

Then, if one places an appropriate probability distribution on the set of p�s,one has placed a distribution on the distribution P . In fact, since the p�s withlabels ending in 0 are 1 minus the corresponding p�s where the label is changedonly by switching the last digit, one needs only to place a joint distribution onthe p�s with labels ending in 1.

71

Figure 12: A 3-Level Binary Tree

A so-called "Polya tree" method of endowing the p�s with a distribution isto assume that those with labels ending in 1 are independent Beta variables.That is, for � some string of 0; 1; or 2 zeros and ones (so that �1 and �0 arestrings of zeros and ones of length 1; 2; or 3) suppose that the

p�1 � ind Beta (��1; ��0) (72)

(for parameters ��1 and ��0). Letting � stand for the whole collection of ��s(two for each p�1) we will say that the distribution over distributions P of theform (71) produced by this assumption is a PT 3 (�) process. (P is a simple"�nite Polya tree" process.)The form (71) and assumptions (72) immediately produce the result that for

P � PT 3 (�),

EP (f�g) =�

��1�1 + �0

��1�2

��11 + ��10

��1�2�3

��1�21 + ��1�20

�(73)

If conditioned on P variable Y is P distributed and P � PT 3 (�), it is imme-diate that the marginal probabilities for Y are also given by (73).Next observe that if P � PT 3 (�) and conditional on P , Y � P , for p the

set of p�1�s, a joint density for p and Y is proportional to the product Y�

�p��1�1�1 p��0�1�0

�!(p�1p�1�2p�1�2�3) (74)

(In (74) the product in the �rst term is over ��s of length 0 through 2.) The�rst term of (74) is proportional to the joint density of p�s and the second is thepmf for Y . It is then obvious that the posterior distribution of P jY is again aPolya tree process. That is because with � a vector of 1; 2; or 3 zeros and onesand

�� (Y ) ��1 if the �rst part of Y is �0 otherwise

;

72

conditioned on Y = (�1; �2; �3)

p�1 � ind Beta (��1 +��1 (Y ) ; ��0 + ��1 +��0 (Y ))

That is, in order to update a PT 3 (�) "prior" for P to a posterior, one simplylooks through the tree adding 1 to each � traversed to produce Y .The conjugacy of the Polya tree process and the form (71) for a single obser-

vation obviously generalizes to Y1; Y2; : : : ; Yn iid according to P de�ned in (71)and further allows for easy identi�cation of posterior predictive distributions.That is, adopt the notation

�� (Y1; Y2; : : : ; Yn)

for a set of ��s updated by adding to each �� a count of the number of Yi�sthat involve use of the corresponding branch of the tree. If P � PT 3 (�) andconditioned on P the Y1; Y2; : : : ; Yn are iid according to P , the posterior of Pis PT 3 (�� (Y1; Y2; : : : ; Yn)). Further, if P � PT 3 (�) and conditionedon P the Y1; Y2; : : : ; Yn; Ynew are iid P , the posterior predictive distribution ofYnew j (Y1; Y2; : : : ; Yn) is speci�ed by

Pr [Ynew = (�1; �2; �3) j (Y1; Y2; : : : ; Yn)] =

��1 +

Pni=1��1 (Yi)

�1 + �0 + n

��

��1�2 +Pn

i=1��1�2 (Yi)

��11 + ��10 +Pn

i=1��1 (Yi)

��

��1�2�3 +Pn

i=1��1�2�3 (Yi)

��1�21 + ��1�20 +Pn

i=1��1�2 (Yi)

�which is in some sense the generalization of the statement that (73) givesmarginal probabilities for Y if given P , variable Y has distribution P andP � PT 3 (�).An important question is how one might sensibly choose the parameters �

for a PT 3 (�) process (or di¤erently put, what are the consequences of variouschoices). To begin, formula (73) shows how the mean of P distribution valuesdepends upon the choice of � If one has in mind some "best guess" distributionH (�) it is simple to choose � to produce EP (f�g) = H (f�g) 8� 2 f0; 1g3. Thisis accomplished by choosing elements in � so that

�1�0

=H (f100; 101; 110; 111g)H (f000; 001; 010; 011g) ;

�01�00

=H (f010; 011g)H (f000; 001g) ;

�11�10

=H (f110; 111g)H (f100; 101g) ;

�001�000

=H (f001g)H (f000g) ;

�011�010

=H (f011g)H (f010g) ;

�101�100

=H (f101g)H (f100g) ; and

�111�110

=H (f111g)H (f110g) :

Any pairs of ��s with the correct ratios will do to produce a Polya tree processwith mean H (�), and subject to these ratio relationships, one is still free tochoose, say, the sums

�0 + �1; �00 + �01; �10 + �11; �000 + �001;

�010 + �011; �100 + �101; and �110 + �111 (75)

73

to govern how variable realizations P from PT 3 (�) are and "at what scale(s)"they vary the most.To understand this last point, note that if U �Beta( ; �), if I �x =� I have

�xed =�

1 + =�=

+ �= EU ;

but that the larger is +�, the smaller is VarU: So in the PT 3 (�) context, thelarger are the sums ��1+��0, the less variable are the realizations P . (And in thecase where conditioned on P variables Y1; Y2; : : : ; Yn are iid P , the less stronglythe posterior is pulled from the prior mean H toward the empirical distributionof the Yi�s.) Control of the scales at which P varies (or the posterior is pulledtoward the empirical distribution of the Yi�s) can be made via control of therelative sizes of these sums at di¤erent levels in the tree. For example, for agiven mean H for P ,

1. the sum �0 + �1 being large in comparison to the other sums in (75)allows the "left and right half total P probability" to vary little fromthe corresponding "left and right half total H probability" but allows thedetails within those "halves" to vary relatively substantially (so posteriorleft and right half totals stay near H totals, but the speci�cs of posteriorprobabilities within the halves can become approximately proportional tothe empirical frequency pattern of Yi�s within the halves), and

2. the sum �0+�1 being small in comparison to the other sums in (75) allowsthe "left and right half total P probability" to �uctuate substantially, butforces the patterns within those halves to be like those of the mean H (soposterior left and right half totals are pulled toward the sample fractions,but the speci�cs of posterior probabilities within the halves are pulledtowards being in proportion to those of the prior mean distribution, H).

In keeping with the qualitative notion that a sample Y1; Y2; : : : ; Yn should pro-vide less and less information about the �ner and �ner detail of P � PT 3 (�) asone goes further down the tree (and one must thus lean more and more heavilyon prior assumptions) it is more or less conventional to make ��0+��1 increasein the length of � (the depth one is at in the tree).The issue of how to take the basic idea illustrated with the foregoing small 8

point example and make a general tool for Bayes data analysis for distributionson < from it can be attacked in at least 2 ways. The simplest is to spread apower of 2 (say 2k) values xi across a part of < thought to contain essentially allthe probability of an unknown probability distribution and use a �nite PT k (�)process on those points as a nonparametric prior over distributions (on the 2k

points) that might approximate the unknown one. A second more interestingone is to use (approximate truncated versions) of a general Polya tree processon <: In the balance of this discussion we consider this second possibility.De�nition of a Polya tree process on < depends upon the choice of a nested

set of partitions of <. That is, let

74

1. B0 and B1 be disjoint sets with B0 [B1 = <,

2. B00 and B01 be disjoint sets with B00 [ B01 = B0, and B10 and B11 bedisjoint sets with B10 [B11 = B1; and

3. 8k � 2 and � 2 f0; 1gk B�0 and B�1 be disjoint sets with B�0 and B�1 =B�.

Then consider an in�nite set of independent random variables

p�1 � Beta (��1; ��0)

for each � a (possibly degenerate) �nite vector of zeros and ones (for positiveconstants ��1 and ��0). De�ne a (random) set of conditional probabilities

P (B�1jB�) = p�1

These speci�cations in turn imply (random) P probabilities for all of the setsB�. For example, much as in (71) for any � = (�1; �2; : : : ; �k) 2 f0; 1gk

P (B�) = p�1p�1�2 � � � p�1�2��k

Theorem 3.3.2 of Ghosh and Ramamoorthi then says that under quite mildconditions on the set of ��s this random set of probabilities on the sets B�extends to a random distribution P giving probabilities not just to sets B� butto all (measurable) subsets of <, and we can term that a realization from aPT (�) process.Of course (involving as it does limits and "in�nities") this de�nition of a

PT (�) process on < can not be used exactly in real data analysis. An ap-proximation to it that can be used in practice is this. For H a desired meandistribution on <, suppose that for ��s vectors of zeros and ones of lengths nomore than k, we have de�ned B��s, p��s and P (B�)�s as above with ��s chosento make

EP (B�) = H (B�)

Then for any � of length k and A � B�, de�ne

P (AjB�) =H (A)

H (B�)

This works to de�ne a random process for assigning probabilities to all subsetsof <. In particular, in cases where H has pdf h, this random distribution on <has (random, because the P (B�) are random) pdf de�ned piecewise by

fk (y) = P (B�)h(y)

H (B�)for y 2 B� (for all � 2 f0; 1gk )

(This is P (B�) times the H conditional density over B�.)If this model is used to do data analysis based on Y1; Y2; : : : ; Yn iid according

to P , to get a posterior for P , the original ��s are simply updated as in the �rst

75

PT 3 (�) example above according to counts of Yi�s falling into the B� (forall � 2 f0; 1gk). The posterior is then of the "approximate PT (�)" type justdescribed. The "posterior mean pdf" (and posterior predictive density for Ynew)under this structure is de�ned piecewise by

E [P (B�) jY1; Y2; : : : ; Yn]h(y)

H (B�)for y 2 B�

where E[P (B�) jY1; Y2; : : : ; Yn] is of a form similar to (73) but based on theupdated ��s.A particularly attractive choice of the partitions leading to a PT (�) process

on < (and thus to a truncation of it) in the case that H has density h that ispositive exactly at every point of the (potentially in�nite) interval (a; b) is

1. B0 =�a;H�1 � 1

2

��and B1 =

�H�1 � 1

2

�; b�;

2. B00 =�a;H�1 � 1

4

��; B01 =

�H�1 � 1

4

�;H�1 � 1

2

��; B10 =

�H�1 � 1

2

�;H�1 � 3

4

��;

and�H�1 � 3

4

�; b�;

and so on. This not only provides a natural set of partitions, but suggests thechoices ��1 = ��0 and thus all Ep�1 = 1=2 under the Polya scheme (makingEP (B�) = H (B�) the power of 1=2 corresponding to the length of �). Inkeeping with a desire for (relatively) smooth (though admittedly only piecewisecontinuous) posterior mean densities and the kind of considerations discussedfor choice of the ��s for PT 3 (�), it is conventional to make ��0 + ��1 increasein the length of � (increase with depth in the tree) and in particular, growth ofthe sum at a "squared length of � rate" is often recommended.

8 Some Scraps (WinBUGS and Other)

8.1 The "Zeroes Trick"

WinBUGS is a very �exible/general program. But it obviously can not automat-ically handle every distribution one could invent and want to use as part of oneof its models. That is, there is sometimes a need to include some non-standardfactor h1 (�) in an expression

h (�) = h1 (�)h2 (�)

from which one wishes to sample. (For example, one might invent a nonstandardprior distribution speci�ed by g (�) that needs to be combined with a likelihoodL (�) in order to create a product L (�) g (�) proportional to a posterior densityfrom which samples need to be drawn.) Obviously, in such cases one willsomehow need to code a formula for h1 (�) and get it used as a factor in aformula for h (�). The WinBUGS method of doing this is to employ "the zeroestrick" based on the use of a �ctitious Poisson observation with �ctitious observedvalue 0.

76

That is, a Poisson variable X with mean � has probability of taking thevalue 0

P� [X = 0] = exp (��)

So if in addition to all else one does in a WinBUGS model statement, one speci�esthat a variable Y is Poisson with mean

c� ln (h1 (�))

and gives the program "data" that says Y = 0, the overall e¤ect is to include amultiplicative factor of

exp (�c+ ln (h1 (�))) = exp (�c)h1 (�)

in the h (�) from which WinBUGS samples. Notice that since WinBUGS expectsa positive mean for a Poisson variable, one will typically have to use a non-zerovalue of c when employing the "trick" with

c > max�ln (h1 (�))

in order to prevent WinBUGS from balking at some point in its Gibbs iterations.

8.2 Convenient Parametric Forms for Sums and Products

In building probability models (including "Bayes" models that treat parametersas random variables) it is often convenient to be able to think of a variable as asum or product of independent pieces that "combine nicely," i.e. to be able tomodel Y as either

X1 +X2 + � � �+Xk (76)

or asX1 �X2 � � � � �Xk (77)

for independent Xi with suitable marginal distributions. It is thus useful toreview and extend a bit of Stat 542 probability that is relevant in accomplishingthis.For the case of (76) recall that

1. X1 �N��1; �

21

�independent of X2 �N

��2; �

22

�implies that Y = X1 +

X2 �N��1 + �2; �

21 + �

22

�producing a fact useful when one is modeling a

continuous variable taking values in all of <,

2. X1 �Poisson(�1) independent of X2 �Poisson(�2) implies that Y = X1+X2 �Poisson(�1 + �2) producing a fact useful when one is modeling adiscrete variable taking values in f0; 1; 2; : : :g, and

3. X1 � � (�1; �) independent of X2 � � (�2; �) implies that Y = X1+X2 �� (�1 + �2; �) producing a fact useful when one is modeling a continuousvariable taking values in (0;1).

77

And, of course, the facts 1) that independent Binomial variables the same suc-cess probability add to give another Binomial variable with that success proba-bility and 2) that independent negative Binomials (or geometrics) with a com-mon success probability add to give a negative Binomial variable with thatsuccess probability are sometimes helpful.For purposes of convenient modeling of products (77), one can make use of

facts about sums and the exponential function (i.e. the fact that exponentiationturns sums into products). That is, if

Xi = exp (X0i)

thenY = X1 �X2 � � � � �Xk = exp (X

01 +X

02 + � � �+X 0

k)

so facts about convenient forms for sums have corresponding facts about con-venient product forms. Using 1. and 3. above, one has for example

1. X 0 �N��; �2

�implies that X = exp (X 0) has what is typically called a

"lognormal" distribution on (0;1) and the product of independent log-normal variables is again lognormal,

2. X 0 � � (�; �) implies that X = exp (X 0) has what is might be called a"log-gamma" distribution on (1;1) and the product of independent log-gamma variables with a common � is again log-gamma, and

3. X 0 � � (�; �) implies that X = exp (�X 0) has what is might be calleda "negative log-gamma" distribution on (0; 1) and the product of inde-pendent negative log-gamma variables with a common � is again negativelog-gamma.

This last fact can be useful, for example, when modeling the reliability of seriessystems (where system reliability is assumed to be the product of componentreliabilities).

9 Some Theory of MCMC for Discrete Cases

The following exposition for discrete cases of h (�) is based on old ISU lecturenotes obtained from Noel Cressie. (Parallel theory for general cases can befound in Luke Tierney�s December 1994 Annals of Statistics paper.)

9.1 General Theory

The question addressed here is how the theory of Markov Chains is useful inthe simulation of realizations from a (joint) distribution for � speci�ed by afunction h (�) proportional to a "density" (here a pmf).

78

De�nition 12 A (discrete time/discrete state space) Markov Chain is a se-quence of random quantities f�kg, each taking values in a (�nite or) countableset X , with the property that

P [�k = xkj�1 = x1; :::;�k�1 = xk�1] = P [�k = xkj�k�1 = xk�1] .

De�nition 13 A Markov Chain is stationary provided P [�k = xkj�k�1 =xk�1] is independent of k.

WOLOG we will for the time being name the elements of X with the integers1; 2; 3; ::. and call them �states.�

De�nition 14 With pij:= P [�k = jj�k�1 = i], the square matrix P :

= (pij)is called the transition matrix for a stationary Markov Chain and the pij arecalled transition probabilities.

Note that a transition matrix has nonnegative entries and row sums of 1.Such matrices are often called �stochastic�matrices. As a matter of furthernotation for a stationary Markov Chain, let

ptij = P [�t+k = jj�k = i]

(this is the i; j entry of the tth power of P , P t =

t factorsz }| {P � P � � � � � P ) and

f tij = P [�k+t = j;�k+t�1 6= j; :::;�k+1 6= jj�k = i] .

(These are respectively the probabilities of moving from i to j in t steps and�rst moving from i to j in t steps.)

De�nition 15 We say that a MC is irreducible if for each i and j 9 t (possiblydepending upon i and j) such that ptij > 0.

(A chain is irreducible if it is possible to eventually get from any state i to anyother state j.)

De�nition 16 We say that the ith state of a MC is transient ifP1

t=1 ftii < 1

and say that the state is persistent ifP1

t=1 ftii = 1. A chain is called persistent

if all of its states are persistent.

(A state is transient if once in it, there is some possibility that the chain willnever return. A state is persistent if once in it, the chain will with certainty bein it again.)

De�nition 17 We say that state i of a MC has period t if psii = 0 unless s = �t(s is a multiple of t) and t is the largest integer with this property. The stateis aperiodic if no such t > 1 exists. And a MC is called aperiodic if all of itsstates are aperiodic.

79

Many sources (including Chapter 15 of the 3rd Edition of Feller Volume 1)present a number of useful simple results about MC�s. Among them are thefollowing.

Theorem 18 All states of an irreducible MC are of the same type (with regardto persistence and periodicity).

Theorem 19 A �nite state space irreducible MC is persistent.

Theorem 20 Suppose that a MC is irreducible, aperiodic, and persistent. Sup-pose further that for each state i the mean recurrence time is �nite, i.e.

1Xt=1

tf tii <1 .

Then an invariant/stationary distribution for the MC exists, i.e. 9 fujg withuj > 0 and

Puj = 1 such that

uj =Xi

uipij .

(If the chain is started with distribution fujg, after one transition it is in states1; 2; 3; : : : with probabilities fujg.) Further, this distributionfujg satis�es

uj = limt!1

ptij 8 i ,

anduj =

1P1t=1 tf

tjj

:

There is a converse of this theorem.

Theorem 21 An irreducible, aperiodic MC for which 9 fujg with uj > 0 andPuj = 1 such that uj =

Pi uipij must be persistent with uj =

1P1t=1 tf

tjj.

And there is an important �ergodic� result that guarantees that �time av-erages�have the right limits.

Theorem 22 Under the hypotheses of Theorem 20, if g is a real-valued functionsuch that X

j

jg(j)juj <1

then for any j, if �0 = j

P

24 1n

nXk=1

g(�k)!Xj

g(j)uj

35 = 180

(Note that the choice of g as an indicator provides approximations for stationaryprobabilities.)With this background, the basic idea of MCMC for Bayes computation is

the following. If we wish to simulate from a distribution fujg with

uj / h (j) (78)

or approximate properties of the distribution that can be expressed as momentsof some function g (j), we �nd a convenient P whose invariant distribution isfujg. From a starting state �0 = i, we use P to generate a value for �1. Usingthe realization of �1 and P , we generate �2, etc. Then one applies Theorem22 to approximate the quantity of interest.In answer to the question of "How does one argue that the common algo-

rithms of Bayes computation have P�s with invariant distribution proportionalto fh (j)g?" there is the following useful su¢ cient condition (that has appli-cation in the original motivating problem of simulating from high dimensionaldistributions) for a chain to have fujg for an invariant distribution.

Lemma 23 If f�kg is a MC with transition probabilities satisfying

uipij = ujpji , (79)

then it has invariant distribution fujg.

Note then that if a candidate P satis�es (79) and is irreducible and aperiodic,Theorem 21 shows that it is persistent. Theorem 20 then shows that anyarbitrary starting value can be used and yields approximate realizations fromfujg and Theorem 22 implies that �time averages�can be used to approximateproperties of fujg. Of course, in the Bayes context, it is distributions (78) thatare of interest in MCMC from a posterior.

9.2 Application to the Metropolis-Hastings Algorithm

Sometimes MCMC schemes useful in Bayes computation can be shown to havethe �correct� invariant distributions by observing that they satisfy (79). Forexample, Lemma 23 can be applied to the Metropolis-Hastings Algorithm. Thatis, let T = (tij) be any stochastic matrix corresponding to an irreducible ape-riodic MC. This speci�es, for each i, a jumping distribution. Note that in a�nite case, one can take tij = 1=(the number of states). As indicated in Section2.4, the Metropolis-Hastings algorithm operates as follows:

� Supposing that �k�1 = i, generate J (at random) according to the distri-bution over the state space speci�ed by row i of T (that is, according toftijg).

� Then generate �k based on i and (the randomly generated) J accordingto

�k =

8<: J with probability min�1; uJ tJiuitiJ

�i with probability max

�0; 1� uJ tJi

uitiJ

� (80)

81

Note that for f�kg so generated, for j 6= i

pij = P [�k = jj�k�1 = i] = min

�1;ujtjiuitij

�tij

and

pii = P [�k = ij�k�1 = i] = tii +Xj 6=i

max

�0; 1� ujtji

uitij

�tij .

So, for i 6= j

uipij = min(uitij ; ujtji) = ujpji .

That is, (79) holds and the MC f�kg has stationary distribution fujg. (Further,the assumption that T corresponds to an irreducible aperiodic chain implies thatf�kg is irreducible and aperiodic.)As indicated in Section 2.4, in order to use the Metropolis-Hastings Algo-

rithm one only has to know the uj�s up to a multiplicative constant. If (78)holds

uJui=h (J)

h (i)

we may write (80) as

�k =

8<: J with probability min�1; h(J)tJih(i)tiJ


�0; 1� h(J)tJi

h(i)tiJ

� (81)

Notice also that if T is symmetric, (i.e. tij = tji and the jumping distributionis symmetric), (81) reduces to the Metropolis algorithm with

�k =

8<: J with probability min�1; h(J)h(i)


�0; 1� h(J)

h(i)

�A variant of the Metropolis-Hastings algorithm is the �Barker Algorithm.�

The Barker algorithm modi�es the above by replacing

min

�1;uJ tJiuitiJ

�with

uJ tJiuitiJ + uJ tJi

and

max

�0; 1� uJ tJi

uitiJ

�with

uitiJuitiJ + uJ tJi

in (80). Note that for this algorithm, for j 6= i

pij =

�ujtji

uitij + ujtji

�tij ,

so

82

uipij =(uitij)(ujtji)

uitij + ujtji= ujpji .

That is, (79) holds and thus Lemma 23 guarantees that under the Barker algo-rithm f�kg has invariant distribution fujg. (And T irreducible and aperiodiccontinues to imply that f�kg is also irreducible and aperiodic.)Note also that since

ujtjiuitij + ujtji

=

ujuitji

tij +ujuitji

once again it su¢ ces to know the uj up to a multiplicative constant in order toimplement Barker�s algorithm. In the Bayes posterior simulation context, thismeans that the Barker analogue of the Metropolis-Hastings form (81) is

�k =

(J with probability h(J)tJi

h(i)tiJ+h(J)tJi

i with probability h(i)tiJh(i)tiJ+h(J)tJi

(82)

and if T is symmetric, (82) becomes

�k =

(J with probability h(J)

h(i)+h(J)

i with probability h(i)h(i)+h(J)

9.3 Application to the Gibbs Sampler

Consider now the Gibbs Sampler (of Section 2.2). For sake of concreteness,consider the situation where the distribution of a discrete 3-dimensional randomvector � = (�1; �2; �3)with probability mass function proportional to h (�) is atissue. One de�nes a MC f�kg as follows. For an arbitrary starting state�0 =

��01; �

02; �

03

�once one has �k�1 =

��k�11 ; �k�12 ; �k�13

�:

� Generate �k�11 =��k1 ; �

k�12 ; �k�13

�by generating �k1 from the conditional

distribution of �1j�2 = �k�12 and �3 = �k�13 , i.e. from the (conditional) dis-

tribution with probability mass function h�1j�2;�3(�1j�k�12 ; �k�13 )

:=

h(�1;�k�12 ;�k�13 )P

�1h(�1;�

k�12 ;�k�13 )

.

� Generate �k�12 = (�k1 ; �k2 ; �

k�13 ) by generating �k2 from the conditional dis-

tribution of �2j�1 = �k1 and �3 = �k�13 , i.e. from the (conditional) distrib-

ution with probability function h�2j�1;�3(�2j�k1 ; �

k�13 )

:=

h(�k1 ;�2;�k�13 )P

�2h(�k1 ;�2;�

k�13 )

.

� Generate �k = (�k1 ; �k2 ; �k3) by generating �k3 from the conditional distrib-ution of �3j�1 = �k1 and �2 = �k2 , i.e. from the (conditional) distribution

with probability function h�3j�1;�2(�3j�k1 ; �

k2)

:=

h(�k1 ;�k2 ;�3)P

�3h(�k1 ;�

k2 ;�3)

.

83

Note that with this algorithm, a typical transition probability (for a stepwhere a �k�11 is going to be generated) is

P [�k�11 =��1; �

k�12 ; �k�13

�j�k�1 =

��k�11 ; �k�12 ; �k�13

�] =

h(�1; �k�12 ; �k�13 )P

�1h(�1; �

k�12 ; �k�13 )

so if �k�1 has distribution speci�ed by h, the probability that �k�11 = (�1; �2; �3)is X

h( ; �2; �3)P ;�2;�3

h( ; �2; �3)

h(�1; �2; �3)P�1h(�1; �2; �3)

=h(�1; �2; �3)P

�1;�2;�3h( ; �2; �3)

so �n also has distribution h And analogous results hold for the other twotypes of transitions (where �k�12 and �k are to be generated). That is, directcalculation (as opposed to the use of Lemma 23) shows that if P1; P2 and P3are the 3 (di¤erent) transition matrices respectively for transitions �k�1 !�k�11 ;�k�11 ! �k�12 ; and �k�12 ! �k; they each have the distribution speci�edby h as their invariant distributions. This means that the transition matrix for�k�1 ! �k, namely

P = P1P2P3

also has the distribution speci�ed by h as its invariant distribution, and describesa whole cycle of the Gibbs/Successive Substitution Sampling algorithm.

��kis

thus a stationary Markov Chainwith transition matrix P: So one is in a positionto apply Theorems 21 and 22. If P is irreducible and aperiodic (this hasto be checked), Theorem 21 says that the chain

��kis persistent and then

Theorems 20 and 22 say that observations from h can be simulated using anarbitrary starting state.

9.4 Application toMetropolis-Hastings-in-Gibbs Algorithms

Consider now the kind of combination of Metropolis-Hastings and Gibbs sam-pling algorithms considered in Section 2.5. For sake of concreteness, supposeagain that a discrete 3-dimensional random vector � = (�1; �2; �3)with proba-bility mass function proportional to h (�) is at issue. Suppose further that it isclear how to make �2 and �3 updates using conditionals of �2j�1; �3 and �3j�1; �2(these are recognizable as of a standard form) but that a "Metropolis-Hastingsstep" is to be used to make �1 updates.For an arbitrary starting state �0 =

��01; �

02; �

03

�once one has �k�1 =�

�k�11 ; �k�12 ; �k�13

�, one �rst makes an �1 update as follows. Suppose that

for every pair (�2; �3),t (�1; �

01j�2; �3)

speci�es a transition matrix on the set of corresponding possible �1�s (for tran-sitions �1 ! �01), and for safety sake, let�s require that all t (�1; �

01j�2; �3) > 0.

Then

� sample �k�1 from t��k�11 ; �j�k�12 ; �k�13

�, and

84

� set

�k1 =

8<: �k�1 with probability min�1;

h(�k�1 ;�k�12 ;�k�13 )t(�k�1 ;�k�11 j�k�12 ;�k�13 )h(�k�11 ;�k�12 ;�k�13 )t(�k�11 ;�k�1 j�k�12 ;�k�13 )

��k�11 otherwise

and then just as in Section 9.3,

� generate �k2 from the conditional distribution of �2j�1 = �k1 and �3 =�k�13 , i.e. from the (conditional) distribution with probability function

h�2j�1;�3(�2j�k1 ; �

k�13 )

:=

h(�k1 ;�2;�k�13 )P

�2h(�k1 ;�2;�

k�13 )

and

� generate �k = (�k1 ; �k2 ; �k3) by generating �k3 from the conditional distrib-ution of �3j�1 = �k1 and �2 = �k2 , i.e. from the (conditional) distribution

with probability function h�3j�1;�2(�3j�k1 ; �

k2)

:=

h(�k1 ;�k2 ;�3)P

�3h(�k1 ;�

k2 ;�3)

.

We have already argued that the two "straight Gibbs updates" above havethe distribution speci�ed by h as their invariant distribution. We need toargue that the �rst (Metropolis-Hastings) step leaves the distribution speci�edby h invariant (notice that this is not obviously covered by the argument forthe overall Metropolis-Hastings algorithm o¤ered in Section 9.2). So supposethat �k�1 has distribution speci�ed by h and consider the distribution of �k�11 =��k1 ; �

k�12 ; �k�13

�obtained from �k�1 by employing the Metropolis-Hastings step

to replace �k�11 .

P��k�11 = (�01; �2; �3)

�=X�1

P��k�1 = (�1; �2; �3)

��P [(�1; �2; �3)! (�01; �2; �3)]

where (�1; �2; �3) ! (�01; �2; �3) is shorthand for the Metropolis-Hastings stepresulting in the indicated transition. Then if � =

P� h (�) so that it is

1�h that

85

is the pmf of interest,

� � P��k�11 = (�01; �2; �3)

�=

X�1 6=�01

h (�1; �2; �3) t (�1; �01j�2; �3)min

�1;h (�01; �2; �3) t (�

01; �1j�2; �3)

h (�1; �2; �3) t (�1; �01j�2; �3)

�+h (�01; �2; �3) t (�

01; �

01j�2; �3) � 1

+h (�01; �2; �3)X�1 6=�01

t (�01; �1j�2; �3)max�0; 1� h (�1; �2; �3) t (�1; �

01j�2; �3)

h (�01; �2; �3) t (�01; �1j�2; �3)

�=

X�1 6=�01

min (h (�1; �2; �3) t (�1; �01j�2; �3) ; h (�01; �2; �3) t (�01; �1j�2; �3))

+h (�01; �2; �3) t (�01; �

01j�2; �3)

+X�1 6=�01

max (0; h (�01; �2; �3) t (�01; �1j�2; �3)� h (�1; �2; �3) t (�1; �01j�2; �3))

= h (�01; �2; �3) t (�01; �

01j�2; �3) +

X�1 6=�01

h (�01; �2; �3) t (�01; �1j�2; �3)

=X�1

h (�01; �2; �3) t (�01; �1j�2; �3)

= h (�01; �2; �3)

and �k�11 =��k1 ; �

k�12 ; �k�13

�also has the distribution speci�ed by h. That

is, the Metropolis-Hastings step leaves the distribution speci�ed by h invariant.This can be represented by some transition matrix P1 for the �k�1 ! �k�11

transition. Then if as in Section 9.3, P2 and P3 represent respectively �k�11 !

�k�12 ; and �k�12 ! �k transitions, the whole transition matrix for �k�1 ! �k

P = P1P2P3

has the distribution speci�ed by h as its invariant distribution, and describes acomplete cycle of the Metropolis-Hastings in Gibbs algorithm.

��kis thus a

stationary Markov Chainwith transition matrix P: So again one is in a positionto apply Theorems 21 and 22. If P is irreducible and aperiodic (this hasto be checked), Theorem 21 says that the chain

��kis persistent and then

Theorems 20 and 22 say that observations from h can be simulated using anarbitrary starting state.

9.5 Application to "Alternating" Algorithms

The kind of logic used above in considering the Gibbs and Metropolis-Hastings-in-Gibbs algorithms suggests another variant on MCMC for Bayes computation.That is, one might think about alternating in some regular way between two ormore basic algorithms. That is, if PGibbs is a transition matrix for a completecycle of Gibbs substitutions and PM-H is a transition matrix for an iteration of

86

a Metropolis-Hastings algorithm, then

P = PGibbsPM-H

is a transition matrix for an algorithm that can be implemented by followinga Gibbs cycle with a Metropolis-Hastings iteration, followed by a Gibbs cycleand so on. It�s possible that in some cases that such an alternating algorithmmight avoid di¢ culties in "mixing" that would be encountered by either of thecomponent algorithms applied alone.

87

Stat 544 Outline Spring 2008 - Iowa State Universityvardeman/stat544/544outline-08.pdfStat 544 Outline Spring 2008 Steve Vardeman Iowa State University August 13, 2009 Abstract This

Documents