Introduction/media/Documents/research/... · 2016-01-14 · QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 3 range of possible distributions, including the possibility of multi-modality.

QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION

MARK FISHER

Preliminary and incomplete.

Abstract. The Quasi-Bernstein Density (QBD) model is a Bayesian nonparametericmodel for density estimation on a finite interval that incorporates beta distributions intoa Dirichlet Process Mixture (DPM) model. The coefficients of the beta distributions arerestricted to the natural numbers, producing a model that is related to the Bernsteinpolynomial model introduced by Petrone (1999a). The potential number of mixture com-ponents is unbounded. The prior predictive density is uniform over the finite interval.

The model is first applied to observed data, and then applied to latent variables viaindirect density estimation. In this latter setting the model becomes a DPM model forthe prior. In the context of binomial observations and latent success rates, this prior iscompared with the Dirichlet process (DP) prior found in the literature. (The DP priorcan be viewed as a special case of the DPM prior.)

1. Introduction

This paper is about density estimation on a finite interval using a mixture of beta distri-butions. (A simple extension allows for two-dimensional density estimation on a rectangle.)Here are the main features: (1) the approach to inference is Bayesian; (2) the parametersof the beta distributions are restricted to the natural numbers; (3) the potential number ofmixture components is unbounded; and (4) the prior predictive density is uniform. (Theprior predictive density can be made nonuniform if desired.) As illustrations, the model isapplied to a number a standard data sets in one- and two-dimensions.

In addition, I show how to apply the model to latent variables via what I call indirectdensity estimation. (In this context I introduce the distinction between generic and specificcases.) To illustrate this technique, I apply this estimation technique to compute the densityof unobserved success rates that underly the observations from binomial experiments and/orunits. The results are compared with those generated by an alternative model that hasappeared in the literature.

Related literature. This model is related to the Bernstein polynomial density techniqueof Petrone (1999a). See also Trippa et al. (2011) and Kottas (2006).

For asymptotic properties of random Bernstein polynomials, see Petrone (1999b) andPetrone and Wasserman (2002). For a multivariate extension, see Zhao et al. (2013). For adiscussion of samplers for Dirichlet process mixture models see Neal (2000).

Date: July 22, 2014 @ 05:23.The views expressed herein are the author’s and do not necessarily reflect those of the Federal Reserve

Bank of Atlanta or the Federal Reserve System.

1

2 MARK FISHER

Liu (1996) presents a related model in which the latent success rates for binomial obser-vations have a Dirichlet Process (DP) prior. (This model is discussed in Section 8.)

Outline. Sections 2 through 6 present the model as applied to observables. Section 2provides a general introduction to Dirichlet Process Mixture (DPM) models and ChineseRestaurant Processes (CRP) and derives the general form of the conditional predictive dis-tribution. Section 3 specializes the model to the quasi-Bernstein model. Section 4 presentsa two-dimensional extension of the model. Section 5 describes the details of the Markovchain Monte Carlo (MCMC) sampler. An empirical investigation is presented in Section 6.

Sections 7 through 9 apply the model to the case of latent variables. Section 7 appliesthe model to latent variables via indirect density estimation, introducing the distinctionbetween generic and specific cases. Section 8 presents an alternative (DP-based) prior thathas been used in the literature for the case of binomial observations with latent successrates. Section 9 presents an empirical investigation for the latent-variable case.

In addition there are five appendices. Appendix A discusses potential variation in thepredictive distribution arising from additional observations. Appendix B presents additionalfeatures of the probabilities associated with the Chinese Restaurant Process. Appendix Cpresents a generalization of the two-dimensional model that includes local dependence viaa copula. Appendix D shows how to integrate out the unobserved success rate in the caseof binomial data. Appendix E discusses pooling, shrinkage, and sharing.

2. The standard model in general terms

This section describes the model. Much of the exposition in this section is dedicatedto providing the less knowledgeable reader with some insight into (i) the structure of thestandard Dirichelt Process Mixture (DPM) model, (ii) its marginalization into a classifica-tion model based on the Chinese Restaurant Process (CRP), and (iii) the derivation of theconditional predictive distribution (which forms the basis for posterior estimation). To thatend, this section borrows from the exposition of Teh (2010) [to which the reader is referredfor additional introductory material].

The general form of the standard model adopted in this paper is stated in (2.13). Thenovelty is entirely in the specializations, which are presented in Section 3 in (3.1), (3.3),and (3.12).

Inferential goal. Let x1:n = (x1, . . . , xn) denote a set of observations.1 The goal of in-ference is posterior distribution for the next observation (which we express hereafter as adensity), also known as the posterior predictive distribution:

p(xn+1|x1:n, I), (2.1)

where I stands for the model and the prior information.2

Computing this predictive distribution amounts to an exercise in Bayesian density esti-mation. For density estimation, it is natural to adopt assumptions that allow for a wide

1In Section 3, the model will be specialized to require xi ∈ [0, 1] for all i. This restriction, however, innot relevant in the current section.

2Appendix A discusses the potential variation in the predictive distribution arising from additionalobservations.

QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 3

range of possible distributions, including the possibility of multi-modality. In what follows,we provide a brief introduction to what has become the standard approach. The discussionof DPM models is intended to help give the reader a sense of where the flexibility of theBayesian approach to density estimation comes from. However, one may skim this discus-sion (for notation) and start in earnest with the CRP model [summarized in (2.13)] uponwhich the predictive distribution is based.

DPM models. Let θ1:n = (θ1, . . . , θn) denote a set of latent parameters, where θi ∈ Θ.The Dirichlet Process Mixture (DPM) model has the following hierarchical structure:3

xi|θi ∼ F (θi) (2.2a)

θiiid∼ G (2.2b)

G|α ∼ DP(α,H) (2.2c)

α ∼ Pα. (2.2d)

According to (2.2a), the observations are conditionally independent but not (necessarily)identically distributed. The distribution F plays the role of a kernel in density estimation.The location and scale (i.e., bandwidth) of the kernel associated with the ith observationare controlled by the parameter θi. The prior distribution for θi is G.G is a random probability distribution with a prior given by a Dirichlet Process (DP).

The DP depends on a concentration parameter α > 0 and a base distribution H, where His a distribution over Θ. The flexibility of the DP prior is such that G can approximatearbitrarily well (in the weak topology4) any distribution defined on the support of H.

The DP prior for G is completely characterized by the follow proposition: If A1, . . . , Asis any finite measurable partition of Θ, then(

G(A1), . . . , G(As))∼ Dirichlet

(αH(A1), . . . , αH(As)

), (2.3)

where Dirichlet denotes the Dirichlet distribution. Note∑s

j=1G(Aj) =∑s

j=1H(As) = 1.It follows from the properties of the Dirichlet distribution that

E[G(A)] = H(A), for any measurable set A ⊂ Θ. (2.4)

Thus, the base distribution H can be thought of as the expectation of the random distri-bution G. In addition, the variance of G around H is inversely related to the concentrationparameter α:

V [G(A)] = H(A)(1−H(A)

)/(1 + α). (2.5)

Thus, the concentration parameter determines how concentrated G is around its expecta-tion H.

Realizations of G are (almost surely) discrete — even if H is not — and consequentlythere is positive probability that θi = θj for some j 6= i. The smaller the concentrationparameter, the greater the probability of duplication. (This will become evident shortly.)

3The distributions F , H, and Pα will be specified in Section 3.4The central limit theorem involves the weak topology.

4 MARK FISHER

There is an explicit representation for G. Let θ = (θ1, θ2, . . .) denote the support of G andlet v = (v1, v2, . . .) denote the associated probabilities. In other words, vc is the probability

that θi = θc. Then G can be expressed as

G =

∞∑c=1

vc δθc , (2.6)

where δw denotes a point-mass (distribution) located at w. The randomness of G followsfrom the randomness of both the support and the associated probabilities:

θciid∼ H (2.7a)

v|α ∼ Stick(α), (2.7b)

where Stick(α) denotes the stick-breaking distribution given by5

vc = εc

c−1∏`=1

(1− ε`) where εc ∼ Beta(1, α). (2.8)

The parameter α controls the rate at which the weights decline on average. In particular,the weights decline geometrically in expectation:

E[vc|α, I] = αc−1 (1 + α)−c. (2.9)

Note E[v1|α, I] = 1/(1 +α) and E[∑∞

c=n+1 vc|α, I]

=(α/(1 +α)

)n. When α is small, a few

weights dominate the distribution thereby producing a substantial probability of duplicates.

Marginalization. There are a number of schemes to estimate the DPM-based model (as-suming F , H and Pα have been specified). However, it is possible to reexpress the model(via marginalization) in a form that allows for estimation schemes that are both simplerand lower-variance. Although the marginalized model cannot deliver inferences about G, itdoes deliver the predictive inferences that are the subject of this paper.

As a first step in reexpressing the model, introduce the classification variables z1:n =

(z1, . . . , zn), where zi = c indicates θi = θc.6 The classifications are solely determined by

the mixture weights v. In particular, zi takes on value c with probability vc, which can beexpressed as

z1:n|v ∼ Multinomial(v). (2.10)

The reexpression is completed by integrating out v using its prior, producing the marginaldistribution for the classifications:

p(z1:n|α) =

∫p(z1:n|v) p(v|α) dv. (2.11)

This distribution can be expressed as follows:

z1:n|α ∼ CRP(α), (2.12)

5Start with a stick of length one. Break off the fraction ε1 leaving a stick of length 1 − ε1. Then breakoff the fraction ε2 of the remaining stick leaving a stick of length (1− ε1) (1− ε2). Continue in this manner.

6Note that z1:n can be interpreted as a partition of the set {1, . . . , n}. We may interpret the partition asdetermining the way in which the parameters are shared among the observations.


where CRP(α) denotes the Chinese Restaurant Process (CRP) with concentration parameterα. The CRP plays a central role in the model and we will discuss is shortly.

Summary. The marginalized model is summarized as follows:

xi|θi ∼ F (θi) (2.13a)

θi = θzi (2.13b)

θciid∼ H (2.13c)

z1:n|α ∼ CRP(α) (2.13d)

α ∼ Pα. (2.13e)

This is a standard model in the literature. The novelty introduced in this paper comes inthe specification of F , H, and Pα (in Section 3). In the meantime, it is convenient to discussthe CRP and also to derive the general forms of some additional expressions.

CRP as sharing. Perhaps the most interesting aspect of (2.13) is the sharing of parametersamong the observations. We begin by stating certain features of this sharing. In particular,there is a finite number of clusters, where the number of clusters is random. Each of the

clusters has a common parameter, where θc is the parameter for cluster c. Each observationhas a classification parameter zi, where zi = c denotes that θi belongs to cluster c, andconsequently

θi = θzi . (2.14)

We assume the cluster parameters are independently and identically distributed accordingto some distribution function H:

θciid∼ H. (2.15)

The way in which the parameters are shared is determined by a probability distributionfor the classifications, z1:n = (z1, . . . , zn). For this purpose we adopt the Chinese RestaurantProcess (CRP). The imagery behind the CRP is that there is a restaurant with an unlimitednumber of tables at which customers are randomly seated, and all customers seated at thesame table share the same entree. In the analogy, the tables correspond to the clusters, thecustomers correspond to the regimes, and the entrees correspond to the parameter values.

The seating assignment works as follows. Customers arrive one at a time, and the firstcustomer is always seated at table number one. Subsequent customers are seated eitherat an already occupied table or at the next unoccupied table. The probabilities of each ofthese possible seating choices are described next.

Suppose n customers have already been seated. Let mn denote the number of occupiedtables, which can be anywhere from 1 to n. Let dn,c denote the number of customers seatedat table c. (For future reference, note that mn = max[z1:n], that dn,c is the multiplicity of cin z1:n, and that

∑mnc=1 dn,c = n.) The probability that the next customer is seated at table

c is given by

p(zn+1 = c|z1:n, α) =

{dn,cn+α c ∈ {1, . . . ,mn}α

n+α c = mn + 1, (2.16)

6 MARK FISHER

where α is the concentration parameter.7

The effect of α on the amount of sharing can be seen in (2.16). If α = 0, then all

customers are seated at the first table and there is complete sharing (θi = θ1). At the other

extreme if α =∞, then each customer is seated separately and there is no sharing (θi = θi).Intermediate values for α produce partial sharing. Because the value of α plays such animportant role in determining the amount of sharing, we provide it with a prior and allowthe data to help determine its value.

The probability of z1:n (given α) can be computed using (2.16). It depends only on theset of multiplicities associated with z1:n:

p(z1:n|α) = p(z1|α)n−1∏i=1

p(zi+1|z1:i, α) =αmn

∏mnc=1(dn,c − 1)!

(α)n, (2.17)

where (α)n :=∏ni=1(i− 1 + α) = Γ(n+ α)/Γ(α).8

Posterior predictive distribution revisited. The following parameter vector is centralto what follows:

ψn := (α, z1:n, θ1:mn). (2.18)

We can express the posterior predictive distribution in terms of ψn:

p(xn+1|x1:n, I) =

∫p(xn+1|ψn, I) p(ψn|x1:n, I) dψn. (2.19)

Calculation of the posterior distribution according to (2.19) involves the following steps:Find the explicit form for the conditional predictive distribution

p(xn+1|ψn, I), (2.20)

make draws of {ψ(r)n }Rr=1 from the posterior distribution p(ψn|x1:n, I), and compute the

approximation (Rao–Blackwellization):

p(xn+1|x1:n, I) ≈ 1

R

R∑r=1

p(xn+1|ψ(r)n , I). (2.21)

The task of finding the explicit form for p(xn+1|ψn) begins by recognizing

xn+1|ψn+1 ∼ F (θzn+1). (2.22)

Then, noting

ψn+1 = ψn ∪ (zn+1, θmn+1), (2.23)

the task concludes by integrating out (zn+1, θmn+1) using distributions conditional on ψn.The result is a mixture of mn + 1 terms:

p(xn+1|ψn, I) =

mn∑c=1

dn,cn+ α

f(xn+1|θc) +α

n+ αp(xn+1|I), (2.24)

7The CRP can be interpreted as the limit of sharing scheme involving K categories. Let z1:n ∼Multinomial(n,w) where w ∼ Dirichlet(α/K, . . . , α/K). Integrate out w and let K → ∞. See Neal (2000)for details.

8See Appendix B for addition information regarding the assignment of probabilities to partitions.


where f( · |θi) is the density of F (θi), h( · |I) is the density of H, and

p(xi|I) =

∫f(xi|θc)h(θc|I) dθc (2.25)

for any i ≥ 1 (including n + 1). Equation (2.24) is the standard workhorse of Bayesiannonparametric density estimation.

A key feature of (2.24) is this: If α = ∞ then nothing is learned about xn+1 from x1:n

because no information flows through the hyperparameter and p(xn+1|ψn, I) = p(xn+1|I).Thus the no-sharing environment is also a no-learning environment.

3. Specializing the model

Having set the stage with the general structure of the standard Bayesian nonparametricdensity model, we are now in a position to specify the model by specializing F , H, and Pα.Recall the general structure of the model is summarized by (2.13).

Specializing F . Let θi = (ai, bi) ∈ Θ = N× N and let

F (θi) = Beta(ai, bi), (3.1)

where

Beta(x|a, b) = 1[0,1](x)xa−1 (1− x)b−1

B(a, b), (3.2)

and where B(a, b) is the beta function, B(a, b) = Γ(a) Γ(b)Γ(a+b) =

∫ 10 x

a−1 (1−x)b−1 dx. Note the

parameters in the Beta distributions are restricted to the positive integers. Also note themodel assumes xi ∈ [0, 1]. It is straightforward to adopt the model to any finite interval.

Specializing H. Let θc = (ac, bc) ∈ Θ. In order to describe the prior for θc it is convenient

to change variables to (jc, kc), where (ac, bc) = (jc, kc − jc + 1). Note kc ∈ N and jc ∈{1, . . . , kc}. Let H be characterized by

kc ∼ Pascal(1, ξ) (3.3a)

jc|kc ∼ Discrete(1/kc, . . . , 1/kc). (3.3b)

This prior delivers a model that shares much in common with Petrone’s random Bernsteinpolynomial model. Because we do not treat an entire Bernstein basis as a unit, we refer tothis as a quasi -Bernstein model.

The prior for (jc, kc) can be expressed as

p(jc, kc|I) = p(kc|I)/kc, (3.4)

where

p(kc|I) = ξ (1− ξ)kc−1 (3.5)

8 MARK FISHER

We are now equipped to explicitly express p(xi|I)

p(xi|I) =

∫p(xi|θc) p(θc|I) dθc

=

∞∑kc=1

p(kc|I)

kc∑jc=1

Beta(xi |jc, kc − jc + 1)

kc

=∞∑kc=1

p(kc|I)Uniform(xi|0, 1)

= 1[0,1](xi).

(3.6)

The simplification in (3.6) does not depend on the form of p(kc|I). It follows from a

property of Bernstein polynomials, for which∑d

`=0B`,d(x) = 1, where B`,d(x) =(d`

)x` (1−

x)d−` denotes the `th Bernstein basis function of degree d. Equation (3.6) follows sinceBeta(x|j, k − j + 1)/k ≡ Bj−1,k−1(x).

Here is the prior in terms of (ac, bc):

p(ac, bc|I) = p(jc, kc|I)|(jc,kc)=(ac,ac+bc−1)

=ξ (1− ξ)ac+bc−2

ac + bc − 1(3.7)

(The Jacobian of the transformation is identically one)

Conditional predictive distribution revisited. Given the specification of F and H, wecan now express the conditional predictive distribution in the form used for estimation inthis paper:

p(xn+1|ψn, I) =

mn∑c=1

(dn,cn+ α

)Beta(xn+1|ac, bc) +

(α

n+ α

)1[0,1](xn+1), (3.8)

where the prior for (ac, bc) ∈ N × N is given in (3.7). Thus the conditional predictivedistribution for xn+1 is a mixture of Beta distributions with integer coefficients — andconsequently so is the predictive distribution p(xn+1|x1:n, I).

Specializing Pα. In formulating the prior for α it is instructive to examine the jointdistribution for x1 and x2 conditional on α. First, note that the joint distribution conditionalon ψ1 can be expressed as

p(x1, x2|ψ1, I) = f(x1|θ1) p(x2|ψ1, I), (3.9)

where [refer to (2.24)]

p(x2|ψ1, I) =

(1

1 + α

)f(x2|θ1) +

(α

1 + α

)p(x2|I). (3.10)


0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

x1

x 2

Figure 1. Density histogram of 105 draws from p(x1, x2|I) with ξ = .005.

Therefore, the joint distribution conditional on α can be obtained by integrating out θ1:

p(x1, x2|α, I) =

∫p(x1, x2|ψ1, I) p(θ1|I) dθ1

=

(1

1 + α

)∫f(x1|θ1) f(x2|θ1)h(θ1|I) dθ1 +

(α

1 + α

)p(x1|I) p(x2|I).

(3.11)

We see that the joint distribution is a weighted average of the two extreme sharing envi-ronments.

Equation (3.11) illustrates sharing in its simplest form. In light of this equation, wechoose the prior for α such that λ = α/(1 + α) ∼ Uniform(0, 1). In particular, Pα ischaracterized by the density

p(α|I) =1

(1 + α)2. (3.12)

This prior differs from the standard conjugate prior for α.With this prior for α, the joint distribution for x1 and x2 is

p(x1, x2|I) =1

2

∫f(x1|θ1) f(x2|θ1)h(θ1|I) dθ1 +

1

21[0,1](x1) 1[0,1](x2), (3.13)

where the densities f and h are given above. See Figure 1 for a binned density plot of 105

draws from p(x1, x2|I) where ξ = 0.005.

10 MARK FISHER

4. Two-dimensional density estimation

The extension of the model to two-dimensional observations is straightforward.9

Let xi = (xi1, xi2), θi = (θi1, θi2), and θc = (θc1, θc2), where θi` = (ai`, bi`) and θc` =

(ac`, bc`) for ` ∈ {1, 2}. Let F be characterized by

f(xi|θi) = Beta(xi1|ai1, bi1)Beta(xi2|ai2, bi2). (4.1)

Note the local independence. A model with local dependence is described in Appendix C.

Let H be characterized by p(θc|I) = p(θc1|I) p(θc2|I), where p(θc`|I) = p(kc`|I)/kc`. Conse-quently,

p(xi|I) = 1[0,1]2(xi). (4.2)

The priors for α and z1:n are unchanged.The expression for the conditional predictive distribution is becomes

p(xn+1|ψn, I) =

mn∑c=1

(dn,cn+ α

)Beta(xn+1,1|ai1, bi1)Beta(xn+1,2|ai2, bi2)

+

(α

n+ α

)1[0,1]2(xn+1). (4.3)

Simple adaptations of the sampling scheme described in Section 5 allow one to make draws ofψn in the two-dimensional case and thereby use (2.21) in conjunction with (4.3) to computethe approximation to the posterior predictive density.

5. Drawing from the posterior distribution

We discuss how to make draws of ψn = (α, z1:n, θ1:mn) from the posterior distribution viaMCMC (Markov chain Monte Carlo). We adopt a Gibbs sampling approach, relying on fullconditional posterior distributions. The sampling scheme described here follows the outlinesummarized by Algorithm 2 in Neal (2000).

Classifications. We begin with making draws of the classifications, z1:n. Let z−i1:n denotethe (possibly renormalized) vector of classifications after having removed case i. Renor-malization will occur if — before removal — the cluster associated with observation i is asingleton (i.e., dn,zi = 1), in which case θzi will be discarded and the remaining clusters willbe relabeled.

Based on (2.16) and the exchangeability of the classifications, we have

p(zi = c|α, z−i1:n) =

{d−in,c

n−1+α c ∈ {1, . . . ,m−in }α

n−1+α c = m−in + 1, (5.1)

where m−in = max[z−i1:n] and let d−in,c denote the multiplicity of class c in z−i1:n. Therefore, thefull conditional probability of zi given the observations is

p(zi|α, z−i1:n, θ1:m−in, x1:n, I) ∝

{d−in,c

n−1+α Beta(xi|ac, bc) c ∈ {1, . . . ,m−in }α

n−1+α 1[0,1](xi) c = m−in + 1, (5.2)

9See Zhao et al. (2013) for a discussion of the properties of a two-dimensional Bernstein polynomialdensity.


where 1[0,1](xi) = 1 the likelihood for a new, as yet unobserved, component parameter.

If a new cluster is selected (i.e., if zi = m−in + 1), then the new cluster parameter isdrawn from the posterior using xi as the sole observation. The posterior distribution forthe parameters for a new cluster factors as follows (letting c = m−in + 1):

p(jc, kc|xi, I) =p(xi |jc, kc) p(jc|kc, I) p(kc|I)

p(xi|I)= p(jc|kc, xi, I) p(kc|I), (5.3)

where

p(jc|kc, xi, I) =

(kc − 1

jc − 1

)xjc−1i (1− xi)kc−jc = Binomial(jc − 1|kc − 1, xi). (5.4)

Note that kc is not identified and is drawn from its prior distribution and then (jc − 1) ∼Binomial(kc − 1, xi) with the proviso that if kc = 1 then jc = 1.

Cluster parameters. Having classified each of the observations (and having, in the pro-

cess, discarded old values of θc and drawn new values of θc), we then resample each element

of θ1:mn conditional on the classifications. Let

Ic = {i : zi = c}, (5.5)

and let xc1:n = {xi|i ∈ Ic}, where the number observations in xc1:n equals dn,c. We havealready shown how to make draws if dn,c = 1.

Now assume dn,c ≥ 2. The posterior for (jc, kc) is given by

p(θc|x1:n, z1:n, α, I) = p(θc|xc1:n, I) ∝ h(θc|I)∏i∈Ic

f(xi|θc)

= p(jc, kc|I)×(∏

i∈Ic xi)jc−1 (∏

i∈Ic(1− xi))kc−jc

B(jc, kc − jc + 1)dn,c.

(5.6)

One can adopt a Metropolis-Hastings scheme with the following proposal:

k′c − 1 ∼ Poisson(k(r)c ) (5.7)

j′c − 1 ∼ Binomial(k′c − 1, xc (r)1:n ), (5.8)

where xc1:n = 1dn,c

∑i∈Ic xi is the sample mean of xc1:n.10 Let

q(θ′c|θc, xc1:n) := Poisson(k′c − 1|kc)Binomial(j′c|k′c − 1, xc1:n). (5.9)

Then

θ(r+1)c =

{θ′c M(r)

c ≥ u(r+1)

θ(r)c otherwise

, (5.10)

where u(r+1) ∼ Uniform(0, 1) and

M(r)c =

p(θ′c|x1:n, z(r)1:n, α

(r), I)

p(θ(r)c |x1:n, z

(r)1:n, α

(r), I)× q(θ

(r)c |θ′c, x

c (r)1:n )

q(θ′c|θ(r)c , x

c (r)1:n )

. (5.11)

10As an alternative, one can adopt a Metropolis scheme with a random-walk proposal for (ac, bc). Forexample, one can draw from a bivariate normal distribution and round the draw to the integers.

12 MARK FISHER

Concentration parameter. The remaining component of ψn is α. The likelihood for αis given by (2.17):

p(z1:n|α) ∝ αmn

(α)n=αmn Γ(α)

Γ(n+ α). (5.12)

Draws of α can be made using a Metropolis–Hastings scheme.11 Let the proposal be

α′ ∼ Singh–Maddala(α(r), h), (5.13)

where12

Singh–Maddala(x|m,h) =hxh−1mh

(xh +mh)2. (5.14)

Note that Pα = Singh–Maddala(1, 1). The inverse-CDF method can be used to make draws

from Singh–Maddala(m,h): Draw u ∼ Uniform(0, 1) and set x = m(u/(1 − u)

)1/h. For

determining whether or not to accept the proposal, we require the “Hastings ratio,”

Singh–Maddala(α(r)|α′, h)

Singh–Maddala(α′|α(r), h)=

α′

α(r). (5.15)

Then

α(r+1) =

{α′ M(r) ≥ u(r+1)

α(r) otherwise, (5.16)

where

M(r) =p(z

(r)1:n|α′)

p(z(r)1:n|α(r))

× α′

α(r)=

(α(r))n(α′)n

(α′

α(r)

)m(r)n +1

. (5.17)

6. Investigation: Part I

In this section we apply the Bernstein–Dirichlet density prior spline to a number ofapplications and investigate the performance: the galaxy data, the Buffalo snowfall data,and the Old Faithful data (in two dimensions).

Unless otherwise noted, ξ = 1/200. With this setting, the prior mean for kc equals 200and the prior standard deviation equals

√200× 199 ≈ 199.5. The 90% highest prior density

(i.e., probability mass) region runs from kc = 1 to kc = 460.As noted above, the prior for α is given by p(α|I) = 1/(1 + α)2 so that the prior median

for α equals one.

Galaxy data. Figure 5 shows the quasi-Bernstein predictive density for the galaxy datawith support over the interval [5, 40]. A total of 50,500 draws were made, with the first 500discarded and 1,000 draws retained (every 50th) from the remaining 50,000. In Figures 6–8

are shown the posterior distributions of the number of clusters, λ = α/(1 + α), and kc.

11Alternatively, draws of α can be made using a Metropolis scheme. Make a random-walk proposal ofλ′ ∼ N(λ(r), s2), where λ(r) = α(r)/(1 + α(r)) and s2 is a suitable scale. Then evaluate the likelihood ratio

for α′ = λ′/(1− λ′) relative to α(r) to determine whether or not to accept the proposal α′.12The Singh–Maddala distribution is also known as the Burr Type XII distribution. The version of the

Singh–Maddala distribution expressed in (5.14) is specialized from the more general version that involves athird parameter. In Mathematica, the distribution is represented by SinghMaddalaDistribution[1, h, m].


Buffalo snowfall data. Figure 9 shows the quasi-Bernstein predictive density for thegalaxy data with support over the interval [0, 150]. A total of 50,500 draws were made,with the first 500 discarded and 1,000 draws retained (every 50th) from the remaining50,000. In Figures 10–12 are shown the posterior distributions of the number of clusters,

λ = α/(1 + α), and kc.The density in Figure 9 is substantially smoother than what is produced by many alter-

native models which typically display three modes. In the current model, fixing α = 5 willproduce three modes, but this value for α is deemed unlikely according the model whenwe learn about α. The posterior median for α is about 0.31. The posterior probabilityof α ≥ 5 is about 20 times lower than the prior probability. (Increasing α also has theeffect of increasing the probability of new cluster, which in turn has the effect of increasingthe predictive density at the boundaries of the region. For example, the predictive densityincrease by roughly a factor of 10 at xn+1 = 150.)

With this data set there is a strong (posterior) relation between α and kc. The posterior

median of kc equals about 10 given α < 1, but it equals about 140 given α ≥ 1.

Old Faithfull data. Here we examine the Old Faithful data, which comprises 272 obser-vations of pairs composed of eruption time (the duration of the current eruption in minutes)and waiting time (the amount of time until the subsequent eruption in minutes). Figure 13shows a scatter plot of the data, a contour plot of the joint predictive distribution, andtwo line plots of conditional expectations computed from the joint distribution. The distri-bution was given positive support over the region [1, 5.75] × [35, 105]. The distribution isdistinctly bimodal. Figure 14 shows the marginal predictive distributions computed fromthe joint distribution (along with rug plots of the data). The following four figures providediagnostics.

7. Indirect density estimation for latent variables

Up to this point we have assumed that x1:n was observed and the goal of inference wasthe posterior predictive distribution p(xn+1|x1:n, I). In this section, we now suppose x1:n

is latent and instead Y1:n = (Y1, . . . , Yn) is observed. Note that Yi may be a vector ofobservations. To accommodate this situation, we augment (2.13) with

Yi|xi ∼ PYi(xi). (7.1)

The form of the density p(Yi|xi) will depend on the specific application. Nuisance parame-ters may have been integrated out to obtain p(Yi|xi). One may interpret p(Yi|xi) as a noisysignal for xi.

When x1:n was observed, there was an obvious asymmetry between xi ∈ x1:n and xn+1.This asymmetry carries over to the situation where x1:n is latent. In particular, we observesignals for xi ∈ x1:n but we do not observe a signal for xn+1. Based on this distinction, werefer to xi as a specific case because we have a (specific) signal for it and we refer to xn+1

as the generic case because it applies to any coefficient for which we have no signal as yet(and for which we judge x1:n to provide an appropriate basis for inference). The specific

14 MARK FISHER

cases are the ones that appear in the likelihood:

p(Y1:n|x1:n+1) = p(Y1:n|x1:n) =n∏i=1

p(Yi|xi). (7.2)

The goal of inference is the indirect posterior predictive distribution, p(xn+1|Y1:n, I).From a conceptual standpoint, the indirect predictive distribution is the average of the “di-rect” distribution, where the average is computed with respect to the posterior uncertaintyof x1:n:

p(xn+1|Y1:n, I) =

∫p(xn+1|x1:n, I) p(x1:n|Y1:n, I) dx1:n. (7.3)

Once again we can express the distribution of interest in terms of ψn:

p(xn+1|Y1:n, I) =

∫p(xn+1|ψn) p(ψn|Y1:n, I) dψn

≈ 1

R

R∑r=1

p(xn+1|ψ(r)n )

(7.4)

where {ψ(r)n }Rr=1 now represents draws from p(ψn|Y1:n, I).13

The sampler works as before with the additional step of drawing x1:n|Y1:n, ψn for eachsweep of the sampler. Since the joint likelihood factors [see (7.2)], the full conditionalposterior for xi reduces to the posterior for xi in isolation (conditional on θi):

p(xi|Y1:n, x−i1:n, ψn, I) = p(xi|Yi, θi)|θi=θzi , (7.5)

where

p(xi|Yi, θi) =p(Yi|xi) f(xi|θi)∫p(Yi|xi) f(xi|θi) dθi

. (7.6)

The posterior distributions of the specific cases can be approximated with histograms

of the draws {x(r)i }Rr=1 from the posterior. However, we can adopt a Rao–Blackwellization

approach (as we have done with the generic case) and obtain a lower variance approximation.In particular,

p(xi|Y1:n, I) =

∫p(xi|Yi, θi) p(θi|Y1:n, I) dθi

≈ 1

R

R∑r=1

p(xi|Yi, θ(r)i ).

(7.7)

Note θ(r)i = θ

(r)c where c = z

(r)i .

13In passing, note the likelihood of the observations Y1:n is can be computed via a succession of genericdistirubions:

p(Y1:n|I) = p(Y1|I)n∏i=2

p(Yi|Y1:i−1, I)

where

p(Yi|Y1:i−1, I) =

∫p(Yi|xi) p(xi|Y1:i−1, I) dxi.


Binomial data. An important case is when the data are binomial in nature:14

p(Yi|xi) = Binomial(si|Ti, xi), (7.8)

where Ti is the number of trials, si is the number of successes, and xi is the latent probabilityof success. In this case, the conditional posterior for a specific cases is

p(xi|Yi, θi) = Beta(xi|ai + si, bi + Ti − si), (7.9)

which can be used in (7.7). In this setting both specific and generic posterior distributionsare mixtures of beta distributions.

8. Alternative prior for indirect density estimation with binomial data

We present a special case where the kernel F is a point mass located at θi ∈ [0, 1] andthe base distribution H is uniform over the unit interval:

F (θi) = δθi and H = Uniform(0, 1). (8.1)

The densities can be expressed as

f(xi|θi) = δθi(xi) and h(θc|I ′) = 1[0,1](θc), (8.2)

where I ′ denotes this alternative model. Given (8.2), the conditional predictive distribution[see (2.24)] becomes a mixture of point masses and the uniform density:

p(xn+1|ψn, I ′) =

mn∑c=1

dn,cn+ α

δθc

(xn+1) +α

n+ α1[0,1](xn+1). (8.3)

The knowledgable reader will recognize (8.3) as the conditional predictive distribution thatwould be derived from a DP model (as opposed to a DPM model). This equivalence is aconsequence of the point-mass kernel.

The point-mass kernel identifies xi with θi. Since xi plays no role given θi, it is convenientto integrate xi out, obtaining the likelihood of θi:

p(Yi|θi) :=

∫p(Yi|xi) f(xi|θi) dxi = p(Yi|xi)|xi=θi . (8.4)

The sampling scheme proceeds as follows. The classifications are updated according to

p(zi|z−i1:n, θ1:m−in, α, Y1:n, I

′) ∝

{d−in,c

n−1+α p(Yi|θc) c ∈ {1, . . . ,m−in }α

n−1+α p(Yi|I′) c = m−in + 1

, (8.5)

where

p(Yi|θc) = p(Yi|θi)|θi=θc (8.6)

p(Yi|I ′) =

∫p(Yi|θc)h(θc|I ′) dθc. (8.7)

14With the binomial likelihood it is possible to analytically integrate out x1:n as shown in Appendix D.However, we do not follow that procedure in our estimation.

16 MARK FISHER

The case we are interested in involves the binomial likelihood.15 Therefore,

p(Yi|θc) = Binomial(si|Ti, θc) (8.8)

p(Yi|I ′) =1

Ti + 1. (8.9)

Updating the cluster parameters conditional on the classifications is straightforward:

θc|Y c1:n, I

′ ∼ Beta(ac, bc) (8.10)

whereac := 1 +

∑`∈Ic

s` and bc := 1 +∑`∈Ic

T` − s`. (8.11)

There is no change in the way the updates are computed for α.When we apply our approximations to the generic and specific distributions, we obtain

p(xn+1|Y1:n, I′) ≈ 1

R

R∑r=1

m(r)n∑

c=1

d(r)n,c

n+ α(r)δθ(r)c

(xn+1) +α(r)

n+ α(r)1[0,1](xn+1)

(8.12)

p(xi|Y1:n, I′) ≈ 1

R

R∑r=1

δθ(r)i

(xi). (8.13)

The generic distribution is a weighted average of the specific distributions and the priordistribution. Also note these distributions are composed of mixtures of point masses (alongwith a continuous distribution for the generic case). It is possible to compute smootherapproximations. We turn to that now.

Smoothing. In the binomial-data case with the alternative model it is possible to adoptAlgorithm 3 in Neal (2000) which reduces estimation solely to classification (conditional onthe concentration parameter).

The algorithm involves integrating out θc:

p(Yi|(Y −i1:n)c, I ′) =

∫p(Yi|θc) p(θc|(Y −i1:n)c, I ′) dθc, (8.14)

where (Y −i1:n)c denotes all the observations excluding the ith that are classified as c. Inparticular,

p(Yi|(Y −i1:n)c, I ′) = Beta–Binomial(si|a−ic , b−ic , Ti), (8.15)

wherea−ic := ac − si and b−ic := bc − (Ti − si). (8.16)

The classifications are assigned according to

p(zi|α, z−i1:n, Y1:n, I′) ∝

{d−in,c

n−1+α p(Yi|(Y−i

1:n)c, I ′) c ∈ {1, . . . ,m−in }α

n−1+α p(Yi|I′) c = m−in + 1

=

{d−in,c

n−1+α Beta–Binomial(si|a−ic , b−ic , Ti) c ∈ {1, . . . ,m−in }α

n−1+α1

Ti+1 c = m−in + 1.

(8.17)

15For textbook treatment of this case see Greenberg (2013) and Geweke et al. (2011).


Draws of the concentration parameter can be made as usual.

Posterior distributions. The posterior distribution for the specific case conditional on theclassifications and the data is given by

p(xi|z1:n, Y1:n, I′) =

p(Yi|xi) p(xi|(Y −i1:n)zi , I ′)

p(Yi|(Y −i1:n)zi , I ′)= Beta(xi|azi , bzi), (8.18)

since

p(Yi|xi) = Binomial(si|Ti, xi) (8.19)

p(xi|(Y −i1:n)zi , I ′) = Beta(xi|a−izi , b−izi ) (8.20)

p(Yi|(Y −i1:n)zi , I ′) = Beta–Binomial(si|a−izi , b−izi , Ti), (8.21)

where azi and bzi are given in (8.11) and a−izi and b−izi are given in (8.16) [where in bothcases c = zi]. Therefore,

p(xi|Y1:n, I′) ≈ 1

R

n∑r=1

p(xi|z(r)1:n, Y1:n, I

′) =1

R

R∑r=1

Beta(xi|a(r)zi , b

(r)zi ). (8.22)

The posterior distribution for the generic case given the concentration parameter, theclassifications, and the data is

p(xn+1|Y1:n, α, z1:n, I′) =

∫p(xn+1|ψn, I ′) p(θ1:mn |Y1:n, I

′) dθ1:mn

=

mn∑c=1

dn,cn+ α

Beta(xn+1|ac, bc) +α

n+ α1[0,1](xn+1),

(8.23)

where we have used∫f(xn+1|θc) p(θ1:mn |Y1:n, I

′) dθmn = p(θc|Y c1:n, I

′)|θc=xn+1

= Beta(xn+1|ac, bc). (8.24)

Therefore the posterior distribution for the generic case can be approximated by

p(xn+1|Y1:n, I′) ≈ 1

R

R∑r=1

p(xn+1|Y1:n, α(r), z

(r)1:n, I

′). (8.25)

Model comparison. The alternative model may be compared with the main model interms of their likelihoods. For A ∈ {I, I ′},

p(Y1:n|A) = p(Y1|A)

n∏i=2

p(Yi|Y1:i−1, A), (8.26)

where

p(Yi|Y1:i−1, A) =

∫p(Yi|xi) p(xi|Y1:i−1, A) dxi. (8.27)

Note p(Y1|I) = p(Y1|I ′) =∫ 1

0 p(Y1|x1) dx1.

18 MARK FISHER

Table 1. Rat tumor data: 71 studies (rats with tumors/total number or rats).

00/17 00/18 00/18 00/19 00/19 00/19 00/19 00/20 00/2000/20 00/20 00/20 00/20 00/20 01/20 01/20 01/20 01/2001/19 01/19 01/18 01/18 02/25 02/24 02/23 01/10 02/2002/20 02/20 02/20 02/20 02/20 05/49 02/19 05/46 03/2702/17 07/49 07/47 03/20 03/20 02/13 09/48 04/20 04/2004/20 04/20 04/20 04/20 04/20 10/50 10/48 04/19 04/1904/19 05/22 11/46 12/49 05/20 05/20 06/23 05/19 06/2204/14 06/20 06/20 06/20 16/52 15/47 15/46 09/24

Table 2. The thumbtack data set: 320 instances of binomial experimentswith 9 trials each. The results are summarized in terms of the number ofexperiments that have a given number of successes.

No. of successes 0 1 2 3 4 5 6 7 8 9 TotalNo. of experiments 0 3 13 18 48 47 67 54 51 19 320Frequency (percent) 0 1 4 6 15 15 21 17 16 6 ≈ 100

9. Investigation: Part II

In this section we apply the density spline to a number of applications and investigatethe performance. Rat tumor data, baseball data, and other data.

Rat tumor data. The rat tumor data is composed of the results from 71 studies. Thenumber of rats per study varied from ten to 52. The rat tumor data are described inTable 5.1 in Gelman et al. (2014) and repeated for convenience in Table 1 (although thedata are displayed in a different order). The data are plotted in Figure 3. This plot bringsout certain features of the data that are not evident in the table. There are 59 studies forwhich the total number of rats is less than or equal to 35 and more than half of these studies(32) have observed tumor rates less than or equal to 10%. By contrast, none of the other12 studies has an observed tumor rate less than or equal to 10%.

The posterior distribution for the generic case is shown in Figure 19. The posteriordistributions for the specific cases are shown in Figure 20. This latter figure can be comparedwith Figure 5.4 in Gelman et al. (2014) to show the differences in the results obtained bythe more general approach presented here.

Baseball batting skill. This example is inspired by the example in Efron (2010) which inturn draws on Efron and Morris (1975). We are interested in the ability of baseball playersto generate hits. We do not observe this ability directly; rather we observe the outcomes(successes and failures) of a number of trials for a number of players. In this example Ti isthe number of “at-bats” and si is the number of “hits” for player i. See Figure 2 for thedata. [The analysis in not complete.]

Thumbtack data. The thumbtack data are shown in Table 2. The posterior distributionfor the generic success rate is displayed in Figure 21.


7 8 9 10 11 12 13 14 15 16 17 18

Hits

Figure 2. Baseball data: 18 players with 45 at-bats each.

10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0

2

4

6

8

10

12

14

16

Number of Rats in Study

Ob

serv

edR

ate

of

Tum

ors

Figure 3. Rat tumor data: 71 studies. Number of studies (1 to 7) propor-tional to area of dot. Number of rats with tumors (0 to 16) indicated bycontour lines. There are 59 studies for which the total number of rats is lessthan or equal to 35 and more than half of these studies (32) have observedtumor rates less than or equal to 10%. By contrast, none of the other 12studies has an observed tumor rate less than or equal to 10%.

The posterior distribution for the generic success rate given the alternative model isshown in Figure 26.

20 MARK FISHER

Appendix A. Uncertainty about the predictive distribution

The predictive distribution quantifies the uncertainty about xn+1 given x1:n. It is thisuncertainty that is of interest in this paper. There is, however, a different notion of un-certainty that could be of interest. Consider a decision that must be made based on thepredictive distribution. Suppose the decision can be made either with the currently availableinformation or delayed (with some additional cost) until additional information is acquired.Uncertainty about the extent to which that additional information can change the predictivedistribution is relevant to the overall decision process.16

For example, suppose the decision can be made using the “current” predictive distributionp(xn+1|x1:n, I) or delayed and made with the “next” predictive distribution

p(xn+2|xn+1, x1:n, I). (A.1)

Variation in the “next” distribution is driven by variation in xn+1 from the “current” dis-tribution, and the expectation of the “next” distribution equals the “current” distribution:

p(xn+2|x1:n, I) =

∫p(xn+2|xn+1, x1:n, I) p(xn+1|x1:n, I) dxn+1

=

∫p(xn+1, xn+2|x1:n, I) dxn+1

= p(xn+1|x1:n, I)|xn+1=xn+2 ,

(A.2)

where the last equality follows from the exchangeability of xn+1 and xn+2. If xn+1 and xn+2

are independent conditional on x1:n, then the “next” distribution equals its expectation andthere is no variation.

For n = 0, the “next” generic distribution (conditional on α) is

p(x2|x1, α, I) =p(x1, x2|α, I)

p(x1|α, I)=p(x1, x2|α, I)

p(x1|I)

=

(1

1 + α

) ∫f(x1|θ1) f(x2|θ1)h(θ1|I) dθ1

p(x1|I)+

(α

1 + α

)p(x2|I).

(A.3)

If α =∞ then there is no variation.

Appendix B. Partitions, configurations, and adding up

Let P(n) denote the set of partitions of the set {1, . . . , n}. The elements of P(n) canbe identified with (normalized) values for z1:n. (We normalize z1:n as follows: z1 = 1 andzi+1 ≤ max[z1:i] + 1.) For example, let n = 4. The partition {{1, 2, 4}, {3}} of the set{1, 2, 3, 4} can be identified with the classifications z1:4 = (1, 1, 2, 1). Note that the numberof elements of in the partition is mn. In this example, m4 = 2 and the set of multiplicitiesis {d4,1, d4,2} = {3, 1}.

Each element of P(n) is assigned a probability via p(z1:n|α) such that∑ζ∈P(n)

p(z1:n(ζ)|α) = 1, (B.1)

16This is not the same as the posterior uncertainty regarding G.


where z1:n(ζ) denotes the classifications that corresponds to the partition ζ. The numberof partitions is given by the Bell number, Bn, which grows very rapidly. For example,B10 = 115,975 and B20 = 51,724,158,235,372.

Here is a way to evaluate (B.1) without completely enumerating P(n). We can partitionP(n) into what Casella et al. (2014) call configuration classes. Configuration classes them-selves can be identified with the partitions of an integer. (Even though there are far fewerpartitions of the integer n than of the set {1, . . . , n}, there remains a combinatoric explosionthat makes explicit evaluation infeasible for any but small values of n. For example, thenumber of partitions of 100 is 190,569,292.)

Classifications with the same set of multiplicities belong to the same configuration class.Consequently, the CRP assigns the same probability to each element of a configurationclass [see (2.17)]. Therefore to obtain the probability of the entire configuration class wecan evaluate the probability of a representative element of each configuration class (as givenby a partition of the associated integer) and multiply by the number of elements of eachconfiguration class. Casella et al. (2014) show that the number of elements in a configurationclass is given by (

n

dn,1, . . . , dn,mn

)1∏n

i=1

(∑mnc=1 1{i}(dn,c)

)!, (B.2)

where(

ndn,1, ..., dn,mn

)is the multinomial coefficient and the second factor corrects for over

counting.

Example. Let n = 4. There are 15 partitions of the set {1, 2, 3, 4}, but there are onlyfive partitions of the integer 4: (4), (3, 1), (2, 2), (2, 1, 1), and (1, 1, 1, 1), where each of thepartitions sums to 4. Each partition has an associated value of m4. The values are 1, 2,2, 3, and 4, respectively. The second and third partitions of the integer 4 each involve twoterms (i.e., m4 = 2). Interpreting these partitions as configuration classes, the second andthird configuration classes belong the the same partition class as defined by Casella et al.(2014)17.

Figure 4 provides a visualization for the n = 4 example. [See also Figure 2 in Casella et al.(2014).] The number of elements in each of the five configuration classes is (1, 4, 3, 6, 1).Additionally assuming α = 1, the probabilities of a representative element from each con-figuration class are (1

4 ,112 ,

124 ,

124 ,

124).

Appendix C. Local dependence via copula

In this section we generalize the two-dimensional density model (presented in Section 4) toinclude local dependence, which we introduce via a copula. The Farlie–Gumbel–Morgenstern(FGM) copula is easy to work with because a flat prior for its copula parameter producesa flat prior predictive density over the unit square [0, 1]2.

The FGM copula is given by

C(u1, u2|τ) := u1 u2

(1− τ (1− u1) (1− u2)

)where −1 ≤ τ ≤ 1. (C.1)

17Casella et al. (2014) provide an alternative prior over the partitions of {1, . . . , n} in which all elementsof a given partition class have equal probability.

22 MARK FISHER

Figure 4. The 15 partitions of {1, 2, 3, 4}. Each of the four rows representsa partition class: m4 = 1, 2, 3, 4. Each of the five configuration classes is indi-cated by color. (The second partition class is comprised of two configurationclasses.) Source: Wikipedia.

This is the CDF for (u1, u2) defined over the unit square. The joint PDF is

c(u1, u2|τ) := 1 + τ (2u1 − 1) (2u2 − 1) (C.2)

The marginal densities are flat: p(ui|τ) = 1[0,1](ui) for i = 1, 2. The correlation between u1

and u2 is τ/3. Thus the potential dependence is somewhat limited. Setting τ = 0 deliversindependence.

The CDF for the beta distribution is the regularized incomplete beta function:

Ix(a, b) =

∫ x

0Beta(t|a, b) dt. (C.3)

Let the joint CDF for a single observation xi = (xi1, xi2) be given by

F (xi|θi) := C(wi1, wi2, τi), (C.4)

where θi = (ji1, ji2, ki1, ki2, τi) and

wi` := Ixi`(ji`, ki` − ji` + 1). (C.5)


Then the joint density is given by

f(xi|θi) = c(wi1, wi2|τi)2∏`=1

Beta(xi`|ji`, ki` − ji` + 1). (C.6)

Let the prior for θc = (jc1, jc2, kc1, kc2, τc) be given by

p(θc|I) =p(kc1|I) p(kc2|I) p(τc|I)

kc1 kc2, (C.7)

where

p(τc|I) =1

21[−1,1](τc). (C.8)

Let jc = (jc1, jc2), kc = (kc1, kc2), and

f(xi|θc) = c(wic1, wic2|τc)2∏`=1

Beta(xi` |jc`, kc` − jc` + 1), (C.9)

where

wic` := Ixi`(jc`, kc` − jc` + 1). (C.10)

The expectation with respect to τc produces (conditional) independence between xi1 andxi2:

p(xi |jc, kc, I) =

∫ 1

−1f(xi|θc) p(τc|I) dτc =

2∏`=1

Beta(xi` |jc`, kc` − jc` + 1). (C.11)

Given the conditional priors for jc1 and jc2, this prior produces a flat prior predictive for

any p(kc1|I) and p(kc1|I):

p(xi|I) = 1[0,1]2(xi). (C.12)

The posterior for θc given a single observation xi is

p(θc|xi, I) =f(xi|θc) p(θc|I)

p(xi|I)

=

(2∏`=1

p(kc`|I)

)︸︷︷︸

p(kc|xi,I)

(2∏`=1

p(jc`|xi`, kc`)

)︸︷︷︸

p(jc|xi,kc,I)

(c(wic1, wic2|τc)

2

)︸︷︷︸

p(τc|xi ,jc,kc,I)

.(C.13)

Thus, we can make a draw form the joint posterior by first drawing kc from its prior

distribution, then drawing jc from its distribution conditional on kc, and finally drawing τcfrom its conditional distribution.

24 MARK FISHER

Appendix D. Binomial likelihood

When the data are observations from binomial experiments, it is possible to integrateout the unobserved success rates (since the prior is a mixture of beta distributions).

There is a closed-form expression for the likelihood of θi = (ji, ki) in terms of the obser-vations:

p(si|Ti, θi) = p(si|Ti, ji, ki) =

∫ 1

0Binomial(si|Ti, xi)Beta(xi|ji, ki − ji + 1) dxi

= Beta-Binomial(si|ji, ki − ji + 1, Ti)

=

(Tisi

)ki! (si + ji − 1)! (Ti + ki − si − ji)!

(ji − 1)! (ki − ji)! (Ti + ki)!.

(D.1)

Note that

p(si|Ti, kc, I) =

kc∑jc=1

p(si|Ti, jc, kc)kc

=1

Ti + 1, (D.2)

which is independent of kc. Therefore, p(si|Ti, kc, I) = p(si|Ti, I).We can again use Algorithm 2 from Neal (2000) to make draws from the posterior dis-

tribution. Classification of si|Ti depends on

p(zi|z−i1:n, θ1:m−in, α, si, Ti) ∝

{d−in,c

n−1+α p(si|Ti, θc) c ∈ {1, . . . ,m−in }α

n−1+α p(si|Ti, I) c = m−in + 1. (D.3)

In order to make a draw of a new cluster component, consider the following. Note that

p(jc, kc|Ti, si, I) =p(si|Ti, jc, kc) p(jc, kc|I)

p(si|Ti, I)

=

{(Ti + 1)Beta-Binomial(si |jc, kc − jc + 1, Ti)

kc

}p(kc|I)

= Beta-Binomial(jc − 1|si + 1, Ti − si + 1, kc − 1) p(kc|I),

(D.4)

where

(jc − 1)|(Ti, si, kc) ∼ Beta-Binomial(si + 1, Ti − si + 1, kc − 1), (D.5)

with the proviso that jc = 1 if kc = 1. Note that a single observation by itself does not

identify kc.

Appendix E. Sharing, shrinkage, and pooling

One may interpret the main prior in terms of partial sharing of the parameters. Whenevera parameter is shared among cases, the associated coefficients are shrunk toward a commonvalue. Complete sharing, therefore, implies global shrinkage, while partial sharing implieslocal shrinkage which allows for multiple modes to exist simultaneously.

Gelman et al. (2014) and Gelman and Hill (2007), discuss three types of pooling : nopooling, complete pooling, and partial pooling. The no-pooling model corresponds to ourno-sharing prior and the partial-pooling model corresponds to our one-component complete


Table 3. Sharing, shrinkage, and pooling. Sharing is controlled by theconcentration parameter α. Complete sharing produces global shrinkage (toa single cluster). Local shrinkage is controlled by the bandwidth of thekernel. Complete local shrinkage identifies the cases in a given cluster (i.e.,all cases in a given cluster have the same value). The Dirichlet Process (DP)and Dirichlet Process Mixture (DPM) are Bayesian nonparametric priors.

Sharing

Local Shrinkage complete partial none

complete complete pooling DP no poolingpartial partial pooling DPM no poolingnone no pooling no pooling no pooling

sharing prior (global shrinkage). The complete-pooling model is a special case of our one-component complete sharing prior with the added restriction that all of the xi are the same(complete local shrinkage).

We can obtain this restricted model by letting the prior for kc put all its weight on

kc =∞. For large values of kc, the normal approximation to Beta(xi |jc, kc− jc + 1) is givenby N(xi|µc, σ2

c ) where

µc =jc − 1

kc − 1and σ2

c =(jc − 1) (kc − jc)

(kc − 1)3=µc (1− µc)kc − 1

, (E.1)

for 1 < jc < kc. The normal approximation shows that as kc →∞, Beta(xi |jc, kc − jc + 1)converges to a point mass located at µc. At the same time the prior distribution forµc converges to the uniform distribution on the unit interval. Thus the limiting case is

equivalent to the specification given in (8.2) in which θc corresponds to µc. This latter prioris a special case of the DPM prior known as the Dirichlet process (DP) prior.

See Table 3 for the complete set of relationships. (The case of no pooling via complete

sharing can be obtained by letting kc = 1.)

26 MARK FISHER

5 10 15 20 25 30 35 40

0.00

0.05

0.10

0.15

Figure 5. Galaxy data: quasi-Bernstein predictive density with supportover the interval [5, 40] and a rug plot of the data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.00

0.05

0.10

0.15

number of clusters

frequency

Figure 6. Galaxy data: Distribution of the number of clusters in the mixture.


0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

2.5

Figure 7. Galaxy data: Posterior distribution of λ = α/(1 +α). The meanof 0.55 is indicated by the dashed vertical line. The uniform prior is indicatedas well.

0 200 400 600 800 1000 1200 14000.000

0.001

0.002

0.003

0.004

0.005

Figure 8. Galaxy data: Posterior distribution of kc = ac+ bc−1. The priordistribution is shown (as a smooth curve) for reference.

28 MARK FISHER

0 20 40 60 80 100 120 140

0.000

0.005

0.010

0.015

snowfall (inches)

Figure 9. Buffalo snowfall data: quasi-Bernstein predictive density withsupport over the interval [0, 150] and a rug plot of the data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0

0.1

0.2

0.3

0.4

0.5

number of clusters

frequency

Figure 10. Buffalo snowfall data: Distribution of the number of clusters inthe mixture.


0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 11. Buffalo snowfall data: Posterior distribution of λ = α/(1 + α).The mean of 0.25 is indicated by the dashed vertical line. The uniform prioris indicated as well.

0 200 400 600 800 10000.000

0.005

0.010

0.015

0.020

0.025

Figure 12. Buffalo snowfall data: Posterior distribution of kc = ac+ bc−1.The prior distribution is shown (as a smooth curve) for reference.

30 MARK FISHER

1 2 3 4 5

40

50

60

70

80

90

100

eruption time (minutes)

waitingtime(minutes

)

Figure 13. Old Faithful data: Contours for posterior predictive densitywith support over [1, 5.75] × [35, 105]. Lowest contour is at the level ofthe uniform prior (≈ 0.003). Contour spacing above the lowest contour is≈ 0.006. Data are shown as dots and conditional expectations are shown asthicker lines.

1 2 3 4 5

0.0

0.2

0.4

0.6

eruption time (minutes)

40 50 60 70 80 90 100

0.00

0.01

0.02

0.03

0.04

waiting time (minutes)

Figure 14. Old Faithful data: Marginal distributions for eruption time andwaiting time computed from joint distribution.


1 2 3 4 5 6 7 8 9 10 11 12 130.00

0.05

0.10

0.15

0.20

0.25

0.30

number of clusters

frequency

Figure 15. Old Faithful data: Distribution of the number of clusters in the mixture.

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

2.5

Figure 16. Old Faithful data: Posterior distribution of λ = α/(1+α). Themean of 0.45 is indicated by the dashed vertical line. The uniform prior isindicated as well.

32 MARK FISHER

0 200 400 600 8000.000

0.005

0.010

0.015

0.020

0.025

Figure 17. Old Faithful data: Posterior distribution of kc1 = ac1 + bc1− 1.The prior distribution is shown (as a smooth curve) for reference.

0 200 400 600 8000.000

0.005

0.010

0.015

0.020

0.025

Figure 18. Old Faithful data: Posterior distribution of kc2 = ac2 + bc2− 1.The prior distribution is shown (as a smooth curve) for reference.


0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

tumor rate

pro

bab

ilit

yden

sity

Figure 19. Posterior distribution for generic rat tumor rate.

0. 0.1 0.2 0.3 0.4

0.

0.1

0.2

0.3

0.4

observed tumor rate

95

%H

PD

inte

rval

san

dpost

erio

rm

edia

ns

Figure 20. Posterior medians and 95% highest posterior density regions ofrat tumor rates. Darker lines indicate multiple observations. Compare withFigure 5.4 in Gelman et al. (2014).

34 MARK FISHER

References

Casella, G., E. Moreno, and F. J. Giron (2014). Cluster analysis, model selection, and priordistributions on models. Bayesian Analysis. Accepted for publication. Available online.

Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing,and Prediction. Cambridge University Press.

Efron, B. and C. Morris (1975). Data analysis using Stein’s estimator and its generalizations.Journal of the American Statistical Association 70, 311–319.

Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2014).Bayesian data analysis (Third ed.). CRC Press.

Gelman, A. and J. Hill (2007). Data analsis using regression and multilevel/hierarchicalmodels. Cambridge University Press.

Geweke, J., G. Koop, and H. van Dijk (Eds.) (2011). The Oxford Handbook of BayesianEconometrics. Oxford University Press.

Greenberg, E. (2013). Introduction to Bayesian Econometrics (Second ed.). CambridgeUniversity Press.

Kottas, A. (2006). Dirichlet process mixtures of beta distributions, with applications todensity and intensity estimation. In Workshop on Learning with Nonparametric BayesianMethods, 23rd International Conference on Machine Learning (ICML).

Liu, J. S. (1996). Nonparametric hierarchical Bayes via sequential imputations. The Annalsof Statistics 24 (3), 911–930.

Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models.Journal of Computational and Graphical Statistics 9, 249–265.

Petrone, S. (1999a). Bayesian density estimation using Bernstein polynomials. The Cana-dian Journal of Statisitcs 27, 105–102.

Petrone, S. (1999b). Random Bernstein polynomials. Scandinavian Journal of Statistics 26,373–393.

Petrone, S. and L. Wasserman (2002). Consistency of Bernstein polynomial posteriors.Journal of the Royal Statistical Society. Series B (Statistical Methodology) 63, 79–100.

Teh, Y. W. (2010). Dirichlet process. In Encyclopedia of Machine Learning. Springer.Trippa, L., P. Bulla, and S. Petrone (2011). Extended Bernstein prior via reinforced urn

processes. Annals of the Institute of Statistical Mathematics 63, 481–496.Zhao, Y., M. C. Ausın, and M. P. Wiper (2013). Bayesian multivariate Bernstain poly-

nomial density estimation. Statistics and Econometrics Series 11 Working Paper 13-12,Universidad Carlos III de Madrid.

Federal Reserve Bank of Atlanta, Research Department, 1000 Peachtree Street N.E.,Atlanta, GA 30309–4470

E-mail address: [email protected]

URL: http://www.markfisher.net


0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

2.5

probability of success

Figure 21. Thumbtack data: Posterior distribution for the generic proba-bility of success.

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

Figure 22. Thumbtack data: Posterior distribution for λ = α/(1 + α).

36 MARK FISHER

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 280.00

0.05

0.10

0.15

number of clusters

frequency

Figure 23. Thumbtack data: Posterior distribution for number of clusters.

0 200 400 600 800 10000.000

0.002

0.004

0.006

0.008

Figure 24. Thumbtack data: Posterior distribution for kc.


0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

61

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

62

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

63

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

64

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

65

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

66

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

67

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

68

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

5

69

Figure 25. Thumbtack data: Posterior distributions for the specific prob-abilities of success, computed for each exchangeable group with a commonnumber of success, 1 through 9.

38 MARK FISHER

0.0 0.2 0.4 0.6 0.8 1.00

1

2

3

4

probability of success

Figure 26. Thumbtack data: Posterior distribution for generic success rategiven the alternative model compared with the distribution given the mainmodel.