CONSISTENCY OF BAYES ESTIMATORS OF A BINARY …galton.uchicago.edu/~lalley/Papers/Bayes.pdf · 2004. 12. 9. · BAYES ESTIMATION OF A REGRESSION FUNCTION 3 given Xi = x, is Bernoulli-f(x).

CONSISTENCY OF BAYES ESTIMATORS OF A BINARY

REGRESSION FUNCTION

MARC CORAM AND STEVEN P. LALLEY

Abstract. When do nonparametric Bayesian procedures “overfit?” Toshed light on this question, we consider a binary-regression problem indetail and establish frequentist consistency for a large class of Bayesprocedures based on certain heirarchical priors, called uniform mixturepriors. These are defined as follows: let ν be any probability distributionon the nonnegative integers. To sample a function f from the priorπν , first sample m from ν and then sample f uniformly from the setof step functions from [0, 1] into [0, 1] that have exactly m jumps (i.e.sample all m jump locations and m + 1 function values independentlyand uniformly). The main result states that with only one exception, ifa data-stream is generated according to any fixed, measurable binary-regression function f0 consistency obtains: i.e. for any ν with infinitesupport, the posterior of πν concentrates on any L1 neighborhood off0. The only exception is that if f0 is identically

1

2, so that all class-

label information is pure noise, inconsistency occurs if the tail of ν istoo long. Qualitatively, this is the same as the finding of Diaconis andFreedman for a class of related priors. However, because the uniformmixture priors have randomly located jumps, they are more flexible andpresumably more “prone” to overfitting. Solution of a large-deviationsproblem is central to the consistency proof.

1. Introduction

1.1. Consistency of Bayes Procedures. It has been known since thework of Freedman [8] that Bayesian procedures may fail to be “consistent”in the frequentist sense: For estimating a probability density on the naturalnumbers, Freedman exhibited a prior that assigns positive mass to everyopen set of possible densities, but for which the posterior is consistent onlyat a set of the first category. Freedman’s example is neither pathologicalnor rare: for other instances, see [7], [4], [10], and the references therein.

Frequentist consistency of a Bayes procedure here will mean that, in a suit-able topology, the posterior probability of each neighborhood of the “trueparameter” tends to 1. The choice of topology may be critical: For consis-tency in the weak topology on measures it is generally enough that the priorshould place positive mass on every Kullback-Leibler neighborhood of thetrue parameter [21], but for consistency in stronger topologies, more strin-gent requirements on the prior are needed – see, for example, [1], [9], [22].

Date: December 9, 2004.

1

2 MARC CORAM AND STEVEN P. LALLEY

Roughly, these demand not only that the prior charge Kullback-Leiblerneighborhoods of the true parameter, but also that it not be overly diffuse,as this can lead to “overfitting”. Unfortunately, it appears that in certainnonparametric function estimation problems, the general formulation of thislatter requirement for consistency in [1] is far too stringent, as it rules outlarge classes of useful priors for which the corresponding Bayes proceduresare in fact consistent.

1.2. Binary Regression. The purpose of this paper is to examine in detailthe consistency properties of Bayes procedures based on certain hierarchicalpriors in a nonparametric regression problem. For mathematical simplicity(such as it is – the reader will be the judge), we choose to work in the set-ting of binary regression, with covariates valued in the unit interval [0, 1].Consistency of Bayes procedures in binary regression has been studied pre-viously by Diaconis and Freedman [5], [6] for a class of priors – suggestedby de Finetti – that are supported by the set of step functions with discon-tinuities at dyadic rationals. The use of such priors may be quite reasonablein circumstances where the covariate is actually an encoding (via binaryexpansion) of an infinite sequence of binary covariates. However, in appli-cations where the numerical value of the covariate represents a real physicalvariable, the restriction to step functions with discontinuities only at dyadicsis highly unnatural; and simulations show that when the regression functionis continuous, the concentration of the posterior may be quite slow.

Coram [3] has proposed a natural class of priors, which we shall calluniform mixture priors, on step functions that are at once mathematicallynatural, allow computationally efficient simulation of posteriors, and appearto have much more favorable concentration properties for data generatedby continuous binary regression functions than do the Diaconis-Freedmanpriors. These priors πν , like those of Diaconis and Freedman, are hierarchi-cal priors parametrized by probability distributions ν on the nonnegativeintegers. A random step function with distribution πν can be obtained asfollows: (1) Choose a random integer M with distribution ν. (2) Given thatM = m, choose m points ui at random in [0, 1] according to the uniformdistribution: these are the points of discontinuity of the step function. (3)Given M = m and the discontinuities ui, choose the m+ 1 step heights wjby sampling again from the uniform distribution. The uniform sampling insteps (2)-(3) allows for easy and highly efficient Metropolis-Hastings sim-ulations of posteriors; the uniform distribution could be replaced by otherdistributions in either step, at the expense of some efficiency in posteriorsimulations (and our main theoretical results could easily be extended tosuch priors), but we see no compelling reason to discuss such generaliza-tions in detail.

Let f be a binary regression function on [0, 1], that is, a Borel-measurablefunction f : [0, 1] → [0, 1]. We shall assume that under Pf the data (Xi, Yi)are i.i.d. random vectors, with Xi uniformly distributed on [0, 1] and Yi,

BAYES ESTIMATION OF A REGRESSION FUNCTION 3

given Xi = x, is Bernoulli-f(x). (Our main result would also hold if thecovariate distribution were not uniform but any other distribution givingpositive mass to all intervals of positive length.) Let Qν =

∫

Pf dπν , and

denote by Qν(· | Fn) the posterior distribution on step functions given thefirst n observations (Xi, Yi). The main result of the paper is as follows.

Theorem 1. Assume that the hierarchy prior ν is not supported by a finitesubset of the integers. Then for every binary regression function f 6≡ 12 , theQν−Bayes procedure is L1−consistent at f , that is, for every ε > 0,(1) lim

n→∞Pf{Qν({g : ‖g − f‖1 > ε} | Fn) > ε} = 0.

The restriction f 6≡ 1/2 arises for precisely the same reason as in [6],namely, that this exceptional function is the prior mean of the regressionfunction. See [6] for further discussion.

Theorem 1 implies that the uniform mixture priors enjoy the same con-sistency as do the Diaconis-Freedman priors [6]. This is not exactly un-expected, but neither it should not be considered a priori obvious — asthe proof will show, there are substantial differences between the uniformmixture priors and those of Diaconis and Freedman. As noted above, thepossibility of inconsistency arises because the posteriors may favor modelsthat are “over-fit”. Since the uniform mixture priors allow the step-functiondiscontinuities to arrange themselves in favorable (but atypical for uniformsamples) configurations vis-a-vis the data, the danger of over-fitting wouldseem, at least a priori, greater than for the Diaconis-Freedman priors. Infact, the bulk of the proof (sections 4–5) will be devoted to showing thatthere is no excessive accumulation of posterior mass in the end zone (theset of step functions where the number of discontinuities grows linearly withthe number of data points), where over-fitting occurs.

It should be noted that the sufficient conditions for consistency given in[1] can be specialized to the uniform mixture priors. Unfortunately, thesesufficient conditions require that the hierarchy prior ν satisfy

∑

k≥m

νk ≤ m−mC

for some C > 0 (see [3]). Such a severe restriction on the tail of the hierarchyprior certainly prevents the accumulation of posterior mass in the end zone,but at the possible cost of having the posterior favor models that are under-fit. Preliminary analysis seems to indicate that when the true regressionfunction is smooth, more rapid posterior concentration takes place whenthe hierarchy prior has a rather long tail.

1.3. Overfitting and Large Deviations Problems. The problem of ex-cessive accumulation of posterior mass in the end zone is, as it turns out,naturally tied up with a large-deviations problem connected to the model:this is the most interesting mathematical feature of the paper. (A similarlarge-deviations problem occurs in [6], but there it reduces easily to the


classical Cramér LD theorem for sums of i.i.d. random variables.) Roughly,we will show in section 4 that as the complexity m of the model (here, thenumber of discontinuities in the step function) and the sample size n tendto ∞ in such a way that m/n → α > 0, the posterior mass of the set ofall step functions with complexity m decays at a precise exponential rateψ(α) in n. The mathematical issues involved in this program are reminis-cent of those encountered in rigorous statistical mechanics, specifically inconnection with the existence of thermodynamic limits (see, for example,[20], chapters 2—3): the connection is that the log-likelihood serves as akind of Hamiltonian on configurations of step function discontinuities. Con-sistency (1) follows from the fact (proved in section 5) that ψ(α) is uniquelymaximized at α → 0, as this, in essence, implies that the posterior is con-centrated on models of small complexity relative to the sample size. Thatposterior concentration in the small-complexity regime must occur in L1−neighborhoods of the true regression function follows by routine arguments— see section 3.

We expect (and hope to show in a subsequent paper) that for large classesof heirarchical priors, in a variety of problems, the critical determinant ofthe consistency of Bayes procedures will prove to be the rate functions inassociated large deviations problems. The template of the analysis is asfollows: Let

(2) π =∞∑

m=0

νmπm

be a hierarchical prior obtained by mixing priors πm of “complexity” m.Let Q and Qm be the probability distributions on the space of data se-quences gotten by mixing with respect to π and πm, respectively; and letQ(· | Fn) and Qm(· | Fn) be the corresponding posterior distributions giventhe information in the σ−field Fn. Then by Bayes’ formula,

(3) Q(· | Fn) ={

∞∑

m=0

νmZm,nQm(· | Fn)}

/

{

∞∑

m=0

νmZm,n

}

,

where Zm,n are the “predictive probabilities” for the data in Fn based onthe model Qm (see section 2.2 for more detail in the binary regression prob-lem). This formula makes apparent that the relative sizes of the predictiveprobabilities Zm,n determine where the mass in the posterior Q

ν(· | Fn) isconcentrated. The large deviations problem is to show that as m,n→∞ insuch a way that m/n→ α,(4) Z1/nm,n −→ exp{ψ(α)}in Pf−probability, for an appropriate nonrandom rate function ψ(α), andto show that ψ(α) is uniquely maximized at α = 0. This, when true, willimply that most of the posterior mass will be concentrated on “models” πmwith small complexity m relative to the sample size n, where overfitting doesnot occur.


1.4. Choice of Topology. The use of the L1−metric (equivalently, anyLp−metric, 0 ≤ p < ∞) in measuring posterior concentration, as in (1),although in many ways natural, may not always be appropriate. Poste-rior concentration relative to the L1−metric justifies confidence that, for anew random sample of individuals with covariates uniformly distributed on[0, 1], the responses will be reasonably well-predicted by a regression func-tion samples from the posterior, but it would not justify similar confidencefor a random sample of individuals all with covariate (say) x = .47. Forthis, posterior concentration in the sup-norm metric would be required. Wedo not yet know if consistency holds in the sup-norm metric, for either theuniform mixture priors or the Diaconis-Freedman priors, even for smoothf ; but we conjecture that it does. We hope to return to this issue in asubsequent paper.

2. Preliminaries

2.1. Data. A (binary) regression function is a Borel measurable functionf : [0, 1] → [0, 1], or, more generally, f : J → [0, 1] where J is an interval.For each binary regression function f , let Pf be a probability measure ona measurable space supporting a “data stream” {(Xn, Yn)}n≥1 such thatunder Pf the random variables Xn are i.i.d. Uniform-[0, 1] and, conditionalon σ({Xn}n≥1), the random variables Yn are independent Bernoullis withconditional means

(5) Ef (Yn |σ({Xm}m≥1)) = f(Xn).(In several arguments below it will be necessary to consider alternative dis-tributions F for the covariates Xn. In such cases we shall adopt the con-vention of adding the subscript F to relevant quantities; thus, for instance,Pf,F would denote a probability distribution under which the covariates Xnare i.i.d. F , and the conditional distribution of the responses Yn is the sameas under Pf .) We shall assume when necessary that probability spaces sup-port additional independent streams of uniform and exponential r.v.s (andthus also Poisson processes), so that auxiliary randomization is possible.Generic data sets (values of the first n pairs (xi, yi)) will be denoted (x,y),or (x,y)n to emphasize the sample size; the corresponding random vectorswill be denoted by the matching upper case letters (X,Y). For any data set(x,y) and any interval J ⊂ [0, 1], the number of successes (yi = 1), failures(yi = 0), and the total number of data points with covariate xi ∈ J will bedenoted by

NS(J), NF (J), and N(J) = NS(J) +NF (J).

In certain comparison arguments, it will be convenient to have datastreams for different regression functions defined on a common probabil-ity space (Ω,F , P ). This may be accomplished by the usual device: Let{Xn}n≥1 and {Vn}n≥1 be independent, identically distributed Uniform-[0, 1]random


variables, and set

(6) Y fn = 1{Vn ≤ f(Xn)}.

2.2. Priors on Regression Functions. The prior distributions on regres-sion functions considered in this paper are probability measures on the setof step functions with finitely many discontinuities. Points of discontinuity,or “split points”, of step functions will be denoted by ui, and step heightsby wi. Each vector u = (u1, u2, . . . , um) of split points induces a partitionof the unit interval into m+ 1 subintervals (or “cells”) Ji = Ji(u). Denoteby πu the probability measure on step functions with discontinuities ui thatmakes the step height random variables Wi (that is, the values wi on theintervals Ji(u)) independent and uniformly distributed on [0, 1]. For eachnonnegative integer m and each probability distribution G on [0, 1], defineπm to be the uniform mixture of the measures πu over all split point vectorsu of length m, that is,

(7) πm =

∫

u∈(0,1)mπu d(u).

It is, of course, possible to mix against distributions other than the uniform,and in some arguments it will be necessary for us to do so. The priorsof primary interest – those considered in Theorem 1 – are mixtures of themeasures πm against hierarchy priors ν on the nonnegative integers:

(8) πν =

∞∑

m=0

νmπm

Each of the probability measures πu, πm, and πν induces a corresponding

probability measure on the space of data sequences by mixing:

Qu =

∫

Pf dπu(f),(9)

Qm =

∫

Pf dπm(f), and(10)

Qν =

∫

Pf dπν(f).(11)

Observe that Qm is the uniform mixture of the measures Qu over split pointvectors u of size m, and Qν is the ν−mixture of the measures Qm.

For any data sample (x,y), the posterior distribution Q(· | (x,y)) underany of the measures Qu, Qm, or Q

ν is the conditional distribution on theset of step functions given that (X,Y) = (x,y). The posterior distributionQu(· | (x,y)) can be explicitly calculated: it is the distribution that makesthe step height r.v.s Wi independent, with Beta-(N

Si , N

Fi ) distributions,

where NSi = NS(Ji(u)) and N

Fi = N

F (Ji(u)) are the success/failure countsin the intervals Ji of the partition induced by u. Thus, the joint density of


the step heights (relative to product Lebesgue measure on the cube [0, 1]m+1)is

(12) qu(w | (x,y)) = Zu(x,y)−1m∏

i=0

wNSii (1−wi)N

Fi

where the normalizing constant Zu(x,y), henceforth called theQu−predictiveprobability for the data sample (x,y), is given by

Zu(x,y) =

∫

w∈[0,1]m+1

m∏

i=0

wNSii (1− wi)N

Fi dw =

m∏

i=0

B(NSi , NFi ) and(13)

B(m,n) =

{

(m+ n+ 1)

(

m+ n

m

)}−1

.(14)

(This is not the usual convention for the arguments of the beta function, butwill save us from a needless proliferation of +1s.) The posterior distributionsQm(· | (x,y)) and corresponding predictive probabilities Zm(x,y) are relatedto Qu(· | (x,y)) and Zu(x,y) as follows:

(15) Qm(· | (x,y)) ={

∫

u∈[0,1]mQu(· | (x,y))Zu(x,y) d(u)

}

/

Zm(x,y)

where

Zm(x,y) =

∫

u∈(0,1)mZu(x,y) d(u)(16)

=

∫

u∈(0,1)m

m∏

i=0

B(NSi , NFi ) d(u).

(Note: The dependence of the integrand on u, via the values of the suc-cess/failure counts NSi , N

Fi , is suppressed.) In general the last integral can-

not be evaluated in closed form, unlike the integral (13) that defines theQu−predictive probabilities. This, as we shall see in sections 4-5, will makethe mathematical analysis of the posteriors considerable more difficult thanthe corresponding analysis for Diaconis-Freedman priors.

Note for future reference (sec. 5) that the predictive probabilities Zm arerelated to likelihood ratios dQm/dPf : In particular, when f ≡ p is constant,

(17) Zm((X,Y)n) = pNS(1− p)NF

(

dQmdPp

)

Fn

where FN is the σ−algebra generated by the first n data points, and N S, NFare the numbers of successes and failures in the entire data set (X,Y)n.

Finally, the posterior distribution Qν(· | (x,y)), whose asymptotic behav-ior is the main concern of this paper, is related to the posterior distributions


Qm(· | (x,y)) by Bayes’ formula(18)

Qν(· | (x,y)) ={

∞∑

m=0

νmZm(x,y)Qm(· | (x,y))}

/

{

∞∑

m=0

νmZm(x,y)

}

.

The goal of sections 4-5 will be to show that for large samples (X,Y)n, un-der Pf , the predictive probabilities Zαn((X,Y)n) are of smaller exponentialmagnitude for α > 0 than for α = 0. This will imply that the posteriorconcentrates in the region m� n, where the number of split points is smallcompared to the number of data points.

Caution: Note that πm and πu have different meanings, as do Zm and Zu,and Qu and Qm. The reader should have no difficulty discerning the propermeaning, by context or careful font analysis.

2.3. Transformations. Part of the analysis to follow will depend cruciallyon the invariance under monotone transformations of the predictive prob-abilities (16). We have assumed that the covariates Xn are uniformly dis-tributed on [0, 1]; and, in constructing the priors πm and π

ν , we have useduniform mixtures on the locations of the split points ui. This is only forconvenience: clearly, the covariate space could be relabelled by any homeo-morphism without changing the nature of the estimation problem. Thus, ifthe data sample (x,y) were changed to (G−1x,y), where G is a continuous,strictly increasing c.d.f., and if G−mixtures rather than uniform mixtureswere used in building the priors, then the predictive probabilities would beunchanged:

(19) Zm,G(G−1x,y) = Zm(x,y)

where Zm = Zm,G are the predictive probabilities for the transformed datarelative to priors built using G−mixtures. (This follows easily from thetransformation formula for integrals.)

Two instances will be useful below. First, if (x,y)n is a data sampleof size n, with covariates xi ∈ [0, 1], and if G is the c.d.f. of the uniformdistribution on the interval [0, n] (so that G−1 is just multiplication by n),then the transformation has the effect of standardizing the spacings betweendata points and between split points. This transformation will be used insection 5. Second, if (X,Y) is a data stream with covariates Xi uniformlydistributed on [0, 1], then the subsequence obtained by keeping only thosepairs (Xi, Yi) such that Xi ∈ J , for some interval J , has the joint distribu-tion of a data stream in which the covariates are i.i.d. uniform on J , and isindependent of the corresponding subsequence gotten by keeping only those(Xi, Yi) such that Xi ∈ Jc. This will be of crucial importance in section 4,where it will be used to show that the predictive probability Z[αn]((X,Y)n)approximately splits as a product of two independent predictive probabili-ties, one for each of the intervals [0, .5] and [.5, 1].


2.4. Beta Function and Beta Distributions. Because the posterior dis-tributions (12) of the step height random variables are Beta distributions,certain elementary properties of these distributions and the correspondingnormalizing constants B(n,m) will play an important role in the analysis.The behavior of the Beta function for large arguments is well understood,and easily deduced from Stirling’s formula. Similarly, the asymptotic be-havior of the Beta distributions follows from the fact that these are thedistributions of uniform order statistics:

Beta Concentration Property . For each ε > 0 there exists k(ε) < ∞such that for all index pairs (m,n) with weight m+ n > k(ε), (a) the Beta-(m,n) distribution puts all but at most ε of its mass within ε of m/(m+n);and (b) the normalization constant B(m,n) satisfies

(20)∣

∣

∣

logB(m,n)

m+ n+H

(

m

m+ n

)

∣

∣

∣< ε,

where H(x) is the Shannon entropy, defined by

(21) H(x) = −x log x− (1− x) log(1− x).Note that the binomial coefficient in the expression (14) is bounded above

by 2m+n, so it follows that B(m,n) ≥ 4−m−n. Thus, by equation (16), forany data sample (x,y) of size n,

(22) Zm,G(x,y) ≥ 4−n.Some of the arguments in section 4 will require an estimate of the effect

on the integral (16) of adding another split point. This breaks one of theintervals Ji into two, leaving all of the others unchanged, and so the effecton the integrand in (16) is that one of the factors B(N Si , N

Fi ) is replaced

by a product of two factors B(NSL , NFL )B(N

SR , N

FR ), where the cell counts

satisfy

NSL +NSR = N

Si and

NFL +NFR = N

Fi .

The following inequality shows that the multiplicative error made in thisreplacement is bounded by the overall sample size:

(23)B(NSi , N

Fi )

B(NSL , NFL )B(N

SR , N

FR )

≤ NSi +NFi

2.5. The Entropy Functional. We will show, in section 3 below, thatin the “Middle Zone” where the number of split points is large but smallcompared to the number n of data points, the predictive probability decaysat a precise exponential rate as n → ∞. The rate is the negative of theentropy functional H(f), defined by

(24) H(f) =

∫ 1

0H(f(x)) dx


where H(x), for x ∈ (0, 1), is the Shannon entropy defined by (21) above.The Shannon entropy function H(x) is uniformly continuous and strictlyconcave on [0, 1], with second derivative bounded away from 0; it is strictlypositive except at the endpoints, 0 and 1; and it attains a maximum valueof log 2 at x = 1/2. The entropy functional H(f) enjoys similar properties:

Entropy Continuity Property . For each ε > 0 there exists δ > 0 sothat

(25) ‖f − g‖1 < δ =⇒ |H(f)−H(g)| < ε.Entropy Concavity Property . Let f and g be binary regression functionssuch that g is an averaged version of f in the following sense: There existfinitely many pairwise disjoint Borel sets Bi such that {x : g(x) 6= f(x)} =∪iBi, and for each i such that |Bi| > 0,

(26) g(x) =

∫

Bi

f(y) dy/|Bi| ∀x ∈ Bi.

Then

(27) H(g) −H(f) ≥ −( max0


step function whose value on each interval Ji(u) is the mean value∫

Jif/|Ji|

of f on that interval. Then by the Concavity Property,

(28) H(f̄u) ≥ H(f),with strict inequality unless f = f̄u a.e. Moreover, the difference is small ifand only if f andf̄u are close in L

1. This will be the case if all intervals Jiof the partition are small:

Lemma 1. For each binary regression function f and each ε > 0 there existsδ > 0 such that if |Ji| < δ for every interval Ji in the partition induced byu, then

(29) ‖f − f̄u‖1 < ε.Proof. First, observe that the assertion is elementary for continuous regres-sion functions, since continuity implies uniform continuity on [0, 1]. Second,recall that continuous functions are dense in L1[0, 1], by Lusin’s theorem;thus, for each regression function f and any η > 0 there exists a continuousfunction g : [0, 1] → [0, 1] such that ‖f − g‖1 < η. It then follows by theelementary inequality |

∫

h| ≤∫

|h| that for any vector u of split points,‖f̄u − ḡu‖1 < η.

Finally, use η = ε/3, and choose δ so that for the continuous function g andany u that induces a partition whose intervals are all of length < δ,

‖g − ḡu‖1 < η.Then by the triangle inequality for L1,

‖f − f̄u‖1 ≤‖g − f‖1 + ‖g − ḡu‖1 + ‖f̄u − ḡu‖1≤3η = ε.

�

2.6. Empirical Distributions under Pf . Ultimately, the Consistency The-orem follows from the Law of Large Numbers for functionals of the datasequence (X,Y). In its simplest form, this states that for any fixed intervalJ , (a) the fraction NS(J)/N(J) of successes among the data points fallingin J converges (almost surely under Pf ) to the mean value of f on J as thesample size n→∞; and (b) the fraction N(J)/n of the entire sample thatfall in J converges to |J | (the Lebesgue measure of J). In section 3, we willneed the following weak uniform version of this statement.

Proposition 2. For any ε > 0 there exists κ = κ(ε) such that

lim supn→∞

Pf{|{x : sup|J |≥κ/n

x∈J

|N(J)/n− |J || ≥ ε|J |}| ≥ ε} < ε and(30)

lim supn→∞

Pf{|{x : sup|J |≥κ/n

x∈J

|NS(J)/n−∫

Jf(x) dx| ≥ ε|J |}| ≥ ε} < ε.(31)


(Here |J | denotes the Lebesgue measure of J .)Proof. We shall outline the proof of (30) and leave that of (31) to the reader.First, note that the event in (30) involves only the covariates Xi, which areuniformly distributed on [0, 1] regardless of the choice of f , and thus havethe same distribution as under P . We claim that for each δ > 0 there existC = C(δ) and n(δ) sufficiently large that for any sample size n ≥ n(δ),(32) P{ sup

1≥t≥C/n|N([0, t]) − nt|/nt ≥ δ} < δ.

This follows by a routine Poissonization argument: If the sample size nis changed to a random sample size Λ where Λ is independent of the datasequence (X,Y) and has the Poisson distribution with mean αn, then underP the process N([0, t]) is a Poisson process (in t) of rate nα. By the usualSLLN for the Poisson process, if C = C(δ) is sufficiently large then

P{ sup1≥t≥C/n

|N([0, t]) − αnt|/nt ≥ δ} < δ.

By choosing parameter values α− < 1 < α+, one may bracket the originaldata sample of size n by Poissonized samples, and by choosing α− and α+sufficiently near 1, deduce the claim (32).

Next, observe that if the covariates Xi are “rotated” by an amount x(that is, if each Xi is replaced by Xi + x mod 1), their joint distributionis unchanged, as the uniform distribution on the circle is rotation-invariant.Therefore, (32) implies that for each x ∈ [0, 1 − C/n],(33) P{ sup

1−x≥t≥C/n|N([x, t+ x])− nt|/nt ≥ δ} < δ.

Similarly, because the uniform distribution is invariant under the reflectionmapping x 7→ 1− x, for all x ≥ C/n(34) P{ sup

x≥t≥C/n|N([x− t, x])− nt|/nt ≥ δ} < δ.

Let B = B(C, δ) be the set of all x for which the event in (33) occurs, andB′ the set of x for which the event in (34) occurs. Then by Fubini’s theorem,the expected Lebesgue measures of the sets B and B ′ are < δ, and hence,by Markov’s inequality,

P{|B| ≥√δ} <

√δ and(35)

P{|B′| ≥√δ} <

√δ

The assertions (30)- (31) now follow, because for any point x that lies in aninterval J of length at least κ/n for which |N(J)/n − |J || ≥ ε|J | (and notwithin distance κ/n of the endpoints 0, 1), it must be the case that either

|N([x− t, x])/n− t| ≥ εt/4 or|N([x, x+ t])/n− t| ≥ εt/4

for some t ≥ εκ/(4n). �


3. Beginning and Middle Zones

Following [5], we designate three asymptotic “zones” where the predictiveprobabilities Zm((X,Y)n) decay at different exponential rates. These aredetermined by the relative sizes of m, the number of discontinuities of thestep functions, and n, the sample size. The end zone is the set of pairs (m,n)such that m/n ≥ ε; this zone will be analyzed in sections 4—5, where weshall prove that the asymptotic decay of Zm((X,Y)n) is faster than in themiddle zone, where K ≤ m ≤ εn for a large constant K. The beginning zoneis the set of pairs (m,n) for which m ≤ K for some large K. A regressionfunction cannot be arbitrarily well-approximated by step functions with abounded number of discontinuities unless it is itself a step function, and so,as we will see, the asymptotic decay of Zm((X,Y)n) is generally faster inthe beginning zone than in the middle zone.

In this section we analyze the beginning and middle zones, using the BetaConcentration Property, Lemma 1, and Proposition 2. In the beginning andmiddle zones, the number m of split points is small compared to the numbern of data points, and so for typical split-point vectors u, most intervalsin the partition induced by u will, with high probability, contain a largenumber of data points. Consequently, the law of large numbers applies inthese intervals: together with the Beta Concentration property, it ensuresthat the Qu−posterior is concentrated in an L1−neighborhood of f̄u, andthat the Qu−predictive probability is roughly exp{−nH(f̄u)}. The nextproposition makes this precise.

Proposition 3. For each δ > 0 there exists ε > 0 such that the followingis true: For all sufficiently large n, the Pf−probability is at least 1− δ thatfor all m ≤ εn and all split-point vectors u of size m,

Qu({g : ‖g − f̄u‖1 ≥ δ} | (X,Y)n) < δ and(36)|n−1 logZu((X,Y)n) +H(f̄u)| < δ.(37)

Proof. Let Ji = Ji(u) be the intervals in the partition induced by u. Fixκ = κ(δ) as in Proposition 2. If ε is sufficiently small, then for any split-point vector u of size m ≤ εn, the union of those Ji of length ≤ κ/n willhave Lebesgue measure < δ: this follows by a trivial counting argument.Let Bu be the union of those Ji that are either of length ≤ κ/n or are suchthat either

|Ni/n− |J || ≥ δ|Ji| or(38)

|NSi /n−∫

Ji

f | ≥ δ|Ji|.

where NSi , NFi are the success/failure counts in the interval Ji(u) = Ji,

and Ni = NSi + N

Fi . By Proposition 2, the Pf−probability of the event

Gc that there exists a split-point vector u of size m ≤ εn for which theLebesgue measure of Bu exceeds 2δ is less than ε, for all large n. But on


the complementary event G, the inequality (36) must hold (with possiblydifferent values of δ), by the Beta Concentration Property.

For the proof of (37), recall that by (13),

(39) n−1 logZu((X,Y)n) = n−1

m∑

i=0

logB(NSi , NFi ),

where B(k, l) is the Beta function (using our convention for the arguments).By the Stirling approximation (20), each term of the sum for which Ni islarge is well-approximated by −NiH(NSi /Ni); and for each index i such thatJi 6⊂ Bu, this in turn is well approximated by n|Ji|H(f̄u(Ji)), where f̄u(Ji)is the average of f on the interval Ji. If Bu were empty, then (37) wouldfollow directly.

By Proposition 2, Pf (Gc) < ε for all sufficiently large n. On the the

complementary event G, the Lebesgue measure of the set Bu of “bad” in-tervals Ji is < 2δ. Because the intervals Ji not contained in Bu must haveapproximately the expected frequency n|Ji| of data points, by (38), thenumber of data points in Bu cannot exceed 4δ, on the event G. Since1 ≥ B(k, l) ≥ 4−k−l, it follows that the summands in (39) for which Ji ⊂ Bucannot contribute more than 4δ log 4 to the right side. The assertion (37)now follows (with a larger value of δ). �

Corollary 4. For each ε > 0 there exist δ > 0 and K < ∞ such that thefollowing is true: For all sufficiently large n, the Pf−probability is at least1− δ that for all K ≤ m ≤ εn,

Qm({g : ‖g − f‖1 ≥ δ} | (X,Y)n) < δ and(40)|n−1 logZm((X,Y)n) +H(f)| < δ.(41)

Proof. For large m (say, m ≥ K), most split-point vectors u (as measuredby the uniform distribution on [0, 1]m) are such that all intervals Ji(u) in theinduced partition are short — this follows, for instance, from the Glivenko-Cantelli theorem — and so, by Lemma 1, ‖f − f̄u‖1 will be small. Thus,if

Bm(α) := {u ∈ [0, 1]m : ‖f − f̄u‖1 ≥ α},then Bm(α) has Lebesgue measure < β for all m ≥ K = Kβ, whereK(β) 0. Inequality (36) of Proposition 3 implies that foreach u in the complementary event Bcm(α), the Qu−posterior distribution isconcentrated on a small L1−neighborhood of f , provided α is small. Thus,to prove (40), it must be shown that the contribution to the Qm−posterior(15) from split-point vectors u ∈ Bm(α) is negligible. For this, it sufficesto show that the predictive probabilities Zu((X,Y)n) are not larger foru ∈ Bm(α) than for u ∈ Bcm(α).

By Entropy Continuity, if α > 0 is sufficiently small then for all u ∈Bcm(α),

|H(f)−H(f̄u)| < η.


Hence, by inequality (37) of Proposition 3, n−1 logZu((X,Y)n) must bewithin 2η of H(f) for all u ∈ Bcm(α). On the other hand, by the EntropyConcavity Property (27), H(f̄) ≤ H(f) for all u, and in particular, for allu ∈ Bm(Cη),

H(f̄) < H(f)− 4η,provided C > 0 is appropriately chosen. Consequently, by (37),

|n−1 logZu((X,Y)n)−H(f)| > 2ηfor u ∈ Bm(Cη). Therefore, the primary contribution to the integral (15)must come from u ∈ Bcm(Cα). This proves (40). Assertion (41) also follows,in view of the representation (16) for the predictive probability Zm((X,Y)n).

�

The exponential decay rate of the predictive probabilities in the beginningzone depends on whether or not the true regression function f is a stepfunction. If not, the decay is faster than in the middle zone; if so, thedecay matches that in the middle zone, but the posterior concentrates in aneighborhood of f .

Corollary 5. If the regression function f is a step function with k discon-tinuities in (0, 1), then for each m ≥ k and all ε > 0, the inequalities (40)and (41) hold with Pf−probability tending to 1 as the sample size n → ∞.If f is not a step function with fewer than K + 1 discontinuities, then thereexists ε > 0 such that with Pf−probability → 1 as n→∞,(42) max

m≤KZm((X,Y)n) < exp{−nH(f)− nε}.

Proof. If f is not a step function with fewer than K+1 discontinuities, thenby the Entropy Concavity Property, there exists ε > 0 so that H(f̄u) isbounded above by H(f) − ε for all split-point vectors u of length m ≤ K.Hence, (42) follows from (37), by the same argument as in the proof ofCorollary 4.

Suppose then that f is a step function with k discontinuities, that is,f = f̄u∗ for some split-point vector u

∗ of length k. For any other split-point vector u, the entropy H(f̄u) cannot exceed H(f), by the EntropyConcavity Property, and so (37) implies that for any m, the exponentialdecay rate of the predictive probability Zm((X,Y)n) as n → ∞ cannotexceed −H(f). But since f is a step function with k discontinuities, anyopen L1 neighborhood of f has positive πm−probability; consequently, byEntropy Continuity and (37), the exponential decay rate of Zm((X,Yn))in n must be at least −H(f). Thus, (41) holds with Pf−probability → 1as n → ∞. Finally, (42) follows by the same argument as in the proof ofCorollary 4. �

Corollaries 5 and 4 imply that, with Pf−probability → 1 as n→∞, theQν−posterior in the beginning and middle zones concentrates near f , andthat the total posterior mass in the beginning and middle zones decays at the


exponential rate H(f) as n→∞. Thus, to complete the proof of Theorem1, it suffices to show that the posterior mass in the end zone m ≥ δn decaysat an exponential rate > H(f). This will be the agenda for the remainderof the article: see Proposition 6 below.

4. The End Zone

For the Diaconis-Freedman priors, the log-predictive probabilities simplifyneatly as sums of independent random variables, and so their asymptoticbehavior drops out easily from the usual WLLN. No such simplificationis possible in our case: the integral in the second line of (13) does notadmit further reduction, as an integral conspires to separate the log fromthe product inside. Thus, the analysis of the posterior in the End Zone willnecessarily be somewhat more roundabout than in the Diaconis-Freedmancase. The main objective is the following.

Proposition 6. For any Borel measurable regression function f 6≡ 1/2 andall ε > 0, there exist constants δ = δ(ε, f) > 0 such that

(43) limn→∞

Pf{ supm≥εn

logZm((X,Y)n) ≥ n(−H(f)− δ)} = 0.

The key to proving this will be to establish that the predictive probabil-ities decay exponentially in n at a precise rate, depending on α > 0, form/n→ α > 0. (In fact, only a “Poissonized” version of this will be proved.)See Proposition 11 below for a precise statement.

4.1. Preliminaries: Comparison and Poissonization. Comparison ar-guments will be based on the following simple observation.

Lemma 7. Adding more data points (xi, yi) to the sample (x,y) decreasesthe value of Zm(x,y).

Proof. For each fixed pair u,w ∈ [0, 1]m, adding data points to the sampleincreases (some of) the cell counts NSi , N

Fi , and therefore decreases the

integrand in (13). �

Two “Poissonizations” will be used, one for the data sample, the other forthe sample of split points. Let Λ(t) and M(t) be independent standard Pois-son counting processes of intensity 1, jointly independent of the data stream(X,Y). Replacing the sample (X,Y)n of fixed size n by a “Poissonized”sample (X,Y)Λ(n) of size Λ(n) has the effect of making the success/failurecounts in disjoint intervals independent random variables with Poisson dis-tributions.

Lemma 8. For each ε > 0, the probability that

(44) Zm((X,Y)Λ(n−εn)) ≤ Zm((X,X)n) ≤ Zm((X,Y)Λ(n+εn))for all m approaches 1 as n→∞


Proof. For any ε > 0, P{Λ(n − εn) ≤ n ≤ Λ((1 + ε)n)} → 1 as n → ∞,by the weak law of large numbers. On this event, the inequality (44) musthold, by Lemma 7. �

Poissonization of the number of split points entails mixing the priors πmaccording to a Poisson hyper-prior. For any λ > 0, let π∗λ be the Poisson-λmixture of the priors πm, and let Q

∗λ be the corresponding induced measure

on data sequences (equivalently, Q∗λ is the Poisson-λ mixture of the measuresQm). Then the Q

∗λ−predictive probability for a data set (x,y) is given by

(45) Z∗λ(x,y) :=

∞∑

k=0

λke−λ

k!Zk(x,y).

In some of the arguments to follow, an alternative representation of thesePoissonized predictive probabilities as a conditional expectation will be use-ful. Thus, assume that on the underlying probability space (Ω,F , P ) (or(Ω,F , Pf )) are defined i.i.d. uniform-[0,1] r.v.s Un that are jointly indepen-dent of the data stream and the Poisson processes Λ,M . Then

(46) Z∗λ((X,Yf )Λ(n)) = E(β | (X,Yf )Λ(n))

where

(47) β = β(UM(λ); (X,Yf )Λ(n)) :=

M(λ)∏

i=0

B(NSi , NFi )

and NSi , NFi are the success/failure cell counts for the data (X,Y

f )Λ(n) rela-tive to the partition induced by the split point sample UM(λ). Alternatively,if the regression function f is fixed,

(48) Z∗λ((X,Y)Λ(n)) = Ef (β | (X,Y)Λ(n)).The effect of Poissonization on the number of split points is a bit more

subtle than the effect on data, because there is no simple a priori rela-tion between neighboring predictive probabilities Zm(x,y) and Zm+1(x,y).However, because the Poisson distribution with mean αn assigns mass atleast C/

√n to the value [αn] (where C = C(α) > 0 is continuous in α), the

following is true.

Lemma 9. For each ε > 0 there exists C 0 and A < ∞ there exist δ > 0 such that ifµ, λ ≤ A and |µ − λ| ≤ δ then for all n ≥ 1 and all data sets (x,y) of size


n,

(50)Z∗µn(x,y)

Z∗λn(x,y)≤ enε.

Proof. The inequality (22) implies that Z∗m(x,y) ≥ 4−n. Chernoff’s largedeviation inequality implies that if M has the Poisson distribution withmean λ ≤ An then

P{M ≥ κn} ≤ e−γnwhere γ →∞ as κ→∞. Since Zk(x,y) ≤ 1, it follows that the contributionto the sum (45) from terms indexed by k ≥ κn is of smaller exponential orderof magnitude than that from terms indexed by k < κn provided γ > log 4.

Consider the Poisson distributions with means µn, λn ≤ κn: these aremutually absolutely continuous, and the likelihood ratio at the integer valuek is

(µ/λ)kenλ−nµ.

If k ≤ κn and |µ − λ| is sufficiently small then this likelihood ratio is lessthan enε. By the result of the preceding paragraph, only values of k ≤ κncontribute substantially to the expectations; thus, the assertion follows. �

4.2. Exponential Decay. The asymptotic behavior of the doubly Pois-sonized predictive probabilities is spelled out in the following proposition,whose proof will be the goal of sections 4.4-4.6 and section 5 below.

Proposition 11. For each Borel measurable regression function f and eachα > 0 there exists a constant ψf (α) such that, as n→∞,

(51) n−1 logZ∗αn(((X,Y)Λ(n)))Pf−→ ψf (α).

The function ψf (α) satisfies

(52) ψf (α) =

∫ 1

0ψ(f(x), α) dx

where ψ(p, α) = ψp(α) is the corresponding limit for the constant regressionfunction f ≡ p. The function ψ(p, α) is jointly continuous in p, α andsatisfies

limα→∞

maxp∈[o,1]

|ψp(α) + log 2| = 0 and(53)

ψp(α) < −H(p).(54)Note that the entropy inequality (54) extends to all regression functions

f : that is, p may be replaced by f on both sides of (54). This followsfrom the integral formulas that define ψf (α) and H(f). The fact that thisinequality is strict is crucially important to the consistency theorem. It willalso require a rather elaborate argument: see section 5 below.

The case f ≡ p, where the regression function is constant, will prove tobe the crucial one. In this case, the existence of the limit (51) is somewhatreminiscent of the existence of “thermodynamic limits” in formal statistical


mechanics (see [20], Ch. 3). Unfortunately, Proposition 11 cannot be reducedto the results of [20], as (i) the data sequence enters conditionally (thusfunctioning as a “random environment”); and, more importantly, (ii) thehypothesis of “tempered interaction” needed in [20] cannot be verified here.The limit (51) is also related to the “conditional LDP” of Chi [2], but againcannot be deduced from the results of that paper, because the log-predictiveprobability cannot be expressed as a continuous functional of the empiricaldistribution of split point/data point pairs.

4.3. Proof of Proposition 6. Before proceeding with the long and some-what arduous proof of Proposition 11, we show how it implies Proposition6. In the process, we shall establish the asymptotic behavior (53) of the ratefunction.

Lemma 12. For every δ > 0 there exists αδ 0, let G = Gδ,ε be the event that atleast (1 − δ)n of the spacings ξj are larger than ε/n. Call these spacings“fat”. Since the spacings are independent exponentials with mean 1/n, theGlivenko-Cantelli theorem implies that there exist δ = δ(ε) → 0 as ε → 0such that

limn→∞

P (Gδ,ε ∩ {|Λ(n)− n| < εn}) = 1.

By elementary large deviations estimates for the Poisson process, givenG, the probability that a random sample of M(αn) split points is such thatmore than (1 − 2δ)n of the fat spacings contain no split points is less thanexp{−nγ}, where γ = γ(α, ε, δ) → ∞ as α → ∞. But on the complement


of this event, at least (1 − 4δ)n of the intervals induced by the split pointshave exactly one data point. Thus, on the event G ∩ {|Λ(n)− n| < εn},

2−n+4nδ4−4δn−εn ≤∏

i

B(NSi , NFi ) ≤ 2−n+4δn.

Observe that these inequalities hold regardless of the assignment y of valuesto the response variables. Thus, taking conditional expectations given thedata (X,Y)Λ(n), we obtain

(60) (1− e−nγ)2−n−4nδ−2δn ≤ Z∗αn(XΛ(n),YΛ(n)) ≤ 2−n+4δn + e−nγ .Since γ can be made arbitrarily large by making α large, the assertions (55)-(56) follow. �

Proof of Proposition 6. Since H(f) < log 2 for every regression function f 6≡1/2, Lemma 12 implies that to prove (43) it suffices to replace the supremumover m ≥ εn by the supremum over m ∈ [εn, ε−1n]. Now for m in thisrange, the bound (49) is available; since log n is negligible compared to n,(49) implies that

supεn≤m≤ε−1n

logZm((X,Y)n)

may be replaced by

supε≤α≤ε−1

logZ∗αn((X,Y)Λ(n))

in (43). Lemma 10 implies that this last supremum may be replaced by amaximum over a finite set of values α, and now (43) follows from assertions(51), (52), and (54) of Proposition 11 . �

4.4. Constant Regression Functions. The shortest and simplest routeto the convergence (51) is via subadditivity (more precisely, approximatesubadditivity). Assume that f ≡ p is constant, and that the constant p 6=0, 1. In this case, the integrals (16) defining the predictive probabilities havea nearly “self-similar” structure: after Poissonization of the sample size andthe number of split points, the integral in (16) almost factors perfectly intothe product of two integrals, one over the data and split points in [0, 1/2],the other over (1/2, 1], of the same form (but on a different scale – see (19)).Unfortunately, this factorization is not exact, as the partition of the unitinterval induced by the split points ui includes an interval that straddles thedemarcation point 1/2. However, the error can be controlled, and so theconvergence (51) can be deduced from a simple approximate subadditiveWLLN (Proposition 23 of the Appendix).

Lemma 14. Fix α > 0, and write ζn = logZ∗αn(((X,Y)Λ(n))). For each pair

m,n ∈ N of positive integers there exist random variables ζ ′m,m+n, ζ ′′n,m+n,and Rm,n such that

(a) ζ ′m,m+n, ζ′′n,m+n are independent;

(b) ζm and ζ′m,m+n have the same law;


(c) ζn and ζ′′n,m+n have the same law;

(d) the random variables {Rm,n}m,n≥1 are identically distributed;(e) E|R1,1|


expectations of these products (given the data and the values of U ′, U ′′)have the same distributions as exp{ζm} and exp{ζn}, respectively. Thus,

exp{ζm+n} ≥ δ exp{ζ ′m} exp{ζ ′′n}4−N∗∗

where ζ ′m and ζ′′n are independent, with the same distributions as ζm and ζn,

respectively, and N∗∗ is Poisson with mean 2. �

Remark. There is a similar (and in some respects simpler) approximatesubaddivitivity relation among the distributions of the random variables ζn:For each pair m,n ≥ 1 of positive integers, there exist independent randomvariables ξ′m,m+n, ξ

′′n,m+n whose distributions are the same as those of ζm, ζn,

respectively, such that

(63) ζm+n ≤ ξ′m,m+n + ξ′′n,m+n + log Λ(m+ n).Corollary 15 can also be deduced from (63), but this requires a more sophis-ticated almost-subadditive WLLN than is proved in the Appendix, becausethe remainders log Λ(m + n) are not uniformly L1 bounded, as they are in(61).

Proof of (63). Consider the effect on the integral (16) of adding a split pointat b = m/(m + n): This breaks one of the intervals Ji into two, leavingall of the others unchanged, and so the effect on the integrand in (16) isthat one of the factors B(NSi , N

Fi ) is replaced by a product of two factors

B(NSL , NFL )B(N

SR , N

FR ). By (23), the multiplicative error in this replacement

is bounded above by Λ(m + n). After the replacement, the factors in theintegrand β =

∏

B(NSi , BFi ) may be partitioned neatly into those indexed

by intervals left of b and those indexed by intervals right of b: thus,

β = β′β′′

where β, β ′′ are independent and have the same distributions as the productsβ occurring as integrands in the expectations defining exp{ζm} and exp{ζn},respectively. Thus,

exp{ζm+n} ≤ exp{ξm} exp{ξ′′n}Λ(m+ n)where ξ′m = ξ

′m,m+n and ξ

′′n = ξ

′′n,m+n are independent and distributed as ζm

and ζn, respectively. �

4.5. Piecewise Constant Regression Functions. The next step is toextend the convergence (51) to piecewise constant regression functions f .For ease of exposition, we shall restrict attention to step functions with asingle discontinuity in (0, 1); the general case involves no new ideas. Thus,assume that

f(x) = pL for x ≤ b,f(x) = pR for x > b, and

pL 6= pR.


Fix α > 0, and set

(64) Z∗n := Z∗αn((X,Y)Λ(n)).

Lemma 16. With Pf−probability approaching one as n→∞,Z∗n ≥ Z ′nZ ′′n/n2 and(65)Z∗n ≤ 2nZ ′nZ ′′n(66)

where for each n the random variables Z ′n, Z′′n are independent, with the same

distributions as

Z ′nL= Z∗αnb((X,Y)Λ(bn)) under PpL ;(67)

Z ′′nL= Z∗αn−αnb((X,Y)Λ(n−nb)) under PpR .

Proof. Consider the effect on Z∗n of placing an additional split point at b:this would divide the interval straddling b into two non-overlapping intervalsL,R (for “left” and “right”), and so in the integrand β :=

∏

B(NSi , NFi )

the the single factor B(NS∗ , NF∗ ) representing the interval straddling b would

be replaced by a product of two factors B(N SL , NFL ) and B(N

SR , N

FR ). As in

the proof of the subadditivity inequality (63) in section 4.4, the factors ofthis modified product separate into those indexed by subintervals of [0, b]and those indexed by subintervals of [b, 1]; thus, the modified product hasthe form β ′β′′, where β′ and β′′ are the products of the factors indexedby intervals to the left and right, respectively, of b. Denote by Z ′n andZ ′′n the conditional expectations of β

′ and β′′ (given the data). These areindependent random variables, and by the scaling relation (19) their distri-butions satisfy (67). By inequality (23), the multiplicative error in makingthe replacement is at most Λ(n); since the event Λ(n) ≥ 2n has probabilitytending to 0 as n→∞, inequality (66) follows.

The reverse inequality (65) follows by a related argument. Let G be theevent that the data sample (X,Y)Λ(n) contains no points with covariate Xi ∈[b, b + n−2]. Since the covariates are generated by a Poisson point processwith intensity n, the probability of Gc is approximately n−1. Considerthe integral (over all samples of split points) that defines Z ∗n: this integralexceeds its restriction to the event A that there is a split point in [b, b+n−2].The conditional probability of A (given the data) is approximately αn−1,and thus larger than n−2 for large n. On the event G ∩A,

β = β′β′′

holds exactly, as the split point in [b, b+n−2] produces exactly the same binsas if the split point were placed at b. Moreover, conditioning on the eventA does not affect the joint distribution (conditional on the data) of β ′, β′′

when G holds. Thus, the conditional expectation of the product, given Aand the data, equals Z ′nZ

′′n on the event G. �

Taking nth roots on each side of (65) and appealing to Corollary 15 nowyields the following.


Corollary 17. If the regression function is piecewise constant, with onlyfinitely many discontinuities, then the convergence (51) holds.

4.6. Thinning. Extension of the preceding corollary to arbitrary Borelmeasurable regression functions will be based on thinning arguments. Re-call that if points of a Poisson point process of intensity λ(x) are ran-domly removed with location-dependent probability %(x), then the resulting“thinned” point process is again Poisson, with intensity λ(x) − %(x)λ(x).This principle may be applied to both the success (y = 1) and failure (y = 0)point processes in a Poissonized data sample. Because thinning at location-dependent rates may change the distribution of the covariates, it will benecessary to deal with data sequences with non-uniform covariate distribu-tion. Thus, let (X,Y) be a data sample of random size with Poisson-λdistribution under the measure Pf,F (here f is the regression function, F isthe distribution of the covariate sequence Xj). If successes (x, y = 1) areremoved from the sample with probability %1(x) and failures (x, y = 0) areremoved with probability %0(x), then the resulting sample will be a datasample of random size with the Poisson-µ distribution from Pg,G, where themean µ, the regression function g, and the covariate distribution G satisfy

µg(x)G(dx) = (1− %1(x))λf(x)F (dx) and(68)µ(1− g(x))G(dx) = (1− %0(x))λ(1 − f(x))F (dx).

By the monotonicity principle (Lemma 7), the predictive probability of thethinned sample will be no smaller than that of the original sample. Thus,thinning allows comparison of predictive probabilities for data generated bytwo different measures Pf,F and Pg,G. The first and easiest consequence isthe continuity of the rate function.

Lemma 18. The rate function ψp(α) is jointly continuous in p, α.

Proof. Corollary 15 and Lemma 10 imply that the functions α 7→ ψp(α)are uniformly continuous in α. Continuity in p and joint continuity in p, αare now obtained by thinning. Let (X,Y) be a random sample of sizeΛ(n) ∼Poisson-n from a data stream distributed according to Pp (that is,f ≡ p and F is the uniform-[0,1] distribution). Let (X,Y)′ be the sampleobtained by randomly removing failures from the sample (X,Y), with prob-ability ε. Then (X,Y)′ has the same distribution as a random sample ofsizeΛ(n− εqn) (here q = 1− p) from a data stream distributed according toPp′ , where p

′ = p/(1− εq). By the monotonicity principle (Lemma 7),Z∗αn((X,Y)) ≤ Z∗αn((X,Y)′).

Taking nth roots and appealing to Corollary 15 shows that

ψ(p, α) ≤ (1− εq)−1ψ(p/(1 − εq), α/(1 − εq)).A similar inequality in the opposite direction can be obtained by reversingthe roles of p and p/(1 − εq). The continuity in p of ψ(p, α) now follows


from the continuity in α, and the joint continuity follows from the uniformcontinuity in α. �

Proposition 19. The convergence (51) holds for every Borel measurableregression function f .

Proof. By Corollary 17 above, the convergence holds for all piecewise con-stant regression functions with only finitely many discontinuities. The gen-eral case will be deduced from this by another thinning argument.

If f : 0, 1 → [0, 1] is measurable, then for each ε > 0 there exists apiecewise constant g : [0, 1] → [0, 1] (with only finitely many discontinuities)such that ‖f − g‖1 < ε. If ε is small, then |f − g| must be small excepton a set B of small Lebesgue measure; moreover, g may be chosen so thatg = 1 wherever f is near 1, and g = 0 wherever f is near 0 (except on B).For such choices of ε and g there will exist removal rate functions %0(x) and%1(x) so that equation (68) holds with F =uniform distribution on [0, 1],G =uniform distribution on [0, 1] −B, and

|λµ− 1| < δ(ε)

for some constants δ(ε) → 0 as ε→ 0. (Note: Requiring G to be the uniformdistribution on [0, 1]−B forces complete thinning in B, that is, %0 = %1 = 1in B.) Thus, a Poissonized data sample distributed according to Pf maybe thinned so as to yield a Poissonized data sample distributed accordingto Pg,G, in such a way that the overall thinning rate is arbitrarily small.It follows, by the monotonicity principle, that the Poissonized predictiveprobabilities for data distributed according to Pf are majorized by those fordata distributed according to Pg,G, with a slightly smaller rate.

Now consider data (X,Y) distributed according to Pg,G: Since g is piece-wise constant andG is a uniform distribution, the transformed data (GX,Y)will be distributed as Ph, where h is again piecewise constant. Moreover,since the removed set B has small Lebesgue measure, the function h isclose to the function g in the Skorohod topology, and so by Lemma 18,ψh ≈ ψg ≈ ψf . Because the convergence (51) has been established for piece-wise constant regression functions h, it now follows from the monotonicityprinciple that

Pf{n−1 logZ∗αn(((X,Y)Λ(n))) > ψf (α) + δ} −→ 0

for every δ > 0. This proves the upper (and for us, the more important)half of (51). The lower half may be proved by a similar thinning argumentin the reverse direction. �

5. Proof of the Entropy Inequality (54)

This requires a change of perspective. Up to now, we have adopted thepoint of view that the covariates Xj and the split points Ui are generated by


Poisson point processes in the unit interval of intensities n and αn, respec-tively. However, the transformation formula (19) implies that the predictiveprobabilities, and hence also their Poissonized versions, are unchanged ifthe covariates and the split points are rescaled by a common factor n. Therescaled covariates X̂j := Xj/n and split points Ûi := Ui/n are then gener-ated by Poisson point processes of intensities 1 and α on the interval [0, n].Consequently, versions of all the random variables Z ∗[αn]((X,Y)Λ(n)) may

be constructed from two independent Poisson processes of intensities 1 andα on the whole real line. The advantage of this new point of view is thepossibility of deducing the large-n asymptotics from the Ergodic Theorem.

5.1. Reformulation of the inequality. To avoid cluttered notation, weshall henceforth drop the hats from the rescaled covariates and split points.Thus, assume that under both P = Pp and Q,

· · · < X−1 < X0 < 0 < X1 < · · · and· · · < U−1 < U0 < 0 < U1 < · · ·

are the points of independent Poisson point processes X and U of intensities1 and α, respectively, and let {Wi}i∈Z be a stream of uniform-[0,1] randomvariables independent of the point processes X,U. Denote by N(t) thenumber of occurrences in the Poisson point process X during the interval[0, t], and set Ji = (Ui, Ui+1]. Let {Yi}i∈Z be Bernoulli r.v.s. distributedaccording to the following laws:

(A) Under P , the random variables Yj are i.i.d. Bernoulli-p, jointly in-dependent of the Poisson point processes U,X.

(B) Under Q, the random variables Yj are conditionally independent,given X,U,W, with conditional distributions Yj ∼Bernoulli-Wi,where i is the index of the interval Ji containing Xj .

UnderQ the sequence {Yn}n∈Z is an ergodic, stationary sequence; for reasonsthat we shall explain below, we shall refer to this process as the rechargeablePolya urn. The distribution of (X,Y) ∩ [0, t] under Q is, after rescaling ofthe covariates by the factor t, the same as that of a data sample of randomsize Λ(t) under the Poisson mixture Q∗αt defined in section 4.1 above.

For (extended) integers −∞ ≤ m ≤ n define σ−algebras

FX,Ym,n = σ({Xj , Yj}m≤j≤n) and FYm,n = σ({Yj}m≤j≤n),

If m,n are both finite then the restrictions of the measures P,Q to FX,Ym,n(and therefore also to FYm,n) are mutually absolutely continuous. The Radon-Nikodym derivative on the smaller σ−algebra F Y1,n is just

(69)

(

dQ

dP

)

FY1,n

=q(Y1, Y2, . . . , Yn)

p(Y1, Y2, . . . , Yn)


where

q(y1, y2, . . . , yn) := Q{Yj = yj ∀ 1 ≤ j ≤ n} andp(y1, y2, . . . , yn) := P{Yj = yj ∀ 1 ≤ j ≤ n}

= pPn

j=1 yy(1− p)n−Pn

j=1 yj .

The Radon-Nikodym derivative on the larger σ−algebra FX,Y1,n cannot be sosimply expressed, but is closely related to the Poissonized predictive proba-bility Z∗αn((X,Y)n) defined by (45). Define

(70) Ẑn := p(Y1, Y2, . . . , Yn)

(

dQ

dP

)

FX,Y1,n

;

then by (17), the random variable ẐN(n) has the same distribution, underP , as does the Poissonized predictive probability (45) under Pf , for any f .

Hence, the convergence (51) must also hold for the random variables Ẑn:

Corollary 20. Under P , as n→∞,

(71) n−1 log ẐnL1−→ ψp(α).

Therefore, to prove the entropy inequality (54) it suffices to prove that

(72) limn→∞

n−1EP log

(

dQ

dP

)

FY1,n

< 0.

Proof. The first assertion follows directly from (51) of Proposition 11. Thus,to prove the entropy inequality ψp(α) < −H(p) it suffices, in view of (70),to prove that (72) holds when the σ−algebra F Y1,n is replaced by F

X,Y1,n . But

the former is a sub-σ−algebra of the latter; since log is a concave function,Jensen’s inequality implies that

EP log

(

dQ

dP

)

FX,Y1,n

≤ EP log(

dQ

dP

)

FY1,n

.

�

5.2. The relative Shannon-MacMillan-Breiman theorem. The exis-tence of the limit (71) is closely related to the relative Shannon-McMillan-Breiman theorem studied by several authors [19], [14], [15], [18] in an en-tertaining series of papers not entirely devoid of errors and misconceptions.The sequence Y1, Y2, . . . is, under either measure P or Q, an ergodic station-ary sequence of Bernoulli random variables. Thus, by the usual Shannon-MacMillan-Breiman theorem [23], as n→∞,

n−1 log q(Y1, Y2, . . . , Yn)a.s. Q−→ −hQ and

n−1 log p(Y1, Y2, . . . , Yn)a.s. P−→ −H(p)


where hQ is the Kolmogorov-Sinai entropy of the sequence Yj under Q. Ingeneral, of course, the almost sure convergence holds only for the prob-ability measure indicated – see, for instance, [14] for an example wherethe first convergence fails under the alternative measure P . The relativeShannon-MacMillan-Breiman theorem of [19] gives conditions under whichthe difference of the two averages

n−1 logq(Y1, Y2, . . . , Yn)

p(Y1, Y2, . . . , Yn)= n−1 log

(

dQ

dP

)

FY1,n

converges under P . In the case at hand, unfortunately, these conditions arenot of much use: they essentially require the user to verify that

n−1q(Y1, Y2, . . . , Yn)a.s. P−→ C

for some constant C. Thus, it appears that [19] does not provide a shortcutto the convergence (51).

5.3. The rechargeable Polya urn. In the ordinary Polya urn scheme,balls are drawn at random from an urn, one at a time; after each draw,the ball drawn is returned to the urn along with another of the same color.If initially the urn contains one red and one blue ball, then the limitingfraction Θ of red balls is uniformly distributed on the unit interval. ThePolya urn is connected with Bayesian statistics in the following way: theconditional distribution of the sequence of draws given the value of Θ is asi.i.d. Bernoulli-Θ random variables.

The rechargeable Polya urn is a simple variant of the scheme describedabove, differing only in that before each draw, with probability r > 0, theurn is emptied and then reseeded with one red and one blue ball. Unlikethe usual Polya urn, the rechargeable Polya urn is recurrent, that is, ifVn := (Rn, Bn) denotes the composition of the urn after n draws, then Vn isa positive recurrent Markov chain on the state space N× N. Consequently,{Vn} may be extended to n ∈ Z in such a way that the resulting process isstationary. Let Yn denote the binary sequence recording the results of thesuccessive draws (1 =blue, 0 =red). Clearly, this sequence has the samelaw as does the sequence Y1, Y2, . . . under the probability measure Q (withr = 1− exp{−1/α}).Lemma 21. For any ε > 0 there exists m such that the following is true:For any finite sequence y−k, y−k−1, . . . , y0, the conditional distribution ofYm+1, Ym+2, . . . , Y2m given that Yi = yi for all −k ≤ i ≤ 0 differs from theQ−unconditional distribution by less than ε in total variation norm.Proof. It is enough to show that the conditional distribution of Ym+1, . . . , Y2mgiven the composition V0 of the urn before the first draw differs from theunconditional distribution by less than ε. Let T be the time of the firstregeneration (emptying of the urn) after time 0; then conditional on T = n,for any n ≤ m, and on V0, the distribution of Ym+1, . . . , Y2m does not de-pend on the value of V0. Thus, if m is sufficiently large that the probability


of having at least one regeneration event between the first and mth drawsexceeds 1 − ε, then the conditional distribution given V0 differs from theunconditional distribution by less than ε in total variation norm. �

The construction of the sequence Y = Y1, Y2, . . . using the rechargeablePolya urn shows that this sequence behaves as a “factor” of a denumerable-state Markov chain (in terminology more familiar to statistians, the sequenceY follows a “hidden Markov model”). Note that the original specificationof the measure Q, in section 5.1 above, exhibits Y as a factor of the Harris-recurrent Markov chain obtained by adjoining to the state variable the cur-rent value of W. It does not appear that Yn can be represented as a functionof a finite-state Markov chain; if it could, then results of Kaijser [12] wouldimply the existence of the limit

limn→∞

n−1 log q(Y1, Y2, . . . , Yn)

almost surely under P , and exhibit it as the top Lyapunov exponent of asequence of random matrix products. Unfortunately, little is known aboutthe asymptotic behavior of random operator products (see [13] and refer-ences therein for the state of the art), and so it does not appear that theinequality (69) can be obtained by an infinite-state extension of Kaijser’sresult.

5.4. Proof of (72). Since it is not necessary to establish the convergenceof the integrands on the left side of (72), we shall not attempt to do so.Instead, we will proceed from the identity

(73) n−1EP logq(Y1, Y2, . . . , Yn)

p(Y1, Y2, . . . , Yn)= n−1

n−1∑

k=0

EP logq(Yk+1 |Y1, Y2, . . . , Yk)p(Yk+1 |Y1, Y2, . . . , Yk)

.

Because the random variables Yi are i.i.d. Bernoulli-p under P , the con-ditional probabilities p(yk+1 | y1, y2, . . . , yk) must coincide with the uncondi-tional probabilities p(yk+1). Thus, the usual information inequality (Jensen’sinequality), in the form Ef log(g(X)/f(X)) < 0 for distinct probability den-sities f, g, implies that for each k,

(74) EP logq(Yk+1 |Y1, Y2, . . . , Yk)

p(Yk+1)≤ 0,

with the inequality strict unless theQ−conditional distribution of Yk+1 giventhe past coincides with the Bernoulli-p distribution. Moreover, the left sideof (74) will remain bounded away from 0 as long as the conditional dis-tribution remains bounded away from the Bernoulli-p distribution (in anyreasonable metric, e.g, the total variation distance). Thus, to complete theproof of (72) it suffices to establish the following lemma.

Lemma 22. There is no sequence of integers kn →∞ along which(75) ‖q(· |Y1, Y2, . . . , Ykn)− p(·)‖TV −→ 0in P -probability.


Proof. This is based on the fact that the sequence of draws Y1, Y2, . . . pro-duced by the rechargeable Polya urn is not a Bernoulli sequence, that is,the Q− and P− distributions of the sequence Y1, Y2, . . . are distinct. De-note by qk the Q−conditional probability that Yk+1 = 1 given the valuesY1, Y2, . . . , Yk. Suppose that qkn → p in P−probability; then by summingover successive values of the last l variables, it follows that qkn−l → p inP−probability for each fixed l ∈ N. We will show that this leads to acontradiction.

Consider the following method of generating binary random variablesY1, Y2, . . . , Y2m: first generate i.i.d. Bernoulli-p random variables Yj for−k ≤ j ≤ 0; then, conditional on their values, generate Y1 according toqk+1; then, conditional on Y1, generate Y2 according to qk+2; and so on. Bythe hypothesis of the preceding paragraph, there is a sequence kn →∞ suchthat for any fixed m the joint distribution of Y1, Y2, . . . , Y2m converges to theproduct-Bernoulli-p distribution. But this contradicts the mixing propertyof the rechargeable Polya urn asserted by Lemma 21 above. �

6. Appendix: An Almost Subadditive WLLN

The purpose of this appendix is to prove the simple variant of the Subad-ditive Ergodic Theorem required in section 4. For the original subadditiveergodic theorem of Kingman, see [16], and for another variant that is usefulin applications to percolation theory see [17]. There are two novelties in ourversion: (a) the subadditivity relation is only approximate, with a randomerror; and (b) there is no measure-preserving transformation related to thesequence Sn.

Proposition 23. Let Sn be real random variables. Suppose that for eachpair m,n ≥ 1 of positive integers there exist random variables S ′m,m+n, S′′n,m+nand a nonnegative random variable Rm,n such that

(a) S′m,m+n and S′′n,m+n are independent;

(b) S′m,m+n has the same distribution as Sm;

(c) S′′m,m+n has the same distribution as Sn;(d) the random variables {Rm,n}m,n≥1 are identically distributed;(e) ER1,1


Proof of Proposition 23. Since the random variables S ′m,m+n and S′′n,m+n are

independent, with the same distributions as Sm and Sn, respectively, Kol-mogorov’s consistency theorem implies that the probability space may beenlarged so as to support additional random variables permitting recursionon the inequality (76). Here the simplest recursive strategy works: froma starting value n = km + r, reduce by m at each step. This leads to aninequality of the form

(78) Skm+r ≤ S0r +k

∑

j=1

Sjm +

k∑

j=1

Rj

where the random variables {Sjm}j≥1 are i.i.d., each with the same distri-bution as Sm, and the random variables Rj are identically distributed (butnot necessarily independent), each with the law of R1,1 := R.

The weak law (77) is easily deduced from the inequality (78). Note firstthat the special case of (76) with m = 1, together with hypothesis (d),implies that ESn ≤ nER + nES1 < ∞ for every n ≥ 1, and so γ < ∞.Assume for definiteness that γ > −∞; the case γ = −∞ may be treated bya similar argument. Divide each side of (78) by km; as k →∞,

S0rkm

P−→ 0 and 1km

k∑

j=1

SjmP−→ ESm

m,

the latter by the usual WLLN. The WLLN need not apply to the sum∑

Rj,since the terms are not necessarily independent; however, since all of theterms are nonnegative and have the same expectation ER < ∞, Markov’sinequality implies that for any ε > 0,

P{ 1km

k∑

j=1

Rj ≥ ε} ≤ER

mε.

Thus, letting m→∞ through a subsequence along which ESm/m→ γ, wefind that for any ε > 0,

limn→∞

P{Sn ≥ nγ + nε} = 0.

Since the r.v.s Sn/n are uniformly integrable, this implies that

limn→∞

P{Sn ≤ nγ − nε} = 0,

because otherwise lim inf ESn/n < −γ. This proves that Sn/n → γ inprobability; in view of the uniform integrability of the sequence Sn/n, con-vergence in L1 follows. �

Remark. Numerous variants of this proposition are true, and may beestablished by more careful recursions. Among these are WLLNs for randomvariables satisfying hypotheses such as those given in (63) above, where theremainders log Λ(m + n) are not identically distributed, but whose growth


is sublinear in m+ n. For hints as to how such results may be approached,see [11].

References

[1] Barron, A., Schervish, M. J. and Wasserman, L. (1999). Theconsistency of posterior distributions in nonparametric problems. TheAnnals of Statistics 27 536–561.

[2] Chi, Z. (2001). Stochastic sub-additivity approach to the conditionallarge deviation principle. Ann. Probability 29 1303–1328.

[3] Coram, M. (2002). Nonparametric Bayesian Classification. Ph.D.thesis, Stanford University.

[4] Diaconis, P. and Freedman, D. (1986). On inconsistent Bayes esti-mates of location. Ann. Statist. 14 68–87.

[5] Diaconis, P. and Freedman, D. A. (1993). Nonparametric binaryregression: A Bayesian approach. The Annals of Statistics 21 2108–2137.

[6] Diaconis, P. and Freedman, D. A. (1995). Nonparametric binaryregression with random covariates. Probability and Mathematical Sta-tistics 15 243–273.

[7] Freedman, D. and Diaconis, P. (1983). On inconsistent Bayes esti-mates in the discrete case. Ann. Statist. 11 1109–1118.

[8] Freedman, D. A. (1963). On the asymptotic behavior of Bayes’ esti-mates in the discrete case. Ann. Math. Statist. 34 1386–1403.

[9] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000).Convergence rates of posterior distributions. Ann. Statist. 28 500–531.

[10] Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian nonpara-metrics. Springer Series in Statistics, Springer-Verlag, New York.

[11] Hammersley, J. M. (1962). Generalization of the fundamental theo-rem on sub-additive functions. Proc. Cambridge Philos. Soc. 58 235–238.

[12] Kaijser, T. (1975). A limit theorem for partially observed Markovchains. Ann. Probability 3 677–696.

[13] Karlsson, A. and Margulis, G. A. (1999). A multiplicative ergodictheorem and nonpositively curved spaces. Comm. Math. Phys. 208107–123.

[14] Kieffer, J. C. (1973). A counterexample to Perez’s generalization ofthe Shannon-McMillan theorem. Ann. Probability 1 362–364.

[15] Kieffer, J. C. (1976). Correction to: “A counterexample to Perez’sgeneralization of the Shannon-McMillan theorem” (Ann. Probability 1(1973), 362–364). Ann. Probability 4 153–154.

[16] Kingman, J. F. C. (1973). Subadditive ergodic theory. Ann. Proba-bility 1 883–909. With discussion by D. L. Burkholder, Daryl Daley, H.Kesten, P. Ney, Frank Spitzer and J. M. Hammersley, and a reply bythe author.


[17] Liggett, T. M. (1985). An improved subadditive ergodic theorem.Ann. Probab. 13 1279–1285.

[18] Perez, A. (1964). Extensions of Shannon-McMillan’s limit theorem tomore general stochastic processes. In Trans. Third Prague Conf. Infor-mation Theory, Statist. Decision Functions, Random Processes (Liblice,1962). Publ. House Czech. Acad. Sci., Prague, 545–574.

[19] Perez, A. (1980). On Shannon-McMillan’s limit theorem for pairs ofstationary random processes. Kybernetika (Prague) 16 301–314.

[20] Ruelle, D. (1999). Statistical mechanics. World Scientific PublishingCo. Inc., River Edge, NJ. Rigorous results, Reprint of the 1989 edition.

[21] Schwartz, L. (1965). On Bayes procedures. Zeitschrift fürWahrscheinlichkeitstheorie und Verwandte Gebiete 4 10–26.

[22] Shen, X. and Wasserman, L. (2001). Rates of convergence of poste-rior distributions. Ann. Statist. 29 687–714.

[23] Walters, P. (1982). An introduction to ergodic theory, vol. 79 ofGraduate Texts in Mathematics. Springer-Verlag, New York.

University of Chicago, Department of Statistics, 5734 University Avenue,Chicago IL 60637

E-mail address: [email protected]

University of Chicago, Department of Statistics, 5734 University Avenue,Chicago IL 60637

E-mail address: [email protected]

CONSISTENCY OF BAYES ESTIMATORS OF A BINARY …galton.uchicago.edu/~lalley/Papers/Bayes.pdf · 2004. 12. 9. · BAYES ESTIMATION OF A REGRESSION FUNCTION 3 given Xi = x, is Bernoulli-f(x).

Documents