-
CONSISTENCY OF BAYES ESTIMATORS OF A BINARY
REGRESSION FUNCTION
MARC CORAM AND STEVEN P. LALLEY
Abstract. When do nonparametric Bayesian procedures “overfit?”
Toshed light on this question, we consider a binary-regression
problem indetail and establish frequentist consistency for a large
class of Bayesprocedures based on certain heirarchical priors,
called uniform mixturepriors. These are defined as follows: let ν
be any probability distributionon the nonnegative integers. To
sample a function f from the priorπν , first sample m from ν and
then sample f uniformly from the setof step functions from [0, 1]
into [0, 1] that have exactly m jumps (i.e.sample all m jump
locations and m + 1 function values independentlyand uniformly).
The main result states that with only one exception, ifa
data-stream is generated according to any fixed, measurable
binary-regression function f0 consistency obtains: i.e. for any ν
with infinitesupport, the posterior of πν concentrates on any L1
neighborhood off0. The only exception is that if f0 is
identically
1
2, so that all class-
label information is pure noise, inconsistency occurs if the
tail of ν istoo long. Qualitatively, this is the same as the
finding of Diaconis andFreedman for a class of related priors.
However, because the uniformmixture priors have randomly located
jumps, they are more flexible andpresumably more “prone” to
overfitting. Solution of a large-deviationsproblem is central to
the consistency proof.
1. Introduction
1.1. Consistency of Bayes Procedures. It has been known since
thework of Freedman [8] that Bayesian procedures may fail to be
“consistent”in the frequentist sense: For estimating a probability
density on the naturalnumbers, Freedman exhibited a prior that
assigns positive mass to everyopen set of possible densities, but
for which the posterior is consistent onlyat a set of the first
category. Freedman’s example is neither pathologicalnor rare: for
other instances, see [7], [4], [10], and the references
therein.
Frequentist consistency of a Bayes procedure here will mean
that, in a suit-able topology, the posterior probability of each
neighborhood of the “trueparameter” tends to 1. The choice of
topology may be critical: For consis-tency in the weak topology on
measures it is generally enough that the priorshould place positive
mass on every Kullback-Leibler neighborhood of thetrue parameter
[21], but for consistency in stronger topologies, more strin-gent
requirements on the prior are needed – see, for example, [1], [9],
[22].
Date: December 9, 2004.
1
-
2 MARC CORAM AND STEVEN P. LALLEY
Roughly, these demand not only that the prior charge
Kullback-Leiblerneighborhoods of the true parameter, but also that
it not be overly diffuse,as this can lead to “overfitting”.
Unfortunately, it appears that in certainnonparametric function
estimation problems, the general formulation of thislatter
requirement for consistency in [1] is far too stringent, as it
rules outlarge classes of useful priors for which the corresponding
Bayes proceduresare in fact consistent.
1.2. Binary Regression. The purpose of this paper is to examine
in detailthe consistency properties of Bayes procedures based on
certain hierarchicalpriors in a nonparametric regression problem.
For mathematical simplicity(such as it is – the reader will be the
judge), we choose to work in the set-ting of binary regression,
with covariates valued in the unit interval [0, 1].Consistency of
Bayes procedures in binary regression has been studied pre-viously
by Diaconis and Freedman [5], [6] for a class of priors –
suggestedby de Finetti – that are supported by the set of step
functions with discon-tinuities at dyadic rationals. The use of
such priors may be quite reasonablein circumstances where the
covariate is actually an encoding (via binaryexpansion) of an
infinite sequence of binary covariates. However, in appli-cations
where the numerical value of the covariate represents a real
physicalvariable, the restriction to step functions with
discontinuities only at dyadicsis highly unnatural; and simulations
show that when the regression functionis continuous, the
concentration of the posterior may be quite slow.
Coram [3] has proposed a natural class of priors, which we shall
calluniform mixture priors, on step functions that are at once
mathematicallynatural, allow computationally efficient simulation
of posteriors, and appearto have much more favorable concentration
properties for data generatedby continuous binary regression
functions than do the Diaconis-Freedmanpriors. These priors πν ,
like those of Diaconis and Freedman, are hierarchi-cal priors
parametrized by probability distributions ν on the
nonnegativeintegers. A random step function with distribution πν
can be obtained asfollows: (1) Choose a random integer M with
distribution ν. (2) Given thatM = m, choose m points ui at random
in [0, 1] according to the uniformdistribution: these are the
points of discontinuity of the step function. (3)Given M = m and
the discontinuities ui, choose the m+ 1 step heights wjby sampling
again from the uniform distribution. The uniform sampling insteps
(2)-(3) allows for easy and highly efficient Metropolis-Hastings
sim-ulations of posteriors; the uniform distribution could be
replaced by otherdistributions in either step, at the expense of
some efficiency in posteriorsimulations (and our main theoretical
results could easily be extended tosuch priors), but we see no
compelling reason to discuss such generaliza-tions in detail.
Let f be a binary regression function on [0, 1], that is, a
Borel-measurablefunction f : [0, 1] → [0, 1]. We shall assume that
under Pf the data (Xi, Yi)are i.i.d. random vectors, with Xi
uniformly distributed on [0, 1] and Yi,
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 3
given Xi = x, is Bernoulli-f(x). (Our main result would also
hold if thecovariate distribution were not uniform but any other
distribution givingpositive mass to all intervals of positive
length.) Let Qν =
∫
Pf dπν , and
denote by Qν(· | Fn) the posterior distribution on step
functions given thefirst n observations (Xi, Yi). The main result
of the paper is as follows.
Theorem 1. Assume that the hierarchy prior ν is not supported by
a finitesubset of the integers. Then for every binary regression
function f 6≡ 12 , theQν−Bayes procedure is L1−consistent at f ,
that is, for every ε > 0,(1) lim
n→∞Pf{Qν({g : ‖g − f‖1 > ε} | Fn) > ε} = 0.
The restriction f 6≡ 1/2 arises for precisely the same reason as
in [6],namely, that this exceptional function is the prior mean of
the regressionfunction. See [6] for further discussion.
Theorem 1 implies that the uniform mixture priors enjoy the same
con-sistency as do the Diaconis-Freedman priors [6]. This is not
exactly un-expected, but neither it should not be considered a
priori obvious — asthe proof will show, there are substantial
differences between the uniformmixture priors and those of Diaconis
and Freedman. As noted above, thepossibility of inconsistency
arises because the posteriors may favor modelsthat are “over-fit”.
Since the uniform mixture priors allow the
step-functiondiscontinuities to arrange themselves in favorable
(but atypical for uniformsamples) configurations vis-a-vis the
data, the danger of over-fitting wouldseem, at least a priori,
greater than for the Diaconis-Freedman priors. Infact, the bulk of
the proof (sections 4–5) will be devoted to showing thatthere is no
excessive accumulation of posterior mass in the end zone (theset of
step functions where the number of discontinuities grows linearly
withthe number of data points), where over-fitting occurs.
It should be noted that the sufficient conditions for
consistency given in[1] can be specialized to the uniform mixture
priors. Unfortunately, thesesufficient conditions require that the
hierarchy prior ν satisfy
∑
k≥m
νk ≤ m−mC
for some C > 0 (see [3]). Such a severe restriction on the
tail of the hierarchyprior certainly prevents the accumulation of
posterior mass in the end zone,but at the possible cost of having
the posterior favor models that are under-fit. Preliminary analysis
seems to indicate that when the true regressionfunction is smooth,
more rapid posterior concentration takes place whenthe hierarchy
prior has a rather long tail.
1.3. Overfitting and Large Deviations Problems. The problem of
ex-cessive accumulation of posterior mass in the end zone is, as it
turns out,naturally tied up with a large-deviations problem
connected to the model:this is the most interesting mathematical
feature of the paper. (A similarlarge-deviations problem occurs in
[6], but there it reduces easily to the
-
4 MARC CORAM AND STEVEN P. LALLEY
classical Cramér LD theorem for sums of i.i.d. random
variables.) Roughly,we will show in section 4 that as the
complexity m of the model (here, thenumber of discontinuities in
the step function) and the sample size n tendto ∞ in such a way
that m/n → α > 0, the posterior mass of the set ofall step
functions with complexity m decays at a precise exponential
rateψ(α) in n. The mathematical issues involved in this program are
reminis-cent of those encountered in rigorous statistical
mechanics, specifically inconnection with the existence of
thermodynamic limits (see, for example,[20], chapters 2—3): the
connection is that the log-likelihood serves as akind of
Hamiltonian on configurations of step function discontinuities.
Con-sistency (1) follows from the fact (proved in section 5) that
ψ(α) is uniquelymaximized at α → 0, as this, in essence, implies
that the posterior is con-centrated on models of small complexity
relative to the sample size. Thatposterior concentration in the
small-complexity regime must occur in L1−neighborhoods of the true
regression function follows by routine arguments— see section
3.
We expect (and hope to show in a subsequent paper) that for
large classesof heirarchical priors, in a variety of problems, the
critical determinant ofthe consistency of Bayes procedures will
prove to be the rate functions inassociated large deviations
problems. The template of the analysis is asfollows: Let
(2) π =∞∑
m=0
νmπm
be a hierarchical prior obtained by mixing priors πm of
“complexity” m.Let Q and Qm be the probability distributions on the
space of data se-quences gotten by mixing with respect to π and πm,
respectively; and letQ(· | Fn) and Qm(· | Fn) be the corresponding
posterior distributions giventhe information in the σ−field Fn.
Then by Bayes’ formula,
(3) Q(· | Fn) ={
∞∑
m=0
νmZm,nQm(· | Fn)}
/
{
∞∑
m=0
νmZm,n
}
,
where Zm,n are the “predictive probabilities” for the data in Fn
based onthe model Qm (see section 2.2 for more detail in the binary
regression prob-lem). This formula makes apparent that the relative
sizes of the predictiveprobabilities Zm,n determine where the mass
in the posterior Q
ν(· | Fn) isconcentrated. The large deviations problem is to
show that as m,n→∞ insuch a way that m/n→ α,(4) Z1/nm,n −→
exp{ψ(α)}in Pf−probability, for an appropriate nonrandom rate
function ψ(α), andto show that ψ(α) is uniquely maximized at α = 0.
This, when true, willimply that most of the posterior mass will be
concentrated on “models” πmwith small complexity m relative to the
sample size n, where overfitting doesnot occur.
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 5
1.4. Choice of Topology. The use of the L1−metric (equivalently,
anyLp−metric, 0 ≤ p < ∞) in measuring posterior concentration,
as in (1),although in many ways natural, may not always be
appropriate. Poste-rior concentration relative to the L1−metric
justifies confidence that, for anew random sample of individuals
with covariates uniformly distributed on[0, 1], the responses will
be reasonably well-predicted by a regression func-tion samples from
the posterior, but it would not justify similar confidencefor a
random sample of individuals all with covariate (say) x = .47.
Forthis, posterior concentration in the sup-norm metric would be
required. Wedo not yet know if consistency holds in the sup-norm
metric, for either theuniform mixture priors or the
Diaconis-Freedman priors, even for smoothf ; but we conjecture that
it does. We hope to return to this issue in asubsequent paper.
2. Preliminaries
2.1. Data. A (binary) regression function is a Borel measurable
functionf : [0, 1] → [0, 1], or, more generally, f : J → [0, 1]
where J is an interval.For each binary regression function f , let
Pf be a probability measure ona measurable space supporting a “data
stream” {(Xn, Yn)}n≥1 such thatunder Pf the random variables Xn are
i.i.d. Uniform-[0, 1] and, conditionalon σ({Xn}n≥1), the random
variables Yn are independent Bernoullis withconditional means
(5) Ef (Yn |σ({Xm}m≥1)) = f(Xn).(In several arguments below it
will be necessary to consider alternative dis-tributions F for the
covariates Xn. In such cases we shall adopt the con-vention of
adding the subscript F to relevant quantities; thus, for
instance,Pf,F would denote a probability distribution under which
the covariates Xnare i.i.d. F , and the conditional distribution of
the responses Yn is the sameas under Pf .) We shall assume when
necessary that probability spaces sup-port additional independent
streams of uniform and exponential r.v.s (andthus also Poisson
processes), so that auxiliary randomization is possible.Generic
data sets (values of the first n pairs (xi, yi)) will be denoted
(x,y),or (x,y)n to emphasize the sample size; the corresponding
random vectorswill be denoted by the matching upper case letters
(X,Y). For any data set(x,y) and any interval J ⊂ [0, 1], the
number of successes (yi = 1), failures(yi = 0), and the total
number of data points with covariate xi ∈ J will bedenoted by
NS(J), NF (J), and N(J) = NS(J) +NF (J).
In certain comparison arguments, it will be convenient to have
datastreams for different regression functions defined on a common
probabil-ity space (Ω,F , P ). This may be accomplished by the
usual device: Let{Xn}n≥1 and {Vn}n≥1 be independent, identically
distributed Uniform-[0, 1]random
-
6 MARC CORAM AND STEVEN P. LALLEY
variables, and set
(6) Y fn = 1{Vn ≤ f(Xn)}.
2.2. Priors on Regression Functions. The prior distributions on
regres-sion functions considered in this paper are probability
measures on the setof step functions with finitely many
discontinuities. Points of discontinuity,or “split points”, of step
functions will be denoted by ui, and step heightsby wi. Each vector
u = (u1, u2, . . . , um) of split points induces a partitionof the
unit interval into m+ 1 subintervals (or “cells”) Ji = Ji(u).
Denoteby πu the probability measure on step functions with
discontinuities ui thatmakes the step height random variables Wi
(that is, the values wi on theintervals Ji(u)) independent and
uniformly distributed on [0, 1]. For eachnonnegative integer m and
each probability distribution G on [0, 1], defineπm to be the
uniform mixture of the measures πu over all split point vectorsu of
length m, that is,
(7) πm =
∫
u∈(0,1)mπu d(u).
It is, of course, possible to mix against distributions other
than the uniform,and in some arguments it will be necessary for us
to do so. The priorsof primary interest – those considered in
Theorem 1 – are mixtures of themeasures πm against hierarchy priors
ν on the nonnegative integers:
(8) πν =
∞∑
m=0
νmπm
Each of the probability measures πu, πm, and πν induces a
corresponding
probability measure on the space of data sequences by
mixing:
Qu =
∫
Pf dπu(f),(9)
Qm =
∫
Pf dπm(f), and(10)
Qν =
∫
Pf dπν(f).(11)
Observe that Qm is the uniform mixture of the measures Qu over
split pointvectors u of size m, and Qν is the ν−mixture of the
measures Qm.
For any data sample (x,y), the posterior distribution Q(· |
(x,y)) underany of the measures Qu, Qm, or Q
ν is the conditional distribution on theset of step functions
given that (X,Y) = (x,y). The posterior distributionQu(· | (x,y))
can be explicitly calculated: it is the distribution that makesthe
step height r.v.s Wi independent, with Beta-(N
Si , N
Fi ) distributions,
where NSi = NS(Ji(u)) and N
Fi = N
F (Ji(u)) are the success/failure countsin the intervals Ji of
the partition induced by u. Thus, the joint density of
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 7
the step heights (relative to product Lebesgue measure on the
cube [0, 1]m+1)is
(12) qu(w | (x,y)) = Zu(x,y)−1m∏
i=0
wNSii (1−wi)N
Fi
where the normalizing constant Zu(x,y), henceforth called
theQu−predictiveprobability for the data sample (x,y), is given
by
Zu(x,y) =
∫
w∈[0,1]m+1
m∏
i=0
wNSii (1− wi)N
Fi dw =
m∏
i=0
B(NSi , NFi ) and(13)
B(m,n) =
{
(m+ n+ 1)
(
m+ n
m
)}−1
.(14)
(This is not the usual convention for the arguments of the beta
function, butwill save us from a needless proliferation of +1s.)
The posterior distributionsQm(· | (x,y)) and corresponding
predictive probabilities Zm(x,y) are relatedto Qu(· | (x,y)) and
Zu(x,y) as follows:
(15) Qm(· | (x,y)) ={
∫
u∈[0,1]mQu(· | (x,y))Zu(x,y) d(u)
}
/
Zm(x,y)
where
Zm(x,y) =
∫
u∈(0,1)mZu(x,y) d(u)(16)
=
∫
u∈(0,1)m
m∏
i=0
B(NSi , NFi ) d(u).
(Note: The dependence of the integrand on u, via the values of
the suc-cess/failure counts NSi , N
Fi , is suppressed.) In general the last integral can-
not be evaluated in closed form, unlike the integral (13) that
defines theQu−predictive probabilities. This, as we shall see in
sections 4-5, will makethe mathematical analysis of the posteriors
considerable more difficult thanthe corresponding analysis for
Diaconis-Freedman priors.
Note for future reference (sec. 5) that the predictive
probabilities Zm arerelated to likelihood ratios dQm/dPf : In
particular, when f ≡ p is constant,
(17) Zm((X,Y)n) = pNS(1− p)NF
(
dQmdPp
)
Fn
where FN is the σ−algebra generated by the first n data points,
and N S, NFare the numbers of successes and failures in the entire
data set (X,Y)n.
Finally, the posterior distribution Qν(· | (x,y)), whose
asymptotic behav-ior is the main concern of this paper, is related
to the posterior distributions
-
8 MARC CORAM AND STEVEN P. LALLEY
Qm(· | (x,y)) by Bayes’ formula(18)
Qν(· | (x,y)) ={
∞∑
m=0
νmZm(x,y)Qm(· | (x,y))}
/
{
∞∑
m=0
νmZm(x,y)
}
.
The goal of sections 4-5 will be to show that for large samples
(X,Y)n, un-der Pf , the predictive probabilities Zαn((X,Y)n) are of
smaller exponentialmagnitude for α > 0 than for α = 0. This will
imply that the posteriorconcentrates in the region m� n, where the
number of split points is smallcompared to the number of data
points.
Caution: Note that πm and πu have different meanings, as do Zm
and Zu,and Qu and Qm. The reader should have no difficulty
discerning the propermeaning, by context or careful font
analysis.
2.3. Transformations. Part of the analysis to follow will depend
cruciallyon the invariance under monotone transformations of the
predictive prob-abilities (16). We have assumed that the covariates
Xn are uniformly dis-tributed on [0, 1]; and, in constructing the
priors πm and π
ν , we have useduniform mixtures on the locations of the split
points ui. This is only forconvenience: clearly, the covariate
space could be relabelled by any homeo-morphism without changing
the nature of the estimation problem. Thus, ifthe data sample (x,y)
were changed to (G−1x,y), where G is a continuous,strictly
increasing c.d.f., and if G−mixtures rather than uniform
mixtureswere used in building the priors, then the predictive
probabilities would beunchanged:
(19) Zm,G(G−1x,y) = Zm(x,y)
where Zm = Zm,G are the predictive probabilities for the
transformed datarelative to priors built using G−mixtures. (This
follows easily from thetransformation formula for integrals.)
Two instances will be useful below. First, if (x,y)n is a data
sampleof size n, with covariates xi ∈ [0, 1], and if G is the
c.d.f. of the uniformdistribution on the interval [0, n] (so that
G−1 is just multiplication by n),then the transformation has the
effect of standardizing the spacings betweendata points and between
split points. This transformation will be used insection 5. Second,
if (X,Y) is a data stream with covariates Xi uniformlydistributed
on [0, 1], then the subsequence obtained by keeping only thosepairs
(Xi, Yi) such that Xi ∈ J , for some interval J , has the joint
distribu-tion of a data stream in which the covariates are i.i.d.
uniform on J , and isindependent of the corresponding subsequence
gotten by keeping only those(Xi, Yi) such that Xi ∈ Jc. This will
be of crucial importance in section 4,where it will be used to show
that the predictive probability Z[αn]((X,Y)n)approximately splits
as a product of two independent predictive probabili-ties, one for
each of the intervals [0, .5] and [.5, 1].
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 9
2.4. Beta Function and Beta Distributions. Because the posterior
dis-tributions (12) of the step height random variables are Beta
distributions,certain elementary properties of these distributions
and the correspondingnormalizing constants B(n,m) will play an
important role in the analysis.The behavior of the Beta function
for large arguments is well understood,and easily deduced from
Stirling’s formula. Similarly, the asymptotic be-havior of the Beta
distributions follows from the fact that these are thedistributions
of uniform order statistics:
Beta Concentration Property . For each ε > 0 there exists
k(ε) < ∞such that for all index pairs (m,n) with weight m+ n
> k(ε), (a) the Beta-(m,n) distribution puts all but at most ε
of its mass within ε of m/(m+n);and (b) the normalization constant
B(m,n) satisfies
(20)∣
∣
∣
logB(m,n)
m+ n+H
(
m
m+ n
)
∣
∣
∣< ε,
where H(x) is the Shannon entropy, defined by
(21) H(x) = −x log x− (1− x) log(1− x).Note that the binomial
coefficient in the expression (14) is bounded above
by 2m+n, so it follows that B(m,n) ≥ 4−m−n. Thus, by equation
(16), forany data sample (x,y) of size n,
(22) Zm,G(x,y) ≥ 4−n.Some of the arguments in section 4 will
require an estimate of the effect
on the integral (16) of adding another split point. This breaks
one of theintervals Ji into two, leaving all of the others
unchanged, and so the effecton the integrand in (16) is that one of
the factors B(N Si , N
Fi ) is replaced
by a product of two factors B(NSL , NFL )B(N
SR , N
FR ), where the cell counts
satisfy
NSL +NSR = N
Si and
NFL +NFR = N
Fi .
The following inequality shows that the multiplicative error
made in thisreplacement is bounded by the overall sample size:
(23)B(NSi , N
Fi )
B(NSL , NFL )B(N
SR , N
FR )
≤ NSi +NFi
2.5. The Entropy Functional. We will show, in section 3 below,
thatin the “Middle Zone” where the number of split points is large
but smallcompared to the number n of data points, the predictive
probability decaysat a precise exponential rate as n → ∞. The rate
is the negative of theentropy functional H(f), defined by
(24) H(f) =
∫ 1
0H(f(x)) dx
-
10 MARC CORAM AND STEVEN P. LALLEY
where H(x), for x ∈ (0, 1), is the Shannon entropy defined by
(21) above.The Shannon entropy function H(x) is uniformly
continuous and strictlyconcave on [0, 1], with second derivative
bounded away from 0; it is strictlypositive except at the
endpoints, 0 and 1; and it attains a maximum valueof log 2 at x =
1/2. The entropy functional H(f) enjoys similar properties:
Entropy Continuity Property . For each ε > 0 there exists δ
> 0 sothat
(25) ‖f − g‖1 < δ =⇒ |H(f)−H(g)| < ε.Entropy Concavity
Property . Let f and g be binary regression functionssuch that g is
an averaged version of f in the following sense: There
existfinitely many pairwise disjoint Borel sets Bi such that {x :
g(x) 6= f(x)} =∪iBi, and for each i such that |Bi| > 0,
(26) g(x) =
∫
Bi
f(y) dy/|Bi| ∀x ∈ Bi.
Then
(27) H(g) −H(f) ≥ −( max0
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 11
step function whose value on each interval Ji(u) is the mean
value∫
Jif/|Ji|
of f on that interval. Then by the Concavity Property,
(28) H(f̄u) ≥ H(f),with strict inequality unless f = f̄u a.e.
Moreover, the difference is small ifand only if f andf̄u are close
in L
1. This will be the case if all intervals Jiof the partition are
small:
Lemma 1. For each binary regression function f and each ε > 0
there existsδ > 0 such that if |Ji| < δ for every interval Ji
in the partition induced byu, then
(29) ‖f − f̄u‖1 < ε.Proof. First, observe that the assertion
is elementary for continuous regres-sion functions, since
continuity implies uniform continuity on [0, 1]. Second,recall that
continuous functions are dense in L1[0, 1], by Lusin’s
theorem;thus, for each regression function f and any η > 0 there
exists a continuousfunction g : [0, 1] → [0, 1] such that ‖f − g‖1
< η. It then follows by theelementary inequality |
∫
h| ≤∫
|h| that for any vector u of split points,‖f̄u − ḡu‖1 <
η.
Finally, use η = ε/3, and choose δ so that for the continuous
function g andany u that induces a partition whose intervals are
all of length < δ,
‖g − ḡu‖1 < η.Then by the triangle inequality for L1,
‖f − f̄u‖1 ≤‖g − f‖1 + ‖g − ḡu‖1 + ‖f̄u − ḡu‖1≤3η = ε.
�
2.6. Empirical Distributions under Pf . Ultimately, the
Consistency The-orem follows from the Law of Large Numbers for
functionals of the datasequence (X,Y). In its simplest form, this
states that for any fixed intervalJ , (a) the fraction NS(J)/N(J)
of successes among the data points fallingin J converges (almost
surely under Pf ) to the mean value of f on J as thesample size
n→∞; and (b) the fraction N(J)/n of the entire sample thatfall in J
converges to |J | (the Lebesgue measure of J). In section 3, we
willneed the following weak uniform version of this statement.
Proposition 2. For any ε > 0 there exists κ = κ(ε) such
that
lim supn→∞
Pf{|{x : sup|J |≥κ/n
x∈J
|N(J)/n− |J || ≥ ε|J |}| ≥ ε} < ε and(30)
lim supn→∞
Pf{|{x : sup|J |≥κ/n
x∈J
|NS(J)/n−∫
Jf(x) dx| ≥ ε|J |}| ≥ ε} < ε.(31)
-
12 MARC CORAM AND STEVEN P. LALLEY
(Here |J | denotes the Lebesgue measure of J .)Proof. We shall
outline the proof of (30) and leave that of (31) to the
reader.First, note that the event in (30) involves only the
covariates Xi, which areuniformly distributed on [0, 1] regardless
of the choice of f , and thus havethe same distribution as under P
. We claim that for each δ > 0 there existC = C(δ) and n(δ)
sufficiently large that for any sample size n ≥ n(δ),(32) P{
sup
1≥t≥C/n|N([0, t]) − nt|/nt ≥ δ} < δ.
This follows by a routine Poissonization argument: If the sample
size nis changed to a random sample size Λ where Λ is independent
of the datasequence (X,Y) and has the Poisson distribution with
mean αn, then underP the process N([0, t]) is a Poisson process (in
t) of rate nα. By the usualSLLN for the Poisson process, if C =
C(δ) is sufficiently large then
P{ sup1≥t≥C/n
|N([0, t]) − αnt|/nt ≥ δ} < δ.
By choosing parameter values α− < 1 < α+, one may bracket
the originaldata sample of size n by Poissonized samples, and by
choosing α− and α+sufficiently near 1, deduce the claim (32).
Next, observe that if the covariates Xi are “rotated” by an
amount x(that is, if each Xi is replaced by Xi + x mod 1), their
joint distributionis unchanged, as the uniform distribution on the
circle is rotation-invariant.Therefore, (32) implies that for each
x ∈ [0, 1 − C/n],(33) P{ sup
1−x≥t≥C/n|N([x, t+ x])− nt|/nt ≥ δ} < δ.
Similarly, because the uniform distribution is invariant under
the reflectionmapping x 7→ 1− x, for all x ≥ C/n(34) P{ sup
x≥t≥C/n|N([x− t, x])− nt|/nt ≥ δ} < δ.
Let B = B(C, δ) be the set of all x for which the event in (33)
occurs, andB′ the set of x for which the event in (34) occurs. Then
by Fubini’s theorem,the expected Lebesgue measures of the sets B
and B ′ are < δ, and hence,by Markov’s inequality,
P{|B| ≥√δ} <
√δ and(35)
P{|B′| ≥√δ} <
√δ
The assertions (30)- (31) now follow, because for any point x
that lies in aninterval J of length at least κ/n for which |N(J)/n
− |J || ≥ ε|J | (and notwithin distance κ/n of the endpoints 0, 1),
it must be the case that either
|N([x− t, x])/n− t| ≥ εt/4 or|N([x, x+ t])/n− t| ≥ εt/4
for some t ≥ εκ/(4n). �
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 13
3. Beginning and Middle Zones
Following [5], we designate three asymptotic “zones” where the
predictiveprobabilities Zm((X,Y)n) decay at different exponential
rates. These aredetermined by the relative sizes of m, the number
of discontinuities of thestep functions, and n, the sample size.
The end zone is the set of pairs (m,n)such that m/n ≥ ε; this zone
will be analyzed in sections 4—5, where weshall prove that the
asymptotic decay of Zm((X,Y)n) is faster than in themiddle zone,
where K ≤ m ≤ εn for a large constant K. The beginning zoneis the
set of pairs (m,n) for which m ≤ K for some large K. A
regressionfunction cannot be arbitrarily well-approximated by step
functions with abounded number of discontinuities unless it is
itself a step function, and so,as we will see, the asymptotic decay
of Zm((X,Y)n) is generally faster inthe beginning zone than in the
middle zone.
In this section we analyze the beginning and middle zones, using
the BetaConcentration Property, Lemma 1, and Proposition 2. In the
beginning andmiddle zones, the number m of split points is small
compared to the numbern of data points, and so for typical
split-point vectors u, most intervalsin the partition induced by u
will, with high probability, contain a largenumber of data points.
Consequently, the law of large numbers applies inthese intervals:
together with the Beta Concentration property, it ensuresthat the
Qu−posterior is concentrated in an L1−neighborhood of f̄u, andthat
the Qu−predictive probability is roughly exp{−nH(f̄u)}. The
nextproposition makes this precise.
Proposition 3. For each δ > 0 there exists ε > 0 such that
the followingis true: For all sufficiently large n, the
Pf−probability is at least 1− δ thatfor all m ≤ εn and all
split-point vectors u of size m,
Qu({g : ‖g − f̄u‖1 ≥ δ} | (X,Y)n) < δ and(36)|n−1
logZu((X,Y)n) +H(f̄u)| < δ.(37)
Proof. Let Ji = Ji(u) be the intervals in the partition induced
by u. Fixκ = κ(δ) as in Proposition 2. If ε is sufficiently small,
then for any split-point vector u of size m ≤ εn, the union of
those Ji of length ≤ κ/n willhave Lebesgue measure < δ: this
follows by a trivial counting argument.Let Bu be the union of those
Ji that are either of length ≤ κ/n or are suchthat either
|Ni/n− |J || ≥ δ|Ji| or(38)
|NSi /n−∫
Ji
f | ≥ δ|Ji|.
where NSi , NFi are the success/failure counts in the interval
Ji(u) = Ji,
and Ni = NSi + N
Fi . By Proposition 2, the Pf−probability of the event
Gc that there exists a split-point vector u of size m ≤ εn for
which theLebesgue measure of Bu exceeds 2δ is less than ε, for all
large n. But on
-
14 MARC CORAM AND STEVEN P. LALLEY
the complementary event G, the inequality (36) must hold (with
possiblydifferent values of δ), by the Beta Concentration
Property.
For the proof of (37), recall that by (13),
(39) n−1 logZu((X,Y)n) = n−1
m∑
i=0
logB(NSi , NFi ),
where B(k, l) is the Beta function (using our convention for the
arguments).By the Stirling approximation (20), each term of the sum
for which Ni islarge is well-approximated by −NiH(NSi /Ni); and for
each index i such thatJi 6⊂ Bu, this in turn is well approximated
by n|Ji|H(f̄u(Ji)), where f̄u(Ji)is the average of f on the
interval Ji. If Bu were empty, then (37) wouldfollow directly.
By Proposition 2, Pf (Gc) < ε for all sufficiently large n.
On the the
complementary event G, the Lebesgue measure of the set Bu of
“bad” in-tervals Ji is < 2δ. Because the intervals Ji not
contained in Bu must haveapproximately the expected frequency n|Ji|
of data points, by (38), thenumber of data points in Bu cannot
exceed 4δ, on the event G. Since1 ≥ B(k, l) ≥ 4−k−l, it follows
that the summands in (39) for which Ji ⊂ Bucannot contribute more
than 4δ log 4 to the right side. The assertion (37)now follows
(with a larger value of δ). �
Corollary 4. For each ε > 0 there exist δ > 0 and K < ∞
such that thefollowing is true: For all sufficiently large n, the
Pf−probability is at least1− δ that for all K ≤ m ≤ εn,
Qm({g : ‖g − f‖1 ≥ δ} | (X,Y)n) < δ and(40)|n−1 logZm((X,Y)n)
+H(f)| < δ.(41)
Proof. For large m (say, m ≥ K), most split-point vectors u (as
measuredby the uniform distribution on [0, 1]m) are such that all
intervals Ji(u) in theinduced partition are short — this follows,
for instance, from the Glivenko-Cantelli theorem — and so, by Lemma
1, ‖f − f̄u‖1 will be small. Thus,if
Bm(α) := {u ∈ [0, 1]m : ‖f − f̄u‖1 ≥ α},then Bm(α) has Lebesgue
measure < β for all m ≥ K = Kβ, whereK(β) 0. Inequality (36) of
Proposition 3 implies that foreach u in the complementary event
Bcm(α), the Qu−posterior distribution isconcentrated on a small
L1−neighborhood of f , provided α is small. Thus,to prove (40), it
must be shown that the contribution to the Qm−posterior(15) from
split-point vectors u ∈ Bm(α) is negligible. For this, it
sufficesto show that the predictive probabilities Zu((X,Y)n) are
not larger foru ∈ Bm(α) than for u ∈ Bcm(α).
By Entropy Continuity, if α > 0 is sufficiently small then
for all u ∈Bcm(α),
|H(f)−H(f̄u)| < η.
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 15
Hence, by inequality (37) of Proposition 3, n−1 logZu((X,Y)n)
must bewithin 2η of H(f) for all u ∈ Bcm(α). On the other hand, by
the EntropyConcavity Property (27), H(f̄) ≤ H(f) for all u, and in
particular, for allu ∈ Bm(Cη),
H(f̄) < H(f)− 4η,provided C > 0 is appropriately chosen.
Consequently, by (37),
|n−1 logZu((X,Y)n)−H(f)| > 2ηfor u ∈ Bm(Cη). Therefore, the
primary contribution to the integral (15)must come from u ∈
Bcm(Cα). This proves (40). Assertion (41) also follows,in view of
the representation (16) for the predictive probability
Zm((X,Y)n).
�
The exponential decay rate of the predictive probabilities in
the beginningzone depends on whether or not the true regression
function f is a stepfunction. If not, the decay is faster than in
the middle zone; if so, thedecay matches that in the middle zone,
but the posterior concentrates in aneighborhood of f .
Corollary 5. If the regression function f is a step function
with k discon-tinuities in (0, 1), then for each m ≥ k and all ε
> 0, the inequalities (40)and (41) hold with Pf−probability
tending to 1 as the sample size n → ∞.If f is not a step function
with fewer than K + 1 discontinuities, then thereexists ε > 0
such that with Pf−probability → 1 as n→∞,(42) max
m≤KZm((X,Y)n) < exp{−nH(f)− nε}.
Proof. If f is not a step function with fewer than K+1
discontinuities, thenby the Entropy Concavity Property, there
exists ε > 0 so that H(f̄u) isbounded above by H(f) − ε for all
split-point vectors u of length m ≤ K.Hence, (42) follows from
(37), by the same argument as in the proof ofCorollary 4.
Suppose then that f is a step function with k discontinuities,
that is,f = f̄u∗ for some split-point vector u
∗ of length k. For any other split-point vector u, the entropy
H(f̄u) cannot exceed H(f), by the EntropyConcavity Property, and so
(37) implies that for any m, the exponentialdecay rate of the
predictive probability Zm((X,Y)n) as n → ∞ cannotexceed −H(f). But
since f is a step function with k discontinuities, anyopen L1
neighborhood of f has positive πm−probability; consequently,
byEntropy Continuity and (37), the exponential decay rate of
Zm((X,Yn))in n must be at least −H(f). Thus, (41) holds with
Pf−probability → 1as n → ∞. Finally, (42) follows by the same
argument as in the proof ofCorollary 4. �
Corollaries 5 and 4 imply that, with Pf−probability → 1 as n→∞,
theQν−posterior in the beginning and middle zones concentrates near
f , andthat the total posterior mass in the beginning and middle
zones decays at the
-
16 MARC CORAM AND STEVEN P. LALLEY
exponential rate H(f) as n→∞. Thus, to complete the proof of
Theorem1, it suffices to show that the posterior mass in the end
zone m ≥ δn decaysat an exponential rate > H(f). This will be
the agenda for the remainderof the article: see Proposition 6
below.
4. The End Zone
For the Diaconis-Freedman priors, the log-predictive
probabilities simplifyneatly as sums of independent random
variables, and so their asymptoticbehavior drops out easily from
the usual WLLN. No such simplificationis possible in our case: the
integral in the second line of (13) does notadmit further
reduction, as an integral conspires to separate the log fromthe
product inside. Thus, the analysis of the posterior in the End Zone
willnecessarily be somewhat more roundabout than in the
Diaconis-Freedmancase. The main objective is the following.
Proposition 6. For any Borel measurable regression function f 6≡
1/2 andall ε > 0, there exist constants δ = δ(ε, f) > 0 such
that
(43) limn→∞
Pf{ supm≥εn
logZm((X,Y)n) ≥ n(−H(f)− δ)} = 0.
The key to proving this will be to establish that the predictive
probabil-ities decay exponentially in n at a precise rate,
depending on α > 0, form/n→ α > 0. (In fact, only a
“Poissonized” version of this will be proved.)See Proposition 11
below for a precise statement.
4.1. Preliminaries: Comparison and Poissonization. Comparison
ar-guments will be based on the following simple observation.
Lemma 7. Adding more data points (xi, yi) to the sample (x,y)
decreasesthe value of Zm(x,y).
Proof. For each fixed pair u,w ∈ [0, 1]m, adding data points to
the sampleincreases (some of) the cell counts NSi , N
Fi , and therefore decreases the
integrand in (13). �
Two “Poissonizations” will be used, one for the data sample, the
other forthe sample of split points. Let Λ(t) and M(t) be
independent standard Pois-son counting processes of intensity 1,
jointly independent of the data stream(X,Y). Replacing the sample
(X,Y)n of fixed size n by a “Poissonized”sample (X,Y)Λ(n) of size
Λ(n) has the effect of making the success/failurecounts in disjoint
intervals independent random variables with Poisson
dis-tributions.
Lemma 8. For each ε > 0, the probability that
(44) Zm((X,Y)Λ(n−εn)) ≤ Zm((X,X)n) ≤ Zm((X,Y)Λ(n+εn))for all m
approaches 1 as n→∞
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 17
Proof. For any ε > 0, P{Λ(n − εn) ≤ n ≤ Λ((1 + ε)n)} → 1 as n
→ ∞,by the weak law of large numbers. On this event, the inequality
(44) musthold, by Lemma 7. �
Poissonization of the number of split points entails mixing the
priors πmaccording to a Poisson hyper-prior. For any λ > 0, let
π∗λ be the Poisson-λmixture of the priors πm, and let Q
∗λ be the corresponding induced measure
on data sequences (equivalently, Q∗λ is the Poisson-λ mixture of
the measuresQm). Then the Q
∗λ−predictive probability for a data set (x,y) is given by
(45) Z∗λ(x,y) :=
∞∑
k=0
λke−λ
k!Zk(x,y).
In some of the arguments to follow, an alternative
representation of thesePoissonized predictive probabilities as a
conditional expectation will be use-ful. Thus, assume that on the
underlying probability space (Ω,F , P ) (or(Ω,F , Pf )) are defined
i.i.d. uniform-[0,1] r.v.s Un that are jointly indepen-dent of the
data stream and the Poisson processes Λ,M . Then
(46) Z∗λ((X,Yf )Λ(n)) = E(β | (X,Yf )Λ(n))
where
(47) β = β(UM(λ); (X,Yf )Λ(n)) :=
M(λ)∏
i=0
B(NSi , NFi )
and NSi , NFi are the success/failure cell counts for the data
(X,Y
f )Λ(n) rela-tive to the partition induced by the split point
sample UM(λ). Alternatively,if the regression function f is
fixed,
(48) Z∗λ((X,Y)Λ(n)) = Ef (β | (X,Y)Λ(n)).The effect of
Poissonization on the number of split points is a bit more
subtle than the effect on data, because there is no simple a
priori rela-tion between neighboring predictive probabilities
Zm(x,y) and Zm+1(x,y).However, because the Poisson distribution
with mean αn assigns mass atleast C/
√n to the value [αn] (where C = C(α) > 0 is continuous in α),
the
following is true.
Lemma 9. For each ε > 0 there exists C 0 and A < ∞ there
exist δ > 0 such that ifµ, λ ≤ A and |µ − λ| ≤ δ then for all n
≥ 1 and all data sets (x,y) of size
-
18 MARC CORAM AND STEVEN P. LALLEY
n,
(50)Z∗µn(x,y)
Z∗λn(x,y)≤ enε.
Proof. The inequality (22) implies that Z∗m(x,y) ≥ 4−n.
Chernoff’s largedeviation inequality implies that if M has the
Poisson distribution withmean λ ≤ An then
P{M ≥ κn} ≤ e−γnwhere γ →∞ as κ→∞. Since Zk(x,y) ≤ 1, it follows
that the contributionto the sum (45) from terms indexed by k ≥ κn
is of smaller exponential orderof magnitude than that from terms
indexed by k < κn provided γ > log 4.
Consider the Poisson distributions with means µn, λn ≤ κn: these
aremutually absolutely continuous, and the likelihood ratio at the
integer valuek is
(µ/λ)kenλ−nµ.
If k ≤ κn and |µ − λ| is sufficiently small then this likelihood
ratio is lessthan enε. By the result of the preceding paragraph,
only values of k ≤ κncontribute substantially to the expectations;
thus, the assertion follows. �
4.2. Exponential Decay. The asymptotic behavior of the doubly
Pois-sonized predictive probabilities is spelled out in the
following proposition,whose proof will be the goal of sections
4.4-4.6 and section 5 below.
Proposition 11. For each Borel measurable regression function f
and eachα > 0 there exists a constant ψf (α) such that, as
n→∞,
(51) n−1 logZ∗αn(((X,Y)Λ(n)))Pf−→ ψf (α).
The function ψf (α) satisfies
(52) ψf (α) =
∫ 1
0ψ(f(x), α) dx
where ψ(p, α) = ψp(α) is the corresponding limit for the
constant regressionfunction f ≡ p. The function ψ(p, α) is jointly
continuous in p, α andsatisfies
limα→∞
maxp∈[o,1]
|ψp(α) + log 2| = 0 and(53)
ψp(α) < −H(p).(54)Note that the entropy inequality (54)
extends to all regression functions
f : that is, p may be replaced by f on both sides of (54). This
followsfrom the integral formulas that define ψf (α) and H(f). The
fact that thisinequality is strict is crucially important to the
consistency theorem. It willalso require a rather elaborate
argument: see section 5 below.
The case f ≡ p, where the regression function is constant, will
prove tobe the crucial one. In this case, the existence of the
limit (51) is somewhatreminiscent of the existence of
“thermodynamic limits” in formal statistical
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 19
mechanics (see [20], Ch. 3). Unfortunately, Proposition 11
cannot be reducedto the results of [20], as (i) the data sequence
enters conditionally (thusfunctioning as a “random environment”);
and, more importantly, (ii) thehypothesis of “tempered interaction”
needed in [20] cannot be verified here.The limit (51) is also
related to the “conditional LDP” of Chi [2], but againcannot be
deduced from the results of that paper, because the
log-predictiveprobability cannot be expressed as a continuous
functional of the empiricaldistribution of split point/data point
pairs.
4.3. Proof of Proposition 6. Before proceeding with the long and
some-what arduous proof of Proposition 11, we show how it implies
Proposition6. In the process, we shall establish the asymptotic
behavior (53) of the ratefunction.
Lemma 12. For every δ > 0 there exists αδ 0, let G = Gδ,ε be
the event that atleast (1 − δ)n of the spacings ξj are larger than
ε/n. Call these spacings“fat”. Since the spacings are independent
exponentials with mean 1/n, theGlivenko-Cantelli theorem implies
that there exist δ = δ(ε) → 0 as ε → 0such that
limn→∞
P (Gδ,ε ∩ {|Λ(n)− n| < εn}) = 1.
By elementary large deviations estimates for the Poisson
process, givenG, the probability that a random sample of M(αn)
split points is such thatmore than (1 − 2δ)n of the fat spacings
contain no split points is less thanexp{−nγ}, where γ = γ(α, ε, δ)
→ ∞ as α → ∞. But on the complement
-
20 MARC CORAM AND STEVEN P. LALLEY
of this event, at least (1 − 4δ)n of the intervals induced by
the split pointshave exactly one data point. Thus, on the event G ∩
{|Λ(n)− n| < εn},
2−n+4nδ4−4δn−εn ≤∏
i
B(NSi , NFi ) ≤ 2−n+4δn.
Observe that these inequalities hold regardless of the
assignment y of valuesto the response variables. Thus, taking
conditional expectations given thedata (X,Y)Λ(n), we obtain
(60) (1− e−nγ)2−n−4nδ−2δn ≤ Z∗αn(XΛ(n),YΛ(n)) ≤ 2−n+4δn + e−nγ
.Since γ can be made arbitrarily large by making α large, the
assertions (55)-(56) follow. �
Proof of Proposition 6. Since H(f) < log 2 for every
regression function f 6≡1/2, Lemma 12 implies that to prove (43) it
suffices to replace the supremumover m ≥ εn by the supremum over m
∈ [εn, ε−1n]. Now for m in thisrange, the bound (49) is available;
since log n is negligible compared to n,(49) implies that
supεn≤m≤ε−1n
logZm((X,Y)n)
may be replaced by
supε≤α≤ε−1
logZ∗αn((X,Y)Λ(n))
in (43). Lemma 10 implies that this last supremum may be
replaced by amaximum over a finite set of values α, and now (43)
follows from assertions(51), (52), and (54) of Proposition 11 .
�
4.4. Constant Regression Functions. The shortest and simplest
routeto the convergence (51) is via subadditivity (more precisely,
approximatesubadditivity). Assume that f ≡ p is constant, and that
the constant p 6=0, 1. In this case, the integrals (16) defining
the predictive probabilities havea nearly “self-similar” structure:
after Poissonization of the sample size andthe number of split
points, the integral in (16) almost factors perfectly intothe
product of two integrals, one over the data and split points in [0,
1/2],the other over (1/2, 1], of the same form (but on a different
scale – see (19)).Unfortunately, this factorization is not exact,
as the partition of the unitinterval induced by the split points ui
includes an interval that straddles thedemarcation point 1/2.
However, the error can be controlled, and so theconvergence (51)
can be deduced from a simple approximate subadditiveWLLN
(Proposition 23 of the Appendix).
Lemma 14. Fix α > 0, and write ζn = logZ∗αn(((X,Y)Λ(n))). For
each pair
m,n ∈ N of positive integers there exist random variables ζ
′m,m+n, ζ ′′n,m+n,and Rm,n such that
(a) ζ ′m,m+n, ζ′′n,m+n are independent;
(b) ζm and ζ′m,m+n have the same law;
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 21
(c) ζn and ζ′′n,m+n have the same law;
(d) the random variables {Rm,n}m,n≥1 are identically
distributed;(e) E|R1,1|
-
22 MARC CORAM AND STEVEN P. LALLEY
expectations of these products (given the data and the values of
U ′, U ′′)have the same distributions as exp{ζm} and exp{ζn},
respectively. Thus,
exp{ζm+n} ≥ δ exp{ζ ′m} exp{ζ ′′n}4−N∗∗
where ζ ′m and ζ′′n are independent, with the same distributions
as ζm and ζn,
respectively, and N∗∗ is Poisson with mean 2. �
Remark. There is a similar (and in some respects simpler)
approximatesubaddivitivity relation among the distributions of the
random variables ζn:For each pair m,n ≥ 1 of positive integers,
there exist independent randomvariables ξ′m,m+n, ξ
′′n,m+n whose distributions are the same as those of ζm, ζn,
respectively, such that
(63) ζm+n ≤ ξ′m,m+n + ξ′′n,m+n + log Λ(m+ n).Corollary 15 can
also be deduced from (63), but this requires a more sophis-ticated
almost-subadditive WLLN than is proved in the Appendix, becausethe
remainders log Λ(m + n) are not uniformly L1 bounded, as they are
in(61).
Proof of (63). Consider the effect on the integral (16) of
adding a split pointat b = m/(m + n): This breaks one of the
intervals Ji into two, leavingall of the others unchanged, and so
the effect on the integrand in (16) isthat one of the factors B(NSi
, N
Fi ) is replaced by a product of two factors
B(NSL , NFL )B(N
SR , N
FR ). By (23), the multiplicative error in this replacement
is bounded above by Λ(m + n). After the replacement, the factors
in theintegrand β =
∏
B(NSi , BFi ) may be partitioned neatly into those indexed
by intervals left of b and those indexed by intervals right of
b: thus,
β = β′β′′
where β, β ′′ are independent and have the same distributions as
the productsβ occurring as integrands in the expectations defining
exp{ζm} and exp{ζn},respectively. Thus,
exp{ζm+n} ≤ exp{ξm} exp{ξ′′n}Λ(m+ n)where ξ′m = ξ
′m,m+n and ξ
′′n = ξ
′′n,m+n are independent and distributed as ζm
and ζn, respectively. �
4.5. Piecewise Constant Regression Functions. The next step is
toextend the convergence (51) to piecewise constant regression
functions f .For ease of exposition, we shall restrict attention to
step functions with asingle discontinuity in (0, 1); the general
case involves no new ideas. Thus,assume that
f(x) = pL for x ≤ b,f(x) = pR for x > b, and
pL 6= pR.
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 23
Fix α > 0, and set
(64) Z∗n := Z∗αn((X,Y)Λ(n)).
Lemma 16. With Pf−probability approaching one as n→∞,Z∗n ≥ Z ′nZ
′′n/n2 and(65)Z∗n ≤ 2nZ ′nZ ′′n(66)
where for each n the random variables Z ′n, Z′′n are
independent, with the same
distributions as
Z ′nL= Z∗αnb((X,Y)Λ(bn)) under PpL ;(67)
Z ′′nL= Z∗αn−αnb((X,Y)Λ(n−nb)) under PpR .
Proof. Consider the effect on Z∗n of placing an additional split
point at b:this would divide the interval straddling b into two
non-overlapping intervalsL,R (for “left” and “right”), and so in
the integrand β :=
∏
B(NSi , NFi )
the the single factor B(NS∗ , NF∗ ) representing the interval
straddling b would
be replaced by a product of two factors B(N SL , NFL ) and
B(N
SR , N
FR ). As in
the proof of the subadditivity inequality (63) in section 4.4,
the factors ofthis modified product separate into those indexed by
subintervals of [0, b]and those indexed by subintervals of [b, 1];
thus, the modified product hasthe form β ′β′′, where β′ and β′′ are
the products of the factors indexedby intervals to the left and
right, respectively, of b. Denote by Z ′n andZ ′′n the conditional
expectations of β
′ and β′′ (given the data). These areindependent random
variables, and by the scaling relation (19) their distri-butions
satisfy (67). By inequality (23), the multiplicative error in
makingthe replacement is at most Λ(n); since the event Λ(n) ≥ 2n
has probabilitytending to 0 as n→∞, inequality (66) follows.
The reverse inequality (65) follows by a related argument. Let G
be theevent that the data sample (X,Y)Λ(n) contains no points with
covariate Xi ∈[b, b + n−2]. Since the covariates are generated by a
Poisson point processwith intensity n, the probability of Gc is
approximately n−1. Considerthe integral (over all samples of split
points) that defines Z ∗n: this integralexceeds its restriction to
the event A that there is a split point in [b, b+n−2].The
conditional probability of A (given the data) is approximately
αn−1,and thus larger than n−2 for large n. On the event G ∩A,
β = β′β′′
holds exactly, as the split point in [b, b+n−2] produces exactly
the same binsas if the split point were placed at b. Moreover,
conditioning on the eventA does not affect the joint distribution
(conditional on the data) of β ′, β′′
when G holds. Thus, the conditional expectation of the product,
given Aand the data, equals Z ′nZ
′′n on the event G. �
Taking nth roots on each side of (65) and appealing to Corollary
15 nowyields the following.
-
24 MARC CORAM AND STEVEN P. LALLEY
Corollary 17. If the regression function is piecewise constant,
with onlyfinitely many discontinuities, then the convergence (51)
holds.
4.6. Thinning. Extension of the preceding corollary to arbitrary
Borelmeasurable regression functions will be based on thinning
arguments. Re-call that if points of a Poisson point process of
intensity λ(x) are ran-domly removed with location-dependent
probability %(x), then the resulting“thinned” point process is
again Poisson, with intensity λ(x) − %(x)λ(x).This principle may be
applied to both the success (y = 1) and failure (y = 0)point
processes in a Poissonized data sample. Because thinning at
location-dependent rates may change the distribution of the
covariates, it will benecessary to deal with data sequences with
non-uniform covariate distribu-tion. Thus, let (X,Y) be a data
sample of random size with Poisson-λdistribution under the measure
Pf,F (here f is the regression function, F isthe distribution of
the covariate sequence Xj). If successes (x, y = 1) areremoved from
the sample with probability %1(x) and failures (x, y = 0)
areremoved with probability %0(x), then the resulting sample will
be a datasample of random size with the Poisson-µ distribution from
Pg,G, where themean µ, the regression function g, and the covariate
distribution G satisfy
µg(x)G(dx) = (1− %1(x))λf(x)F (dx) and(68)µ(1− g(x))G(dx) = (1−
%0(x))λ(1 − f(x))F (dx).
By the monotonicity principle (Lemma 7), the predictive
probability of thethinned sample will be no smaller than that of
the original sample. Thus,thinning allows comparison of predictive
probabilities for data generated bytwo different measures Pf,F and
Pg,G. The first and easiest consequence isthe continuity of the
rate function.
Lemma 18. The rate function ψp(α) is jointly continuous in p,
α.
Proof. Corollary 15 and Lemma 10 imply that the functions α 7→
ψp(α)are uniformly continuous in α. Continuity in p and joint
continuity in p, αare now obtained by thinning. Let (X,Y) be a
random sample of sizeΛ(n) ∼Poisson-n from a data stream distributed
according to Pp (that is,f ≡ p and F is the uniform-[0,1]
distribution). Let (X,Y)′ be the sampleobtained by randomly
removing failures from the sample (X,Y), with prob-ability ε. Then
(X,Y)′ has the same distribution as a random sample ofsizeΛ(n− εqn)
(here q = 1− p) from a data stream distributed according toPp′ ,
where p
′ = p/(1− εq). By the monotonicity principle (Lemma
7),Z∗αn((X,Y)) ≤ Z∗αn((X,Y)′).
Taking nth roots and appealing to Corollary 15 shows that
ψ(p, α) ≤ (1− εq)−1ψ(p/(1 − εq), α/(1 − εq)).A similar
inequality in the opposite direction can be obtained by
reversingthe roles of p and p/(1 − εq). The continuity in p of ψ(p,
α) now follows
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 25
from the continuity in α, and the joint continuity follows from
the uniformcontinuity in α. �
Proposition 19. The convergence (51) holds for every Borel
measurableregression function f .
Proof. By Corollary 17 above, the convergence holds for all
piecewise con-stant regression functions with only finitely many
discontinuities. The gen-eral case will be deduced from this by
another thinning argument.
If f : 0, 1 → [0, 1] is measurable, then for each ε > 0 there
exists apiecewise constant g : [0, 1] → [0, 1] (with only finitely
many discontinuities)such that ‖f − g‖1 < ε. If ε is small, then
|f − g| must be small excepton a set B of small Lebesgue measure;
moreover, g may be chosen so thatg = 1 wherever f is near 1, and g
= 0 wherever f is near 0 (except on B).For such choices of ε and g
there will exist removal rate functions %0(x) and%1(x) so that
equation (68) holds with F =uniform distribution on [0, 1],G
=uniform distribution on [0, 1] −B, and
|λµ− 1| < δ(ε)
for some constants δ(ε) → 0 as ε→ 0. (Note: Requiring G to be
the uniformdistribution on [0, 1]−B forces complete thinning in B,
that is, %0 = %1 = 1in B.) Thus, a Poissonized data sample
distributed according to Pf maybe thinned so as to yield a
Poissonized data sample distributed accordingto Pg,G, in such a way
that the overall thinning rate is arbitrarily small.It follows, by
the monotonicity principle, that the Poissonized
predictiveprobabilities for data distributed according to Pf are
majorized by those fordata distributed according to Pg,G, with a
slightly smaller rate.
Now consider data (X,Y) distributed according to Pg,G: Since g
is piece-wise constant andG is a uniform distribution, the
transformed data (GX,Y)will be distributed as Ph, where h is again
piecewise constant. Moreover,since the removed set B has small
Lebesgue measure, the function h isclose to the function g in the
Skorohod topology, and so by Lemma 18,ψh ≈ ψg ≈ ψf . Because the
convergence (51) has been established for piece-wise constant
regression functions h, it now follows from the
monotonicityprinciple that
Pf{n−1 logZ∗αn(((X,Y)Λ(n))) > ψf (α) + δ} −→ 0
for every δ > 0. This proves the upper (and for us, the more
important)half of (51). The lower half may be proved by a similar
thinning argumentin the reverse direction. �
5. Proof of the Entropy Inequality (54)
This requires a change of perspective. Up to now, we have
adopted thepoint of view that the covariates Xj and the split
points Ui are generated by
-
26 MARC CORAM AND STEVEN P. LALLEY
Poisson point processes in the unit interval of intensities n
and αn, respec-tively. However, the transformation formula (19)
implies that the predictiveprobabilities, and hence also their
Poissonized versions, are unchanged ifthe covariates and the split
points are rescaled by a common factor n. Therescaled covariates
X̂j := Xj/n and split points Ûi := Ui/n are then gener-ated by
Poisson point processes of intensities 1 and α on the interval [0,
n].Consequently, versions of all the random variables Z
∗[αn]((X,Y)Λ(n)) may
be constructed from two independent Poisson processes of
intensities 1 andα on the whole real line. The advantage of this
new point of view is thepossibility of deducing the large-n
asymptotics from the Ergodic Theorem.
5.1. Reformulation of the inequality. To avoid cluttered
notation, weshall henceforth drop the hats from the rescaled
covariates and split points.Thus, assume that under both P = Pp and
Q,
· · · < X−1 < X0 < 0 < X1 < · · · and· · · <
U−1 < U0 < 0 < U1 < · · ·
are the points of independent Poisson point processes X and U of
intensities1 and α, respectively, and let {Wi}i∈Z be a stream of
uniform-[0,1] randomvariables independent of the point processes
X,U. Denote by N(t) thenumber of occurrences in the Poisson point
process X during the interval[0, t], and set Ji = (Ui, Ui+1]. Let
{Yi}i∈Z be Bernoulli r.v.s. distributedaccording to the following
laws:
(A) Under P , the random variables Yj are i.i.d. Bernoulli-p,
jointly in-dependent of the Poisson point processes U,X.
(B) Under Q, the random variables Yj are conditionally
independent,given X,U,W, with conditional distributions Yj
∼Bernoulli-Wi,where i is the index of the interval Ji containing Xj
.
UnderQ the sequence {Yn}n∈Z is an ergodic, stationary sequence;
for reasonsthat we shall explain below, we shall refer to this
process as the rechargeablePolya urn. The distribution of (X,Y) ∩
[0, t] under Q is, after rescaling ofthe covariates by the factor
t, the same as that of a data sample of randomsize Λ(t) under the
Poisson mixture Q∗αt defined in section 4.1 above.
For (extended) integers −∞ ≤ m ≤ n define σ−algebras
FX,Ym,n = σ({Xj , Yj}m≤j≤n) and FYm,n = σ({Yj}m≤j≤n),
If m,n are both finite then the restrictions of the measures P,Q
to FX,Ym,n(and therefore also to FYm,n) are mutually absolutely
continuous. The Radon-Nikodym derivative on the smaller σ−algebra F
Y1,n is just
(69)
(
dQ
dP
)
FY1,n
=q(Y1, Y2, . . . , Yn)
p(Y1, Y2, . . . , Yn)
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 27
where
q(y1, y2, . . . , yn) := Q{Yj = yj ∀ 1 ≤ j ≤ n} andp(y1, y2, . .
. , yn) := P{Yj = yj ∀ 1 ≤ j ≤ n}
= pPn
j=1 yy(1− p)n−Pn
j=1 yj .
The Radon-Nikodym derivative on the larger σ−algebra FX,Y1,n
cannot be sosimply expressed, but is closely related to the
Poissonized predictive proba-bility Z∗αn((X,Y)n) defined by (45).
Define
(70) Ẑn := p(Y1, Y2, . . . , Yn)
(
dQ
dP
)
FX,Y1,n
;
then by (17), the random variable ẐN(n) has the same
distribution, underP , as does the Poissonized predictive
probability (45) under Pf , for any f .
Hence, the convergence (51) must also hold for the random
variables Ẑn:
Corollary 20. Under P , as n→∞,
(71) n−1 log ẐnL1−→ ψp(α).
Therefore, to prove the entropy inequality (54) it suffices to
prove that
(72) limn→∞
n−1EP log
(
dQ
dP
)
FY1,n
< 0.
Proof. The first assertion follows directly from (51) of
Proposition 11. Thus,to prove the entropy inequality ψp(α) <
−H(p) it suffices, in view of (70),to prove that (72) holds when
the σ−algebra F Y1,n is replaced by F
X,Y1,n . But
the former is a sub-σ−algebra of the latter; since log is a
concave function,Jensen’s inequality implies that
EP log
(
dQ
dP
)
FX,Y1,n
≤ EP log(
dQ
dP
)
FY1,n
.
�
5.2. The relative Shannon-MacMillan-Breiman theorem. The
exis-tence of the limit (71) is closely related to the relative
Shannon-McMillan-Breiman theorem studied by several authors [19],
[14], [15], [18] in an en-tertaining series of papers not entirely
devoid of errors and misconceptions.The sequence Y1, Y2, . . . is,
under either measure P or Q, an ergodic station-ary sequence of
Bernoulli random variables. Thus, by the usual
Shannon-MacMillan-Breiman theorem [23], as n→∞,
n−1 log q(Y1, Y2, . . . , Yn)a.s. Q−→ −hQ and
n−1 log p(Y1, Y2, . . . , Yn)a.s. P−→ −H(p)
-
28 MARC CORAM AND STEVEN P. LALLEY
where hQ is the Kolmogorov-Sinai entropy of the sequence Yj
under Q. Ingeneral, of course, the almost sure convergence holds
only for the prob-ability measure indicated – see, for instance,
[14] for an example wherethe first convergence fails under the
alternative measure P . The relativeShannon-MacMillan-Breiman
theorem of [19] gives conditions under whichthe difference of the
two averages
n−1 logq(Y1, Y2, . . . , Yn)
p(Y1, Y2, . . . , Yn)= n−1 log
(
dQ
dP
)
FY1,n
converges under P . In the case at hand, unfortunately, these
conditions arenot of much use: they essentially require the user to
verify that
n−1q(Y1, Y2, . . . , Yn)a.s. P−→ C
for some constant C. Thus, it appears that [19] does not provide
a shortcutto the convergence (51).
5.3. The rechargeable Polya urn. In the ordinary Polya urn
scheme,balls are drawn at random from an urn, one at a time; after
each draw,the ball drawn is returned to the urn along with another
of the same color.If initially the urn contains one red and one
blue ball, then the limitingfraction Θ of red balls is uniformly
distributed on the unit interval. ThePolya urn is connected with
Bayesian statistics in the following way: theconditional
distribution of the sequence of draws given the value of Θ is
asi.i.d. Bernoulli-Θ random variables.
The rechargeable Polya urn is a simple variant of the scheme
describedabove, differing only in that before each draw, with
probability r > 0, theurn is emptied and then reseeded with one
red and one blue ball. Unlikethe usual Polya urn, the rechargeable
Polya urn is recurrent, that is, ifVn := (Rn, Bn) denotes the
composition of the urn after n draws, then Vn isa positive
recurrent Markov chain on the state space N× N. Consequently,{Vn}
may be extended to n ∈ Z in such a way that the resulting process
isstationary. Let Yn denote the binary sequence recording the
results of thesuccessive draws (1 =blue, 0 =red). Clearly, this
sequence has the samelaw as does the sequence Y1, Y2, . . . under
the probability measure Q (withr = 1− exp{−1/α}).Lemma 21. For any
ε > 0 there exists m such that the following is true:For any
finite sequence y−k, y−k−1, . . . , y0, the conditional
distribution ofYm+1, Ym+2, . . . , Y2m given that Yi = yi for all
−k ≤ i ≤ 0 differs from theQ−unconditional distribution by less
than ε in total variation norm.Proof. It is enough to show that the
conditional distribution of Ym+1, . . . , Y2mgiven the composition
V0 of the urn before the first draw differs from theunconditional
distribution by less than ε. Let T be the time of the
firstregeneration (emptying of the urn) after time 0; then
conditional on T = n,for any n ≤ m, and on V0, the distribution of
Ym+1, . . . , Y2m does not de-pend on the value of V0. Thus, if m
is sufficiently large that the probability
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 29
of having at least one regeneration event between the first and
mth drawsexceeds 1 − ε, then the conditional distribution given V0
differs from theunconditional distribution by less than ε in total
variation norm. �
The construction of the sequence Y = Y1, Y2, . . . using the
rechargeablePolya urn shows that this sequence behaves as a
“factor” of a denumerable-state Markov chain (in terminology more
familiar to statistians, the sequenceY follows a “hidden Markov
model”). Note that the original specificationof the measure Q, in
section 5.1 above, exhibits Y as a factor of the Harris-recurrent
Markov chain obtained by adjoining to the state variable the
cur-rent value of W. It does not appear that Yn can be represented
as a functionof a finite-state Markov chain; if it could, then
results of Kaijser [12] wouldimply the existence of the limit
limn→∞
n−1 log q(Y1, Y2, . . . , Yn)
almost surely under P , and exhibit it as the top Lyapunov
exponent of asequence of random matrix products. Unfortunately,
little is known aboutthe asymptotic behavior of random operator
products (see [13] and refer-ences therein for the state of the
art), and so it does not appear that theinequality (69) can be
obtained by an infinite-state extension of Kaijser’sresult.
5.4. Proof of (72). Since it is not necessary to establish the
convergenceof the integrands on the left side of (72), we shall not
attempt to do so.Instead, we will proceed from the identity
(73) n−1EP logq(Y1, Y2, . . . , Yn)
p(Y1, Y2, . . . , Yn)= n−1
n−1∑
k=0
EP logq(Yk+1 |Y1, Y2, . . . , Yk)p(Yk+1 |Y1, Y2, . . . , Yk)
.
Because the random variables Yi are i.i.d. Bernoulli-p under P ,
the con-ditional probabilities p(yk+1 | y1, y2, . . . , yk) must
coincide with the uncondi-tional probabilities p(yk+1). Thus, the
usual information inequality (Jensen’sinequality), in the form Ef
log(g(X)/f(X)) < 0 for distinct probability den-sities f, g,
implies that for each k,
(74) EP logq(Yk+1 |Y1, Y2, . . . , Yk)
p(Yk+1)≤ 0,
with the inequality strict unless theQ−conditional distribution
of Yk+1 giventhe past coincides with the Bernoulli-p distribution.
Moreover, the left sideof (74) will remain bounded away from 0 as
long as the conditional dis-tribution remains bounded away from the
Bernoulli-p distribution (in anyreasonable metric, e.g, the total
variation distance). Thus, to complete theproof of (72) it suffices
to establish the following lemma.
Lemma 22. There is no sequence of integers kn →∞ along which(75)
‖q(· |Y1, Y2, . . . , Ykn)− p(·)‖TV −→ 0in P -probability.
-
30 MARC CORAM AND STEVEN P. LALLEY
Proof. This is based on the fact that the sequence of draws Y1,
Y2, . . . pro-duced by the rechargeable Polya urn is not a
Bernoulli sequence, that is,the Q− and P− distributions of the
sequence Y1, Y2, . . . are distinct. De-note by qk the
Q−conditional probability that Yk+1 = 1 given the valuesY1, Y2, . .
. , Yk. Suppose that qkn → p in P−probability; then by summingover
successive values of the last l variables, it follows that qkn−l →
p inP−probability for each fixed l ∈ N. We will show that this
leads to acontradiction.
Consider the following method of generating binary random
variablesY1, Y2, . . . , Y2m: first generate i.i.d. Bernoulli-p
random variables Yj for−k ≤ j ≤ 0; then, conditional on their
values, generate Y1 according toqk+1; then, conditional on Y1,
generate Y2 according to qk+2; and so on. Bythe hypothesis of the
preceding paragraph, there is a sequence kn →∞ suchthat for any
fixed m the joint distribution of Y1, Y2, . . . , Y2m converges to
theproduct-Bernoulli-p distribution. But this contradicts the
mixing propertyof the rechargeable Polya urn asserted by Lemma 21
above. �
6. Appendix: An Almost Subadditive WLLN
The purpose of this appendix is to prove the simple variant of
the Subad-ditive Ergodic Theorem required in section 4. For the
original subadditiveergodic theorem of Kingman, see [16], and for
another variant that is usefulin applications to percolation theory
see [17]. There are two novelties in ourversion: (a) the
subadditivity relation is only approximate, with a randomerror; and
(b) there is no measure-preserving transformation related to
thesequence Sn.
Proposition 23. Let Sn be real random variables. Suppose that
for eachpair m,n ≥ 1 of positive integers there exist random
variables S ′m,m+n, S′′n,m+nand a nonnegative random variable Rm,n
such that
(a) S′m,m+n and S′′n,m+n are independent;
(b) S′m,m+n has the same distribution as Sm;
(c) S′′m,m+n has the same distribution as Sn;(d) the random
variables {Rm,n}m,n≥1 are identically distributed;(e) ER1,1
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 31
Proof of Proposition 23. Since the random variables S ′m,m+n and
S′′n,m+n are
independent, with the same distributions as Sm and Sn,
respectively, Kol-mogorov’s consistency theorem implies that the
probability space may beenlarged so as to support additional random
variables permitting recursionon the inequality (76). Here the
simplest recursive strategy works: froma starting value n = km + r,
reduce by m at each step. This leads to aninequality of the
form
(78) Skm+r ≤ S0r +k
∑
j=1
Sjm +
k∑
j=1
Rj
where the random variables {Sjm}j≥1 are i.i.d., each with the
same distri-bution as Sm, and the random variables Rj are
identically distributed (butnot necessarily independent), each with
the law of R1,1 := R.
The weak law (77) is easily deduced from the inequality (78).
Note firstthat the special case of (76) with m = 1, together with
hypothesis (d),implies that ESn ≤ nER + nES1 < ∞ for every n ≥
1, and so γ < ∞.Assume for definiteness that γ > −∞; the case
γ = −∞ may be treated bya similar argument. Divide each side of
(78) by km; as k →∞,
S0rkm
P−→ 0 and 1km
k∑
j=1
SjmP−→ ESm
m,
the latter by the usual WLLN. The WLLN need not apply to the
sum∑
Rj,since the terms are not necessarily independent; however,
since all of theterms are nonnegative and have the same expectation
ER < ∞, Markov’sinequality implies that for any ε > 0,
P{ 1km
k∑
j=1
Rj ≥ ε} ≤ER
mε.
Thus, letting m→∞ through a subsequence along which ESm/m→ γ,
wefind that for any ε > 0,
limn→∞
P{Sn ≥ nγ + nε} = 0.
Since the r.v.s Sn/n are uniformly integrable, this implies
that
limn→∞
P{Sn ≤ nγ − nε} = 0,
because otherwise lim inf ESn/n < −γ. This proves that Sn/n →
γ inprobability; in view of the uniform integrability of the
sequence Sn/n, con-vergence in L1 follows. �
Remark. Numerous variants of this proposition are true, and may
beestablished by more careful recursions. Among these are WLLNs for
randomvariables satisfying hypotheses such as those given in (63)
above, where theremainders log Λ(m + n) are not identically
distributed, but whose growth
-
32 MARC CORAM AND STEVEN P. LALLEY
is sublinear in m+ n. For hints as to how such results may be
approached,see [11].
References
[1] Barron, A., Schervish, M. J. and Wasserman, L. (1999).
Theconsistency of posterior distributions in nonparametric
problems. TheAnnals of Statistics 27 536–561.
[2] Chi, Z. (2001). Stochastic sub-additivity approach to the
conditionallarge deviation principle. Ann. Probability 29
1303–1328.
[3] Coram, M. (2002). Nonparametric Bayesian Classification.
Ph.D.thesis, Stanford University.
[4] Diaconis, P. and Freedman, D. (1986). On inconsistent Bayes
esti-mates of location. Ann. Statist. 14 68–87.
[5] Diaconis, P. and Freedman, D. A. (1993). Nonparametric
binaryregression: A Bayesian approach. The Annals of Statistics 21
2108–2137.
[6] Diaconis, P. and Freedman, D. A. (1995). Nonparametric
binaryregression with random covariates. Probability and
Mathematical Sta-tistics 15 243–273.
[7] Freedman, D. and Diaconis, P. (1983). On inconsistent Bayes
esti-mates in the discrete case. Ann. Statist. 11 1109–1118.
[8] Freedman, D. A. (1963). On the asymptotic behavior of Bayes’
esti-mates in the discrete case. Ann. Math. Statist. 34
1386–1403.
[9] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W.
(2000).Convergence rates of posterior distributions. Ann. Statist.
28 500–531.
[10] Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian
nonpara-metrics. Springer Series in Statistics, Springer-Verlag,
New York.
[11] Hammersley, J. M. (1962). Generalization of the fundamental
theo-rem on sub-additive functions. Proc. Cambridge Philos. Soc. 58
235–238.
[12] Kaijser, T. (1975). A limit theorem for partially observed
Markovchains. Ann. Probability 3 677–696.
[13] Karlsson, A. and Margulis, G. A. (1999). A multiplicative
ergodictheorem and nonpositively curved spaces. Comm. Math. Phys.
208107–123.
[14] Kieffer, J. C. (1973). A counterexample to Perez’s
generalization ofthe Shannon-McMillan theorem. Ann. Probability 1
362–364.
[15] Kieffer, J. C. (1976). Correction to: “A counterexample to
Perez’sgeneralization of the Shannon-McMillan theorem” (Ann.
Probability 1(1973), 362–364). Ann. Probability 4 153–154.
[16] Kingman, J. F. C. (1973). Subadditive ergodic theory. Ann.
Proba-bility 1 883–909. With discussion by D. L. Burkholder, Daryl
Daley, H.Kesten, P. Ney, Frank Spitzer and J. M. Hammersley, and a
reply bythe author.
-
BAYES ESTIMATION OF A REGRESSION FUNCTION 33
[17] Liggett, T. M. (1985). An improved subadditive ergodic
theorem.Ann. Probab. 13 1279–1285.
[18] Perez, A. (1964). Extensions of Shannon-McMillan’s limit
theorem tomore general stochastic processes. In Trans. Third Prague
Conf. Infor-mation Theory, Statist. Decision Functions, Random
Processes (Liblice,1962). Publ. House Czech. Acad. Sci., Prague,
545–574.
[19] Perez, A. (1980). On Shannon-McMillan’s limit theorem for
pairs ofstationary random processes. Kybernetika (Prague) 16
301–314.
[20] Ruelle, D. (1999). Statistical mechanics. World Scientific
PublishingCo. Inc., River Edge, NJ. Rigorous results, Reprint of
the 1989 edition.
[21] Schwartz, L. (1965). On Bayes procedures. Zeitschrift
fürWahrscheinlichkeitstheorie und Verwandte Gebiete 4 10–26.
[22] Shen, X. and Wasserman, L. (2001). Rates of convergence of
poste-rior distributions. Ann. Statist. 29 687–714.
[23] Walters, P. (1982). An introduction to ergodic theory, vol.
79 ofGraduate Texts in Mathematics. Springer-Verlag, New York.
University of Chicago, Department of Statistics, 5734 University
Avenue,Chicago IL 60637
E-mail address: [email protected]
University of Chicago, Department of Statistics, 5734 University
Avenue,Chicago IL 60637
E-mail address: [email protected]