Class Notes of Stat 8112 1 Bayes estimators Here are three methods of estimating parameters: (1) MLE; (2) Moment Method; (3) Bayes Method. An example of Bayes argument: Let X ∼ F (x|θ),θ ∈ H . We want to estimate g(θ) ∈ R 1 . Suppose t(X) is an estimator and look at MSE θ (t)= E θ (t(X) − g(θ)) 2 . The problem is MSE θ (t) depends on θ. So minimizing one point may costs at other points. Bayes idea is to average MSE θ (t) over θ and then minimize over t’s. Thus we pretend to have a distribution for θ, say, π, and look at H (t)= E(t(X) − g(θ)) 2 where E now refers to the joint distribution of X and θ, that is E(t(X) − g(θ)) 2 = (t(x) − g(θ)) 2 F (dx|θ)π(dθ). (1.1) Next pick t(X) to minimize H (t). The minimizer is called the Bayes estimator. LEMMA 1.1 Suppose Z and W are real random variables defined on the same probability space and H be the set of functions from R 1 to R 1 . Then min h∈H E(Z − h(W )) 2 = E(Z − E(Z |W )) 2 . That is, the minimizer above is E(Z |W ). Proof. Note that E(Z − h(W )) 2 = E(Z − E(Z |W )+ E(Z |W ) − h(W )) 2 = E(Z − E(Z |W )) 2 + E(E(Z |W ) − h(W )) 2 +2E{(Z − E(Z |W ))(E(Z |W ) − h(W ))}. Conditioning on W, we have that the cross term is zero. Thus E(Z − h(W )) 2 = E(Z − E(Z |W )) 2 + E(E(Z |W ) − h(W )) 2 . 1
79
Embed
Class Notes of Stat 8112 1 Bayes estimatorsusers.stat.umn.edu/~jiang040/8112/8112notes.pdfClass Notes of Stat 8112 1 Bayes estimators Here are three methods of estimating parameters:
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
One can see that µ→ µ0 as τ0 → 0 and µ→ X as τ0 → ∞.
Example. Let X ∼ Np(θ, Ip) and π(θ) ∼ Np(0, τIp), τ > 0. The Bayes estimator is the
posterior mean E(θ|X). We claim that
(X
θ
)∼ N
((0
0
),
((1 + τ)Ip τIp
τIp τIp
)).
Indeed, by the prior we know that Eθ = 0 and Cov(θ)=τIp. By conditioning argument one
can verify that E(Xθ′) = τIp and E(XX ′) = (1 + τ)Ip. So by conditional distribution of
normal random variables,
θ|X ∼ N
(τ
1 + τX,
τ
1 + τIp
).
So the Bayes estimator is t0(X) = τX/(1 + τ).
The usual MVUE is t1(X) = X. It follows that
MSEt1(θ) = E‖X − θ‖2 = p
which doesn’t depend on θ. Now
MSEt0(θ) = Eθ‖τ
1 + τX − θ‖2 = Eθ‖
τ
1 + τ(X − θ) +
−1
1 + τθ‖2
=
(τ
1 + τ
)2
p+1
(1 + τ)2‖θ‖2 (1.2)
4
which goes to p as τ → ∞. What happens to the prior when τ → ∞? It “converges
to” the Lebesgue measure, or “uniform distribution over the real line” which generates
no information (recall the information of X ∼ Np(0, (1 + τ)Ip) is p/(1 + τ) and θi =
uniform over [−τ, τ ] with θi, 1 ≤ i ≤ p i.i.d. is p/(2τ2). Both go to zero and the later is
faster than the former). This explains why the MSE of the former and the limiting MSE
are identical.
Now let
M = inft
supθEθ‖t(X) − θ‖2
called the Minimax risk and any estimator t∗ with
M = supθEθ‖t∗(X) − θ‖2
is called a Minimax estimator.
THEOREM 2 For any p ≥ 1, t0(X) = X is a minimax.
Proof. First,
M = inft
supθEθ‖t(X) − θ‖2 ≤ sup
θEθ‖X − θ‖2 = p.
Second
M = inft
supθEθ‖t(X) − θ‖2 ≥ inf
t
∫Eθ‖t(X) − θ‖2 πτ (dθ) ≥
(τ
1 + τ
)2
p
for any τ > 0 where πτ (θ) = N(0, τIp), where the last step is from (1.2). Then M ≥ p by
letting p→ ∞. The above says that M = supθ Eθ‖X − θ‖2 = p.
Question. Does there exist another estimator t1(X) such that
Eθ‖t1(X) − θ‖2 ≤ Eθ‖X − θ‖2
for all θ, and the strict inequality holds for some θ? If so, we say t0(X) = X is inadmissible
(because it can be beaten for any θ by some estimator t, and strictly beaten at some θ).
Here is the answer:
(i) For p = 1, there is not. This is proved by Blyth in 1951.
(ii) For p = 2, there is not. This result was shown by Stein in 1961.
(iii) When p ≥ 3, Stein (1956) shows there is such estimator, which is called James-Stein
estimator.
Recall the density function of N(µ, σ2) is φ(x) = (√
2πσ)−1 exp(−(x− µ)2/(2σ2)).
5
LEMMA 1.3 (Stein’s lemma). Let Y ∼ N(µ, σ2) and g(y) be a function such that g(b) −g(a) =
∫ ba g
′(y) dy for any a and b and some function g′(y). If E|g′(Y )| <∞. Then
Eg(Y )(Y − µ) = σ2Eg′(Y ).
Proof. Let φ(y) = (1/√
2πσ) exp(−(y − µ)2/(2σ2)), the density of N(µ, σ2). Then φ′(y) =
−φ(y)(y − µ)/σ2. Thus, φ(y) =∫∞y
z−µσ2 φ(z) dz = −
∫ y−∞
z−µσ2 φ(z) dz. We then have that
Eg′(Y ) =
∫ ∞
−∞g′(y)φ(y) dy
=
∫ ∞
0g′(y)
∫ ∞
y
z − µ
σ2φ(z) dz
dy −
∫ 0
−∞g′(y)
∫ y
−∞
z − µ
σ2φ(z) dz
dy
=
∫ ∞
0
z − µ
σ2φ(z)
∫ z
0g′(y) dy
dz −
∫ 0
−∞
z − µ
σ2φ(z)
∫ 0
zg′(y) dy
dz
=
(∫ ∞
0+
∫ 0
−∞
)z − µ
σ2φ(z)(g(z) − g(0)) dz
=1
σ2
∫ ∞
−∞(z − µ)g(z)φ(z) dz =
1
σ2E(Y − µ)g(Y ).
The Fubini’s theorem is used in the third step; the mean of a centered normal random
variable is zero is used in the fifth step.
Remark 1. Suppose two functions g(x) and h(x) are given such that g(b) − g(a) =∫ ba h(x) dx for any a < b. This does not mean that g′(x) = h(x) for all x. Actually, in this
case, g(x) is differentiable almost everywhere and g′(x) = h(x) a.s. under Lebesgue measure.
For example, let h(x) = 1 if x is an irrational number, and h(x) = 0 if x is rational. Then
g(x) = x =∫ x0 h(t) dt for any x ∈ R. We know in this case that g′(x) = h(x) a.s. The
following are true from real analysis:
Fundamental Theorem of Calculus. If f(x) is absolutely continuous on [a, b], then f(x)
is differentiable almost everywhere and
f(x) − f(a) =
∫ x
af ′(t) dt, x ∈ [a, b].
Another fact. Let f(x) be differentiable everywhere on [a, b] and f ′(x) is integrable over
[a, b]. Then
f(x) − f(a) =
∫ x
af ′(t) dt x ∈ [a, b].
Remark 2. What drives the Stein’s lemma is the formula of integration by parts:
6
suppose g(x) is differentiable then
Eg(Y )(Y − µ) =
∫
R
g(y)(y − µ)φ(y) dy = −σ2
∫ ∞
−∞g(y)φ′(y) dy
= σ2 limy→−∞
g(y)φ(y) − σ2 limy→+∞
g(y)φ(y) + σ2
∫ ∞
−∞g′(y)φ(y) dy.
It is reasonable to assume the two limits are zero. The last term is exactly σ2Eg′(Y ).
THEOREM 3 Let X ∼ Np(θ, Ip) for some p ≥ 3. Define
δc(X) =
(1 − c
p− 2
‖X‖2
)X.
Then
E‖δc(X) − θ‖2 = p− (p− 2)2E
[c(2 − c)
‖X‖2
].
Proof. Let gi(x) = c(p − 2)xi/‖x‖2 and g(x) = (g1(x), · · · , gp(x)) for x = (x1, x2, · · · , xp).
THEOREM 5 (Neyman-Pearson Lemma). Let X be a sample with pdf or pmf f(x|θ).Consider testing H0 : θ = θ0 vs H1 : θ = θ1, using a test with rejection region R that
satisfies
x ∈ R if f(x|θ1) > kf(x|θ0) and x ∈ Rc if f(x|θ1) < kf(x|θ0) (4.1)
for some k ≥ 0, and
α = Pθ0(X ∈ R). (4.2)
12
Then
(i) (Sufficiency). Any test satisfies (4.1) and (4.2) is a UMP level α test.
(ii) (Necessity). Suppose there exists a test satisfying (4.1) and (4.2) with k > 0. Then
every UMP level α test is a size α test (satisfies (4.2)); Every UMP level α test satisfies
(4.1) except on a set A satisfying Pθ(X ∈ A) = 0 for θ = θ0 and θ1.
Proof. We will prove the theorem only for the case that f(x|θ) is continuous.
Thus the assertion (4.3) follows from (4.2) and that k ≥ 0.
(ii) Suppose the test satisfying (4.1) and (4.2) has power function β(θ). By (i) the test is
UMP. For the second UMP level α test, say, it has power function β′(θ). Then β(θ1) = β′(θ1).
Since k > 0, (4.4) implies β′(θ0) ≥ β(θ0) = α. So β′(θ0) = α.
Given a UMP level α test with power function β′(θ). By the proved (ii), β(θ1) = β′(θ1)
and β(θ0) = β′(θ0) = α. Then (4.4) implies that the expectation in there is zero. Being
always nonnegative, (φ(x) − φ′(x))(f(x|θ1) − kf(x|θ0)) = 0 except on a set A of Lebesgue
measure zero, which leads to (4.1) (by considering f(x|θ1) > kf(x|θ0) or not). Since X has
density, Pθ(X ∈ A) = 0 for θ = θ0 and θ = θ1.
Example. Let X ∼ Bin(2, θ). Consider H0 : θ = 1/2 vs Ha : θ = 3/4. Note that
f(x|θ1) =
(2
x
)(3
4
)x(1
4
)2−x
> k
(2
x
)(1
2
)x(1
2
)2−x
= kf(x|θ0)
is equivalent to 3x > 4k.(i) The case k ≥ 9/4 and the case k < 1/4 correspond to R = ∅ and Ω, respectively.
The according UMP level α = 0 and α = 1.
(ii) If 1/4 ≤ k < 3/4, then R = 1, 2, and α = P (Bin(2, 1/2) = 1 or 2) = 3/4.
(iii) If 3/4 ≤ k < 9/4, then R = 2. The level α = P (Bin(2, 1/2) = 2) = 1/4.
13
This example says that if we firmly want a size α test, then there are only two such α’s:
α = 1/4 and α = 3/4.
Example. Let X1, · · · ,Xn be a random sample from N(θ, σ2) with σ known. Look
at the test H0 : θ = θ0 vs H1 : θ = θ1, where θ0 > θ1. The density function f(x|θ, σ) =
(1/√
2π σ) exp(−(x− θ)2/(2σ2)). Now f(x|θ1) > kf(θ0) is equivalent to that R = X < c.Set α = Pθ0(X < c). One can check easily that c = θ0 + (σzα/
√n). So the UMP rejection
region is
R =
X < θ0 +
σ√nzα
.
LEMMA 4.1 Let X be a random variable. Let also f(x) and g(x) be non-decreasing real
functions. Then
E(f(X)g(X)) ≥ Ef(X) ·Eg(X)
provided all the above three expectations are finite.
Proof. Let the µ be the distribution of X. Noting that (f(x) − f(y))(g(x) − g(y)) ≥ 0 for
any x and y. Then
0 ≤∫∫
(f(x) − f(y))(g(x) − g(y))µ(dx)µ(dy)
= 2
∫f(x)g(x)µ(dx) − 2
∫f(x)g(y)µ(dx)µ(dy)
= 2[E(f(X)g(X)) − Ef(X) ·Eg(X)].
The desired result follows.
Let f(x|θ), θ ∈ R be a pmf or pdf with common support, that is, the set x : f(x|θ) > 0is identical for every θ. This family of distributions is said to have monotone likelihood ratio
(MLR) with respect to a statistic T (x) if
fθ2(x)
fθ1(x)on the support is a non-decreasing function in T (x) for any θ2 > θ1. (4.5)
LEMMA 4.2 Let X be a random vector with X ∼ f(x|θ), θ ∈ R, where f(x|θ) is a pmf
or pdf with a common support. Suppose f(x|θ) has MLR with a statistic T (x). Let
φ(x) =
1 when T (x) > C,
γ when T (x) = C,
0 when T (x) < C
(4.6)
14
for some C ∈ R, where γ ∈ [0, 1]. Then
Eθ1φ(X) ≤ Eθ2φ(X)
for any θ1 < θ2.
Proof. Thanks to MLT, for any θ1 < θ2 there exists a nondecreasing function g(t) such
that f(x|θ2)/f(x|θ1) = g(T (x)), where g(t) is a nondecreasing function. Thus
Eθ2φ(X) =
∫φ(x)
f(x|θ2)f(x|θ1)
f(x|θ1) dx = Eθ1 (g(T (X)) · h(T (X))) ,
where
h(x) =
1 when x > C,
γ when x = C,
0 when x < C
Since both g(x) and h(x) are non-decreasing, by the positive correlation inequality,
Remark. All the above statements of weak convergence of random variables Xn can be
replaced by their distributions µn := L(Xn). So “Xn =⇒ X” is equivalent to “µn =⇒ µ”.
THEOREM 11 (Levy’s continuity theorem). Let µn;n = 0, 1, 2, · · · be a sequence of prob-
ability measures on Rk with characteristic function φn. We have that
(i) If µn =⇒ µ0 then φn(t) → φ0(t) for every t ∈ Rk;
(ii) If φn(t) converges pointwise to a limit φ0(t) that is continuous at 0, then the as-
sociated sequence of distributions µn is tight and converges weakly to the measure µ0 with
characteristic function φ0.
Remark. The condition that “φ0(t) is continuous at 0” is essential. Let Xn ∼N(0, n), n ≥ 1. So P (Xn ≤ x) → 1/2 for any x, which means that Xn doesn’t converges
weakly. Note that φn(t) = e−nt2/2 → 0 as n→ ∞ and φn(0) = 1. So φ0(t) is not continuous.
If one ignores the continuity condition at zero, the conclusion that Xn converges weakly
may wrongly follow.
Proof of Theorem 11. (i) is trivial.
(ii) Because marginal tightness implies joint tightness. W.L.O.G., assume that µn is a
probability measure on R and generated by Xn. Note that for every x and δ > 0,
1|δx| > 2 ≤ 2
(1 − sin δx
δx
)=
1
δ
∫ δ
−δ(1 − cos tx) dt.
Replace x by Xn, take expectations, and use Fubini’s theorem to obtain that
P
(|Xn| >
2
δ
)≤ 1
δ
∫ δ
−δ(1 − EeitXn) dt =
1
δ
∫ δ
−δ(1 − φn(t)) dt.
By the given condition,
lim supn
P
(|Xn| >
2
δ
)≤ 1
δ
∫ δ
−δ(1 − φ0(t)) dt.
The right hand side above goes to zero as δ ↓ 0. So tightness follows.
29
To prove the weak convergence, we have to show that∫f(x) dun →
∫f(x) dµ for every
bounded continuous function f(x). It is equivalent to that for any subsequence, there is a
further subsequence, say nk, such that∫f(x) dunk
→∫f(x) dµ0.
For any subsequence, by Lemma 5.6, the Helly selection principle, there is a further
subsequence, say µnk, such that it converges to µ0 weakly, by the earlier proof, the c.f. of
µ0 is φ0. We know that a c.f. uniquely determines a distribution. This says that every limit
is identical. So∫f(x) dunk
→∫f(x) dµ0.
THEOREM 12 (Central Limit Theorem). Let X1,X2, ·,Xn be a sequence of i.i.d. random
variables with mean zero and variance one. Let Xn = (X1 + · · ·+Xn)/n. Then√nXn =⇒
N(0, 1).
Proof. Let φ(t) be the c.f. of X1. Since EX1 = 0 and EX21 = 1 we have that φ′(0) =
iEX1 = 0 and φ′′(0) = i2EX21 = −1. By Taylor’s expansion
φ
(t√n
)= 1 − t2
2n+ o
(1
n
)
as n is sufficiently large. Hence
Eeit√
nXn = φ
(t√n
)n
=
(1 − t2
2n+ o
(1
n
))n
→ e−t2/2
as n→ ∞. By Levy’s continuity theorem, the desired conclusion follows.
To deal with multivariate random variables, we have the next tool.
THEOREM 13 (Cramer-Wold device). Let Xn and X be random variables taking values
in Rk. Then
Xn =⇒ X if and only if tTXn =⇒ tTX for all t ∈ Rk.
Proof. By Levy’s continuity Theorem 11, Xn =⇒ X if and only if E exp(itTXn) →E exp(itTX) for each t ∈ R
k, which is equivalent to E exp(iu(tTXn)) → E exp(iu(tTX)) for
any real number u. This is same as saying that the c.f. of tTXn converges to that of tTX
for each t ∈ Rk. It is equivalent to that tTXn =⇒ tTX for all t ∈ R
k.
As an application, we have the multivariate analogue of the one-dimensional CLT.
THEOREM 14 Let X1,X2, · · · be a sequence of i.i.d. random vectors in Rk with mean
vector µ = EX1 and covariance matrix Σ = E(X1 − µ)(X1 − µ)T . Then
1√n
n∑
i=1
(Xi − µ) =√n(Yn − µ) =⇒ Nk(0,Σ).
30
Proof. By the Cramer-Wold device, we need to show that
1√n
n∑
i=1
(tTXi − tTµ) =⇒ N(0, tT Σt)
for ant t ∈ Rk. Note that tTXi − tTµ, i = 1, 2, · · · are i.i.d. real random variables. By
the one-dimensional CLT,
1√n
n∑
i=1
(tTXi − tTµ) =⇒ N(0, Var(tTX1)).
This leads to the desired conclusion since Var(tTX1) = tT Σt.
We state the following theorem without giving the proof. The spirit of the proof is
similar to that of Theorem 12. It is called Lindeberg-Feller Theorem.
THEOREM 15 Let kn; n ≥ 1 be a sequence of positive integers and kn → +∞ as n→ ∞.
For each n let Yn,1, · · · , Yn,kn be independent random vectors with finite variances such that
kn∑
i=1
Cov (Yn,i) → Σ and
kn∑
i=1
E‖Yn,i‖2I‖Yn,i‖ > ǫ → 0
for every ǫ > 0. Then the sequence Σkni=1(Yn,i − EYn,i) =⇒ N(0,Σ).
The Central Limit Theorems hold not only for independent random variables, it is also
true for some other dependent random variables of special structures.
Let Y1, Y2, · · · be a sequence of random variables. A sequence of σ-algebra Fn satisfies
that Fn ⊃ σ(Y1, · · · , Yn) and is non-decreasing Fn ⊂ Fn+1 for all n ≥ 0, where F0 = ∅,Ω.When E(Yn+1|Fn) = 0 for any n ≥ 0, we say Yn; n ≥ 1 are martingale differences.
THEOREM 16 For each n ≥ 1, let Xni, 1 ≤ i ≤ kn be a sequence of martingale differ-
ences relative to nested σ-algebras Fni, 1 ≤ i ≤ kn, that is, Fni ⊂ Fnj for i < j, such
that
1. maxi |Xni|; n ≥ 1 is uniformly integrable;
2. E (maxi |Xni|) → 0;
3.∑
iX2ni → 1 in probability.
Then Sn =∑
iXni → N(0, 1) in distribution.
Example. Let Xi; i ≥ 1 be i.i.d. with mean zero and variance one. Then one can
verify that the required conditions in Theorem 16 hold for Xni = Xi/√n for i = 1, 2, · · · , n.
31
Therefore we recollect the classical CLT that∑n
i=1Xi/√n =⇒ N(0, 1). Actually,
P (maxi
|Xni| ≥ ǫ) ≤ nP (|X1| ≥√nǫ) ≤ n
nǫ2E|X1|2I|X1| ≥
√nǫ → 0 and
E(maxi
|Xni|)2 ≤ E
∑ni=1X
2i
n= 1
This shows that maxi |Xni| → 0 in probability and maxi |Xni|, n ≥ 1 is uniformly inte-
grable. So condition 2 holds.
LEMMA 5.8 Let Un; n ≥ 1 and Tn; n ≥ 1 be two sequences of random variables such
that
1. Un → a in probability, where a is a constant;
2. Tn is uniformly integrable;
3. UnTn is uniformly integrable;
4. ETn → 1.
Then E(TnUn) → a.
Proof. Write TnUn = Tn(Un − a) + aTn. Then E(TnUn) = ETn(Un − a) + aETn. Evi-
dently, Tn(Un−a) is uniformly integrable by the given condition. Since Tn is uniformly
integrable, it is not difficult to verify that Tn(Un − a) → 0 in probability. Then the result
follows.
By the Taylor expansion, we have the following equality
eix = (1 + ix)e−x2/2+r(x) (5.6)
for some function r(x) = O(x3) as x→ ∞. Look at the set-up in Theorem 16, we have that
eitSn = TnUn, where
Tn =∏
j
(1 + itXnj) and Un = exp(−(t2/2)∑
j
X2nj +
∑
j
r(tXnj)). (5.7)
We next obtain a general result on CLT by using Lemma 5.8.
THEOREM 17 Let Xnj , 1 ≤ i ≤ kn, n ≥ 1 be a triangular array of random variables.
Suppose
1. ETn → 1;
2. Tn is uniformly integrable;
32
3.∑
j X2nj → 1 in probability;
4. E(maxj |Xnj |) → 0.
Then Sn =∑
j Xnj converges to N(0, 1) in distribution.
Proof. Since |TnUn| = |eitSn | = 1, we know that TnUn is uniformly integrable. By Lemma
5.8, it is enough to show that Un → e−t2/2 in probability. Review (5.7) and condition 3, it
suffices to show that∑
j r(tXnj) → 0 in probability. The assertion r(x) = O(x3) say that
there exists A > 0 and δ > 0 such that |r(x)| ≤ A|x3| as |x| < δ. Then
P (|∑
j
r(tXnj)| > ǫ) ≤ P (maxj
|Xnj | ≥ δ) + P (|∑
j
r(tXnj)| > ǫ, maxj
|Xnj | < δ)
≤ δ−1E(maxj
|Xnj |) + P (maxj
|Xnj | ·∑
j
|Xnj |2 > ǫ) → 0
as n→ ∞ by the given conditions.
Proof of Theorem 16. Define Zn1 = Xn1 and
Znj = XnjI(
j−1∑
k=1
X2nk ≤ 2), 2 ≤ j ≤ kn.
It is easy to check that Znj is still a martingale relative to the original σ-algebra. Now
define J = infj ≥ 1;∑
1≤k≤j X2nk > 2 ∧ kn. Then
P (Xnk 6= Znk for some k ≤ kn) = P (J ≤ kn − 1)
≤ P (∑
1≤k≤kn
X2nk > 2) → 0
as n→ ∞. Therefore, P (Sn 6=∑knj=1 Znj) → 0 as n→ ∞. To prove the result, we only need
to show
kn∑
j=1
Znj =⇒ N(0, 1). (5.8)
We now apply Theorem 17 to prove this. Replacing Xnj by Znj in (5.7), we have new Tn and
Un. Let’s literally verify the four conditions in Theorem 17. By a martingale property and
iteration, ETn = 1. So condition 1 holds. Now maxj |Znj | ≤ maxj |Xnj | → 0 in probability
as n → ∞. So condition 4 holds. It is also easy to check that∑
j Z2nj → 1 in probability.
Then it remains to show condition 2.
33
By definition, Tn =∏kn
j=1(1 + itZnj) =∏
1≤j≤J(1 + itZnj). Thus,
|Tn| =J−1∏
k=1
(1 + t2X2nk)
1/2|1 + tXnJ |
≤ exp
(t2
2
J−1∑
k=1
X2nk
)(1 + |t| · |XnJ |)
≤ et2(1 + |t| · max
j|Xnj |)
which is uniform integrable by condition 1 in Theorem 16.
The next is an extreme limit theorem, which is different than the CLT. The limiting
distribution is called an extreme distribution.
THEOREM 18 Let Xi; 1 ≤ i ≤ n be i.i.d. N(0, 1) random variables. Let Wn = max1≤i≤nXi.
Then
P
(Wn ≤
√2 log n− (log2 n) + x
2√
2 log n
)→ e
− 12√
πex/2
for any x ∈ R, where log2 n = log(log n).
Proof. Let tn be the right hand side of “≤” in the probability above. Then
P (Wn ≤ tn) = P (X1 ≤ tn)n = (1 − P (X1 > tn))n.
Since (1−xn)n ∼ e−a as n→ ∞ if xn ∼ a/n. To prove the theorem, it suffices to show that
P (X1 > tn) ∼ 1
2√π n
ex/2. (5.9)
Actually, we know that
P (X1 > x) ∼ 1√2π x
e−x2/2
as x→ +∞. It is easy to calculate that
1√2π tn
∼ 1
2√π log n
andt2n2
∼ log n− log(√
log n) − x
2
as n→ ∞. This leads to (5.9).
34
As an application of the CLT of i.i.d. random variables, we will be able to derive the
classical χ2-test as follows.
Let’s consider a multinomial distribution with n trails, k classes with parameter p =
(p1, · · · , pk). Roll a die twenty times. Let Xi be the number of the occurrences of “i dots”,
1 ≤ i ≤ 6. Then (X1,X2, · · · ,X6) follows a multinomial distribution with “success rate”
p = (p1, · · · , p6). How to test the die is fair or not? From an introductory course, we know
that, the null hypothesis is H0 : p1 = · · · = p6 = 1/6. The test statistic is
χ2 =6∑
i=1
(Xi − npi)2
npi.
We use the fact that χ2 is roughly χ2(5) as n is large. We will prove this next.
In general, Xn = (Xn,1, · · · ,Xn,k) follows a multinomial distribution with n trails and
“success rate” p = (p1, · · · , pk). Of course,∑k
i=1Xn,i = n. To be more precise,
P (Xn,1 = xn,1, · · · ,Xn,k = xn,k) =n!
xn,1! · · · xn,k!p
xn,1
1 · · · pxn,k
k ,
where∑k
i=1 xn,i = n. Now we prove the following theorem,
THEOREM 19 As n→ ∞,
χ2n =
k∑
i=1
(Xi − npi)2
npi=⇒ χ2(k − 1).
We need a lemma.
LEMMA 5.9 Let Y ∼ Nk(0,Σ). Then
‖Y ‖2 d=
k∑
i=1
λiZ2i ,
where λ1, · · · , λk are eigenvalues of Σ and Z1, · · · , Zk are i.i.d. N(0, 1).
Proof. Decompose Σ = Odiag(λ1, · · · , λk)OT for some orthogonal matrix O. Then the k
coordinates of OY are independent with mean zero and variances λi’s. So ||Y ‖2 = ‖OY ‖2
is equal to∑k
i=1 λiZ2i , where Zi’s are i.i.d. N(0, 1).
Another fact we will use is that AXn =⇒ AX for any k×k matrix A if Xn =⇒ X, where
Xn and X are Rk-valued random vectors. This can be shown easily by the Cramer-Wold
device.
35
Proof of the Theorem. Let Y1, · · · , Yn be i.i.d. with a multinomial distribution with 1
trails and “success rate” p = (p1, · · · , pk). Then Xn = Y1+· · ·Yn. Each Yi is a k-dimensional
over all k ≥ 1,m ≥ 0 and measurable functions f(x) defined on Rm and g(x) defined on
Rm−n+1 with |f(x)| ≤ 1 and |g(x)| ≤ 1 for all x ∈ R.
LEMMA 6.3 The following are true:
(i) α(n) = supk≥1
supm≥0
supA∈σ(X1,··· ,Xk), B∈σ(Xk+n,··· ,Xk+n+m)
|P (A ∩B) − P (A)P (B)|.
(ii) For all n ≥ 1, we have that α(n) ≤ α(n) ≤ 4α(n).
Proof. (i) For two sets A and B, let A∆B = (A\B)∪ (B\A), which is their symmetric
difference. Then IA∆B = |IA − IB |. Let Zi; i ≥ 1 be a sequence of random variables. We
claim that
σ(Z1, Z2, · · · ) = B ∈ σ(Z1, Z2, · · · ); for any ǫ > 0, there exists l ≥ 1 and (6.23)
C ∈ σ(Z1, · · · , Zl) such that P (B∆C) < ǫ. (6.24)
If this is true, then for any ǫ > 0, there exists l < ∞ and C ∈ σ(Z1, · · · , Zl) such that
|P (B) − P (C)| < ǫ. Thus (ii) follows.
42
Now we prove this claim. First, note that σ(Z1, Z2, · · · ) = σZi ∈ Ei; Ei ∈ B(R); i ≥1. Denote by F the right hand side of (6.23). Then F contains all Zi ∈ Ei; Ei ∈B(R); i ≥ 1. We only need to show that the right hand side is a σ-algebra.
It is easy to see that (i) Ω ∈ F ; (ii) Bc ∈ F if B ∈ F since A∆B = Ac∆Bc. (iii) If Bi ∈ Ffor i ≥ 1, there exist mi < ∞ and Ci ∈ σ(Z1, · · · , Zmi) such that P (Bi∆Ci) < ǫ/2i for all
i ≥ 1. Evidently, ∪ni=1Bi ↑ ∪∞
i=1Bi as n → ∞. Therefore, there exists n0 < ∞ such that
|P (∪∞i=1Bi)−P (∪n0
i=1Bi)| ≤ ǫ/2. It is easy to check that (∪n0i=1Bi)∆(∪n0
i=1Ci) ⊂ ∪n0i=1(Bi∆Ci).
Write B = ∪∞i=1Bi and B = ∪n0
i=1Bi and C = ∪n0i=1Ci. Note B∆C ⊂ (B\B)∪ (B∆C). These
facts show that P (B∆C) < ǫ and C ∈ σ(Z1, Z2, · · · , Zl) for some l < ∞. Thus F is a
σ-algebra.
(ii) Take f and g to be indicator functions, from (i) we get α(n) ≤ α(n). By Lemma
6.1, we have that α(n) ≤ 4α(n).
Let Xi; i ≥ 1 be a sequence of random variables. We say it is a Markov chain if
P (Xk+1 ∈ A|X1, · · · ,Xk) = P (Xk+1 ∈ A|Xk) for any Borel set A and any k ≥ 1. A Markov
chain has the property that given the present observations, the past and future
observations are (conditionally) independent. This is literally stated in Proposition
10.3.
Let’s look at the strong mixing coefficient α(n). Recall the definition of α(n). IfX1,X2, · · ·is a Markov chain, for simplicity, we write f and g, respectively, for f(X1, · · · ,Xk) and
g(Xk+n, · · · ,Xk+m). Let f = E(f |Xk+1)) and g = E(g|Xk+n−1). Then by Proposition 10.3,
E(fg) = E(f g), and Ef = Ef and Eg = Eg. Thus,
E(fg) − (Ef)Eg = E(f g) − (Ef)Eg.
The point is that f is Xk+1 measurable, and g is Xk+n−1 measurable. Therfore
THEOREM 24 (Hoeffding’s Theorem) For a U -statistic given by (7.1) with E[h(X1, · · · ,Xm)2] <
∞,
V ar(Un) =1(nm
)m∑
k=1
(m
k
)(n−m
m− k
)ζk
where ζk = V ar(hk(X1, · · · ,Xk)).
Proof. Consider two sets i1, · · · , im and j1, · · · , jm ofm distinct integers from 1, 2, · · · , nwith exactly k integers in common. Then the total number of distinct choices is
(nm
)(mk
)(n−mm−k
).
Now
V ar(Un) =1(
nm
)2∑
Cov(h(Xi1 , · · · ,Xim), h(Xj1 , · · · ,Xjm)),
where the sum is over all indices 1 ≤ i1 < · · · < im ≤ n and 1 ≤ j1 < · · · < jm ≤ n.
If i1, · · · , im and j1, · · · , jm have no integers in common, the covariance is zero by
independence. Now we classfisy the sum over the number of integers in common. Then
k+1). Note that Ehk = Ehk+1. Then V ar(hk) ≤V ar(hk+1). So (i) follows.
(ii) By Theorem 24,
V ar(Un) =1(nm
)(m
k
)(n−m
m− k
)ζk +
1(nm
)m∑
i=k+1
(m
i
)(n−m
m− i
)ζi.
Since m is fixed,(nm
)∼ nm/m! and
(n−mm−i
)≤ nm−i for any k + 1 ≤ i ≤ m. Thus the last
term above is bounded by O(n−(k+1)) as n→ ∞. Now
1(nm
)(m
k
)(n−m
m− k
)ζk
=m!(m
k
)ζk
n(n− 1) · · · (n−m+ 1)· (n−m)!
(m− k)!(n − 2m+ k)!
=k!(m
k
)2ζk
nk· n · n · · ·nn(n− 1) · · · (n− k + 1)
· (n −m) · · · (n− 2m+ k + 1)
(n− k) · · · (n−m+ 1)
=k!(m
k
)2ζk
nk·
k−1∏
i=1
(1 − i
n
)−1
·2m−k−1∏
i=m
(1 − i
n
)·
m−1∏
i=k
(1 − i
n
)−1
.
Note that since m is fixed, each of the above product is equal to 1+O(1/n). The conclusion
then follows.
Let X1, · · · ,Xn be a sample, and Tn be a statistic based on this sample. The projection
of Tn on some random variables Y1, · · · , Yp is defined by
T ′n = ETn +
p∑
i=1
(E(Tn|Yi) −ETn).
Suppose Tn is a symmetric function ofX1, · · · ,Xn. Set ψ(Xi) = E(Tn|Xi) for i = 1, 2, · · · , n.Then ψ(X1), · · · , ψ(Xn) are i.i.d. random variables with mean ETn. If V ar(Tn) <∞, then
1√nV ar(ψ(X1))
n∑
i=1
(ψ(Xi) − ETn) → N(0, 1) (7.4)
in distribution as n→ ∞. Now Let T ′n be the projection of Tn on X1, · · · ,Xn. Then
Tn − T ′n = (Tn − ETn) −
n∑
i=1
(ψ(Xi) − ETn). (7.5)
We will next show that Tn − T ′n is negligible comparing the order appeared in (7.4). Then
the CLT holds for Tn by (7.4) and thre Slusky lemma.
48
LEMMA 7.2 Let Tn be a symmetric statistic with V ar(Tn) < ∞ for every n, and T ′n be
the projection of Tn on X1, · · · ,Xn. Then ET ′n = ETn and
E(Tn − T ′n)2 = V ar(Tn) − V ar(T ′
n).
Proof. Since ETn = ET ′n,
E(Tn − T ′n)2 = V ar(Tn) + V ar(T ′
n) − 2Cov(Tn, T′n).
First, V ar(T ′n) = nV ar(E(Tn|X1)) by independence. Second, by (7.5)
Cov(Tn, T′n)
= V ar(Tn) + nV ar(E(Tn|X1)) − 2n∑
i=1
Cov(Tn − ETn)(ψ(Xi) − ETn). (7.6)
Now the i-th covariance above is equal to E(TnE(Tn|Xi))−E(E(Tn|Xi))2 = V ar(E(Tn|X1)).
We already see V ar(T ′n) = nV ar(E(Tn|X1)). So the proof is complete by the two facts and
(7.6).
THEOREM 25 Let Un be the statistic given by (7.1) with E[h(X1, · · · ,Xm)]2 <∞.
(i) If ζ1 > 0, then√n(Un − ETn) → N(0,m2ζ1) in distribution.
(ii) If ζ1 = 0 and ζ2 > 0, then
n(Un − EUn) → m(m− 1)
2
∞∑
j=1
λj(χ21j − 1),
where χ21j ’s are i.i.d. χ2(1)-random variables, and λj ’s are constants satisfying
∑∞j=1 λ
2j =
ζ2.
Proof. We will only prove (i) here. The proof of (ii) is very technical, it is omitted. Let
U ′n be the projection of Un on X1, · · · ,Xn. Then
U ′n = Eh+
1(nm
)∑
1≤i1<···<im≤n
E(h(Xi1 , · · · ,Xik)|Xi) − Eh.
Observe that
1(nm
)(∑
1≤i1<···<im≤n
E(h(Xi1 , · · · ,Xik)|X1) − Eh)
=1(nm
)∑
2≤i2<···<im≤n
(h1(X1) − Eh)
=
(n−1m−1
)(nm
) (h1(X1) − Eh) =m
n(h1(X1) − Eh).
49
Thus
U ′n = Eh+
m
n
n∑
i=1
(h1(Xi) − Eh).
This says that V ar(U ′n) = (m2/n)ζ1, where ζ1 = V ar(h1(X1)). By lemma 7.1, V ar(Un) =
(m2/n)ζ1+O(n−2). From Lemma 7.2, V ar(Un−U ′n) = O(n−2) as n→ ∞. By the Chebyshev
inequality,√n(Un − U ′
n) → 0 in probability. The result follows from (7.4) and the Slusky
lemma by noting V art(ψ(X1)) = V ar(h1(X1)) = ζ1.
50
8 Empirical Processes
Let X1,X2, · · · be a random sample, where Xi takes values in a metric space M. Define a
probability measure µn such that
µn(A) =1
n
n∑
i=1
I(Xi ∈ A), A ⊂M.
The measure µn is called a probability measure. If X1 takes real values, that is, M = R,
set
Fn(t) =1
n
n∑
i=1
I(Xi ≤ t), t ∈ R, A = (−∞, t].
The random process Fn(t), t ∈ R is called an empirical process. By the LLN and CLT,
Fn(t) → F (t) a.s. and√n(Fn(t) − F (t)) =⇒ N(0, F (t)(1 − F (t))),
where F (t) is the cdf of X1. Among many interesting questions, here we are interested in
the uniform convergence of the above almost sure and weak convergences. Specifically, we
want to answer the following two questions:
1) Does ∆n := supt
|Fn(t) − F (t)| → 0 a.s.?
2) Fix n ≥ 1, regard Fn(t); t ∈ R as one point in a big space, does the CLT hold?
The answers are yes for both questions. We first answer the first question.
THEOREM 26 (Glivenko-Cantelli) Let X1,X2, · · · be a sequence of i.i.d. random vari-
ables with cdf F (t). Then
∆n = supt
|Fn(t) − F (t)| → 0 a.s.
as n→ ∞.
Remark. When F (t) is continuous and strictly increasing in the support of X1, the above
the distribution of ∆n is independent of the distribution of X1. First, F (Xi)’s are i.i.d.
U [0, 1]-distributed random variables. Second,
∆n = supt
| 1n
∑
i
I(Xi ≤ t) − F (t)| = supt
| 1n
∑
i
I(F (Xi) ≤ F (t)) − F (t)|
= sup0≤y≤1
| 1n
∑
i
I(Yi ≤ y) − y|,
51
where Yi = F (Xi); 1 ≤ i ≤ n are i.i.d. random variables with uniform distribution over
[0, 1].
Proof. Set
Fn(t−) =1
n
n∑
i=1
I(Xi < t) and F (t−) = P (X1 < t), t ∈ R.
By the strong law of large numbers, Fn(t) → F (t) a.s. and Fn(t−) → F (t−) a.s. as n→ ∞.
Given ǫ > 0, choose −∞ = t0 < t1 < · · · < tk = ∞ such that F (ti−)−F (ti−1) < ǫ for every
i. Now, for ti−1 ≤ t < ti,
Fn(t) − F (t) ≤ Fn(ti−) − F (ti−) + ǫ
Fn(t) − F (t) ≥ Fn(ti) − F (ti) − ǫ.
This says that
∆n ≤ max1≤i≤k
|Fn(ti) − F (ti)|, |Fn(ti−) − F (ti−)| + ǫ→ ǫ a.s.
as n→ ∞. The conclusion follows by letting ǫ ↓ 0.
Given (t1, t2, · · · , tk) ∈ Rk, it is easy to check that
√n(Fn(t1) − F (t1), · · · , Fn(tk) −
F (tk)) =⇒ Nk(0, Σ) where
Σ = (σij) and σij = F (ti ∧ tj) − F (ti)F (tj), 1 ≤ i, j ≤ n. (8.1)
Let D[−∞,∞] be the set of all functions defined on (−∞,∞) such that they are right-
continuous and the left limits exist everywhere. Equipped with a so-called Skorohod metric,
it becomes a Polish space. The random vector√n(Fn − F ), viewed as an element in
D[−∞,∞], converges weakly to a continuous Gaussian process ξ(t), where ξ(±∞) = 0 and
the covariance structure is given in (8.1). This is usually called a Brownian bridge. It
has same distribution of Gλ F, where Gλ is the limiting point when F is the cumulative
distribution function of the uniform distribution over [0, 1].
THEOREM 27 (Donsker). If X1,X2, · · · are i.i.d. random variables with distribution
function F, then the sequence of empirical processes√n(Fn − F ) converges in distribution
in the space D[−∞,∞] to a random element GF , whose marginal distributions are zero-
mean with covariance function (8.1).
Later we will see that
√n‖Fn − F‖∞ =
√n sup
t|Fn(t) − F (t)| =⇒ sup
t|GF (t)|,
52
where
P
(sup
t|GF (t)| ≥ x
)= 2
∞∑
j=1
(−1)j+1e−2j2x2, x ≥ 0.
Also, the DKW (Dvoretsky, Kiefer, and Wolfowitz) inequality says that
P (√n‖Fn − F‖∞ > x) ≤ 2e−2x2
, x > 0.
Let X1,X2, · · · be i.i.d. random variables taking values in (X ,A) and L(X1) = P. We
write µf =∫f(x)µ( dx). So
Pnf =1
n
n∑
i=1
f(Xi) and Pf =
∫
Xf(x)P ( dx).
By the Glivenko-Cantelli theorem
supx∈R
|Pn((−∞, x]) − P ((−∞, x])| → 0 a.s.
Also, if P has no point mass, then
supany measurable set A
|Pn(A) − P (A)| = 1
because Pn(A) = 1 and P (A) = 0 for A = X1,X2, · · · ,Xn. We are searching for F such
that
supf∈F
|Pnf − Pf | → 0 a.s. (8.2)
The previous two examples say (8.2) holds for F = I(−∞, x], x ∈ R but doesn’t hold for
the power set 2R. The class F is called P -Glivenko-Cantelli if (8.2) holds.
Define Gn =√n(Pn −P ). Given k measurable functions f1, · · · , fk such that Pf2
i <∞for all i, one can also check by the multivariate Central Limit Theorem (CLT) that
(Gnf1, · · · ,Gnfk) =⇒ Nk(0,Σ)
where Σ = (σij)1≤i,j≤n and σij = P (fifj) − (Pfi)Pfj . This tells us that Gnf, f ∈ Fsatisfies CLT in R
k. If F has infinite many members, how we define weak convergence? We
first define a space similar to Rk. Set
l∞(F) = bounded function z : F → R
with norm ‖z‖∞ = supF∈F |z(F )|. Then (l∞(F), ‖ · ‖∞) is a Banach space; it is separable
if F is countable. When F has finite elements, (l∞(F), ‖ · ‖∞) = (Rk, ‖ · ‖∞).
So, Gn : F → R is a random element and can be viewed as an element in Banach space
l∞(F). We say F is P -Donsker if Gn =⇒ G, where G is a tight element in l∞(F).
53
LEMMA 8.1 (Berstein’s inequality). Let X1, · · · ,Xn be independent random variables
mean zero, |Xj | ≤ K for some constant K and all j, and σ2j = EX2
j > 0. Let Sn =∑n
i=1Xi
and s2n =∑n
j=1 σ2j . Then
P (|Sn| ≥ x) ≤ 2e−x2/2(s2n+Kx), x > 0.
Proof. It suffices to show that
P (Sn ≥ x) ≤ e−x2/2(s2n+Kx), x > 0. (8.3)
First, for any λ > 0,
P (Sn > x) ≤ e−λxEeλSn = e−λxΠni=1Ee
λXi .
Now
EeλXj = 1 +
∞∑
i=2
λi
i!EXi
j ≤ 1 +σ2
jλ2
2
∞∑
i=2
(λK)i−2 = 1 +σ2
jλ2
2(1 − λK)≤ exp
(σ2
jλ2
2(1 − λK)
)
if λK < 1. Thus
P (Sn > x) ≤ exp
(−λx+
λ2s2n2(1 − λK)
)
Now (8.3) follows by choosing λ = x/(s2n +Kx).
Recall Gnf =∑n
i=1(f(Xi) − Pf)/√n. The random variable Yi := (f(Xi) − Pf)/
√n
here corresponds to Xi in the above lemma. Note that V ar(∑n
i=1 Yi) = V ar(f(X1)) ≤ Pf2
and ‖Yi‖∞ ≤ 2‖f‖∞/√n. Apply the above lemma to
∑ni=1 Yi, we obtain
COROLLARY 8.1 For any bounded, measurable function f,
P (|Gnf | ≥ x) ≤ 2 exp
(−1
4
x2
Pf2 + x‖f‖∞/√n
)
for any x > 0.
Notice that
P (|Gnf | ≥ x) ≤ 2e−Cx when x is large and
P (|Gnf | ≥ x) ≤ 2e−Cx2when x is small. (8.4)
Let’s estimate Emax1≤i≤m |Yi| provided m is large and P (|Yi| ≥ x) ≤ e−x for all i and
x > 0. Then E|Yi|k ≤ k! for k = 1, 2, · · · . The immediate estimate is
E max1≤i≤m
|Yi| ≤m∑
i=1
E|Yi| ≤ m.
54
By Holder’s inequality,
E max1≤i≤m
|Yi| ≤ (E max1≤i≤m
|Yi|2)1/2 ≤ (
m∑
i=1
E|Yi|2)1/2 ≤√
2m.
Let ψ(x) = ex − 1, x ≥ 0. Following this logic, by Jensen’s inequality
ψ(E max1≤i≤m
|Yi|/2) ≤ m · max1≤i≤m
Eψ(|Yi|/2) ≤ 2m.
Take inverse ψ−1 for both sides. We obtain that
E max1≤i≤m
|Yi| ≤ 2 log(2m).
This together with (8.4) leads to the following lemma.
LEMMA 8.2 For any finite class F of bounded, measurable, square-integrable functions
with |F| elements, there is an universal constant C > 0 such that
C · Emaxf
|Gnf | ≤ maxf
‖f‖∞log(1 + |F|)√
n+ max
f‖f‖P,2
√log(1 + |F|) .
where ‖f‖P,2 = (Pf2)1/2.
Proof. Define a = 24‖f‖∞/√n and b = 24Pf2. Define Af = (Gnf)I|Gnf | > b/a and
Bf = (Gnf)I|Gnf | ≤ b/a, then Gnf = Af +Bf . It follows that
Emaxf
|Gnf | ≤ Emaxf
|Af | + Emaxf
|Bf |. (8.5)
For x ≥ b/a and x ≤ b/a the exponent in the Bernstein inequality is bounded above by
−3x/a and −3x2/b, respectively. By Bernstein’s inequality,
P (|Af | ≥ x) ≤ P (|Gnf | ≥ x ∨ b/a) ≤ 2 exp
(−3x
a
),
P (|Bf | ≥ x) = P (b/a ≥ |Gnf | ≥ x) ≤ P (|Gnf | ≥ x ∧ b/a) ≤ 2 exp
(−3x2
b
)
for all x ≥ 0. Let ψp(x) = exp(xp) − 1, x ≥ 0, p ≥ 1.
Eψ1
( |Af |a
)= E
∫ |Af |/a
0ex dx =
∫ ∞
0P (|Af | > ax)ex dx ≤ 1.
By a similar argument we find that Eψ2(|Bf |/√b) ≤ 1. Because ψp(·) is convex for all p ≥ 1,
by Jensen’s inequality
ψ1
(Emax
f
|Af |a
)≤ Eψ1
(maxf |Af |
a
)≤ E
∑
f
ψ1
( |Af |a
)≤ |F|.
55
Taking inverse function, we have the first term on the right hand side. Similarly we have
the second term. The conclusion follows from (8.5).
We actually use the following lemma above
LEMMA 8.3 Let X be a real valued random variable and f : [0,∞) → R be differentiable
with∫ s0 |f ′(t)| dt <∞ for any s > 0 and
∫∞0 P (X > t)|f ′(t)| dt <∞. Then
Ef(X) =
∫ ∞
0P (X ≥ t)f ′(t) dt + f(0).
Proof. First, since∫ t0 |f ′(t)| dt <∞ for any t > 0, we have
f(X) − f(0) =
∫ X
0f ′(t) dt =
∫ ∞
0I(X ≥ t)f ′(t) dt.
Then, by Fubini’s theorem,
Ef(X) = E
∫ ∞
0I(X ≥ t)f ′(t) dt + f(0) =
∫ ∞
0P (X ≥ t)f ′(t) dt + f(0).
Remark. When f(x) is differentiable, f ′(x) is not necessarily Lebesgue integrable even
though it is Riemann-integrable. The following is an example: Let
f(x) =
x2 cos(1/x2), if x 6= 0;
0, otherwise.
Then
f ′(x) =
2x cos(1/x2) + (2/x) sin(1/x2), if x 6= 0;
0, otherwise
is not Lebesgue-integrable on [0, 1], but obviously Riemann-integrable. We only need to
show g(x) := (2/x) sin(1/x2) is not integrable. Suppose it is,∫ 1
0
1
x
∣∣∣∣sin(
1
x2
)∣∣∣∣ dx =1
2
∫ 1
0
| sin t|t
dt = +∞,
yields a contradiction.
Now we describe the size of F . For any f ∈ F , define
‖f‖P,r = (P |f |r)1/r.
Let l and u be two functions, the bracket [l, u] is set of all functions f such that l ≤ f ≤ u.
An ǫ-bracket in Lr(P ) is a bracket [l, u] such that ‖u − l‖P,r < ǫ. The bracketing number
N[ ](ǫ,F , Lr(P )) is the minimum numbers of ǫ-brackets needed to cover F . The entropy with
bracketing is the logarithm of the bracketing number.
56
8.1 Outer Measures and Expectations
Recall X is a random variable from (Ω,G) → (R,B(R)) if X−1(B) ∈ G for any set B ∈ B(R).
For an arbitrary map X the inverse may not be in G), particularly for small G). For
example, when G = ∅,Ω, many maps are not random variables. In empirical processes,
we will deal with Z =: supt∈T Xt for some index set T. If T is big, then Z may not be
measurable. It does not make sense to study expectations and probabilities related to such
a random variable. But we have another way to get around this.
Definition Let X be an arbitrary map from (Ω,G, P ) → (R,B(R)). Define
E∗X = infEY ; Y ≥ X and Y is a measurable map : (Ω,G) → (R,B(R));P ∗(X ∈ A) = infP (X ∈ B); X ∈ B ∈ G, B ⊃ A, A,B ∈ B(R).
One can easily show that E∗X, as an infimum, can be achieved, i.e., there exists a random
variable X∗ : (Ω,F , P ) → (R, B(R)) such that EX∗ = E∗X. Further, X∗ is P -almost surely
unique, i.e., if there exist two such random variables X∗1 and X∗
2 , then P (X∗1 = X∗
2 ) = 1.
We call X∗ the measurable cover function. Obviously
(X1 +X2)∗ ≤ X∗
1 +X∗2 and X∗
1 ≤ X∗2 if X1 ≤ X2.
One can define E∗ and P∗ similarly.
Let (M,d) be a metric space. A sequence of arbitrary maps Xn : (Ωn,Gn) → (M,d)
converges in distribution to a random vector X if
E∗f(Xn) → Ef(X)
for any bounded, continuous function f defined on (M,d). We still have an analogue of the
Portmanteau theorem.
THEOREM 28 The following are equivalent:
(i) E∗f(Xn) → Ef(X) for every bounded, continuous function f defined on (M,d);
(ii) E∗f(Xn) → Ef(X) for every bounded, Lipschitz function f defined on (M,d), that
is, there is a constant C > 0 such that |f(x) − f(y)| ≤ Cd(x, y) for any x, y ∈M ;
(iii) lim infn P∗(Xn ∈ G) ≥ P (X ∈ G) for any open set G;
(iv) lim supn P∗(Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;
(v) lim supn P∗(Xn ∈ H) = P (X ∈ H) for any set H such that P (X ∈ ∂H) = 0, where
∂H is the boundary of H.
57
Let
J[ ](δ,F , L2(P )) =
∫ δ
0
√logN[ ](ǫ,F , L2(P )) dǫ.
The proof of the following thorem is easy. It is omitted.
THEOREM 29 Every class F of measurable functions such that N[ ](ǫ,F , L1(P )) <∞ for
every ǫ > 0 is P -Glivenko-Cantelli.
THEOREM 30 Every class F of measurable functions with J[ ](1,F , L2(P )) < ∞ is P -
Donsker.
To prove the theorem, we need some preparation.
THEOREM 31 A sequence of arbitrary maps Xn = (Xn,t, t ∈ T ) : (Ωn, Gn) → l∞(T )
converges weakly to a tight random element if and only if both of the following conditions
hold:
(i) The sequence (Xn,t1 , · · · ,Xn,tk) converges in distribution in Rk for every finite set
of points t1, · · · , tk in T ;
(ii) for every ǫ, η > 0 there exists a partition of T into finitely many sets T1, · · · , Tk
such that
lim supn→∞
P ∗(
supi
sups,t∈Ti
|Xn,s −Xn,t| ≥ ǫ
)≤ η.
Proof. We only prove sufficiency.
Step 1: A preparation. For each integer m ≥ 1, let Tm1 , · · · , Tm
kmbe a partition of T such
that
lim supn→∞
P ∗(
supj
sups,t∈T m
j
|Xn,s −Xn,t| ≥1
2m
)≤ 1
2m. (8.1)
Since the supremum above becomes smaller when a partition becomes more refined, w.l.o.g.,
assume the partitions are successive refinements as m increases. Define a semi-metric
ρm(s, t) =
0, if s, t belong to the same partitioning set Tmj for some j;
1, otherwise.
Easily, by the nesting of the partitions, ρ1 ≤ ρ2 ≤ · · · . Define
ρ(s, t) =
∞∑
m=1
ρm(s, t)
2mfor s, t ∈ T.
58
Obviously, ρ(s, t) ≤ ∑∞k=m+1 2−k ≤ 2−m when s, t ∈ Tm
j for some j. So (T, ρ) is totally
bounded. Let T0 be the countable ρ-dense subset constructed by choosing an arbitrary
point tmj from every Tmj .
Step 2: Construct the limit of Xn. For two finite subsets of T, S = (s1, · · · , sp) and
U = (s1, · · · , sp, , · · · , sq), by assumption (i), there are two probability measures µp on Rp
By the same argument as in (8.3), using (8.1) to obtain
limn→∞
|E∗f(Xn) − Ef(X)| ≤ ‖f‖∞2−m + 2−m + 2−m+1‖f‖∞.
Letting m → ∞, we have that E∗f(Xn) → Ef(X) as n→ ∞.
Let F be a class of measurable functions f : X → R. Review
a(δ) =δ√
logN[ ](δ,F , L2(P )),
J[ ](δ,F , L2(P )) =
∫ δ
0
√logN[ ](ǫ,F , L2(P )) dǫ.
LEMMA 8.4 Suppose there is δ > 0 and a measurable function F > 0 such that P ∗f2 < δ2
and |f | ≤ F for every f ∈ F . Then there exists a constant C > 0 such that
C ·E∗P ‖Gn‖F ≤ J[ ](δ,F , L2(P )) +
√nP ∗FF >
√na(δ).
60
Proof. Step 1: Truncation. Recall Gnf = (1/√n)∑n
i=1(f(Xi) − Pf). If |f | ≤ g, then
|Gnf | ≤1√n
n∑
i=1
(g(Xi) + Pg).
It follows that
E∗ ∥∥GnfIF >√na(δ)
∥∥F ≤ 2
√nPFF >
√na(δ).
We will bound E∗ ‖GnfIF ≤ √na(δ)‖F next. The bracketing number of the class of
functions fIF >√na(δ) if f ranges over F are smaller than the bracketing number of
the class F . To simplify notation, we assume, w.l.o.g., |f | ≤ √na(δ) for every f ∈ F .
Step 2: Discretization of the integral. Choose an integer q0 such that 4δ ≤ 2−q0 ≤ 8δ.
We claim there exists a nested sequence of F-partitions Fqi; 1 ≤ i ≤ Nq, indexed by the
integers q ≥ q0, into Nq disjoint subsets and measurable functions ∆qi ≤ 2F such that
C∑
q≥q0
1
2q
√logNq ≤
∫ δ
0
√logN[ ](ǫ,F , L2(P )) dǫ, (8.4)
supf,g∈Fqi
|f − g| ≤ ∆qi, and P∆2qi ≤
1
22q(8.5)
for an universal constant C > 0. For convenience, write N(ǫ) = N[ ](ǫ,F , L2(P )). Thus,
∫ δ
0
√logN(ǫ) dǫ ≥
∫ 1/2q0+3
0
√logN(ǫ) dǫ =
∞∑
q=q0+3
∫ 1/2q
1/2q+1
√logN(ǫ) dǫ
≥∞∑
q=q0+3
1
2q+1
√logN
(1
2q−3
).
Re-indexing the sum, we have that
1
16
∞∑
q=q0
1
2q
√logNq ≤
∫ δ
0
√logN(ǫ) dǫ,
where Nq = N(2−q). By definition of Nq, there exists a partition Fqi = [lqi, uqi]; 1 ≤ i ≤Nq, where lqi and uqi are functions such that E(uqi − lqi)
2 < 2−q. Set ∆qi = uqi − lqi. Then
∆qi ≤ 2F and supf,g∈Fqi|f − g| ≤ ∆qi, and P∆2
qi ≤ 2−2q.
We can also assume, w.l.o.g., that the partitions Fqi; 1 ≤ i ≤ Nq are successive
refined as q increases. Actually, at q-th level, we can make intersections of elements from
Fki; 1 ≤ i ≤ Nk as k goes from q0 to q : ∩qk=q0
Fki. The total possible number of
such intersections is no more than Nq := Nq0Nq0+1 · · ·Nq. Since the current Fqi’s become
61
smaller, all requirements on Fqi still hold obviously except possibly (8.4). Now we verify
(8.4). Actually, noticing√
log Nq ≤∑qk=q0
√logNk. Then
∑
q≥q0
1
2q
√log Nq ≤
∑
q≥q0
q∑
k=q0
1
2q
√logNk =
∞∑
k=q0
∑
q≥k
1
2q
√logNk = 2
∞∑
k=q0
1
2k
√logNk.
So (8.4) still holds when replacing Nq by Nq.
Step 3: Chaining-a skill. For each fixed q ≥ q0, choose a fixed element fqi from each
partition set Fqi, and set
πqf = fqi, ∆qf = ∆qi, if f ∈ Fqi. (8.6)
Thus, πqf and ∆qf runs through a set of Nq functions if f run through F . Without loss of
generality, we can assume
∆qf ≤ ∆q−1f for any f ∈ F and q ≥ q0 + 1. (8.7)
Actually, let ∆qi be the measurable cover function of supf,g∈Fqi|f−g|. Then (8.5) also holds.
Let Fq−1 j be a partition set at the q−1 level such that Fqi ⊂ Fq−1 j . Then supf,g∈Fqi|f−g| ≤
supf,g∈Fq−1 j|f − g|. Thus, ∆qi ≤ ∆q−1 j. The assertion (8.7) follows by replacing ∆qi with
∆qi.
By (8.6), P (πqf −f)2 ≤ maxi P∆2qi < 2−2q. Thus, P
∑q(πqf −f)2 =
∑q P (πqf −f)2 <
∞. So∑
q(πqf − f)2 <∞ a.s. This implies
πqf → f a.s. on P (8.8)
as q → ∞ for any f ∈ F . Define
aq = 2−q/√
logNq+1 ,
τ = τ(n, f) = infq ≥ q0 : ∆qf >√naq.
The value of τ is thought to be +∞ if the above set is empty. This is the first time ∆qf >√naq. By construction, 2a(δ) = 2δ(logN[ ](δ,F , L2(P )))−1/2 ≤ aq0 since the denominator is
decreasing in δ. By Step 1, we know that |∆qf | ≤ 2√na(δ) ≤ √
naq0. This says that τ > q0.
We claim that
f − πq0f =
∞∑
q0+1
(f − πqf)Iτ = q +
∞∑
q0+1
(πqf − πq−1f)Iτ ≥ q, a.s on P. (8.9)
In fact, write f − πq0f = (f − πq1f) +∑q1
q0+1(πqf − πq−1f) for q1 > q0. Now,
62
(i) If τ = ∞, the right hand side above is identical to limq→∞ πqf −πq0f = f −πq0f a.s.
by (8.8);
(ii) If τ = q1 <∞, the right hand is equal to (f − πq1f) +∑q1
q0+1(πqf − πq−1f).
Step 4. Bound terms in the chain. Apply Gn to the both sides of (8.9). For |f | ≤ g,
note that |Gnf | ≤ |Gng| + 2√nPg. By (8.6), |f − πqf | ≤ ∆qf. One obtains that
E∗‖∞∑
q0+1
Gn
f︷ ︸︸ ︷(f − πqf)Iτ = q ‖F
≤∞∑
q0+1
E∗‖Gn(∆qfIτ = q︸ ︷︷ ︸g
)‖F + 2√n
∞∑
q0+1
‖P (∆qfIτ = q)‖F . (8.10)
Now, by (8.7), ∆qfIτ = q ≤ ∆q−1fIτ = q ≤ √naq−1. Moreover, P (∆qfIτ = q)2 ≤
2−2q. By Lemma 8.2, the middle term above is bounded by
∞∑
q0+1
(aq−1 logNq + 2−q√
logNq).
By Holder’s inequality, P (∆qfIτ = q) ≤ (P (∆qf)2)1/2P (τ = q)1/2 ≤ 2−qP (∆qf >√naq)
1/2 ≤ 2−q(√naq)
−1(P (∆qf)2)1/2 ≤ 2−2q(√naq)
−1. So the last term in (8.10) is
bounded by 2 · 2−2q/aq. In summary,
E∗
∥∥∥∥∥∥
∞∑
q0+1
Gn(f − πqf)Iτ = q
∥∥∥∥∥∥F
≤ C
∞∑
q0+1
2−q√
logNq (8.11)
for some universal constant C > 0.
Second, there are at mostNq functions πqf−πq−1f and two values the indicator functions
I(τ ≥ q) takes. Because the partitions are nested, the function |πqf − πq−1f |Iτ ≥ q ≤∆q−1fIτ ≥ q ≤ √
naq−1. The L2(P )-norm of πqf −πq−1f is bounded by 2−q+1. Applying
Lemma 8.2 again to obtain
E∗
∥∥∥∥∥∥
∞∑
q0+1
Gn(πqf − πq−1f)Iτ ≥ q
∥∥∥∥∥∥F
≤∞∑
q0+1
(aq−1 logNq + 2−q√
logNq)
≤ C
∞∑
q0+1
2−q√
logNq (8.12)
for some universal constant C > 0.
63
At last, we consider πq0f. Because |πq0f | ≤ F ≤ a(δ)√n ≤ √
naq0 and P (πq0f)2 ≤ δ2
by assumption, another application of Lemma 8.2 leads to
E∗‖Gnπq0f‖F ≤ aq0 logNq0 + δ√
logNq0.
In view of the choice of q0, this is no more than the bound in (8.12). All above inequalities
together with (8.4) yield the desired result.
COROLLARY 8.2 For any class F of measurable functions with envelope function F, there
exists an universal constant C such that
E∗P ‖Gn‖F ≤ C · J[ ](‖F‖P,2,F , L2(P )).
Proof. Since F has a single bracket [−F, F ], we have that N[ ](δ,F , L2(P )) = 1 for δ =
2‖F‖P,2. Review the definition in Lemma 8.4. Choose δ = ‖F‖P,2. It follows that
a(δ) =‖F‖P,2√
logN[ ](‖F‖P,2,F , L2(P )).
Now√nP ∗FI(F >
√na(δ)) ≤ ‖F‖2
P,2/a(δ) = ‖F‖P,2
√logN[ ](‖F‖P,2,F , L2(P )) by Markov’s
inequality, which is bounded by J[ ](‖F‖P,2,F , L2(P )) since the integrand is non-decreasing
and hence∫ ‖F‖P,2
0
√logN[ ](ǫ,F , L2(P )) dǫ ≥ ‖F‖P,2
√logN[ ](‖F‖P,2,F , L2(P )).
Proof of Theorem 30. We will use Theorem 31 to prove this theorem. The part (i) is
easily satisfied. Now we verify (ii).
Note there is no cover on F and we don’t know if Pf2 < δ2. Let G = f − g, f, g ∈ F.With a given set of ǫ-brackets [li, ui] over F we can construct 2ǫ-brackets over G by taking
differences [li − uj, ui − lj ] of upper and lower bounds. Therefore, the bracketing number
N[ ](2ǫ,G, L2(P )) are bounded by the square of the bracketing number N[ ](ǫ,F , L2(P )).
Easily, ‖(ui − lj) − (li − uj)‖ ≤ 2ǫ. Hence
N[ ](2ǫ,G, L2(P )) ≤ N[ ](ǫ,F , L2(P ))2.
This says that
J[](ǫ,G, L2(P )) <∞. (8.13)
64
For a given, small δ > 0, by the definition of N[ ](δ,F , L2(P )), choose a minimal number
of brackets of size δ that cover F , and use them to form a partition of F = ∪iFi. The subsets
G consisting of differences f − g of functions f and g belonging to the same partitioning
set consists of functions of L2(P )-norm smaller than δ. Hence, by Lemma 8.4, there exists
a finite number a(δ) := a(δ)F < a(2δ)G and a universal constant C such that
C · E∗ supi
supf,g∈Fi
|Gn(f − g)| = CE∗ suph∈H
|Gnh|
≤ J[](δ,H, L2(P )) + 2√nPFI(F > a(δ)
√n)
≤ J[ ](δ,G, L2(P )) + 2√nPFI(F > a(δ)
√n).
since H := f − g; f, g ∈ Fi, for all i ⊂ G, where the envelope function F can be taken
equal to the supremum of the absolute values of the upper and lower bounds of finitely
many brackets that cover F , for instance a minimal set of brackets of size 1. This F is
square integrable. The second term above is bounded by a(δ)−1PF 2I(F > a(δ)√n) → 0 as
n→ ∞ for fixed δ. First let n→ ∞, then let δ ↓ 0 the left hand side goes to 0. So part (ii)
in (31) is valid by using Markov’s inequality.
Example. Let F = ft = I(−∞, t], t ∈ R. Then
‖ft − fs‖2,P = (F (t) − F (s))1/2 for s < t.
Cut [0, 1] into many pieces with length less than ǫ2. Since F (t) is increasing, right continuous
and of left limits, there exist a partition, say, −∞ = t0 < t1 < · · · < tk = ∞ with
is an ǫ-bracket for i = 1, 2, · · · , k. It follows that
N[ ](ǫ,F , L2(P )) ≤ 2
ǫ2for ǫ ∈ (0, 1).
But∫ 10
√log(1/ǫ) dǫ <∞. The previous classical Glivenko-Cantelli’s Lemma 26 and Donsker’s
Theorem 27 follow from Theorems 29 and 30.
Since ‖f‖ = supt |f(t)| is the norm in l∞, it is continuous. By the Delta method, we
have
supt∈R
√n|Fn(t) − F (t)| → sup
t∈R
G(t)
65
weakly, i.e., in distribution, as n → ∞, where G(t) is the Gaussian process we mentioned
before.
Example. Let F = fθ; θ ∈ Θ be a collection of measurable functions with Θ ⊂ Rd
bounded. Suppose Pmr <∞ and |fθ1(x)−fθ2(x)| ≤ m(x)‖θ1−θ2‖ for any θ1 and θ2. Then
N[ ](ǫ‖m‖P,r, F , Lr(P )) ≤ K
(diam(Θ)
ǫ
)d
(8.14)
for any 0 < ǫ < diam(Θ). As long as ‖θ1−θ2‖ < ǫ, we have that l(x) := fθ1(x)−m(x)ǫ(x) ≤fθ2(x) ≤ fθ1(x) + m(x)ǫ(x) =: u(x). Also, ‖u − l‖P,r = ǫ‖m‖P,r. Naturally, [l, u] is an
ǫ‖m‖P,r-bracket. It suffices to calculate the minimal number of balls of radius ǫ need to
cover Θ.
Note that Θ ⊂ Rd is in a cube with side length no bigger than diam(Θ). We can cover
Θ with fewer than (2diam(Θ)/ǫ)d cubes of size ǫ. The circumscribed balls have radius a
multiple of ǫ and also cover Θ. The intersection of these balls with Θ cover Θ. So the claim
is true. This says that F is a P -Donsker class.
Example (Sobolev classes). For k ≥ 1, let
F =
f : [0, 1] → R; ‖f‖∞ ≤ 1 and
∫ 1
0(f (k)(x))2 dx ≤ 1
.
Then there exists an universal constant K such that
logN[ ](ǫ,F , ‖ · ‖∞) ≤ K
(1
ǫ
)1/k
Since ‖f‖P,2 ≤ ‖f‖∞ for any P, it is easy to check that N[ ](ǫ,F , L2(P )) ≤ N[ ](ǫ,F , ‖ · ‖∞)
for any P. So F is P -Donsker class for any P.
Example (Bounded Variation). Let
F = f : R → [−1, 1] of bounded variation 1.
Any function of bounded variation is the difference of two monotone increasing functions.
Then for any r ≥ 1 and probability measure P,
logN[ ](ǫ,F , Lr(P )) ≤ K
(1
ǫ
).
Therefore F is P -Donsker for every P.
66
9 Consistency and Asymptotic Normality of Maximum Like-
lihood Estimators
9.1 Consistency
Let X1,X2, · · · ,Xn be a random sample from a population distribution with pdf or pmf
f(x|θ), where θ is an unknown parameter. The MLE θ under certain conditions on f(x|θ)will be consistent and satisfy Central Limit Theorems. But for some cases those will not
be true. Let’s see a good example and a pathological example.
Example. Let X1,X2, · · · ,Xn be i.i.d. from Exp(θ) with pdf
f(x|θ) =
θe−θx, if x > 0;
0, otherwise.
So EX1 = 1/θ and V ar(X1) = 1/θ2. It is easy to check that θ = 1/X. By the CLT,
√n
(X − 1
θ
)=⇒ N(0, θ−2)
as n → ∞. Let g(x) = 1/x, x > 0. Then g(EX1) = θ and g′(EX1) = −θ2. By the Delta
method,
√n(θ − θ) =⇒ N(0, g′(µ)2σ2) = N(0, θ2)
as n→ ∞. Of course, θ → θ in probability.
Example. Let X1,X2, · · · ,Xn be i.i.d. from U [0, θ], where θ is unknown. The MLE
estimator θ = maxXi. First, it is easy to check that θ → θ in probability. But θ doesn’t
follow the CLT. In fact,
P
(n(θ − θ)
θ≤ x
)→
1 − e−x, if x > 0;
0, otherwise
as n→ ∞.
For some cases even the consistency, i.e., θ → θ in probability, doesn’t hold. Next, we
will study some sufficient conditions for consistency. Later we provide sufficient conditions
for CLT.
Let X1, · · · ,Xn be a random sample from a density pθ with reference measure µ, that
is, Pθ(X1 ∈ A) =∫A pθ(x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θn
maximizes the function h(θ) :=∑
log pθ(Xi) over Θ, or equivalently, the function
Mn(θ) =1
n
n∑
i=1
logpθ
pθ0
(Xi),
67
where θ0 is the true parameter. Under suitable conditions, by the weak law of large numbers,
Mn(θ) →M(θ) := Eθ0 logpθ
pθ0
(X1) =
∫
R
pθ0(x) logpθ
pθ0
(x)µ( dx) (9.1)
in probability as n → ∞. The number −M(θ) is called the Kullback-Leibler divergence of
pθ and pθ0. Let Y = pθ0(Z)/pθ(Z), where Z ∼ pθ. Then EθY = 1 and
M(θ) = −Eθ(Y log Y ) ≤ −(EθY ) log(EθY ) = 0
by Jensen’s inequality. Of course M(θ0) = 0, that is, θ0 attains the maximum of M(θ).
Obviously, M(θ0) = Mn(θ0) = 0. The following gives the consistency of θn with θ0.
THEOREM 32 Suppose supθ∈Θ |Mn(θ) −M(θ)| P→ 0 and supθ:d(θ,θ0)≥ǫ M(θ) < M(θ0) for
any ǫ > 0. Then θn → θ0 in probability.
Proof. For any ǫ > 0, we need to show that
P (d(θn, θ0) ≥ ǫ) → 0 (9.2)
as n→ ∞. By the condition, there exists δ > 0 such that
supθ:d(θ,θ0)≥ǫ
M(θ) < M(θ0) − δ.
Thus, if d(θn, θ0) ≥ ǫ, then M(θn) < M(θ0) − δ. Note that
Since Ψn(θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability, and Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ), the left hand
side goes to one.
Case 2. Suppose Ψn(θ) is nondecreasing satisfying Ψn(θn) = oP (1). Then
P (|θn − θ0| ≥ ǫ) ≤ P (θn > θ0 + ǫ) + P (θn < θ0 − ǫ)
≤ P (Ψn(θn) ≥ Ψn(θ0 + ǫ)) + P (Ψn(θn) ≤ Ψn(θ0 − ǫ)).
Now, Ψn(θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability. This together with Ψn(θn) = oP (1) shows that
P (|θn − θ0| ≥ ǫ) → 0 as n→ ∞.
Example. Let X1, · · · ,Xn be a random sample from Exp(θ), that is, the density
function is pθ(x) = θ−1 exp(−x/θ)I(x ≥ 0) for θ > 0. We know that, under the true model,
the MLE θn = 1/Xn → θ0 in probability. Let’s verify that this conclusion indeed can be
deduced from Theorem 33.
Actually, ψθ(x) = (log pθ(x))′ = −θ−1 + xθ−2 for x ≥ 0. Thus Ψn(θ) = −θ−1 + Xnθ
−2,
which goes to Ψ(θ) = Eθ0(−θ−1 + X1θ−2) = θ−2(θ0 − θ), that is positive or negative
depending on if θ is samller or bigger than θ0. Also, θn = 1/Xn. Applying Theorem 33 to
−Ψn and −Ψ, we obtain the consisitency result.
69
9.2 Asymptotic Normality
Now we study the central limit theorems for the MLE.
Now we illustrate the idea of showing the normality of MLE. Recalling (9.3). Do the
Taylor’s expansion for ψθ(Xi).
ψθn(Xi)
.= ψθ0(Xi) + (θn − θ0)ψθ0(Xi) +
1
2(θn − θ0)
2ψθ0(Xi)
We will use the following notation:
Pf =
∫f(x)µ(dx) for any real function f(x) and Pn =
1
n
n∑
i=1
δXi .
Then Pnf = (1/n)∑n
i=1 f(Xi). Thus,
0 = Ψn(θn).= Pnψθ0 + (θn − θ0)Pnψθ0 +
1
2(θn − θ0)
2Pnψθ0 .
Reorganize it in the following form
√n(θn − θ0)
.=
−√n(Pnψθ0)
Pnψθ0 + 12(θn − θ0)Pnψθ0
.
Recall the Fisher information
I(θ0) = Eθ0
(∂ log pθ(X1)
∂θ|θ=θ0
)2
= Eθ0(ψθ0(X1))2 = −Eθ0
(∂2 log pθ(X1)
∂2θ|θ=θ0
)
= −Eθ0(ψθ0(X1)).
By CLT and LLN,√n(Pnψθ0) =⇒ N(0, I(θ0)), Pnψθ0 → −I(θ0) and Pnψθ0 → pθ0ψθ0(X1)
in probability. This illustrates that
√n(θn − θ0) =⇒ N(0, I(θ0)
−1)
as n→ ∞. The next two theorems will make these steps rigorous.
Let g(θ) = Eθ0(mθ(X1)) =∫mθ(x)pθ0(x)µ( dx). We need the following condition:
g(θ) = g(θ0) +1
2(θ − θ0)
TVθ0(θ − θ0) + o(‖θ − θ0‖2), (9.1)
where
Vθ0 =
(Eθ0
(∂mθ
∂θi∂θj|θ=θ0
))
1≤i,j≤d
, θ = (θ1, · · · , θd) ∈ Rd.
70
THEOREM 34 For each θ in an open subset of Euclidean space let x 7→ mθ(x) be a mea-
surable function such that θ 7→ mθ(x) is differentiable at θ0 for P -almost every x with
derivative mθ0(x) and such that , for every θ1 and θ2 in a neighborhood of θ0 and a mea-
surable function n(x) with En(X1)2 <∞
|mθ1(x) −mθ2(x)| ≤ n(x)‖θ1 − θ2‖.
Furthermore, assume the map θ 7→ Emθ(X1) has expression as in (9.1). If Pnmθn≥
supθ Pnmθ − oP (n−1) and θnP→ θ0, then
√n(θn − θ0) = −V −1
θ0
1√n
n∑
i=1
mθ0(Xi) + oP (1).
In particular,√n(θn − θ0) =⇒ N(0, V −1
θ0E(mθ0m
Tθ0
)V −1θ0
) as n→ ∞.
A statistical model (pθ, θ ∈ Θ) is called differentiable in quadratic mean if there exists a
measurable vector-valued function lθ0 such that, as θ → θ0,
∫ [√pθ −
√pθ0 −
1
2(θ − θ0)
T lθ0
√pθ0
]2
du = o(‖θ − θ0‖2).
THEOREM 35 Suppose that the model (Pθ : θ ∈ Θ) is differentiable in quadratic mean at
an inner point θ0 of Θ ⊂ Rk. Furthermore, suppose that there exists a measurable function
l(x) with Eθ0l2(X1) <∞ such that, for every θ1 and θ2 in a neighborhood of θ0,
| log pθ1(x) − log pθ2(x)| ≤ l(x)‖θ1 − θ2‖.
If the Fisher information matrix Iθ0 is non-singular and θn is consistent, then
√n(θn − θ0) = I−1
θ0
1√n
n∑
i=1
lθ0(Xi) + oP (1)
In particular,√n(θn − θ0) =⇒ N(0, I−1
θ0) as n→ ∞, where
I(θ0) = −(E∂2 log pθ(X1)
∂θi∂θj
)
1≤i,j≤k
.
We need some preparations in proving the above theorems.
Given functions x → mθ(x), θ ∈ Rd, we need conditions that ensure that, for a given
sequence rn → ∞ and any sequence hn = O∗P (1),
Gn
(rn(mθ0+hn/rn
−mθ0) − hTn mθ0
)P→ 0. (9.2)
71
LEMMA 9.1 For each θ in an open subset of Euclidean space let mθ(x), as a function
of x, be measurable for each θ; and as a function of θ, is differentiable for almost every x
(w.r.t. P ) with derivative mθ0(x) and such that for every θ1 and θ2 in a neighborhood of θ0
and a measurable function m such that pm2 <∞,
‖mθ1(x) −mθ2(x)‖ ≤ m(x)‖θ1 − θ2‖.
Then (9.2) holds for every random sequence hn that is bounded in probability.
Proof. Because hn is bounded in probability, to show (9.2), it is enough to show, w.l.o.g.
sup|θ|≤1
|Gn(rn(mθ0+θ/rn−mθ0) − θT mθ0)|
P→ 0
as n→ ∞. Define
Fn = fθ := rn(mθ0+θ/rn−mθ0) − θT mθ0, |θ| ≤ 1.
Then
|fθ1(x) − fθ2(x)| ≤ 2mθ0(x)‖θ1 − θ2‖
for any θ1 and θ2 in the unit ball in Rd by the Lipschitz condition. Further, set
Hn = sup|θ|≤1
|rn(mθ0+θ/rn−mθ0) − θT mθ0|.
Then Hn is a cover and Hn → 0 as n → ∞ by the definition of partial derivative. By
Bounded Convergence Theorem, δn := (PH2n)1/2 → 0. Thus, by Corollary 8.2 and Example
8.14,
E∗P‖Gn‖Fn ≤ C · J[ ](‖H‖P,2,Fn, L2(P )) ≤ C
∫ δn
0
√log(Kǫ−d) dǫ→ 0
as n→ ∞. The desired conclusion follows.
We need some preparation before proving Theorem 34.
Let Pn be the empirical distribution of a random sample of size n from a distribution P,
and, for every θ in a metric space (Θ, d), mθ(x) be a measurable function. Let θn (nearly)
maximize the criterion function Pnmθ. The number θ0 is the truth, that is, the maximizer
of mθ over θ ∈ Θ. Recall Gn =√n(Pn − P ).
THEOREM 36 (Rate of Convergence) Assume that for fixed constants C and α > β, for
every n and every sufficiently small δ > 0,
supd(θ,θ0)>δ
P (mθ −mθ0) ≤ −Cδα,
E∗ supd(θ,θ0)<δ
|Gn(mθ −mθ0)| ≤ Cδβ.
72
If the sequence θn satisfies Pnmθn≥ Pnmθ0 −OP (nα/(2β−2α)) and converges in outer prob-
ability to θ0, then n1/(2α−2β)d(θn, θ0) = O∗P (1).
Proof. Set rn = n1/(2α−2β) and Pnmθn≥ Pnmθ0 −Rn with 0 ≤ Rn = OP (nα/(2β−2α)).
Partition (0,∞) by Sj,n = θ : 2j−1 < rnd(θ, θ0) ≤ 2j for all integers j. If rnd(θn, θ0) ≥ 2M
for a given M, then θn is in one of the shells Sj,n with j ≥M. In that case, supθ∈Sj,n(Pnmθ−
Pnmθ0) ≥ −Rn. It follows that
P ∗(rnd(θn, θ0) ≥ 2M ) ≤∑
j≥M,2j≤ǫrn
P ∗(
supθ∈Sj,n
(Pnmθ − Pnmθ0) ≥ −K
rαn
)(9.3)
+P ∗(2d(θn, θ0) ≥ ǫ) + P (rαnRn ≥ K).
The middle term on right goes to zero; the last term can be arbitrarily small by choosing
large K. We only need to show the sum is arbitrarily small when M is large enough as
n→ ∞. By the given condition
supθ∈Sj,n
P (mθ −mθ0) ≤ −C 2j−1α
rαn
.
For M such that (1/2)C2(M−1)α ≥ K, by the fact that sups(fs + gs) ≤ sups fs + sups gs,
P ∗(
supθ∈Sj,n
(Pnmθ − Pnmθ0) ≥ −K
rαn
)≤ P ∗
(sup
θ∈Si,n
Gn(mθ −mθ0) ≥ C√n
2(j−1)α
2rαn
).
Therefore, the sum in (9.3) is bounded by
∑
j≥M,2j≤ǫrn
P ∗(
supθ∈Si,n
|Gn(mθ −mθ0)| ≥ C√n
2(j−1)α
2rαn
)≤∑
j≥M
(2j/rn)β2rαn√
n2(j−1)α
by Markov’s inequality and the definition of rn. The right hand side goes to zero as M → ∞.
COROLLARY 9.1 For each θ in an open subset of Euclidean space, mθ(x) is a measurable
function such that, there exists m(x) with Pm2 <∞ satisfying
|mθ1(x) −mθ2(x)| ≤ m(x)‖θ1 − θ2‖.
Furthermore, assume
Pmθ = Pmθ0 +1
2(θ − θ0)
TVθ0(θ − θ0) + o(‖θ − θ0‖2) (9.4)
with Vθ0 nonsingular. If Pnmθn≥ Pnmθ0−OP (n−1) and θn
P→ θ0, then√n(θn−θ0) = OP (1).
73
Proof. (9.4) implies the first condition of Theorem 36 holds with α = 2 since Vθ0 is
nonsingular. Now we use Corollary 8.2 to the class of functions F = mθ−mθ0; ‖θ−θ0‖ < δto see the second condition is valid with β = 1. This class has envelope function F = δm,
while
E∗ sup‖θ−θ0‖<δ
|Gn(mθ −mθ0)| ≤ C ·∫ δ‖m‖P,2
0
√log[ ](ǫ,F , L2(P )) dǫ
≤∫ δ‖m‖P,2
0
√√√√log
(K
(δ
ǫ
)d)dǫ = C1δ
by (8.14) with Diam(Θ) = const · δd, where C1 depends on ‖m‖P,2.
Proof of Theorem 34. By Lemma 9.1,
Gn
(rn(mθ0+hn/rn
−mθ0) − hTn mθ0
)P→ 0. (9.5)
Expand P (rn(mθ0+hn/rn−mθ0) by condition 9.1, we obtain
nPn(mθ0+hn/√
n −mθ0) =1
2hT
nVθ0 hn + hTnGnmθ0 + oP (1).
By Corollary 9.1,√n(θn − θ) is bounded in probability (same as tightness). Take hn =
√n(θn − θ) and hn = −V −1
θ0Gnmθ0 , we then obtain the Taylor’s expansions of Pnmθn
and
Pnmθ0−V −1θ0
Gnmθ0as follows
nPn(mθn−mθ0) =
1
2hT
nVθ0hn + hTnGnmθ0 + oP (1),
nPn(mθ0−V −1θ0
Gnmθ0/√
n −mθ0) = −1
2Gnm
Tθ0V −1
θ0Gnmθ0 + oP (1),
where the second one is obtained through a bit algebra. By the definition of θn, the left
hand side of the first equation is greater than that of the second one. So are he right hand
sides. Take the difference and make it to a complete square, we then have
1
2(hn + V −1
θ0Gnmθ0)
TVθ0(hn + V −1θ0
Gnmθ0) + oP (1) ≥ 0.
We know Vθ0 is strictly negative-definite, the quadratic form must converge to zero in
probability. This is also the case ‖hn + V −1θ0
Gnmθ0‖, that is, hn = −V −1θ0
Gnmθ0 + oP (1).
74
10 Appendix
Let A be a collection of subsets of Ω and B is generated by A, that is, B = σ(A). Let P be
a probability measure on (Ω,B).
LEMMA 10.1 Suppose A has the following property: (i) Ω ∈ A, (ii)Ac ∈ A if A ∈ A,and (iii) ∪m
i=1Ai ∈ A if Ai ∈ A for all 1 ≤ i ≤ m. Then, for any B ∈ B and ǫ > 0, there
exists A ∈ A such that P (B∆A) < ǫ.
Proof. Let B′ be the set of B ∈ B satisfying the conclusion. Obviously, A ⊂ B′ ⊂ B. It is
enough to verify that B′ is a σ-algebra.
It is easy to see that (i) Ω ∈ B′; (ii) Bc ∈ B′ if B ∈ B′ since A∆B = Ac∆Bc. (iii) If
Bi ∈ B′ for i ≥ 1, there exist Ai ∈ A such that P (Bi∆Ai) < ǫ/2i for all i ≥ 1. Evidently,
∪ni=1Bi ↑ ∪∞
i=1Bi as n → ∞. Therefore, there exists n0 < ∞ such that |P (∪∞i=1Bi) −
P (∪n0i=1Bi)| ≤ ǫ/2. It is easy to check that (∪n0
i=1Bi)∆(∪n0i=1Ai) ⊂ ∪n0
i=1(Bi∆Ai). Write
B = ∪∞i=1Bi and B = ∪n0
i=1Bi and A = ∪n0i=1Ai. Then A ∈ A. Note B∆A ⊂ (B\B)∪ (B∆A).
The above facts show that P (B∆A) < ǫ. Thus B′ is a σ-algebra.
LEMMA 10.2 Let X1,X2, · · · ,Xm, m ≥ 1 be random variables defined on (Ω,F ,P). Let
f(x1, · · · , xm) be a real measurable function with E|f(X1, · · · ,Xm)|p < ∞ for some p ≥ 1.
Then there exists fn(X1, · · · ,Xm); n ≥ 1 such that
(i) fn(X1, · · · ,Xm) → f(X1, · · · ,Xm) a.s.
(ii) fn(X1, · · · ,Xm) → f(X1, · · · ,Xm) in Lp(Ω,F ,P).
(iii) For each n ≥ 1, fn(X1, · · · ,Xm) =∑km
i=1 ci gi1(X1) · · · gim(Xm) for some km < ∞,
constants ci, and gij(Xj) = IAi,j (Xj) for some sets Ai,j ∈ B(R), and all 1 ≤ i ≤ km and
1 ≤ j ≤ m.
Proof. To save notation, we write f = f(x1, · · · , xm). Since E|fI(|f | ≤ C) − f |p → 0
as C → ∞, choose Ck such that E|fI(|f | ≤ Ck) − f |p ≤ 1/k2 for k = 1, 2, · · · . We will
show that there exists function gk = gk(x1, · · · , xk) of the form in (iii) such that E|fI(|f | ≤Ck) − gk|p ≤ 1/k2 for all k ≥ 1. Thus, E|f − gk|p < 2/k2 for all k ≥ 1. The assertion (ii)
then follows. Also, it implies E(∑
k≥1 |f − gk|p) < ∞, then∑
k≥1 |f − gk|p < ∞ a.s. we
ontain (i). Therefore, to prove this lemma, we assume w.l.o.g. that f is bounded and we
need to show that there exist fn = fn(x1, · · · , xm); n ≥ 1 of the form in (iii) such that
E|f − fn|p ≤ 1
n2(10.6)
for all n ≥ 1.
75
Since f is bouded, for any n ≥ 1, there exists hn such that
supx∈Rm
|f(x) − hn(x)| < 1
2n2, (10.7)
where hn is a simple function, i.e., hn(x) =∑kn
i=1 ciIx ∈ Bi for some kn < ∞, constants
ci’s and Bi ∈ B(Rn).
Now set X = (X1, · · · ,Xm) ∈ Rm and µ be the probability measure of X under proba-
bility P. Let A be the set of all finite unions of sets in A1 := ∏mi=1Ai ∈ B(Rn); Ai ∈ B(R).
By the construction of B(Rn), we know that B(Rn) = σ(A). It is not difficult to verify that
A satisfies the conditions in Lemma 10.1. Thus there exist Ei ∈ A such that
∫|IBi(x) − IEi(x)|p dµ = µ(Bi∆Ei) <
1
(2cn2)pand Ei =
ki⋃
j=1
m∏
l=1
Ai,j,l
where c = 1 +∑kn
i=1 |ci| and Ai,j,l ∈ B(R) for all i, j and l. Now, since ‖ · ‖p is a norm, we
have
‖hn(X) −kn∑
i=1
ciIEi(X)‖p ≤kn∑
i=1
|ci| · ‖IEi(X) − IBi(X)‖p ≤ 1
2n2. (10.8)
Observe that ν(E) := IE(X) is a measure on (Rn,B(Rn)). Note that the intersection of any
finite product sets∏m
l=1Ai,j,l is still in A1. By the inclusion-exclusion formula, IEi(X) is a
finite linear combination of IF (X) where F ∈ A1. Thus, fn(X) :=∑kn