Class Notes of Stat 8112 1 Bayes estimatorsusers.stat.umn.edu/~jiang040/8112/8112notes.pdfClass Notes of Stat 8112 1 Bayes estimators Here are three methods of estimating parameters:

Class Notes of Stat 8112

1 Bayes estimators

Here are three methods of estimating parameters:

(1) MLE; (2) Moment Method; (3) Bayes Method.

An example of Bayes argument: LetX ∼ F (x|θ), θ ∈ H©.We want to estimate g(θ) ∈ R1.

Suppose t(X) is an estimator and look at

MSEθ(t) = Eθ(t(X) − g(θ))2.

The problem is MSEθ(t) depends on θ. So minimizing one point may costs at other points.

Bayes idea is to average MSEθ(t) over θ and then minimize over t’s. Thus we pretend to

have a distribution for θ, say, π, and look at

H(t) = E(t(X) − g(θ))2

where E now refers to the joint distribution of X and θ, that is

E(t(X) − g(θ))2 =

∫(t(x) − g(θ))2F (dx|θ)π(dθ). (1.1)

Next pick t(X) to minimize H(t). The minimizer is called the Bayes estimator.

LEMMA 1.1 Suppose Z and W are real random variables defined on the same probability

space and H be the set of functions from R1 to R

1. Then

minh∈H

E(Z − h(W ))2 = E(Z − E(Z|W ))2.

That is, the minimizer above is E(Z|W ).

Proof. Note that

E(Z − h(W ))2 = E(Z − E(Z|W ) + E(Z|W ) − h(W ))2

= E(Z − E(Z|W ))2 + E(E(Z|W ) − h(W ))2

+2E(Z − E(Z|W ))(E(Z|W ) − h(W )).

Conditioning on W, we have that the cross term is zero. Thus

E(Z − h(W ))2 = E(Z − E(Z|W ))2 + E(E(Z|W ) − h(W ))2.

1

Now E(Z − E(Z|W ))2 is fixed. The second term E(E(Z|W ) − h(W ))2 depends on h. So

choose h(W ) = E(Z|W ) to minimize E(Z − h(W ))2.

Now choose W = X, t = h and Z = g(θ). Then

COROLLARY 1.1 The Bayes estimator in (1.1) is g(θ) = E(g(θ)|X), which is the poste-

rior mean.

At a later occasion, we call E(g(θ)|X) the posterior mean of g(θ). Similarly, we have the

following multivariate analogue of the above corollary. The proof is in the same spirit of

Lemma 1.1. We omit it.

THEOREM 1 Let X ∈ Rm and X ∼ F (·|θ), θ ∈ H©. Let g(θ) = (g1(θ), · · · , gk(θ)) ∈ R

k.

The MSE between an estimator t(X) and g(θ) is

E‖t(X) − g(θ)‖2 =

∫‖t(x) − g(θ)‖2F (dx|θ)π(dθ).

Then Bayes estimator g(θ) = E(g(θ)|X) := (E(g1(θ)|X), · · · , E(gk(θ)|X)).

Definition. The distribution of θ (we thought there is such) , π(θ), is called a prior

distribution of θ. The conditional distribution of θ given X is called the posterior distribu-

tion.

Example. Let X1, · · · ,Xn be i.i.d. Ber(p) and p ∼ Beta(α, β). We know that the

density function of p is

f(p) =Γ(α+ β)

Γ(α)Γ(β)pα−1(1 − p)β−1

and E(p) = α/(α + β). The joint distribution of X1, · · · ,Xn and p is

f(x1, · · · , xn, p) = f(x1, · · · , xn|p) · π(p) = px(1 − p)n−x · C(α, β)pα−1(1 − p)β−1

= C(α, β)px+α−1(1 − p)n+β−x−1

where x =∑

i xi. The marginal density of (X1, · · · ,Xn) is

f(x1, · · · , xn) = C(α, β)

∫ 1

0px+α−1(1 − p)n+β−x−1 dp = C(α, β) ·B(x+ α, n + β − x).

Therefore,

f(p|x1, · · · , xn) =f(x1, · · · , xn, p)

f(x1, · · · , xn)= D(x, α, β) px+α−1(1 − p)n+β−x−1.

2

This says that p|x ∼ Beta(x+ α, n+ β − x). Thus the Bayes estimator is

p = E(p|X) =X + α

n+ α+ β,

where X =∑n

i=1Xi. One can easily check that

E(p) =np+ α

n+ α+ β.

So p is biased unless α = β = 0, which is not allowed in the prior distribution.

The next lemma is useful

LEMMA 1.2 The posterior distribution depends only on sufficient statistics. Let t(X) be

a sufficient statistic. Then E(g(θ)|X) = E(g(θ)|t(X)).

Proof. The joint density function of X and θ is p(x|θ)π(θ) where π(θ) is the prior distri-

bution of θ. Let t(X) be a sufficient statistic. The by the factorization theorem

p(x|θ) = q(t(x)|θ)h(x).

So the joint distribution function is q(t(x)|θ)h(x)π(θ). It follows that the conditional distri-

bution of θ given X is

p(θ|x) =q(t(x)|θ)h(x)π(θ)∫q(t(x)|ξ)h(x)π(ξ) dξ

=q(t(x)|θ)π(θ)∫q(t(x)|ξ)π(ξ) dξ

which is a function of t(x) and θ.

By the above conclusion, E(g(θ)|X) = l(t(X)) for some function l(·). Take conditional

expectation for both sides with respect to the algebra σ(t(X)). Note that σ(t(X)) ⊂ σ(X).

By the Tower theorem, l(t(X)) = E(g(θ)|t(X)). This proves the second conclusion.

Example. Let X1, · · · ,Xn be iid from N(µ, 1) and µ ∼ π(µ) = N(µ0, τ0). We know

that X(∼ N(µ, 1/n)) is a sufficient statistic. The Bayes estimator is µ = E(µ|X). We need

to calculate the joint distribution of (µ, X)T first.

It is not difficult to see that (µ, X)T is bivariate normal. We know that Eµ = µ0, V ar(µ) =

τ0, E(X) = E(E(X |µ)) = E(µ) = µ0. Now

V ar(X) = E(X − µ+ µ− µ0)2 = E((1/n) + (µ− µ0)

2) =1

n+ τ0;

Cov(X, µ) = E(E((X − µ0)(µ− µ0)|µ)) = E(µ− µ0)2 = τ0.

Thus,(X

µ

)∼ N

((µ0

µ0

),

((1/n) + τ0 τ0

τ0 τ0

)).

3

Recall the following fact: if

(Y1

Y2

)∼ N

((µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

)),

then

Y1|Y2 ∼ N(µ1 + Σ12Σ

−122 (Y2 − µ2), Σ11·2

)

where Σ11·2 = Σ11 − Σ12Σ−122 Σ21. It follows that

µ|X ∼ N

(µ0 +

τ0(1/n) + τ0

(X − µ0), τ0 −τ20

(1/n) + τ0

).

Thus the Bayes estimator is

µ = µ0 +τ0

(1/n) + τ0(X − µ0) =

τ0(1/n) + τ0

X +µ0

1 + nτ0.

One can see that µ→ µ0 as τ0 → 0 and µ→ X as τ0 → ∞.

Example. Let X ∼ Np(θ, Ip) and π(θ) ∼ Np(0, τIp), τ > 0. The Bayes estimator is the

posterior mean E(θ|X). We claim that

(X

θ

)∼ N

((0

0

),

((1 + τ)Ip τIp

τIp τIp

)).

Indeed, by the prior we know that Eθ = 0 and Cov(θ)=τIp. By conditioning argument one

can verify that E(Xθ′) = τIp and E(XX ′) = (1 + τ)Ip. So by conditional distribution of

normal random variables,

θ|X ∼ N

(τ

1 + τX,

τ

1 + τIp

).

So the Bayes estimator is t0(X) = τX/(1 + τ).

The usual MVUE is t1(X) = X. It follows that

MSEt1(θ) = E‖X − θ‖2 = p

which doesn’t depend on θ. Now

MSEt0(θ) = Eθ‖τ

1 + τX − θ‖2 = Eθ‖

τ

1 + τ(X − θ) +

−1

1 + τθ‖2

=

(τ

1 + τ

)2

p+1

(1 + τ)2‖θ‖2 (1.2)

4

which goes to p as τ → ∞. What happens to the prior when τ → ∞? It “converges

to” the Lebesgue measure, or “uniform distribution over the real line” which generates

no information (recall the information of X ∼ Np(0, (1 + τ)Ip) is p/(1 + τ) and θi =

uniform over [−τ, τ ] with θi, 1 ≤ i ≤ p i.i.d. is p/(2τ2). Both go to zero and the later is

faster than the former). This explains why the MSE of the former and the limiting MSE

are identical.

Now let

M = inft

supθEθ‖t(X) − θ‖2

called the Minimax risk and any estimator t∗ with

M = supθEθ‖t∗(X) − θ‖2

is called a Minimax estimator.

THEOREM 2 For any p ≥ 1, t0(X) = X is a minimax.

Proof. First,

M = inft

supθEθ‖t(X) − θ‖2 ≤ sup

θEθ‖X − θ‖2 = p.

Second

M = inft

supθEθ‖t(X) − θ‖2 ≥ inf

t

∫Eθ‖t(X) − θ‖2 πτ (dθ) ≥

(τ

1 + τ

)2

p

for any τ > 0 where πτ (θ) = N(0, τIp), where the last step is from (1.2). Then M ≥ p by

letting p→ ∞. The above says that M = supθ Eθ‖X − θ‖2 = p.

Question. Does there exist another estimator t1(X) such that

Eθ‖t1(X) − θ‖2 ≤ Eθ‖X − θ‖2

for all θ, and the strict inequality holds for some θ? If so, we say t0(X) = X is inadmissible

(because it can be beaten for any θ by some estimator t, and strictly beaten at some θ).

Here is the answer:

(i) For p = 1, there is not. This is proved by Blyth in 1951.

(ii) For p = 2, there is not. This result was shown by Stein in 1961.

(iii) When p ≥ 3, Stein (1956) shows there is such estimator, which is called James-Stein

estimator.

Recall the density function of N(µ, σ2) is φ(x) = (√

2πσ)−1 exp(−(x− µ)2/(2σ2)).

5

LEMMA 1.3 (Stein’s lemma). Let Y ∼ N(µ, σ2) and g(y) be a function such that g(b) −g(a) =

∫ ba g

′(y) dy for any a and b and some function g′(y). If E|g′(Y )| <∞. Then

Eg(Y )(Y − µ) = σ2Eg′(Y ).

Proof. Let φ(y) = (1/√

2πσ) exp(−(y − µ)2/(2σ2)), the density of N(µ, σ2). Then φ′(y) =

−φ(y)(y − µ)/σ2. Thus, φ(y) =∫∞y

z−µσ2 φ(z) dz = −

∫ y−∞

z−µσ2 φ(z) dz. We then have that

Eg′(Y ) =

∫ ∞

−∞g′(y)φ(y) dy

=

∫ ∞

0g′(y)

∫ ∞

y

z − µ

σ2φ(z) dz

dy −

∫ 0

−∞g′(y)

∫ y

−∞

z − µ

σ2φ(z) dz

dy

=

∫ ∞

0

z − µ

σ2φ(z)

∫ z

0g′(y) dy

dz −

∫ 0

−∞

z − µ

σ2φ(z)

∫ 0

zg′(y) dy

dz

=

(∫ ∞

0+

∫ 0

−∞

)z − µ

σ2φ(z)(g(z) − g(0)) dz

=1

σ2

∫ ∞

−∞(z − µ)g(z)φ(z) dz =

1

σ2E(Y − µ)g(Y ).

The Fubini’s theorem is used in the third step; the mean of a centered normal random

variable is zero is used in the fifth step.

Remark 1. Suppose two functions g(x) and h(x) are given such that g(b) − g(a) =∫ ba h(x) dx for any a < b. This does not mean that g′(x) = h(x) for all x. Actually, in this

case, g(x) is differentiable almost everywhere and g′(x) = h(x) a.s. under Lebesgue measure.

For example, let h(x) = 1 if x is an irrational number, and h(x) = 0 if x is rational. Then

g(x) = x =∫ x0 h(t) dt for any x ∈ R. We know in this case that g′(x) = h(x) a.s. The

following are true from real analysis:

Fundamental Theorem of Calculus. If f(x) is absolutely continuous on [a, b], then f(x)

is differentiable almost everywhere and

f(x) − f(a) =

∫ x

af ′(t) dt, x ∈ [a, b].

Another fact. Let f(x) be differentiable everywhere on [a, b] and f ′(x) is integrable over

[a, b]. Then

f(x) − f(a) =

∫ x

af ′(t) dt x ∈ [a, b].

Remark 2. What drives the Stein’s lemma is the formula of integration by parts:

6

suppose g(x) is differentiable then

Eg(Y )(Y − µ) =

∫

R

g(y)(y − µ)φ(y) dy = −σ2

∫ ∞

−∞g(y)φ′(y) dy

= σ2 limy→−∞

g(y)φ(y) − σ2 limy→+∞

g(y)φ(y) + σ2

∫ ∞

−∞g′(y)φ(y) dy.

It is reasonable to assume the two limits are zero. The last term is exactly σ2Eg′(Y ).

THEOREM 3 Let X ∼ Np(θ, Ip) for some p ≥ 3. Define

δc(X) =

(1 − c

p− 2

‖X‖2

)X.

Then

E‖δc(X) − θ‖2 = p− (p− 2)2E

[c(2 − c)

‖X‖2

].

Proof. Let gi(x) = c(p − 2)xi/‖x‖2 and g(x) = (g1(x), · · · , gp(x)) for x = (x1, x2, · · · , xp).

Then g(x) = c(p − 2)x/‖x‖2. It follows that

E‖δc(X) − θ‖2 = E‖(X − θ) − g(X)‖2 = E‖X − θ‖2 + E‖g(X)‖2 − 2E〈X − θ, g(X)〉

= p+ E‖g(X)‖2 − 2

p∑

i=1

E(Xi − θi)gi(X).

Since X1, · · · ,Xp are independent, by conditioning and Stein’s lemma,

E(Xi − θi)gi(X) = E [E(Xi − θi)gi(X)|(Xj , j 6= i) ] = E

[∂gi

∂xi(X)

].

It is easy to calculate that

∂gi

∂xi(x) = c(p− 2)

∑pk=1 x

2k − 2x2

i

(∑p

k=1 x2k)

2= c(p− 2)

‖x‖2 − 2x2i

‖x‖4.

Thus,

p∑

i=1

E(Xi − θi)gi(X) = c(p − 2)E

p∑

i=1

‖X‖2 − 2X2i

‖X‖4= c(p− 2)2E

[1

‖X‖2

].

On the other hand, E‖g(X)‖2 = c2(p − 2)2E(1/‖X‖2). Combine all above together, the

conclusion follows.

We have to show that E‖X‖−2 <∞ for p ≥ 3. Indeed,

E

[1

‖X‖2

]≤ 1 +E

[1

‖X‖2

]I(‖X‖ ≤ 1) ≤ 1 +

∫· · ·∫

‖x‖≤1

1

‖x‖2dx1 · · · dxp

7

since the density of a normal distribution is bounded by one. By the polar-transformation,

the last integral is equal to

∫ 1

0dr

∫ π

0dθ1 · · ·

∫ π

0dθp−1

rp−1J(θ1, · · · , θp)

r2dr ≤ Cπp−1

∫ 1

0rp−3 dr <∞

if the integer p ≥ 3, where J(θ1, · · · , θp) is a multivariate polynomial of sin θi and cos θi, i =

1, 2, · · · , p, bounded by a constant C.

When 0 ≤ c ≤ 2, the term c(2 − c) ≥ 0 and attains the maximum at c = 1. It follows

that

COROLLARY 1.2 The estimator δc(X) dominates X provided 0 < c < 2 and p ≥ 3.

When c = 1, the estimator δ1(X) is called the James-Stein estimator.

COROLLARY 1.3 The James-Stein estimator δ1 dominates all estimators δc with c 6= 1.

2 Likelihood Ratio Test

We consider test H0 : θ ∈ H©0 vs Ha : θ ∈ H©a, where H©0 and H©a are disjoint. The

probability density or probability mass function of a random observation is f(x|θ). The

likelihood ratio test statistic is

λ(x) =supH0

f(x|θ)supH0∪Ha

f(x|θ) .

When H0 is true, the value of λ(x) tends to be large. So the rejection region is

λ(x) ≤ c.

Example. The t-test as a likelihood ratio test.

Let X1, · · · ,Xn be i.i.d. N(µ, σ2) with both parameters unknown. Test H0 : µ = µ0 vs

Ha : µ 6= µ0. The joint density function of Xi’s is

f(x1, · · · , xn|µ, σ) =

(1√2π

)n

(σ2)−n/2 exp

(− 1

2σ2

n∑

i=1

(xi − µ)2

). (2.1)

This one has maximum at (X, V ) where

V =1

n

n∑

i=1

(Xi − X)2.

8

The maximum is

maxµ,σ

f(x1, · · · , xn|µ, σ) = CV −n/2 exp

(− 1

2V

∑(Xi − X)2

)= CV −n/2e−n/2

Where C = (2π)−n/2. When µ = µ0, f(x1, · · · , xn|µ, σ) has maximum at

V0 =1

n

n∑

i=1

(Xi − µ0)2.

And the corresponding maximum value is

maxσ

f(x1, · · · , xn|µ0, σ) = CV−n/20 exp

(− 1

2V0

∑(Xi − µ0)

2

)= CV

−n/20 e−n/2.

So the likelihood ratio is

Λ =

(V0

V

)−n/2

=

(V + (X − µ0)

2

V

)−n/2

.

So the rejection region is

Λ ≤ a =

∣∣∣∣X − µ0

S

∣∣∣∣ ≥ b

for some constant b, where S = (1/(n− 1))∑n

i=1(Xi − X)2. This yields exactly a student t

test.

Example. Let X1, · · · ,Xn be a random sample from

f(x|θ) =

e−(x−θ), if x ≥ θ;

0, otherwise.(2.2)

The joint density function is

f(x|θ) =

e−

P

xi+nθ, if θ ≤ x(1);

0, otherwise.

Consider the test H0 : θ ≤ θ0 versus Ha : θ > θ0. It is easy to see

maxθ∈R

f(x|θ) = e−P

xi+nx(1) ,

which is achieved at θ = x(1). Now consider maxθ≤θ0 f(x|θ). If θ0 ≥ x(1), the two maxima

are identical; If θ0 < x(1), it is achieved at θ = θ0 that yields the maximum value

maxθ≤θ0

f(x|θ) =

e−

P

xi+nx(1) , if θ0 ≥ x(1);

e−P

xi+nθ0, if θ0 < x(1).

9

This gives the likelihood ratio

λ(x) =

1, if x(1) ≤ θ0;

en(θ0−x(1)), if x(1) > θ0.

The rejection region is λ(x) ≤ a which is equivalent to x(1) ≥ b for some b.

Again, the general form of a hypothesis test is H0 : θ ∈ H©0 vs Ha : θ ∈ H©a.

THEOREM 4 Suppose X has pdf or pmf f(x|θ) and T (X) is sufficient for θ. Let λ(x) and

λ∗(y) be the LRT statistics obtained from X and Y = T (X). Then λ(x) = λ∗(T (x)) for any

x.

Proof. Suppose the pdf or pmf of T (X) is g(t|θ). To be simple, suppose X is discrete, that

is, Pθ(T (X) = t) = g(t|θ). Then

f(x|θ) = Pθ(X = x, T (X) = T (x))

= Pθ(T (X) = T (x))Pθ(X = x|T (X) = T (x))

= g(T (x)|θ)h(x). (2.3)

by the definition of sufficiency, where h(x) is a function depending only on x (not θ).

Evidently,

λ(x) =supθ∈H0

f(x|θ)supθ∈H0∪Ha

f(x|θ) and λ∗(t) =supθ∈H0

g(t|θ)supθ∈H0∪Ha

g(t|θ) . (2.4)

By (2.3) and (2.4)

λ(x) =supθ∈H0

f(T (x)|θ)supθ∈H0∪Ha

f(T (x)|θ) = λ∗(T (x)).

This theorem says that the simplification of λ(x) is eventually relevant to x through a

sufficient statistic for the parameter.

3 Evaluating tests

Consider the test H0 : θ ∈ H©0 versus Ha : θ ∈ H©a, where H©0 and H©a are disjoint. When

making decisions, there are two types of mistakes:

Type I error: Reject H0 when it is true.

Type II error: Do not reject H0 when it is false.

10

Not reject H0 reject H0

H0 correct type I error

H1 Type II error correct

Suppose R denotes the rejection region for the test. The chances of making the two

types of mistakes are Pθ(X ∈ R) for θ ∈ H©0 and Pθ(X ∈ Rc), θ ∈ H©a. So

Pθ(X ∈ R) =

probability of a Type I error, if θ ∈ H©0;

1 − probability of a Type II error, if θ ∈ H©a.

DEFINITION 3.1 The power function of the test above with the rejection region R is a

function of θ defined by β(θ) = Pθ(X ∈ R).

DEFINITION 3.2 For α ∈ [0, 1], a test with power function β(θ) is a size α test if

supθ∈ H©0β(θ) = α.

DEFINITION 3.3 For α ∈ [0, 1], a test with power function β(θ) is a level α test if

supθ∈ H©0β(θ) ≤ α.

There are some differences between the above two definitions when dealing with complicated

models.

Example. Let X1, · · · ,Xn be from N(µ, σ2) with µ unknown but known σ. Look at the

test H0 : µ ≤ µ0 vs Ha : µ > µ0. It is easy to check that by likelihood rato test, the rejection

region is

X − µ0

σ/√n

> c

.

So the power function is

β(µ) = Pθ

(X − µ0

σ/√n> c

)= Pθ

(X − µ

σ/√n> c+

µ0 − µ

σ/√n

)

= 1 − Φ

(c+

µ0 − µ

σ/√n

).

It is easy to see that

limµ→∞

β(µ) = 1 limµ→−∞

β(µ) = 0 and β(µ0) = a if P (Z > c) = a,

where Z ∼ N(0, 1).

11

Example. In general, a size α LRT is constructed by choosing c such that supθ∈ H©0Pθ(λ(X)

≤ c) = α. Let X1, · · · ,Xn be a random sample from N(θ, 1). Test H0 : θ = θ0 vsHa : θ 6= θ0.

Here one can easily see that

λ(x) = exp(−n

2(x− θ0)

2)

So λ(X) ≤ c = |X−θ0| ≥ d. For α given, d can be chosen such that Pθ0(|Z| > d√n) = α.

For the example in (2.2), the test is H0 : θ ≤ θ0 vs θ > θ0 and the rejection region

λ(X) ≤ c = X(1) > d. For size α, first choose d such that

α = Pθ0(X(1) > d) = e−n(d−θ0).

Thus, d = θ0 − (log α)/n. Now

supθ∈ H©0

Pθ(X(1) > d) = supθ≤θ0

Pθ(X(1) > d) = supθ≤θ0

e−n(d−θ) = e−n(d−θ0) = α.

4 Most Powerful Tests

Consider a hypothesis test H0 : θ ∈ H©0 vs H1 : θ ∈ H©a. There are a lot of decision rules.

We plan to select the best one. Since we are not able to reduce the two types of errors at

same time (the sum of chances of two errors and two correct decisions is one). We fix the

the chance of the first type of error and maximize the powers among all decision rules.

DEFINITION 4.1 Let C be a class of tests for testing H0 : θ ∈ H©0 vs H1 : θ ∈ H©a. A test

in class C, with power function β(θ), is a uniformly most powerful (UMP) class C test if

β(θ) ≥ β′(θ) for every θ ∈ H©a and every β′(θ) that is a power function of a test in class C.

In this section we only consider C as the set of tests such that the level α is fixed, i.e.,

β(θ); supθ∈ H©0β(θ) ≤ α.

The following gives a way to obtain UMP tests.

THEOREM 5 (Neyman-Pearson Lemma). Let X be a sample with pdf or pmf f(x|θ).Consider testing H0 : θ = θ0 vs H1 : θ = θ1, using a test with rejection region R that

satisfies

x ∈ R if f(x|θ1) > kf(x|θ0) and x ∈ Rc if f(x|θ1) < kf(x|θ0) (4.1)

for some k ≥ 0, and

α = Pθ0(X ∈ R). (4.2)

12

Then

(i) (Sufficiency). Any test satisfies (4.1) and (4.2) is a UMP level α test.

(ii) (Necessity). Suppose there exists a test satisfying (4.1) and (4.2) with k > 0. Then

every UMP level α test is a size α test (satisfies (4.2)); Every UMP level α test satisfies

(4.1) except on a set A satisfying Pθ(X ∈ A) = 0 for θ = θ0 and θ1.

Proof. We will prove the theorem only for the case that f(x|θ) is continuous.

Since H©0 consists of only one point, the level α test and size α test are equivalent.

(i) Let R satisfy (4.1) and (4.2) and φ(x) = I(x ∈ R) with power function β(θ). Now take

another rejection region R′ with β′(θ) = Pθ(X ∈ R′) and β′(θ0) ≤ α. Set φ′(x) = I(x ∈ R′).

It suffices to show that

β(θ1) ≥ β′(θ1). (4.3)

Actually, (φ(x)− φ′(x))(f(x|θ1)− kf(x|θ0)) ≥ 0 (check it based on if f(x|θ1) > kf(x|θ0) or

not). Then

0 ≤∫

(φ(x) − φ′(x))(f(x|θ1) − kf(x|θ0)) dx = β(θ1) − β′(θ1) − k(β(θ0) − β′(θ0)). (4.4)

Thus the assertion (4.3) follows from (4.2) and that k ≥ 0.

(ii) Suppose the test satisfying (4.1) and (4.2) has power function β(θ). By (i) the test is

UMP. For the second UMP level α test, say, it has power function β′(θ). Then β(θ1) = β′(θ1).

Since k > 0, (4.4) implies β′(θ0) ≥ β(θ0) = α. So β′(θ0) = α.

Given a UMP level α test with power function β′(θ). By the proved (ii), β(θ1) = β′(θ1)

and β(θ0) = β′(θ0) = α. Then (4.4) implies that the expectation in there is zero. Being

always nonnegative, (φ(x) − φ′(x))(f(x|θ1) − kf(x|θ0)) = 0 except on a set A of Lebesgue

measure zero, which leads to (4.1) (by considering f(x|θ1) > kf(x|θ0) or not). Since X has

density, Pθ(X ∈ A) = 0 for θ = θ0 and θ = θ1.

Example. Let X ∼ Bin(2, θ). Consider H0 : θ = 1/2 vs Ha : θ = 3/4. Note that

f(x|θ1) =

(2

x

)(3

4

)x(1

4

)2−x

> k

(2

x

)(1

2

)x(1

2

)2−x

= kf(x|θ0)

is equivalent to 3x > 4k.(i) The case k ≥ 9/4 and the case k < 1/4 correspond to R = ∅ and Ω, respectively.

The according UMP level α = 0 and α = 1.

(ii) If 1/4 ≤ k < 3/4, then R = 1, 2, and α = P (Bin(2, 1/2) = 1 or 2) = 3/4.

(iii) If 3/4 ≤ k < 9/4, then R = 2. The level α = P (Bin(2, 1/2) = 2) = 1/4.

13

This example says that if we firmly want a size α test, then there are only two such α’s:

α = 1/4 and α = 3/4.

Example. Let X1, · · · ,Xn be a random sample from N(θ, σ2) with σ known. Look

at the test H0 : θ = θ0 vs H1 : θ = θ1, where θ0 > θ1. The density function f(x|θ, σ) =

(1/√

2π σ) exp(−(x− θ)2/(2σ2)). Now f(x|θ1) > kf(θ0) is equivalent to that R = X < c.Set α = Pθ0(X < c). One can check easily that c = θ0 + (σzα/

√n). So the UMP rejection

region is

R =

X < θ0 +

σ√nzα

.

LEMMA 4.1 Let X be a random variable. Let also f(x) and g(x) be non-decreasing real

functions. Then

E(f(X)g(X)) ≥ Ef(X) ·Eg(X)

provided all the above three expectations are finite.

Proof. Let the µ be the distribution of X. Noting that (f(x) − f(y))(g(x) − g(y)) ≥ 0 for

any x and y. Then

0 ≤∫∫

(f(x) − f(y))(g(x) − g(y))µ(dx)µ(dy)

= 2

∫f(x)g(x)µ(dx) − 2

∫f(x)g(y)µ(dx)µ(dy)

= 2[E(f(X)g(X)) − Ef(X) ·Eg(X)].

The desired result follows.

Let f(x|θ), θ ∈ R be a pmf or pdf with common support, that is, the set x : f(x|θ) > 0is identical for every θ. This family of distributions is said to have monotone likelihood ratio

(MLR) with respect to a statistic T (x) if

fθ2(x)

fθ1(x)on the support is a non-decreasing function in T (x) for any θ2 > θ1. (4.5)

LEMMA 4.2 Let X be a random vector with X ∼ f(x|θ), θ ∈ R, where f(x|θ) is a pmf

or pdf with a common support. Suppose f(x|θ) has MLR with a statistic T (x). Let

φ(x) =

1 when T (x) > C,

γ when T (x) = C,

0 when T (x) < C

(4.6)

14

for some C ∈ R, where γ ∈ [0, 1]. Then

Eθ1φ(X) ≤ Eθ2φ(X)

for any θ1 < θ2.

Proof. Thanks to MLT, for any θ1 < θ2 there exists a nondecreasing function g(t) such

that f(x|θ2)/f(x|θ1) = g(T (x)), where g(t) is a nondecreasing function. Thus

Eθ2φ(X) =

∫φ(x)

f(x|θ2)f(x|θ1)

f(x|θ1) dx = Eθ1 (g(T (X)) · h(T (X))) ,

where

h(x) =

1 when x > C,

γ when x = C,

0 when x < C

Since both g(x) and h(x) are non-decreasing, by the positive correlation inequality,

Eθ1 (g(T (X)) · h(T (X))) ≥ Eθ1g(T (X)) ·Eθ1h(T (X)) = Eθ1φ(X)

since Eθ1g(T (X)) = 1.

In (4.6), if T (X) > C reject H0; if T (X) < C, don’t reject H0; if T (X) = C, with

chance γ reject H0 and with chance 1 − γ don’t reject H0. The last case is equivalent to

flipping a coin with chance γ for head and 1 − γ for tail. If the head occurs, reject H0;

otherwise, don’t reject H0. This is a randomization. The purpose is to make the test UMP.

The randomization occurs only when T (X) is discrete.

THEOREM 6 Consider testing H0 : θ ≤ θ0 vs H1 : θ > θ0. Let X ∼ f(x|θ), θ ∈ R, where

f(x|θ) is a pmf or pdf with a common support. Also let T be a sufficient statistic for θ. If

f(x|θ) has MLR w.r.t. T (x), then φ(X) in (4.6) is a UMP level α test, where α = Eθ0φ(X).

Proof. Let β(θ) = Eθφ(X). Then by the previous lemma, supθ≤θ0β(θ) = Eθ0φ(X) = α.

We need to show β(θ1) ≥ β′(θ1) for any θ1 > θ0 and β′ such that supθ≤θ0β′(θ) ≤ α.

Consider test H0 : θ = θ0 vs Ha : θ = θ1. We know that

α = Pθ0(X ∈ R) = Eθ0φ(X) and β′(θ1) ≤ α. (4.7)

Again, we use the same notation as in Lemma 4.2, f(x|θ1)/f(x|θ0) = g(T (x)), where

g(t) is a non-decreasing function in t. Take k = g(C) in the Neyman-Pearson Lemma 5.

15

If f(x|θ1)/f(x|θ0) > k, since g(t) is non-decreasing, then T (X) > C, which is the same

as saying to reject H0 by the functional form of φ(x) as in (4.6). If f(x|θ1)/f(x|θ0) < k

then T (X) < C, that is, not to reject H0. By Neyman-Pearson lemma and (4.7), φ(x)

corresponds to a UMP level-α test. Thus, β(θ1) ≥ β′(θ1).

16

5 Probability convergence

Let (Ω,F , P ) be a probability space. Let Also X, Xn; n ≥ 1 be a sequence of ran-

dom variables defined on this probability space. We say Xn → X in probability or Xn is

consistent with X if

P (|Xn −X| ≥ ǫ) → 0 as n→ ∞

for any ǫ > 0.

Example. Let Xn; n ≥ 1 be a sequence of independent random variables with mean

zero and variance one. Set Sn =∑n

i=1Xi, n ≥ 1. Then Sn/n → 0 in probability as n→ ∞.

In fact,

P

(∣∣∣∣Sn

n

∣∣∣∣ ≥ ǫ

)≤ E(Sn)2

nǫ2=

1

nǫ2→ 0

by Chebyshev’s inequality.

Recall X and Xn, n ≥ 1 are real measurable functions defined on (Ω,F). We say Xn

converges to X almost surely if

P (ω ∈ Ω : limn→∞

Xn(ω) = X(ω)) = 1.

For a sequence of events En; n ≥ 1 in F . Define

En; i.o. =

∞⋂

n=1

⋃

k≥n

Ek.

Here “i.o.” means “infinitely often”. If ω ∈ En; i.o., then there are infinitely many n

such that ω ∈ En. Conversely, if ω /∈ En; i.o., or ω ∈ En; i.o.c, then only finitely many

En contain ω.

In proving almost convergence, the following so-called Borel-Cantelli lemma is very

powerful.

LEMMA 5.1 (Borel-Cantelli Lemma). Let En; n ≥ 1 be a sequence of events in F . If

∞∑

n=1

P (En) <∞ (5.1)

then P (En occurs only finite times) = P (En, i.o.c) = 1.

Proof. Recall P (En) = E(IEn). Let X(ω) =∑∞

n=1 IEn(ω). Then

EX = E

∞∑

i=1

IEn =

∞∑

i=1

P (En) <∞.

17

Thus, X <∞ a.s.

LEMMA 5.2 (Second Borel-Cantelli Lemma). Let En; n ≥ 1 be a sequence of independent

events in F . If

∞∑

n=1

P (En) = ∞ (5.2)

then P (En, i.o.) = 1.

Proof. First,

En, i.o.c =⋃

n

⋂

k≥n

Eck.

With pk denoting P (Ek), we have that

P

⋂

k≥n

Eck

=

∏

k≥n

(1 − pk) ≤ exp

−

∑

k≥n

pk

= 0

as n→ ∞, where we used the inequality that 1 − x ≤ e−x for any x ∈ R. Thus,

P (En, i.o.c) ≤∑

n≥1

P

⋂

k≥n

Eck

= 0.

As an application, we have that

Example. Let Xn; n ≥ 1 be a sequence of i.i.d. random variables with mean µ and

finite fourth moment. Then

Sn

n→ µ a.s.

as n→ ∞.

Proof. W.L.O.G., assume µ = 0. Let ǫ > 0. Then by Chebyshev’s inequality

P

(∣∣∣∣Sn

n

∣∣∣∣ ≥ ǫ

)≤ E(Sn)4

n4ǫ4.

Now, by independence,

E(S4n) = E

∑

k

X4k +

(4

2

) ∑

1≤i6=j≤n

X2i X

2j +

(4

2

) ∑

1≤i6=j=n

XiX3j

= nE(X41 ) + 6n(n − 1)E(X2

1 ).

18

Thus, P(∣∣Sn

n

∣∣ ≥ ǫ)

= O(n−2). Consequently,∑

n≥1 P(∣∣Sn

n

∣∣ ≥ ǫ)< ∞ for any ǫ > 0. This

means that, with probability one, |Sn/n| ≤ ǫ for sufficiently large n. Therefore,

lim supn

|Sn|n

≤ ǫ a.s.

for any ǫ > 0. Let ǫ ↓ 0. The desired conclusion follows.

By using a truncation method, we have the following Kolmogorov’s Strong Law of Large

Numbers.

THEOREM 7 Let Xn; n ≥ 1 be a sequence of i.i.d. random variables with mean µ.

Then

Sn

n→ µ a.s.

as n→ ∞.

There are many variations of the above strong law of large numbers. Among them,

Etemadi showed that Theorem 7 holds if Xn; n ≥ 1 is a sequence of pairwise independent

and identically distributed random variables with mean µ.

The next proposition tells the relationship between convergence in probability and con-

vergence almost surely.

PROPOSITION 5.1 Let Xn and X be random variables taking values in Rk. If Xn →

X a.s. as n → ∞, then Xn converges to X in probability; second, Xn converges to X in

probability iff for any subsequence of Xn there exists a further subsequence, say, Xnk; k ≥

1 such that Xnk→ X a.s. as k → ∞.

Proof. The joint almost sure convergence and joint convergence in probability are equiv-

alent to the corresponding pointwise convergences. So W.L.O.G., assume Xn and X are

one-dimensional. Given ǫ > 0. The almost sure convergence implies that

P (|Xn −X| ≥ ǫ occurs only for finitely many n) = 1.

This says that P (|Xn −X| ≥ ǫ, i.o.) = 0. But

limnP (|Xn −X| ≥ ǫ) ≤ lim

nP

⋃

k≥n

|Xk −X| ≥ ǫ

= P (|Xn −X| ≥ ǫ, i.o.) = 0

19

where we use the fact that

⋃

k≥n

|Xk −X| ≥ ǫ ↑ |Xn −X| ≥ ǫ, i.o..

Now suppose Xn → X in probability. W.L.O.G., we assume the subsequence is Xn

itself. Then for any ǫ > 0, there exists nǫ > 0 such that supn≥nǫP (|Xn −X| ≥ ǫ) < ǫ. We

then have a subsequence nk such that P (|Xnk−X| ≥ 2−k) < 2−k for each k ≥ 1. By the

Borel-Cantelli lemma

P(|Xnk

−X| < 2−k for sufficiently large k)

= 1

which implies that Xnk→ X a.s. as k → ∞.

Now suppose Xn doesn’t converge to X in probability. That is, there exists ǫ0 > 0 and

subsequence Xkl; l ≥ 1 such that

P (|Xkl−X| ≥ ǫ0) ≥ ǫ0

for all l ≥ 1. Therefore any subsequence of Xklcan not converge to X in probability, hence

it can not converge to X almost surely.

Example Let Xi, i ≥ 1 be a sequence of i.i.d. random variables with mean zero and

variance one. Let Sn = X1 + · · ·+Xn, n ≥ 1. Then Sn/√

2n log log n→ 0 in pobability but

it does not converge to zero almost surely. Actually lim supn→∞ Sn/√

2 log log n = 1 a.s.

PROPOSITION 5.2 Let Xk; k ≥ 1 be a sequence of i.i.d. random variables with X1 ∼N(0, 1). Let Wn = max1≤i≤nXi. Then

limn→∞

Wn√log n

=√

2 a.s.

Proof. Recall

x√2π (1 + x2)

e−x2/2 ≤ P (X1 ≥ x) ≤ 1√2π x

e−x2/2 (5.3)

for any x > 0. Given ǫ > 0, let bn = (√

2 + ǫ)√

log n. Then

P (Wn ≥ bn) ≤ nP (X1 ≥ bn) ≤ n

bne−b2n/2 ≤ 1

nǫ2/2

for n ≥ 3. Let nk = [k4/ǫ2 ] + 1 for k ≥ 1. Then,

P (Wnk≥ bnk

) ≤ 1

k2

20

for large k. By the Borel-Cantelli lemma, P (Wnk≥ bnk

, i.o.) = 0. This implies that

lim supk

Wnk

bnk

≤√

2 + ǫ a.s.

For any n ≥ n1 there exists k such that nk ≤ n < nk+1. Note that nk+1/nk → 1 as n→ ∞and Wnk

≤Wn ≤Wnk+1we have that

lim supn

Wn√log n

≤√

2 + ǫ a.s.

Let ǫ ↓ 0. We obtain that

lim supn

Wn√log n

≤√

2 a.s. (5.4)

This proves the upper bound. Now we turn to prove the lower bound. Choose ǫ ∈ (0,√

2).

Set cn = (√

2 − ǫ)√

log n. Then

P (Wn ≤ cn) = P (X1 ≤ cn)n = (1 − P (X1 > cn))n ≤ e−nP (X1>cn).

By (5.3),

nP (X1 > cn) ≥ ncn√2π (1 + c2n)

e−c2n/2 ∼ C√2π log n

n1−(√

2−ǫ)2/2

as n is sufficiently large, where C is a constant. Since 1 − (√

2 − ǫ)2/2 > 0,

∑

n≥2

P (Wn ≤ cn) <∞.

By the Borel-Cantelli again, P (Wn ≤ cn, i.o.) = 0. This implies that

lim infn

Wn√log n

≥√

2 − ǫ a.s.

Let ǫ ↓ 0. We obtain

lim infn

Wn√log n

≥√

2 a.s.

This together with (5.4) implies the desired result.

The above results also hold for random variables taking abstract values. Now let’s

introduce such random variables.

21

A metric space is a set D equipped with a metric. A metric is a map d : D×D → [0,∞)

with the properties

(i) d(x, y) = 0 if and only if x = y;

(ii) d(x, y) = d(y, x);

(iii) d(x, z) ≤ d(x, y) + d(y, z)

for any x, y and z. A semi-metric satisfies (ii) and (iii) , but not necessarily (i). An open

ball is a set of the form y : d(x, y) < r. A subset of a metric space is open if and only if it

is the union of open balls; it is closed if and only if the complement is open. A sequence xn

converges to x if and only if d(xn, x) → 0, and denoted by xn → x. The closure A of a set

A is the set of all points that are the limit of a sequence in A; it is the smallest closed set

containing A. The interior A is the collection of all points x such that x ∈ G ⊂ A for some

open set G; it is the largest open set contained in A. A function f : D → E between metric

spaces is continuous at a point x if and only if f(xn) → f(x) for every sequence xn → x;

it is continuous at every x if and only if f−1(G) is also open for every open set G ⊂ E. A

subset of a metric space is dense if and only if its closure is the whole space. A metric space

is separable if and only if it has a countable dense subset. A subset K of a metric space is

compact if and inly if it is closed and every sequence in K has a convergent subsequence. A

set K is totally bounded if and only if for every ǫ > 0 it can be covered by finitely many balls

of radius ǫ > 0. The space is complete if and only if any Cauchy sequence has a limit. A

subset of a complete metric space is compact if and only if it is totally bounded and closed.

A norm space D is a vector space equipped with a norm. A norm is a map ‖ · ‖ : D →[0,∞) such that for every x, y in D, and α ∈ R,

(i) ‖x+ y‖ ≤ ‖x‖ + ‖y‖;(ii) ‖αx‖ = |α|‖x‖;(iii) ‖x‖ = 0 if and only if x = 0.

Definition. The Borel σ-field on a metric space D is the smallest σ-field that contains

the open sets. A function taking values in a metric space is called Borel-measurable if it is

measurable relative to the Borel σ-field. A Borel-measurable map X : Ω → D defined on a

probability space (Ω,F , P ) is referred to as a random element with values in D.

The following lemma is obvious.

LEMMA 5.3 A continuous map between two metric spaces is Borel-measurable.

22

Example. Let C[a, b] = f(x); f is a continuous function on [a, b]. Define

‖f‖ = supa≤x≤b

|f(x)|.

This is a complete normed linear space which is also called a Banach space. This space is

separable. Any separable Banach space is isomorphic to a subset of C[a, b].

Remark. Proposition 5.1 still holds when the random variables taking values in a Polish

space.

Let X,Xn;n ≥ 1 be a sequence of random variables. We say Xn converges to X

weakly or in distribution if

P (Xn ≤ x) → P (X ≤ x)

as n→ ∞ for every continuous point x, that is, P (X ≤ x) = P (X < x).

Remark. The set of discontinuous points of any random variable is countable.

LEMMA 5.4 Let Xn and X be random variables. The following are equivalent:

(i) Xn converges weakly to X;

(ii) Ef(Xn) → Ef(X) for any bounded continuous function f(x) defined on R;

(iii) liminfnP (Xn ∈ G) ≥ P (X ∈ G) for any open set G;

(iv) limsupnP (Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;

Remark. The above lemma is actually true for any finite dimensional space. Further, it

is true for random variables taking values in separable, complete metric spaces, which are

also called Polish spaces.

Proof. (i) =⇒ (ii). Given ǫ > 0, choose K > 0 such that K and −K are continuous

point of all Xn’s and X and P (−K ≤ X ≤ K) ≥ 1 − ǫ. By (i)

limnP (|Xn| ≤ K) = lim

nP (Xn ≤ K) − lim

nP (Xn ≤ −K) = P (X ≤ K) − P (X ≤ −K)

≥ 1 − ǫ

In other words

limnP (|Xn| > K) ≤ ǫ. (5.5)

Since f(x) is continuous on [−K,K], we can cut the interval into, say, m subintervals

[ai, ai+1], 1 ≤ i ≤ m such that each ai is a continuous point of X and supai≤x≤ai+1|f(x) −

f(ai)| ≤ ǫ. Set

g(x) =

m∑

i=1

f(ai)I(ai ≤ x < ai+1).

23

It is easy to see that

|f(x) − g(x)| ≤ ǫ for |x| ≤ K and |f(x) − g(x)| ≤ ‖f‖ for |x| > K,

where ‖f‖ = supx∈R |f(x)|. It follows that

E|f(Y ) − Eg(Y )| ≤ ǫ+ ‖f‖ · P (|Y | > K)

for any random variable Y. Apply this to Xn and X we obtain from (5.5) that

lim supn

|Ef(Xn) − Eg(Xn)| ≤ (1 + ‖f‖)ǫ and |Ef(X) − Eg(X)| ≤ (1 + ‖f‖)ǫ.

Easily Eg(Xn) → Eg(X). Therefore by triangle inequality,

lim supn

|Ef(Xn) − Ef(X)| ≤ 2(1 + ‖f‖)ǫ.

Then (ii) follows by letting ǫ ↓ 0.

(ii) =⇒ (iii). Define fm(x) = (md(x,Gc)) ∧ 1 for m ≥ 1. Since G is open, fm(x) is a

continuous function bounded by one and fm(x) ↑ 1G(x) for each x as m → ∞. It follows

that

lim infn

P (Xn ∈ G) ≥ lim infn

Efm(Xn) → Efm(X)

for each m. (iii) follows by letting m → ∞.

(iii) and (iv) are equivalent because F c is open.

(iv) =⇒ (i). Let x be a continuous point. Then by (iii) and (iv)

lim supn

P (Xn ≤ x) ≤ P (X ≤ x) = P (X < x) and

lim infn

P (Xn ≤ x) ≥ lim infn

P (Xn < x) ≥ P (X < x).

Then (i) follows.

Example. Let Xn be uniformly distributed over 1/n, 2/n, · · · , n/n. Prove that Xn con-

verges weakly to U [0, 1].

Actually, for any bounded continuous function f(x),

Ef(Xn) =1

n

n∑

i=1

f

(i

n

)→∫ 1

0f(t) dt = Ef(Y )

where Y ∼ U [0, 1].

24

Example. Let Xi, i ≥ 1 be i.i.d. random variables. Define

µn =1

n

n∑

i=1

δXi through µn(A) =1

n

n∑

i=1

I(Xi ∈ A) =#i : Xi ∈ A

n

for any set A. Then, for any measurable function f(x), by the strong law of large numbers,

∫f(x)µn( dx) =

1

n

n∑

i=1

f(Xi) → Ef(X1) =

∫f(x)µ( dx)

where µ = L(X1). This concludes that µn → µ weakly as n→ ∞.

COROLLARY 5.1 Suppose Xn converges to X weakly and g(x) is a continuous map de-

fined on R1 (not necessarily one-dimensional). Then g(Xn) converges to g(X) weakly.

The following two results are called Delta methods.

THEOREM 8 Let Yn be a sequence of random variables that satisfies√n(Yn − θ) =⇒

N(0, σ2). Let g be a real-valued function such that g′(θ) 6= 0. Then

√n(g(Yn) − g(θ)) =⇒ N(0, σ2(g′(θ)2)).

Proof. By Taylor’s expansion g(x) = g(θ) + g′(θ)(x− θ) + a(x− θ) for some real function

a(x) such that a(t)/t → 0 as t→ 0. Then,

√n(g(Yn) − g(θ)) = g′(θ)

√n(Yn − θ) +

√n · a(Yn − θ).

We only need to show that√n · a(Yn − θ) → 0 in probability as n→ ∞ (latter we will see

this easily by Slusky’s lemma. But why now?) Indeed, for any m ≥ 1 there exists a δ > 0

such that |a(x)| ≤ |x|/m for all |x| ≤ δ. Thus,

P (|√n · a(Yn − θ)| ≥ b) ≤ P (|Yn − θ| > δ) + P (|√n(Yn − θ)| ≥ mb)

→ P (|N(0, σ2)| ≥ mb)

as n→ ∞. Now let m ↑ ∞. we obtain that√n · a(Yn − θ) → 0 in probability.

Similarly, one can prove that

THEOREM 9 Let Yn be a sequence of random variables that satisfies√n(Yn − θ) =⇒

N(0, σ2). Let g be a real-valued function such that g′′(x) exists around x = θ with g′(θ) = 0

and g′′(θ) 6= 0. Then

n(g(Yn) − g(θ)) =⇒ 1

2σ2g′′(θ)χ2

1.

25

Proof. By Taylor’s expansion again,

n(g(Yn) − g(θ)) =1

2σ2g′′(θ)

[n ·(Yn − θ

σ

)2]

+ o(n|Yn − θ|2).

This leads to the desired conclusion by the same arguments as in Theorem 8.

PROPOSITION 5.3 Let X, Xn;n ≥ 1 be a sequence of random variables on Rk. Look

at the following three statements:

(i) Xn converges to X almost surely;

(ii)Xn converges to X in probability;

(iii) Xn converges to X weakly.

Then (i) =⇒ (ii) =⇒ (iii).

Proof. The implication “(i) =⇒ (ii)” is proved earlier. Now we prove that (ii) =⇒ (iii).

For any closed set F let

F ǫ = x : infy∈F

d(x, y) ≤ ǫ.

Then F ǫ is a closed set. When X takes values in R, the metric d(x, y) = |x− y|. It is easy

to see that

Xn ∈ F ⊂ X ∈ F 1/k ∪ |Xn −X| ≥ 1/k

for any integer k ≥ 1. Therefore

P (Xn ∈ F ) ≤ P (X ∈ F 1/k) + P (|Xn −X| ≥ 1/k).

Let n→ ∞ we have that

lim supn

P (Xn ∈ F ) ≤ P (X ∈ F 1/k)

Note that F 1/k ↓ F as k → ∞. (iii) follows by letting k → ∞.

Any random vector X taking values in Rk is tight: For every ǫ > 0 there exists a

number M > 0 such that P (‖X‖ > M) < ǫ. A set of random vectors Xα; α ∈ A is called

uniformly tight if there exists a constant M > 0 such that

supα∈A

P (‖Xα‖ > M) < ǫ.

26

A random vector X taking values in a polish space (D, d) is also tight: For every ǫ > 0 there

exists a compact set K ⊂ D such that P (X /∈ K) < ǫ. This will be proved later.

Remark. It is easy to see that a finite number of random variables is uniformly tight if every

individual is tight. Also, if Xα; α ∈ A and Xα; α ∈ B are uniformly tight, respectively,

then Xα, α ∈ A∪B is also tight. Further, if Xn =⇒ X, then X,Xn; n ≥ 1 is uniformly

tight. Actually, for any ǫ > 0, choose M > 0 such that P (‖X‖ ≥ M) ≥ 1 − ǫ. Since the

set x ∈ Rk, ‖x‖ > M is an open set, lim infn→∞ P (‖Xn‖ > M) ≥ P (‖X‖ > M) ≥ 1 − ǫ.

The tightness of X,Xn follows.

LEMMA 5.5 Let µ be a probability measure on (D,B), where (D, d) is a Polish space and

B is the Borel σ-algebra. Then µ is tight.

Proof. Since (D, d) is Polish, it is separable, there exists a countable set x1, x2, · · · ⊂ D

such that it is dense in D.

Given ǫ > 0. For any integer k ≥ 1, since ∪n≥1B(xn, 1/k) = D, there exists nk such

that µ(∪nkn=1B(xn, 1/k)) > 1 − ǫ/2k. Set Dk = ∪nk

n=1B(xn, 1/k) and M = ∩k≥1Dk. Then M

is totally bounded (M is always covered by a finite number of balls with any prescribed

radius). Also

µ(M c) ≤∞∑

k=1

µ(Dck) ≤

∞∑

k=1

ǫ

2k= ǫ.

Let K be the closure of M. Since D is complete, M ⊂ K ⊂ D and K is compact. Also by

the above fact, µ(Kc) < ǫ.

LEMMA 5.6 Let Xn; n ≥ 1 be a sequence of random variables. Let Fn be the cdf of Xn.

Then there exists a subsequence Fnj with the property that Fnj (x) → F (x) at each continuity

point of x of a possibly defective distribution function F. Further, if Xn is tight, then F (x)

is a cumulative distribution function.

Proof. Put all rational numbers q1, q2, · · · in a sequence. Because the sequence Fn(q1)

is contained in the interval [0, 1], it has a converging subsequence. Call the indexing sub-

sequence n1j∞j=1 and the limit G(q1). Next, extract a further subsequence n2

j ⊂ n1j

along which Fn(q2) converges to a limit G(q2), a further subsequence n3j ⊂ n2

j along

which Fn(q3) converges to a limit G(q3), · · · , and so forth. The tail of the diagonal sequence

nj := njj belongs to every sequence ni

j. Hence Fnj (qi) → G(qi) for every i = 1, 2, · · · . Since

each Fn is nondecreasing, G(q) ≤ G(q′) if q ≤ q′. Define

F (x) = infq>x

G(q).

27

It is not difficult to verify that F (x) is decreasing and right-continuous at every point x. It

is also easy to show that Fnj (x) → F (x) at each continuity point of x of F by monotonicity

of F (x).

Now assume Xn is tight. For given ǫ ∈ (0, 1), there exists a rational number M > 0

such that P (−M < Xn ≤ M) ≥ 1 − ǫ. Equivalently, Fn(M) − Fn(−M) ≥ 1 − ǫ for all n.

Passing the subsequence nj to ∞. We have that G(M) − G(−M) ≥ 1 − ǫ. By definition,

F (M) − F (−M) ≥ 1 − ǫ. Therefore F (x) is a cdf.

THEOREM 10 (Almost-sure representations). If Xn =⇒ X0 then there exist random

variables Yn on ([0, 1], B[0, 1]) such that L(Yn) = L(Xn) for each n ≥ 0 and Yn → Y0

almost surely.

Proof. Let Fn(x) be the cdf of Xn. Let U be a random variable uniformly distributed on

([0, 1],B[0, 1]). Define Yn = F−1n (U) where the inverse here is the generalized inverse of a

non-decreasing function defined by

F−1n (t) = infx : Fn(x) ≥ t

for 0 ≤ t ≤ 1. One can verify that F−1n (t) → F−1

0 (t) for every t such that F0(x) is continuous

at t. We know that the set of the discontinuous points is countable. Therefore the Lebesgue

measure is zero. This shows that Yn = F−1n (U) → F−1

0 (U) = Y0 almost surely.

It is easy to check that L(Xn) = L(Yn) for all n = 0, 1, 2, · · · .

LEMMA 5.7 (Slusky). Let an, bn and Xn are sequences of random variables such

that an → a in probability and bn → b in probability and Xn =⇒ X for constants a and b

and some random variable X. Then anXn + bn =⇒ aX + b as n→ ∞.

Proof. We will show that (anXn +bn)−(aXn +b) → 0 in probability. If this is the case,

then the desired conclusion follows from the weak convergence of Xn (why?). Given ǫ > 0.

Since Xn is convergent, it is tight. So there exists M > 0 such that P (|Xn| ≥ M) ≤ ǫ.

Moreover, by the given condition,

P(|an − a| ≥ ǫ

2M

)< ǫ and P (|bn − b| ≥ ǫ) < ǫ

as n is sufficiently large. Note that

|(anXn + bn) − (aXn + b)| ≤ |an − a||Xn| + |bn − b|.

28

Thus,

P (|(anXn + bn) − (aXn + b)| ≥ 2ǫ)

≤ P(|an − a| ≥ ǫ

2M

)+ P (|Xn| ≥M) + P (||bn − b| ≥ ǫ) < 3ǫ

as n is sufficiently large.

Remark. All the above statements of weak convergence of random variables Xn can be

replaced by their distributions µn := L(Xn). So “Xn =⇒ X” is equivalent to “µn =⇒ µ”.

THEOREM 11 (Levy’s continuity theorem). Let µn;n = 0, 1, 2, · · · be a sequence of prob-

ability measures on Rk with characteristic function φn. We have that

(i) If µn =⇒ µ0 then φn(t) → φ0(t) for every t ∈ Rk;

(ii) If φn(t) converges pointwise to a limit φ0(t) that is continuous at 0, then the as-

sociated sequence of distributions µn is tight and converges weakly to the measure µ0 with

characteristic function φ0.

Remark. The condition that “φ0(t) is continuous at 0” is essential. Let Xn ∼N(0, n), n ≥ 1. So P (Xn ≤ x) → 1/2 for any x, which means that Xn doesn’t converges

weakly. Note that φn(t) = e−nt2/2 → 0 as n→ ∞ and φn(0) = 1. So φ0(t) is not continuous.

If one ignores the continuity condition at zero, the conclusion that Xn converges weakly

may wrongly follow.

Proof of Theorem 11. (i) is trivial.

(ii) Because marginal tightness implies joint tightness. W.L.O.G., assume that µn is a

probability measure on R and generated by Xn. Note that for every x and δ > 0,

1|δx| > 2 ≤ 2

(1 − sin δx

δx

)=

1

δ

∫ δ

−δ(1 − cos tx) dt.

Replace x by Xn, take expectations, and use Fubini’s theorem to obtain that

P

(|Xn| >

2

δ

)≤ 1

δ

∫ δ

−δ(1 − EeitXn) dt =

1

δ

∫ δ

−δ(1 − φn(t)) dt.

By the given condition,

lim supn

P

(|Xn| >

2

δ

)≤ 1

δ

∫ δ

−δ(1 − φ0(t)) dt.

The right hand side above goes to zero as δ ↓ 0. So tightness follows.

29

To prove the weak convergence, we have to show that∫f(x) dun →

∫f(x) dµ for every

bounded continuous function f(x). It is equivalent to that for any subsequence, there is a

further subsequence, say nk, such that∫f(x) dunk

→∫f(x) dµ0.

For any subsequence, by Lemma 5.6, the Helly selection principle, there is a further

subsequence, say µnk, such that it converges to µ0 weakly, by the earlier proof, the c.f. of

µ0 is φ0. We know that a c.f. uniquely determines a distribution. This says that every limit

is identical. So∫f(x) dunk

→∫f(x) dµ0.

THEOREM 12 (Central Limit Theorem). Let X1,X2, ·,Xn be a sequence of i.i.d. random

variables with mean zero and variance one. Let Xn = (X1 + · · ·+Xn)/n. Then√nXn =⇒

N(0, 1).

Proof. Let φ(t) be the c.f. of X1. Since EX1 = 0 and EX21 = 1 we have that φ′(0) =

iEX1 = 0 and φ′′(0) = i2EX21 = −1. By Taylor’s expansion

φ

(t√n

)= 1 − t2

2n+ o

(1

n

)

as n is sufficiently large. Hence

Eeit√

nXn = φ

(t√n

)n

=

(1 − t2

2n+ o

(1

n

))n

→ e−t2/2

as n→ ∞. By Levy’s continuity theorem, the desired conclusion follows.

To deal with multivariate random variables, we have the next tool.

THEOREM 13 (Cramer-Wold device). Let Xn and X be random variables taking values

in Rk. Then

Xn =⇒ X if and only if tTXn =⇒ tTX for all t ∈ Rk.

Proof. By Levy’s continuity Theorem 11, Xn =⇒ X if and only if E exp(itTXn) →E exp(itTX) for each t ∈ R

k, which is equivalent to E exp(iu(tTXn)) → E exp(iu(tTX)) for

any real number u. This is same as saying that the c.f. of tTXn converges to that of tTX

for each t ∈ Rk. It is equivalent to that tTXn =⇒ tTX for all t ∈ R

k.

As an application, we have the multivariate analogue of the one-dimensional CLT.

THEOREM 14 Let X1,X2, · · · be a sequence of i.i.d. random vectors in Rk with mean

vector µ = EX1 and covariance matrix Σ = E(X1 − µ)(X1 − µ)T . Then

1√n

n∑

i=1

(Xi − µ) =√n(Yn − µ) =⇒ Nk(0,Σ).

30

Proof. By the Cramer-Wold device, we need to show that

1√n

n∑

i=1

(tTXi − tTµ) =⇒ N(0, tT Σt)

for ant t ∈ Rk. Note that tTXi − tTµ, i = 1, 2, · · · are i.i.d. real random variables. By

the one-dimensional CLT,

1√n

n∑

i=1

(tTXi − tTµ) =⇒ N(0, Var(tTX1)).

This leads to the desired conclusion since Var(tTX1) = tT Σt.

We state the following theorem without giving the proof. The spirit of the proof is

similar to that of Theorem 12. It is called Lindeberg-Feller Theorem.

THEOREM 15 Let kn; n ≥ 1 be a sequence of positive integers and kn → +∞ as n→ ∞.

For each n let Yn,1, · · · , Yn,kn be independent random vectors with finite variances such that

kn∑

i=1

Cov (Yn,i) → Σ and

kn∑

i=1

E‖Yn,i‖2I‖Yn,i‖ > ǫ → 0

for every ǫ > 0. Then the sequence Σkni=1(Yn,i − EYn,i) =⇒ N(0,Σ).

The Central Limit Theorems hold not only for independent random variables, it is also

true for some other dependent random variables of special structures.

Let Y1, Y2, · · · be a sequence of random variables. A sequence of σ-algebra Fn satisfies

that Fn ⊃ σ(Y1, · · · , Yn) and is non-decreasing Fn ⊂ Fn+1 for all n ≥ 0, where F0 = ∅,Ω.When E(Yn+1|Fn) = 0 for any n ≥ 0, we say Yn; n ≥ 1 are martingale differences.

THEOREM 16 For each n ≥ 1, let Xni, 1 ≤ i ≤ kn be a sequence of martingale differ-

ences relative to nested σ-algebras Fni, 1 ≤ i ≤ kn, that is, Fni ⊂ Fnj for i < j, such

that

1. maxi |Xni|; n ≥ 1 is uniformly integrable;

2. E (maxi |Xni|) → 0;

3.∑

iX2ni → 1 in probability.

Then Sn =∑

iXni → N(0, 1) in distribution.

Example. Let Xi; i ≥ 1 be i.i.d. with mean zero and variance one. Then one can

verify that the required conditions in Theorem 16 hold for Xni = Xi/√n for i = 1, 2, · · · , n.

31

Therefore we recollect the classical CLT that∑n

i=1Xi/√n =⇒ N(0, 1). Actually,

P (maxi

|Xni| ≥ ǫ) ≤ nP (|X1| ≥√nǫ) ≤ n

nǫ2E|X1|2I|X1| ≥

√nǫ → 0 and

E(maxi

|Xni|)2 ≤ E

∑ni=1X

2i

n= 1

This shows that maxi |Xni| → 0 in probability and maxi |Xni|, n ≥ 1 is uniformly inte-

grable. So condition 2 holds.

LEMMA 5.8 Let Un; n ≥ 1 and Tn; n ≥ 1 be two sequences of random variables such

that

1. Un → a in probability, where a is a constant;

2. Tn is uniformly integrable;

3. UnTn is uniformly integrable;

4. ETn → 1.

Then E(TnUn) → a.

Proof. Write TnUn = Tn(Un − a) + aTn. Then E(TnUn) = ETn(Un − a) + aETn. Evi-

dently, Tn(Un−a) is uniformly integrable by the given condition. Since Tn is uniformly

integrable, it is not difficult to verify that Tn(Un − a) → 0 in probability. Then the result

follows.

By the Taylor expansion, we have the following equality

eix = (1 + ix)e−x2/2+r(x) (5.6)

for some function r(x) = O(x3) as x→ ∞. Look at the set-up in Theorem 16, we have that

eitSn = TnUn, where

Tn =∏

j

(1 + itXnj) and Un = exp(−(t2/2)∑

j

X2nj +

∑

j

r(tXnj)). (5.7)

We next obtain a general result on CLT by using Lemma 5.8.

THEOREM 17 Let Xnj , 1 ≤ i ≤ kn, n ≥ 1 be a triangular array of random variables.

Suppose

1. ETn → 1;

2. Tn is uniformly integrable;

32

3.∑

j X2nj → 1 in probability;

4. E(maxj |Xnj |) → 0.

Then Sn =∑

j Xnj converges to N(0, 1) in distribution.

Proof. Since |TnUn| = |eitSn | = 1, we know that TnUn is uniformly integrable. By Lemma

5.8, it is enough to show that Un → e−t2/2 in probability. Review (5.7) and condition 3, it

suffices to show that∑

j r(tXnj) → 0 in probability. The assertion r(x) = O(x3) say that

there exists A > 0 and δ > 0 such that |r(x)| ≤ A|x3| as |x| < δ. Then

P (|∑

j

r(tXnj)| > ǫ) ≤ P (maxj

|Xnj | ≥ δ) + P (|∑

j

r(tXnj)| > ǫ, maxj

|Xnj | < δ)

≤ δ−1E(maxj

|Xnj |) + P (maxj

|Xnj | ·∑

j

|Xnj |2 > ǫ) → 0

as n→ ∞ by the given conditions.

Proof of Theorem 16. Define Zn1 = Xn1 and

Znj = XnjI(

j−1∑

k=1

X2nk ≤ 2), 2 ≤ j ≤ kn.

It is easy to check that Znj is still a martingale relative to the original σ-algebra. Now

define J = infj ≥ 1;∑

1≤k≤j X2nk > 2 ∧ kn. Then

P (Xnk 6= Znk for some k ≤ kn) = P (J ≤ kn − 1)

≤ P (∑

1≤k≤kn

X2nk > 2) → 0

as n→ ∞. Therefore, P (Sn 6=∑knj=1 Znj) → 0 as n→ ∞. To prove the result, we only need

to show

kn∑

j=1

Znj =⇒ N(0, 1). (5.8)

We now apply Theorem 17 to prove this. Replacing Xnj by Znj in (5.7), we have new Tn and

Un. Let’s literally verify the four conditions in Theorem 17. By a martingale property and

iteration, ETn = 1. So condition 1 holds. Now maxj |Znj | ≤ maxj |Xnj | → 0 in probability

as n → ∞. So condition 4 holds. It is also easy to check that∑

j Z2nj → 1 in probability.

Then it remains to show condition 2.

33

By definition, Tn =∏kn

j=1(1 + itZnj) =∏

1≤j≤J(1 + itZnj). Thus,

|Tn| =J−1∏

k=1

(1 + t2X2nk)

1/2|1 + tXnJ |

≤ exp

(t2

2

J−1∑

k=1

X2nk

)(1 + |t| · |XnJ |)

≤ et2(1 + |t| · max

j|Xnj |)

which is uniform integrable by condition 1 in Theorem 16.

The next is an extreme limit theorem, which is different than the CLT. The limiting

distribution is called an extreme distribution.

THEOREM 18 Let Xi; 1 ≤ i ≤ n be i.i.d. N(0, 1) random variables. Let Wn = max1≤i≤nXi.

Then

P

(Wn ≤

√2 log n− (log2 n) + x

2√

2 log n

)→ e

− 12√

πex/2

for any x ∈ R, where log2 n = log(log n).

Proof. Let tn be the right hand side of “≤” in the probability above. Then

P (Wn ≤ tn) = P (X1 ≤ tn)n = (1 − P (X1 > tn))n.

Since (1−xn)n ∼ e−a as n→ ∞ if xn ∼ a/n. To prove the theorem, it suffices to show that

P (X1 > tn) ∼ 1

2√π n

ex/2. (5.9)

Actually, we know that

P (X1 > x) ∼ 1√2π x

e−x2/2

as x→ +∞. It is easy to calculate that

1√2π tn

∼ 1

2√π log n

andt2n2

∼ log n− log(√

log n) − x

2

as n→ ∞. This leads to (5.9).

34

As an application of the CLT of i.i.d. random variables, we will be able to derive the

classical χ2-test as follows.

Let’s consider a multinomial distribution with n trails, k classes with parameter p =

(p1, · · · , pk). Roll a die twenty times. Let Xi be the number of the occurrences of “i dots”,

1 ≤ i ≤ 6. Then (X1,X2, · · · ,X6) follows a multinomial distribution with “success rate”

p = (p1, · · · , p6). How to test the die is fair or not? From an introductory course, we know

that, the null hypothesis is H0 : p1 = · · · = p6 = 1/6. The test statistic is

χ2 =6∑

i=1

(Xi − npi)2

npi.

We use the fact that χ2 is roughly χ2(5) as n is large. We will prove this next.

In general, Xn = (Xn,1, · · · ,Xn,k) follows a multinomial distribution with n trails and

“success rate” p = (p1, · · · , pk). Of course,∑k

i=1Xn,i = n. To be more precise,

P (Xn,1 = xn,1, · · · ,Xn,k = xn,k) =n!

xn,1! · · · xn,k!p

xn,1

1 · · · pxn,k

k ,

where∑k

i=1 xn,i = n. Now we prove the following theorem,

THEOREM 19 As n→ ∞,

χ2n =

k∑

i=1

(Xi − npi)2

npi=⇒ χ2(k − 1).

We need a lemma.

LEMMA 5.9 Let Y ∼ Nk(0,Σ). Then

‖Y ‖2 d=

k∑

i=1

λiZ2i ,

where λ1, · · · , λk are eigenvalues of Σ and Z1, · · · , Zk are i.i.d. N(0, 1).

Proof. Decompose Σ = Odiag(λ1, · · · , λk)OT for some orthogonal matrix O. Then the k

coordinates of OY are independent with mean zero and variances λi’s. So ||Y ‖2 = ‖OY ‖2

is equal to∑k

i=1 λiZ2i , where Zi’s are i.i.d. N(0, 1).

Another fact we will use is that AXn =⇒ AX for any k×k matrix A if Xn =⇒ X, where

Xn and X are Rk-valued random vectors. This can be shown easily by the Cramer-Wold

device.

35

Proof of the Theorem. Let Y1, · · · , Yn be i.i.d. with a multinomial distribution with 1

trails and “success rate” p = (p1, · · · , pk). Then Xn = Y1+· · ·Yn. Each Yi is a k-dimensional

vector. By CLT,

Xn − EXn√n

=⇒ Nk(0,Cov(Y1)),

where it is easy to see that EXn = np and

Σ1 := Cov(Y1) =

p1(1 − p1) −p1p2 · · · −p1pk

−p1p2 p2(1 − p2) · · · −p2pk

· · · · · · · · · · · ·−p1pk −p2pk · · · pk(1 − pk)

.

Set D = diag(1/√p1, · · · , 1/√pk). Then

DΣ1DT == I −

√p1

√p2

...√pk

√p1

√p2

...√pk

T

.

Since∑k

i=1 pi = 1, the above matrix is symmetric and idempotent or a projection. Also,

trace of Σ1 is k−1. So Σ1 has eigenvalue 0 with one multiplicity and 1 with k−1 multiplicity.

Therefore, by the lemma, we have that

D(Xn − np)√n

=⇒ N(0,DΣ1DT ).

By the lemma, since g(x) = ‖x‖2 is a continuous function on Rk,

k∑

i=1

(Xi − npi)2

npi=

∥∥∥∥D(Xn − np)√

n

∥∥∥∥2

=⇒k−1∑

i=1

Z2i

d= χ2(k − 1).

6 Central Limit Theorems for Stationary Random Variables

and Markov Chains

Central limit theorems hold not only for independent random variables but also for depen-

dent random variables. In the next we will study the Central Limit Theorems for α-mixing

random variables and and martingale (a special class of dependent random variables).

Let X1,X2, · · · be a sequence of random variables. For each n ≥ 1, define αn by

α(n) := sup |P (A ∩B) − P (A)P (B)| (6.10)

36

where the supremum is taken over all A ∈ σ(X1, · · · ,Xk), B ∈ σ(Xk+n,Xk+n+1, · · · ), and

k ≥ 1. Obviously, α(n) is decreasing. When α(n) → 0, then Xk and Xk+n are approximately

independent. In this case the sequence Xn is said to be α-mixing. If the distribution of

the random vector (Xn,Xn+1, · · · ,Xn+j) does not depend on n, the sequence is said to be

stationary.

The sequence is called m-dependent if (X1, · · · ,Xk) and (Xk+n, · · · ,Xk+n+l) are inde-

pendent for any n > m, k ≥ 1 and l ≥ 1. In this case the sequence is α-mixing with αn = 0

for n > m. An independent sequence is 0-dependent.

For stationary process, we have the following the Central Limit Theorems.

THEOREM 20 Suppose that X1,X2, · · · is stationary with EX1 = 0 and |X1| ≤ A for

some constant A > 0. Let Sn = X1 + · · · +Xn. If∑∞

i=1 α(n) <∞, then

V ar(Sn)

n→ σ2 = E(X2

1 ) + 2

∞∑

k=1

E(X1Xk+1) (6.11)

where the series converges absolutely. If σ > 0, then Sn/√n =⇒ N(0, σ2).

THEOREM 21 Suppose that X1,X2, · · · is stationary with EX1 = 0 and E|X1|2+δ < ∞for some constant δ > 0. Let Sn = X1 + · · · +Xn. If

∑∞i=1 α(n)δ/(2+δ) <∞, then

V ar(Sn)

n→ σ2 = E(X2

1 ) + 2

∞∑

k=1

E(X1Xk+1)

where the series converges absolutely. If σ > 0, then Sn/√n =⇒ N(0, σ2).

Easily we have the following corollary.

COROLLARY 6.1 Let X1,X2, · · · be m-dependent random variables with the same distri-

bution. If EX1 = 0 and EX2+δ1 < ∞ for some δ > 0. Then Sn/

√n → N(0, σ2), where

σ2 = EX21 + 2

∑m+1i=1 E(X1Xi).

Remark. If we argue by using the truncation method as in the proof of Theorem 21, we

can show that the above corollary still holds only under EX21 <∞.

LEMMA 6.1 If Y ∈ σ(X1, · · · ,Xk) and Z ∈ σ(Xk+n,Xk+n+1, · · · ,Xk+m) for n ≤ m ≤+∞. Assume |Y | ≤ C and |Z| ≤ D. Then

|E(Y Z) − (EY )EZ| ≤ 4CDα(n).

37

Proof. W.l.o.g, assume that C = D = 1. Now since expectations can be approximated by

finite sums, so w.l.o.g, we further assume that

Y =m∑

i=1

yiAi and Z =n∑

j=1

zjBj

where Ai’s and Bj ’s are respectively partitions of the sample space. Then

|E(Y Z) − (EY )EZ| = |∑

i

yi

∑

j

zj (P (Ai ∩Bj) − P (Ai)P (Bj)) |

≤∑

i

|∑

j

zj (P (Ai ∩Bj) − P (Ai)P (Bj)) |.

Let ξi = 1 or −1 depends on whether or not zj(P (Ai ∩Bj)−P (Ai)P (Bj) ≥ 0 or not. Then

the last sum is equal to∑

i

∑j ξizj(P (Ai ∩ Bj) − P (Ai)P (Bj)) which is also identical to

∑j zj

∑j ξi(P (Ai ∩ Bj) − P (Ai)P (Bj)). By the same argument, there are ηi equal to 1 or

−1 such that the last quantity is bounded by∑

j

∑j ξiηj(P (Ai ∩ Bj) − P (Ai)P (Bj)). Let

E1 be the union of Ai such that ξi = 1 and E2 = Ec1; Let F1 and F2 be defined similarly for

Bj according to the sign of ηj’s. Then

|E(Y Z) − (EY )EZ| ≤∑

1≤i,j≤2

|P (Ei ∩ Fj) − P (Ei)P (Fj)| ≤ 4α(n)

LEMMA 6.2 If Y ∈ σ(X1, · · · ,Xk) and Z ∈ σ(Xk+n,Xk+n+1, · · · ). Assume for some

δ > 0, E(Y 2+δ) ≤ C and E(Z2+δ) ≤ D. Then

|E(Y Z) − (EY )EZ| ≤ 5(1 + C +D)α(n)δ/(2+δ).

Proof. Let Y0 = Y I(|Y | ≤ a) and Y1 = Y I(|Y | > a); and Z0 = ZI(|Z| ≤ a) and

Z1 = ZI(|Z| > a). Then

|E(Y Z) − (EY )EZ| ≤∑

1≤i,j≤2

|E(YiZj) − (EYi)EZj |.

By the previous lemma, the term corresponding to i = j = 0 is bounded by 4a2α(n). For

i = j = 1, |E(Y1Z1) − (EY1)EZ1| = |Cov(Y1, Z1)| which is bounded by√E(Y 2

1 )E(Z21 ) ≤√

CD/aδ since E(Y 21 ) ≤ C/aδ. Now

|E(Y0Z1) − (EY0)EZ1| ≤ E(|Y0 −EY0| · |Z1 − EZ1|) ≤ 2a · 2E|Z1|

which is less than 4D/aδ . The term for i = 1, j = 0 is bounded by 4C/aδ . In summary

|E(Y Z) − (EY )EZ| ≤ 4a2α(n) +

√CD + 4C + 4D

a2≤ 4a2α(n) +

4.5(C +D)

aδ.

38

Now take a = α(n)−1/(2+δ). The conclusion follows.

Proof of Theorem 20. By Lemma 6.2, |EX1X1+k| ≤ 4A2α(k), so the series (6.11) con-

verges absolutely. Moreover, let ρk = E(X1X1+k). Then ES2n = nEX2

1+2∑

1≤i<j≤nE(XiXj) =

nρ0 + 2∑n−1

k=1(n− k)ρk. Then it is to see that

E(S2n)

n→ ρ0 + 2

∞∑

k=1

ρk. (6.12)

We claim that

E(S4n) = n3δn (6.13)

for a sequence of constants δn → 0. Actually, E(S4n) ≤∑1≤i,j,k,l≤n |E(XiXjXkXl)|. Among

the terms in the sum there are n terms of X4i ; at most n2 of X2

i X2j and X3

i Xj respectively

for i 6= j. So the contribution of these terms is bounded by Cn2 for an universal constant

C. There are only two type of terms left: X2i XjXk and XiXjXkXl where these indices are

all different. We will deal with the second type of terms only. The first one can be done

similarly and more easily. Note that

∑

1≤i,j,k,l≤n

|E(XiXjXkXl)| ≤ 4!n∑

1≤i+j+k≤n

|E(X1X1+iX1+i+jX1+i+j+k)| (6.14)

where the three indices in the last sum are not zero. By Lemma 6.2, the term inside is

bounded by 4A4αi. By the same argument, it is also bounded by 4A4αk. Therefore, the

sum in (6.14) is bounded by

4 · 4!A4n2∑

2≤i+k≤n

minαi, αk ≤ 2Kn2∑

2≤i+k≤n, i≤k

minαi, αk

≤ 2Kn2n∑

k=1

kαk = nδn (6.15)

for a sequence of constants δn such that 0 < δn → 0. This is because

n∑

k=1

kαk =∑

k≤√n

kαk +∑

√n<k≤n

kαk

≤ √n

∞∑

k=1

αk + n∑

k≥√n

αk = o(n).

39

Therefore (6.13) follows. From this, we know that ES4n ≤ nδn ≤ n3(δn + (1/n)) ≤ n3δ′n

where δ′n := supk≥nδk + (1/k); k ≥ n is decreasing to 0 and is bounded below by 1/n.

Thus, w.l.o.g., assume δn has this property.

Define pn = [√n log(1/δ[

√n])] and qn = [n/pn] and rn = [n/(pn + qn)]. Then

√n≪ pn ≤ √

n log n andp2

nδpn

n≤ δpn log2

(1

δ[√

n]

)≤ δpn log2

(1

δpn

)→ 0 (6.16)

as n→ ∞. Now for finite n, break X1,X2, · · · ,Xn into some blocks as follows:

X1, · · · ,Xpn | Xpn+1, · · · ,Xpn+qn | Xpn+qn+1, · · · ,X2pn+qn | · · · ,Xn

For i = 1, 2, · · · , rn, set

Un,i = X(i−1)(pn+qn)+1 +X(i−1)(pn+qn)+2 + · · · +Xi(pn+qn);

Vn,i = Xipn+(i−1)qn+1 +Xipn+(i−1)qn+2 + · · · +Xi(pn+qn);

Wn = Xrn(pn+qn)+1 +Xrn(pn+qn)+1 + · · · +Xn,

and S′n =

∑rni=1 Un,i and S′′

n =∑rn

i=1 Vn,i. In what follows, we will show that S′′n and Wn are

negligible. Now, by (6.12) and its reasoning

V ar(S′′n) = rnV ar(Vn,1) + 2

rn∑

k=2

(rn − k)E(Vn,1Vn,k)

≤ Crnqn + 2rn

rn∑

k=2

|E(Vn,1Vn,k)|.

By assumption, |Vn,i| ≤ Aqn for any i ≥ 1. By Lemma 6.1, |E(Vn,1Vn,k)| ≤ 4A2q2nα(k−1)pn.

Thus, by monotoncity,

rn∑

k=2

|E(Vn,1Vn,k)| ≤ 4A2q2n

rn∑

k=2

α(k−1)pn≤ 4A2q2n

rn∑

k=2

1

pn

(k−1)pn∑

j=(k−2)pn+1

αj

≤ 4A2q2npn

∞∑

j=1

αj .

Since qn/pn → 0, the above two inequalities imply that

V ar(S′′n)

n→ 0 (6.17)

as n→ ∞. This says that S′′n/

√n→ 0 in probability. By Chebyshev’s inequality and (6.12)

we see that Wn/√n→ 0 in probability. Thus, to prove the theorem, it suffices to show that

S′n/

√n =⇒ N(0, σ2). This will be done through the following two steps

Sn√n

=⇒ N(0, σ2) and EeitS′n/

√n − EeitSn/

√n → 0 (6.18)

40

for any t ∈ R, where Sn =∑rn

i=1 U′n,i and U ′

n,i; ı = 1, 2, · · · , rn are i.i.d. with the same

distribution as that of Un,1. First, by (6.12), V ar(Sn) = rnV ar(Un,1) ∼ rnpnσ2. It follows

that V ar(Sn)/n→ σ2. Also, EU ′n,i = 0. By (6.13) and (6.16),

1

V ar(Sn)2

rn∑

i=1

E(U ′n,i)

4 ∼ rnn2σ4

E(U4n,1) =

rnp3nδpn

n2σ4∼ σ−4 p

2nδpn

n→ 0

as n → ∞. By the Lyapounov central limit theorem, we know the first assertion in (6.18)

holds. Last, by Lemma 6.12,

|EeitS′n/

√n − (EeitUn,1/

√n)E exp(it

rn∑

j=2

Un,j/√n)| ≤ 16α(qn).

Then repeat this inequality for E exp(it∑rn

j=2 Un,j/√n) and so on, we obtain that

|EeitS′n/

√n − EeitSn/

√n| = |EeitS′

n/√

n −rn∏

j=1

EeitU′n,j/

√n| ≤ 16α(qn)rn.

Since α(n) is decreasing and∑

n≥1 α(n) < ∞, we have nα(n)/2 ≤ ∑[n/2]≤i≤n α(i) → 0.

Thus the above bound is o(rn/qn) = o(n/(pnqn)) → 0 as n → ∞. The second statement in

(6.18) follows. The proof is complete.

Proof of Theorem 21. By Lemma 6.2, |EX1Xk| ≤ 5(1 + 2E|X1|2+δ)α(n)2/(2+δ),

which is summable by the assumption. Thus the series EX21 +2

∑∞i=2E(X1Xk) is absolutely

convergent. By the exact same argument as in proving (6.11), we obtain that V ar(Sn)/n →σ2.

Now for a given constant A > 0 and all i = 1, 2, · · · , define

Yi = XiI|Xi| ≤ A − E(XiI|Xi| ≤ A) and Zi = XiI|Xi| > A − E(XiI|Xi| > A),

and S′n =

∑ni=1 Yi and S′′

n =∑n

i=1 Zi. Obviously, Sn = S′n + S′′

n. By Theorem 20,

Sn√n

=⇒ N(0, σ(A)2) (6.19)

as n → ∞ for each fixed A > 0, where σ(A)2 := EY 21 + 2

∑∞i=2EY1Yk. By the summable

assumption on α(n), it is easy to check that both the sum in the definition of σ(A)2 and

σ(A)2 := EZ21 + 2

∑∞i=2E(Z1Zk) are absolutely convergent, and

σ(A)2 → σ2 and σ(A)2 → 0 (6.20)

41

as A→ +∞. Now

lim supn→∞

P (|S′′

n|√n

≥ ǫ) ≤ lim supn→∞

V ar(S′′n)

nǫ2→ σ(A)

ǫ2. (6.21)

by using the same argument as in (6.2). Since P (Sn/√n ∈ F ) ≤ P (S′

n/√n ∈ F ǫ) +

P (|S′′n|/

√n ≥ ǫ) for any closed set F ⊂ R and any ǫ > 0, where F ǫ is the closed ǫ-

neighborhood of F. Thus, from (6.19) and (6.21)

lim supn→∞

P (Sn√n∈ F ) ≤ P (N(0, σ(A)2) ∈ F ǫ) + lim sup

n→∞P (

|S′′n|√n

≥ ǫ)

≤ P

(N(0, σ2) ∈ σ

σ(A)F ǫ

)+σ(A)

ǫ2

for any A > 0 and ǫ > 0. First let A→ +∞, and then let ǫ→ 0+ to obtain from (6.20) that

lim supn→∞

P (Sn√n∈ F ) ≤ P (N(0, σ2) ∈ F ).

The proof is complete.

Define by α(n) the supremum of

|E (f(X1, · · · ,Xk)g(Xk+n, · · · ,Xk+m)) −Ef(X1, · · · ,Xk) ·Eg(Xk+n, · · · ,Xk+m)| (6.22)

over all k ≥ 1,m ≥ 0 and measurable functions f(x) defined on Rm and g(x) defined on

Rm−n+1 with |f(x)| ≤ 1 and |g(x)| ≤ 1 for all x ∈ R.

LEMMA 6.3 The following are true:

(i) α(n) = supk≥1

supm≥0

supA∈σ(X1,··· ,Xk), B∈σ(Xk+n,··· ,Xk+n+m)

|P (A ∩B) − P (A)P (B)|.

(ii) For all n ≥ 1, we have that α(n) ≤ α(n) ≤ 4α(n).

Proof. (i) For two sets A and B, let A∆B = (A\B)∪ (B\A), which is their symmetric

difference. Then IA∆B = |IA − IB |. Let Zi; i ≥ 1 be a sequence of random variables. We

claim that

σ(Z1, Z2, · · · ) = B ∈ σ(Z1, Z2, · · · ); for any ǫ > 0, there exists l ≥ 1 and (6.23)

C ∈ σ(Z1, · · · , Zl) such that P (B∆C) < ǫ. (6.24)

If this is true, then for any ǫ > 0, there exists l < ∞ and C ∈ σ(Z1, · · · , Zl) such that

|P (B) − P (C)| < ǫ. Thus (ii) follows.

42

Now we prove this claim. First, note that σ(Z1, Z2, · · · ) = σZi ∈ Ei; Ei ∈ B(R); i ≥1. Denote by F the right hand side of (6.23). Then F contains all Zi ∈ Ei; Ei ∈B(R); i ≥ 1. We only need to show that the right hand side is a σ-algebra.

It is easy to see that (i) Ω ∈ F ; (ii) Bc ∈ F if B ∈ F since A∆B = Ac∆Bc. (iii) If Bi ∈ Ffor i ≥ 1, there exist mi < ∞ and Ci ∈ σ(Z1, · · · , Zmi) such that P (Bi∆Ci) < ǫ/2i for all

i ≥ 1. Evidently, ∪ni=1Bi ↑ ∪∞

i=1Bi as n → ∞. Therefore, there exists n0 < ∞ such that

|P (∪∞i=1Bi)−P (∪n0

i=1Bi)| ≤ ǫ/2. It is easy to check that (∪n0i=1Bi)∆(∪n0

i=1Ci) ⊂ ∪n0i=1(Bi∆Ci).

Write B = ∪∞i=1Bi and B = ∪n0

i=1Bi and C = ∪n0i=1Ci. Note B∆C ⊂ (B\B)∪ (B∆C). These

facts show that P (B∆C) < ǫ and C ∈ σ(Z1, Z2, · · · , Zl) for some l < ∞. Thus F is a

σ-algebra.

(ii) Take f and g to be indicator functions, from (i) we get α(n) ≤ α(n). By Lemma

6.1, we have that α(n) ≤ 4α(n).

Let Xi; i ≥ 1 be a sequence of random variables. We say it is a Markov chain if

P (Xk+1 ∈ A|X1, · · · ,Xk) = P (Xk+1 ∈ A|Xk) for any Borel set A and any k ≥ 1. A Markov

chain has the property that given the present observations, the past and future

observations are (conditionally) independent. This is literally stated in Proposition

10.3.

Let’s look at the strong mixing coefficient α(n). Recall the definition of α(n). IfX1,X2, · · ·is a Markov chain, for simplicity, we write f and g, respectively, for f(X1, · · · ,Xk) and

g(Xk+n, · · · ,Xk+m). Let f = E(f |Xk+1)) and g = E(g|Xk+n−1). Then by Proposition 10.3,

E(fg) = E(f g), and Ef = Ef and Eg = Eg. Thus,

E(fg) − (Ef)Eg = E(f g) − (Ef)Eg.

The point is that f is Xk+1 measurable, and g is Xk+n−1 measurable. Therfore

α(n) ≤ sup|f |≤1 |g|≤1

|E(f(Xk+1)g(Xk+n−1) − (Ef(Xk+1))Eg(Xk+n−1)| (6.25)

≤ 4 supA, B

|P (Xk+1 ∈ A,Xk+n−1 ∈ B) − P (Xk+1 ∈ A)P (Xk+n−1 ∈ B)| (6.26)

by Lemma 6.1 by taking k there to be k + 1 and k + n to be k + n+ 1 and Xi = 0 for all

positive integer i excluding k + 1 and k + n+ 1.

If the Markov process is stationary, then

α(n+ 1) ≤ 4 supA, B

|P (X1 ∈ A,Xn ∈ B) − P (X1 ∈ A)P (Xn ∈ B)|. (6.27)

43

Let Pn(x,B) represent the probability ofXn+1 ∈ B given X1 = x, and π be the marginal

probability distribution of X1. Then

P (Xn+1 ∈ B, X1 ∈ A) =

∫

APn(x,B)π( dx)

From this,

α(n+ 2) ≤ 4 · supA,B

|∫

A(Pn(x,B) − P (Xn ∈ B))π( dx)|.

For two probability measures µ and ν, define its total variation distance ‖µ−ν‖ = supA |µ(A)−ν(A)|. Then

α(n + 2) ≤ 4

∫‖Pn(x, ·) − π‖π( dx), n ≥ 2.

In practice, for example, in the theory of Monto Carlo Markov Chains, we have some

conditions to guarantee the convergence rate of ‖Pn(x, ·) − π‖. In other words, we have a

function M(x) ≥ 0 and γ(n) such that

‖Pn(x, ·) − π‖ ≤M(x)γ(n)

for all x and n. Here are some examples:

If a chain is so-called polynomially ergodic of order m, then γ(n) = n−m;

If a chain is geometrically ergodic, then γ(n) = tn for some t ∈ (0, 1);

If a chain is uniformly geometrically ergodic, then γ(n) = tn for some t ∈ (0, 1) and

supxM(x) <∞.

These three conditions, combining with the Central Limit Theorems of stationary ran-

dom variables stated in Theorems 20 and 21, imply the following CLT for Markov chains

directly.

Let EπM =∫M(x)π( dx).

THEOREM 22 Suppose X1,X2, · · · is a stationary Markov chain with L(X1) = π and

EπM <∞. If the chain is Geometrically ergodic, in particular, if it is uniformly geometri-

cally ergodic, then

Sn − nEX1√n

=⇒ N(0, σ2) (6.28)

where σ2 = E(X21 ) + 2

∑i=2 ∞E(X1Xk).

44

THEOREM 23 Suppose X1,X2, · · · is a stationary Markov chain with L(X1) = π and

EπM <∞.

(i) If |X1| ≤ C for some constant C > 0, and the chain is polynomially ergodic of order

m > 1, then (6.28) holds.

(ii) If the chain is polynomially ergodic of order m, and E|X1|2+δ <∞ for some δ > 0

satisfying mδ > 2 + δ, then (6.28) holds.

Remark. In practice, given an initial distribution λ, and a transitional kernel P (x, A),

here is the way to generate a Markov chain: randomly draw X0 = x under distribution λ,

randomly draw X1 under distribution P (X0, ·) and Xn+1 under P (Xn, ·) for any n ≥ 1.

Through this a Markov chain X1,X2, · · · with initial distribution λ and transitional kernel

P (x, ·) has been generated. Let π be the stationary distribution of this chain, that is,

π(·) =

∫P (x, ·)π( dx).

Theorem 17.1.6 (on p. 420 from “Markov Chains and Stochastic Stability” by S.P. Meyn

and R.L. Tweedie) says that a Markov CLT holds for a certain initial distribution λ0, it

then holds for any initial distribution λ. In particular, if Theorem 22 or Theorem 23 (λ = π

in this case) holds, then the CLT holds for the Markov chain of the same transitional prob-

ability kernel P (x, ·) and stationary distribution π for any initial distribution λ.

7 U-Statistics

Let X1, · · · ,Xn be a sequence of i.i.d. random variables. Sometimes we are interested in

estimating θ = Eh(X1, · · · ,Xm) for 1 ≤ m ≤ n. To make discussion simple, we assume

h(x1, · · · , xm) is a symmetric function. If X(1) ≤ X(2) ≤ · · · ≤ X(n), the order statistic, is

sufficient and complete, then

Un =1(nm

)∑

1≤i1<···<im≤n

h(Xi1 , · · · ,Xim) (7.1)

is an unbiased estimator of θ and is also a function of X(1), · · · ,X(n), and hence is uniformly

minimu variance unbiased estimator (UMVUE). Tosee Un is a function of X(1), · · · ,X(n),

one can rewrite it as

Un =1(nm

)∑

1≤i1<···<im≤n

h(X(i1), · · · ,X(im)). (7.2)

We can see that Un is a conditional expectation of h(x1, · · · ,Xm). In fact, the following is

true.

45

PROPOSITION 7.1 The equality Un = E(h(X1, · · · ,Xm)|X(1), · · · ,X(n)) holds for all 1 ≤m ≤ n.

Proof. One can see that

E(h(X1, · · · ,Xm)|X(1), · · · ,X(n)) = E(h(Xi1 , · · · ,Xim)|X(1), · · · ,X(n))

for any 1 ≤ i1 < · · · < im ≤ n. Summing both sidees above over all such indices and then

dividing by(nm

), we obtain

E(h(X1, · · · ,Xm)|X(1), · · · ,X(n)) =1(nm

)∑

1≤i1<···<im≤n

h(X(i1), · · · ,X(im)) = Un

by (7.2)

Example Let θ = E(X21 ). We know E(X2

1 ) = (1/2)E(X1 −X2)2. Taking h(x1, x2) =

(1/2)(x1 − x2)2, then the corresponding U -statistic is

Un =2

n(n− 1)· 1

2

∑

1≤i<j≤n

(Xi −Xj)2 =

1

n− 1

n∑

I=1

(Xi − X)2 = S2,

which is the sample variance.

Proof. One-sample Wilcoxon statistic θ = P (X1 +X2 ≤ 0). The statistic is

Un =2

n(n− 1)

∑

1≤i<j≤n

I(Xi +Xj ≤ 0)

which is U -statistic.

Example. Gini’s mean difference is to measure the concentration of the distribution,

the parameter is θ = E|X1 −X2|. The statistic is

Un =2

n(n− 1)

∑

1≤i<j≤n

|Xi −Xj |.

This is again a U -statistic.

Now we calculate the variance of U -statistics. Assume E[h(X1, · · · ,Xm)2] <∞. Set

hk(x1, · · · , xk) = E (h(X1, · · · ,Xm)|X1 = x1, · · · ,Xk = xk)

= Eh(x1, · · · , xk,Xk+1, · · · ,Xm). (7.3)

Evidently, hm(x1, · · · , xm) = h(x1, · · · , xm).

46

THEOREM 24 (Hoeffding’s Theorem) For a U -statistic given by (7.1) with E[h(X1, · · · ,Xm)2] <

∞,

V ar(Un) =1(nm

)m∑

k=1

(m

k

)(n−m

m− k

)ζk

where ζk = V ar(hk(X1, · · · ,Xk)).

Proof. Consider two sets i1, · · · , im and j1, · · · , jm ofm distinct integers from 1, 2, · · · , nwith exactly k integers in common. Then the total number of distinct choices is

(nm

)(mk

)(n−mm−k

).

Now

V ar(Un) =1(

nm

)2∑

Cov(h(Xi1 , · · · ,Xim), h(Xj1 , · · · ,Xjm)),

where the sum is over all indices 1 ≤ i1 < · · · < im ≤ n and 1 ≤ j1 < · · · < jm ≤ n.

If i1, · · · , im and j1, · · · , jm have no integers in common, the covariance is zero by

independence. Now we classfisy the sum over the number of integers in common. Then

V ar(Un) =1(nm

)2∑m

k=1

(n

m

)(m

k

)(n−m

m− k

)·

Cov(h(X1, · · · ,Xm), h(X1, · · · ,Xk,Xm+1, · · · ,X2m−k))

=1(nm

)m∑

k=1

(m

k

)(n−m

m− k

)· V ar(ζk)

where the fact (from conditioning) that the covariance above equal to V ar(ζk) is used in

the last step.

LEMMA 7.1 Under the condition of Theorem 24,

(i) 0 = ζ0 ≤ ζ1 ≤ · · · ≤ ζm = V ar(h);

(ii) For fixed m and some k ≥ 1 such that ζ0 = · · · = ζk−1 = 0 and ζk > 0, we have

V ar(Un) =k!(m

k

)2ζk

nk+O

(1

nk+1

)

as n→ ∞.

Proof. (i) For any 1 ≤ k ≤ m− 1,

hk(X1, · · · ,Xk) = E (h(X1, · · · ,Xn)|X1, · · · ,Xk)

= E (hk+1(X1, · · · ,Xk+1)|X1, · · · ,Xk) .

47

Then by Jensen’s inequality, E(h2k) ≤ E(h2

k+1). Note that Ehk = Ehk+1. Then V ar(hk) ≤V ar(hk+1). So (i) follows.

(ii) By Theorem 24,

V ar(Un) =1(nm

)(m

k

)(n−m

m− k

)ζk +

1(nm

)m∑

i=k+1

(m

i

)(n−m

m− i

)ζi.

Since m is fixed,(nm

)∼ nm/m! and

(n−mm−i

)≤ nm−i for any k + 1 ≤ i ≤ m. Thus the last

term above is bounded by O(n−(k+1)) as n→ ∞. Now

1(nm

)(m

k

)(n−m

m− k

)ζk

=m!(m

k

)ζk

n(n− 1) · · · (n−m+ 1)· (n−m)!

(m− k)!(n − 2m+ k)!

=k!(m

k

)2ζk

nk· n · n · · ·nn(n− 1) · · · (n− k + 1)

· (n −m) · · · (n− 2m+ k + 1)

(n− k) · · · (n−m+ 1)

=k!(m

k

)2ζk

nk·

k−1∏

i=1

(1 − i

n

)−1

·2m−k−1∏

i=m

(1 − i

n

)·

m−1∏

i=k

(1 − i

n

)−1

.

Note that since m is fixed, each of the above product is equal to 1+O(1/n). The conclusion

then follows.

Let X1, · · · ,Xn be a sample, and Tn be a statistic based on this sample. The projection

of Tn on some random variables Y1, · · · , Yp is defined by

T ′n = ETn +

p∑

i=1

(E(Tn|Yi) −ETn).

Suppose Tn is a symmetric function ofX1, · · · ,Xn. Set ψ(Xi) = E(Tn|Xi) for i = 1, 2, · · · , n.Then ψ(X1), · · · , ψ(Xn) are i.i.d. random variables with mean ETn. If V ar(Tn) <∞, then

1√nV ar(ψ(X1))

n∑

i=1

(ψ(Xi) − ETn) → N(0, 1) (7.4)

in distribution as n→ ∞. Now Let T ′n be the projection of Tn on X1, · · · ,Xn. Then

Tn − T ′n = (Tn − ETn) −

n∑

i=1

(ψ(Xi) − ETn). (7.5)

We will next show that Tn − T ′n is negligible comparing the order appeared in (7.4). Then

the CLT holds for Tn by (7.4) and thre Slusky lemma.

48

LEMMA 7.2 Let Tn be a symmetric statistic with V ar(Tn) < ∞ for every n, and T ′n be

the projection of Tn on X1, · · · ,Xn. Then ET ′n = ETn and

E(Tn − T ′n)2 = V ar(Tn) − V ar(T ′

n).

Proof. Since ETn = ET ′n,

E(Tn − T ′n)2 = V ar(Tn) + V ar(T ′

n) − 2Cov(Tn, T′n).

First, V ar(T ′n) = nV ar(E(Tn|X1)) by independence. Second, by (7.5)

Cov(Tn, T′n)

= V ar(Tn) + nV ar(E(Tn|X1)) − 2n∑

i=1

Cov(Tn − ETn)(ψ(Xi) − ETn). (7.6)

Now the i-th covariance above is equal to E(TnE(Tn|Xi))−E(E(Tn|Xi))2 = V ar(E(Tn|X1)).

We already see V ar(T ′n) = nV ar(E(Tn|X1)). So the proof is complete by the two facts and

(7.6).

THEOREM 25 Let Un be the statistic given by (7.1) with E[h(X1, · · · ,Xm)]2 <∞.

(i) If ζ1 > 0, then√n(Un − ETn) → N(0,m2ζ1) in distribution.

(ii) If ζ1 = 0 and ζ2 > 0, then

n(Un − EUn) → m(m− 1)

2

∞∑

j=1

λj(χ21j − 1),

where χ21j ’s are i.i.d. χ2(1)-random variables, and λj ’s are constants satisfying

∑∞j=1 λ

2j =

ζ2.

Proof. We will only prove (i) here. The proof of (ii) is very technical, it is omitted. Let

U ′n be the projection of Un on X1, · · · ,Xn. Then

U ′n = Eh+

1(nm

)∑

1≤i1<···<im≤n

E(h(Xi1 , · · · ,Xik)|Xi) − Eh.

Observe that

1(nm

)(∑

1≤i1<···<im≤n

E(h(Xi1 , · · · ,Xik)|X1) − Eh)

=1(nm

)∑

2≤i2<···<im≤n

(h1(X1) − Eh)

=

(n−1m−1

)(nm

) (h1(X1) − Eh) =m

n(h1(X1) − Eh).

49

Thus

U ′n = Eh+

m

n

n∑

i=1

(h1(Xi) − Eh).

This says that V ar(U ′n) = (m2/n)ζ1, where ζ1 = V ar(h1(X1)). By lemma 7.1, V ar(Un) =

(m2/n)ζ1+O(n−2). From Lemma 7.2, V ar(Un−U ′n) = O(n−2) as n→ ∞. By the Chebyshev

inequality,√n(Un − U ′

n) → 0 in probability. The result follows from (7.4) and the Slusky

lemma by noting V art(ψ(X1)) = V ar(h1(X1)) = ζ1.

50

8 Empirical Processes

Let X1,X2, · · · be a random sample, where Xi takes values in a metric space M. Define a

probability measure µn such that

µn(A) =1

n

n∑

i=1

I(Xi ∈ A), A ⊂M.

The measure µn is called a probability measure. If X1 takes real values, that is, M = R,

set

Fn(t) =1

n

n∑

i=1

I(Xi ≤ t), t ∈ R, A = (−∞, t].

The random process Fn(t), t ∈ R is called an empirical process. By the LLN and CLT,

Fn(t) → F (t) a.s. and√n(Fn(t) − F (t)) =⇒ N(0, F (t)(1 − F (t))),

where F (t) is the cdf of X1. Among many interesting questions, here we are interested in

the uniform convergence of the above almost sure and weak convergences. Specifically, we

want to answer the following two questions:

1) Does ∆n := supt

|Fn(t) − F (t)| → 0 a.s.?

2) Fix n ≥ 1, regard Fn(t); t ∈ R as one point in a big space, does the CLT hold?

The answers are yes for both questions. We first answer the first question.

THEOREM 26 (Glivenko-Cantelli) Let X1,X2, · · · be a sequence of i.i.d. random vari-

ables with cdf F (t). Then

∆n = supt

|Fn(t) − F (t)| → 0 a.s.

as n→ ∞.

Remark. When F (t) is continuous and strictly increasing in the support of X1, the above

the distribution of ∆n is independent of the distribution of X1. First, F (Xi)’s are i.i.d.

U [0, 1]-distributed random variables. Second,

∆n = supt

| 1n

∑

i

I(Xi ≤ t) − F (t)| = supt

| 1n

∑

i

I(F (Xi) ≤ F (t)) − F (t)|

= sup0≤y≤1

| 1n

∑

i

I(Yi ≤ y) − y|,

51

where Yi = F (Xi); 1 ≤ i ≤ n are i.i.d. random variables with uniform distribution over

[0, 1].

Proof. Set

Fn(t−) =1

n

n∑

i=1

I(Xi < t) and F (t−) = P (X1 < t), t ∈ R.

By the strong law of large numbers, Fn(t) → F (t) a.s. and Fn(t−) → F (t−) a.s. as n→ ∞.

Given ǫ > 0, choose −∞ = t0 < t1 < · · · < tk = ∞ such that F (ti−)−F (ti−1) < ǫ for every

i. Now, for ti−1 ≤ t < ti,

Fn(t) − F (t) ≤ Fn(ti−) − F (ti−) + ǫ

Fn(t) − F (t) ≥ Fn(ti) − F (ti) − ǫ.

This says that

∆n ≤ max1≤i≤k

|Fn(ti) − F (ti)|, |Fn(ti−) − F (ti−)| + ǫ→ ǫ a.s.

as n→ ∞. The conclusion follows by letting ǫ ↓ 0.

Given (t1, t2, · · · , tk) ∈ Rk, it is easy to check that

√n(Fn(t1) − F (t1), · · · , Fn(tk) −

F (tk)) =⇒ Nk(0, Σ) where

Σ = (σij) and σij = F (ti ∧ tj) − F (ti)F (tj), 1 ≤ i, j ≤ n. (8.1)

Let D[−∞,∞] be the set of all functions defined on (−∞,∞) such that they are right-

continuous and the left limits exist everywhere. Equipped with a so-called Skorohod metric,

it becomes a Polish space. The random vector√n(Fn − F ), viewed as an element in

D[−∞,∞], converges weakly to a continuous Gaussian process ξ(t), where ξ(±∞) = 0 and

the covariance structure is given in (8.1). This is usually called a Brownian bridge. It

has same distribution of Gλ F, where Gλ is the limiting point when F is the cumulative

distribution function of the uniform distribution over [0, 1].

THEOREM 27 (Donsker). If X1,X2, · · · are i.i.d. random variables with distribution

function F, then the sequence of empirical processes√n(Fn − F ) converges in distribution

in the space D[−∞,∞] to a random element GF , whose marginal distributions are zero-

mean with covariance function (8.1).

Later we will see that

√n‖Fn − F‖∞ =

√n sup

t|Fn(t) − F (t)| =⇒ sup

t|GF (t)|,

52

where

P

(sup

t|GF (t)| ≥ x

)= 2

∞∑

j=1

(−1)j+1e−2j2x2, x ≥ 0.

Also, the DKW (Dvoretsky, Kiefer, and Wolfowitz) inequality says that

P (√n‖Fn − F‖∞ > x) ≤ 2e−2x2

, x > 0.

Let X1,X2, · · · be i.i.d. random variables taking values in (X ,A) and L(X1) = P. We

write µf =∫f(x)µ( dx). So

Pnf =1

n

n∑

i=1

f(Xi) and Pf =

∫

Xf(x)P ( dx).

By the Glivenko-Cantelli theorem

supx∈R

|Pn((−∞, x]) − P ((−∞, x])| → 0 a.s.

Also, if P has no point mass, then

supany measurable set A

|Pn(A) − P (A)| = 1

because Pn(A) = 1 and P (A) = 0 for A = X1,X2, · · · ,Xn. We are searching for F such

that

supf∈F

|Pnf − Pf | → 0 a.s. (8.2)

The previous two examples say (8.2) holds for F = I(−∞, x], x ∈ R but doesn’t hold for

the power set 2R. The class F is called P -Glivenko-Cantelli if (8.2) holds.

Define Gn =√n(Pn −P ). Given k measurable functions f1, · · · , fk such that Pf2

i <∞for all i, one can also check by the multivariate Central Limit Theorem (CLT) that

(Gnf1, · · · ,Gnfk) =⇒ Nk(0,Σ)

where Σ = (σij)1≤i,j≤n and σij = P (fifj) − (Pfi)Pfj . This tells us that Gnf, f ∈ Fsatisfies CLT in R

k. If F has infinite many members, how we define weak convergence? We

first define a space similar to Rk. Set

l∞(F) = bounded function z : F → R

with norm ‖z‖∞ = supF∈F |z(F )|. Then (l∞(F), ‖ · ‖∞) is a Banach space; it is separable

if F is countable. When F has finite elements, (l∞(F), ‖ · ‖∞) = (Rk, ‖ · ‖∞).

So, Gn : F → R is a random element and can be viewed as an element in Banach space

l∞(F). We say F is P -Donsker if Gn =⇒ G, where G is a tight element in l∞(F).

53

LEMMA 8.1 (Berstein’s inequality). Let X1, · · · ,Xn be independent random variables

mean zero, |Xj | ≤ K for some constant K and all j, and σ2j = EX2

j > 0. Let Sn =∑n

i=1Xi

and s2n =∑n

j=1 σ2j . Then

P (|Sn| ≥ x) ≤ 2e−x2/2(s2n+Kx), x > 0.

Proof. It suffices to show that

P (Sn ≥ x) ≤ e−x2/2(s2n+Kx), x > 0. (8.3)

First, for any λ > 0,

P (Sn > x) ≤ e−λxEeλSn = e−λxΠni=1Ee

λXi .

Now

EeλXj = 1 +

∞∑

i=2

λi

i!EXi

j ≤ 1 +σ2

jλ2

2

∞∑

i=2

(λK)i−2 = 1 +σ2

jλ2

2(1 − λK)≤ exp

(σ2

jλ2

2(1 − λK)

)

if λK < 1. Thus

P (Sn > x) ≤ exp

(−λx+

λ2s2n2(1 − λK)

)

Now (8.3) follows by choosing λ = x/(s2n +Kx).

Recall Gnf =∑n

i=1(f(Xi) − Pf)/√n. The random variable Yi := (f(Xi) − Pf)/

√n

here corresponds to Xi in the above lemma. Note that V ar(∑n

i=1 Yi) = V ar(f(X1)) ≤ Pf2

and ‖Yi‖∞ ≤ 2‖f‖∞/√n. Apply the above lemma to

∑ni=1 Yi, we obtain

COROLLARY 8.1 For any bounded, measurable function f,

P (|Gnf | ≥ x) ≤ 2 exp

(−1

4

x2

Pf2 + x‖f‖∞/√n

)

for any x > 0.

Notice that

P (|Gnf | ≥ x) ≤ 2e−Cx when x is large and

P (|Gnf | ≥ x) ≤ 2e−Cx2when x is small. (8.4)

Let’s estimate Emax1≤i≤m |Yi| provided m is large and P (|Yi| ≥ x) ≤ e−x for all i and

x > 0. Then E|Yi|k ≤ k! for k = 1, 2, · · · . The immediate estimate is

E max1≤i≤m

|Yi| ≤m∑

i=1

E|Yi| ≤ m.

54

By Holder’s inequality,

E max1≤i≤m

|Yi| ≤ (E max1≤i≤m

|Yi|2)1/2 ≤ (

m∑

i=1

E|Yi|2)1/2 ≤√

2m.

Let ψ(x) = ex − 1, x ≥ 0. Following this logic, by Jensen’s inequality

ψ(E max1≤i≤m

|Yi|/2) ≤ m · max1≤i≤m

Eψ(|Yi|/2) ≤ 2m.

Take inverse ψ−1 for both sides. We obtain that

E max1≤i≤m

|Yi| ≤ 2 log(2m).

This together with (8.4) leads to the following lemma.

LEMMA 8.2 For any finite class F of bounded, measurable, square-integrable functions

with |F| elements, there is an universal constant C > 0 such that

C · Emaxf

|Gnf | ≤ maxf

‖f‖∞log(1 + |F|)√

n+ max

f‖f‖P,2

√log(1 + |F|) .

where ‖f‖P,2 = (Pf2)1/2.

Proof. Define a = 24‖f‖∞/√n and b = 24Pf2. Define Af = (Gnf)I|Gnf | > b/a and

Bf = (Gnf)I|Gnf | ≤ b/a, then Gnf = Af +Bf . It follows that

Emaxf

|Gnf | ≤ Emaxf

|Af | + Emaxf

|Bf |. (8.5)

For x ≥ b/a and x ≤ b/a the exponent in the Bernstein inequality is bounded above by

−3x/a and −3x2/b, respectively. By Bernstein’s inequality,

P (|Af | ≥ x) ≤ P (|Gnf | ≥ x ∨ b/a) ≤ 2 exp

(−3x

a

),

P (|Bf | ≥ x) = P (b/a ≥ |Gnf | ≥ x) ≤ P (|Gnf | ≥ x ∧ b/a) ≤ 2 exp

(−3x2

b

)

for all x ≥ 0. Let ψp(x) = exp(xp) − 1, x ≥ 0, p ≥ 1.

Eψ1

( |Af |a

)= E

∫ |Af |/a

0ex dx =

∫ ∞

0P (|Af | > ax)ex dx ≤ 1.

By a similar argument we find that Eψ2(|Bf |/√b) ≤ 1. Because ψp(·) is convex for all p ≥ 1,

by Jensen’s inequality

ψ1

(Emax

f

|Af |a

)≤ Eψ1

(maxf |Af |

a

)≤ E

∑

f

ψ1

( |Af |a

)≤ |F|.

55

Taking inverse function, we have the first term on the right hand side. Similarly we have

the second term. The conclusion follows from (8.5).

We actually use the following lemma above

LEMMA 8.3 Let X be a real valued random variable and f : [0,∞) → R be differentiable

with∫ s0 |f ′(t)| dt <∞ for any s > 0 and

∫∞0 P (X > t)|f ′(t)| dt <∞. Then

Ef(X) =

∫ ∞

0P (X ≥ t)f ′(t) dt + f(0).

Proof. First, since∫ t0 |f ′(t)| dt <∞ for any t > 0, we have

f(X) − f(0) =

∫ X

0f ′(t) dt =

∫ ∞

0I(X ≥ t)f ′(t) dt.

Then, by Fubini’s theorem,

Ef(X) = E

∫ ∞

0I(X ≥ t)f ′(t) dt + f(0) =

∫ ∞

0P (X ≥ t)f ′(t) dt + f(0).

Remark. When f(x) is differentiable, f ′(x) is not necessarily Lebesgue integrable even

though it is Riemann-integrable. The following is an example: Let

f(x) =

x2 cos(1/x2), if x 6= 0;

0, otherwise.

Then

f ′(x) =

2x cos(1/x2) + (2/x) sin(1/x2), if x 6= 0;

0, otherwise

is not Lebesgue-integrable on [0, 1], but obviously Riemann-integrable. We only need to

show g(x) := (2/x) sin(1/x2) is not integrable. Suppose it is,∫ 1

0

1

x

∣∣∣∣sin(

1

x2

)∣∣∣∣ dx =1

2

∫ 1

0

| sin t|t

dt = +∞,

yields a contradiction.

Now we describe the size of F . For any f ∈ F , define

‖f‖P,r = (P |f |r)1/r.

Let l and u be two functions, the bracket [l, u] is set of all functions f such that l ≤ f ≤ u.

An ǫ-bracket in Lr(P ) is a bracket [l, u] such that ‖u − l‖P,r < ǫ. The bracketing number

N[ ](ǫ,F , Lr(P )) is the minimum numbers of ǫ-brackets needed to cover F . The entropy with

bracketing is the logarithm of the bracketing number.

56

8.1 Outer Measures and Expectations

Recall X is a random variable from (Ω,G) → (R,B(R)) if X−1(B) ∈ G for any set B ∈ B(R).

For an arbitrary map X the inverse may not be in G), particularly for small G). For

example, when G = ∅,Ω, many maps are not random variables. In empirical processes,

we will deal with Z =: supt∈T Xt for some index set T. If T is big, then Z may not be

measurable. It does not make sense to study expectations and probabilities related to such

a random variable. But we have another way to get around this.

Definition Let X be an arbitrary map from (Ω,G, P ) → (R,B(R)). Define

E∗X = infEY ; Y ≥ X and Y is a measurable map : (Ω,G) → (R,B(R));P ∗(X ∈ A) = infP (X ∈ B); X ∈ B ∈ G, B ⊃ A, A,B ∈ B(R).

One can easily show that E∗X, as an infimum, can be achieved, i.e., there exists a random

variable X∗ : (Ω,F , P ) → (R, B(R)) such that EX∗ = E∗X. Further, X∗ is P -almost surely

unique, i.e., if there exist two such random variables X∗1 and X∗

2 , then P (X∗1 = X∗

2 ) = 1.

We call X∗ the measurable cover function. Obviously

(X1 +X2)∗ ≤ X∗

1 +X∗2 and X∗

1 ≤ X∗2 if X1 ≤ X2.

One can define E∗ and P∗ similarly.

Let (M,d) be a metric space. A sequence of arbitrary maps Xn : (Ωn,Gn) → (M,d)

converges in distribution to a random vector X if

E∗f(Xn) → Ef(X)

for any bounded, continuous function f defined on (M,d). We still have an analogue of the

Portmanteau theorem.

THEOREM 28 The following are equivalent:

(i) E∗f(Xn) → Ef(X) for every bounded, continuous function f defined on (M,d);

(ii) E∗f(Xn) → Ef(X) for every bounded, Lipschitz function f defined on (M,d), that

is, there is a constant C > 0 such that |f(x) − f(y)| ≤ Cd(x, y) for any x, y ∈M ;

(iii) lim infn P∗(Xn ∈ G) ≥ P (X ∈ G) for any open set G;

(iv) lim supn P∗(Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;

(v) lim supn P∗(Xn ∈ H) = P (X ∈ H) for any set H such that P (X ∈ ∂H) = 0, where

∂H is the boundary of H.

57

Let

J[ ](δ,F , L2(P )) =

∫ δ

0

√logN[ ](ǫ,F , L2(P )) dǫ.

The proof of the following thorem is easy. It is omitted.

THEOREM 29 Every class F of measurable functions such that N[ ](ǫ,F , L1(P )) <∞ for

every ǫ > 0 is P -Glivenko-Cantelli.

THEOREM 30 Every class F of measurable functions with J[ ](1,F , L2(P )) < ∞ is P -

Donsker.

To prove the theorem, we need some preparation.

THEOREM 31 A sequence of arbitrary maps Xn = (Xn,t, t ∈ T ) : (Ωn, Gn) → l∞(T )

converges weakly to a tight random element if and only if both of the following conditions

hold:

(i) The sequence (Xn,t1 , · · · ,Xn,tk) converges in distribution in Rk for every finite set

of points t1, · · · , tk in T ;

(ii) for every ǫ, η > 0 there exists a partition of T into finitely many sets T1, · · · , Tk

such that

lim supn→∞

P ∗(

supi

sups,t∈Ti

|Xn,s −Xn,t| ≥ ǫ

)≤ η.

Proof. We only prove sufficiency.

Step 1: A preparation. For each integer m ≥ 1, let Tm1 , · · · , Tm

kmbe a partition of T such

that

lim supn→∞

P ∗(

supj

sups,t∈T m

j

|Xn,s −Xn,t| ≥1

2m

)≤ 1

2m. (8.1)

Since the supremum above becomes smaller when a partition becomes more refined, w.l.o.g.,

assume the partitions are successive refinements as m increases. Define a semi-metric

ρm(s, t) =

0, if s, t belong to the same partitioning set Tmj for some j;

1, otherwise.

Easily, by the nesting of the partitions, ρ1 ≤ ρ2 ≤ · · · . Define

ρ(s, t) =

∞∑

m=1

ρm(s, t)

2mfor s, t ∈ T.

58

Obviously, ρ(s, t) ≤ ∑∞k=m+1 2−k ≤ 2−m when s, t ∈ Tm

j for some j. So (T, ρ) is totally

bounded. Let T0 be the countable ρ-dense subset constructed by choosing an arbitrary

point tmj from every Tmj .

Step 2: Construct the limit of Xn. For two finite subsets of T, S = (s1, · · · , sp) and

U = (s1, · · · , sp, , · · · , sq), by assumption (i), there are two probability measures µp on Rp

and µq on Rq such that

(Xn,s1, · · · ,Xn,sp) =⇒ µp; (Xn,s1, · · · ,Xn,sp , · · · ,Xn,sq) =⇒ µq

µq(A× Rq−p) = µp(A) for any Borel set A ⊂ R

p.

By the Kolmogorov consistency theorem there exists a stochastic process Xt; t ∈ T0 on

some probability space such that (Xn,t1 , · · · ,Xn,tk) =⇒ (Xt1 , · · · ,Xtk) for every finite set

of points t1, · · · , tk in T0. It follows that, for a finite set S ⊂ T,

supi

sups,t∈T m

is,t∈S

|Xn,s −Xn,t| → supi

sups,t∈T m

is,t∈S

|Xs −Xt|

as n→ ∞ for fixed m ≥ 1. By the Portmanteau theorem and (8.1)

P

sup

isup

s,t∈T mi

s,t∈S

|Xs −Xt| >1

2m

≤ lim sup

n→∞P ∗

sup

isup

s,t∈T mi

s,t∈S

|Xn,s −Xn,t| ≥1

2m

≤ 1

2m

for each m ≥ 1. By Monotone Convergence Theorem, since T is countable, let S ↑ T, we

obtain that

P

sup

isup

s,t∈T mi

s,t∈T0

|Xs −Xt| >1

2m

≤ 1

2m.

If ρ(s, t) < 2−m, then ρm(s, t) < 1 (otherwise, ρm(s, t) = 1 implies ρ(s, t) ≥ 2−m). This

says that s, t ∈ Tmj for the same j. So the above inequality implies that

P

sup

ρ(s,t)<2−m

s,t∈T0

|Xs −Xt| >1

2m

≤ 1

2m

for each m ≥ 1. By the Borel-Cantelli lemma, when m is large, with probability one

supρ(s,t)<2−m

s,t∈T0

|Xs −Xt| ≤1

2m

59

as m is large enough. This says that, with probability one, Xt, as a function of t is uniformly

continuous on T0. Extending this from T0 to T, we obtain a ρ-continuous process Xt, t ∈ T.

We then have that with probability one

|Xs −Xt| ≤1

2mfor any s, t ∈ T with ρ(s, t) <

1

2m(8.2)

for sufficiently large m.

Step 3: Show Xn =⇒ X. Define πm : T → T as the map that maps every point in

the partitioning set Tmi onto the point tmi ∈ Tm

i . For any bounded, Lipschitz continuous

function f : l∞(T ) → R,

limn→∞

|E∗f(Xn) − Ef(X)| ≤ limn→∞

|E∗f(Xn) − E∗f(Xn πm)|+ lim

n→∞|E∗f(Xn πm) − Ef(X πm)| + E|f(X πm) − f(X)|.

Xn πm and X πm are essentially finite dimensional random vectors, by assumption (i),

the first one converges to the second in distribution. So the middle term above is zero.

Now, if s, t ∈ Tmi , ρ(s, t) ≤ 2−m. It follows that

‖X πm −X‖∞ = supi

supt∈T m

i

|Xt −Xtmi| < 1

2m−1. (8.3)

as m is large enough. Thus, E|f(X πm) − f(X)| ≤ ‖f‖∞2−m+1. Since f is Lipschitz,

|E∗f(Xn) − E∗f(Xn πm))| ≤ ‖f‖∞2−m + P(‖Xn −Xn πm‖ ≥ 2−m

)

By the same argument as in (8.3), using (8.1) to obtain

limn→∞

|E∗f(Xn) − Ef(X)| ≤ ‖f‖∞2−m + 2−m + 2−m+1‖f‖∞.

Letting m → ∞, we have that E∗f(Xn) → Ef(X) as n→ ∞.

Let F be a class of measurable functions f : X → R. Review

a(δ) =δ√

logN[ ](δ,F , L2(P )),

J[ ](δ,F , L2(P )) =

∫ δ

0

√logN[ ](ǫ,F , L2(P )) dǫ.

LEMMA 8.4 Suppose there is δ > 0 and a measurable function F > 0 such that P ∗f2 < δ2

and |f | ≤ F for every f ∈ F . Then there exists a constant C > 0 such that

C ·E∗P ‖Gn‖F ≤ J[ ](δ,F , L2(P )) +

√nP ∗FF >

√na(δ).

60

Proof. Step 1: Truncation. Recall Gnf = (1/√n)∑n

i=1(f(Xi) − Pf). If |f | ≤ g, then

|Gnf | ≤1√n

n∑

i=1

(g(Xi) + Pg).

It follows that

E∗ ∥∥GnfIF >√na(δ)

∥∥F ≤ 2

√nPFF >

√na(δ).

We will bound E∗ ‖GnfIF ≤ √na(δ)‖F next. The bracketing number of the class of

functions fIF >√na(δ) if f ranges over F are smaller than the bracketing number of

the class F . To simplify notation, we assume, w.l.o.g., |f | ≤ √na(δ) for every f ∈ F .

Step 2: Discretization of the integral. Choose an integer q0 such that 4δ ≤ 2−q0 ≤ 8δ.

We claim there exists a nested sequence of F-partitions Fqi; 1 ≤ i ≤ Nq, indexed by the

integers q ≥ q0, into Nq disjoint subsets and measurable functions ∆qi ≤ 2F such that

C∑

q≥q0

1

2q

√logNq ≤

∫ δ

0

√logN[ ](ǫ,F , L2(P )) dǫ, (8.4)

supf,g∈Fqi

|f − g| ≤ ∆qi, and P∆2qi ≤

1

22q(8.5)

for an universal constant C > 0. For convenience, write N(ǫ) = N[ ](ǫ,F , L2(P )). Thus,

∫ δ

0

√logN(ǫ) dǫ ≥

∫ 1/2q0+3

0

√logN(ǫ) dǫ =

∞∑

q=q0+3

∫ 1/2q

1/2q+1

√logN(ǫ) dǫ

≥∞∑

q=q0+3

1

2q+1

√logN

(1

2q−3

).

Re-indexing the sum, we have that

1

16

∞∑

q=q0

1

2q

√logNq ≤

∫ δ

0

√logN(ǫ) dǫ,

where Nq = N(2−q). By definition of Nq, there exists a partition Fqi = [lqi, uqi]; 1 ≤ i ≤Nq, where lqi and uqi are functions such that E(uqi − lqi)

2 < 2−q. Set ∆qi = uqi − lqi. Then

∆qi ≤ 2F and supf,g∈Fqi|f − g| ≤ ∆qi, and P∆2

qi ≤ 2−2q.

We can also assume, w.l.o.g., that the partitions Fqi; 1 ≤ i ≤ Nq are successive

refined as q increases. Actually, at q-th level, we can make intersections of elements from

Fki; 1 ≤ i ≤ Nk as k goes from q0 to q : ∩qk=q0

Fki. The total possible number of

such intersections is no more than Nq := Nq0Nq0+1 · · ·Nq. Since the current Fqi’s become

61

smaller, all requirements on Fqi still hold obviously except possibly (8.4). Now we verify

(8.4). Actually, noticing√

log Nq ≤∑qk=q0

√logNk. Then

∑

q≥q0

1

2q

√log Nq ≤

∑

q≥q0

q∑

k=q0

1

2q

√logNk =

∞∑

k=q0

∑

q≥k

1

2q

√logNk = 2

∞∑

k=q0

1

2k

√logNk.

So (8.4) still holds when replacing Nq by Nq.

Step 3: Chaining-a skill. For each fixed q ≥ q0, choose a fixed element fqi from each

partition set Fqi, and set

πqf = fqi, ∆qf = ∆qi, if f ∈ Fqi. (8.6)

Thus, πqf and ∆qf runs through a set of Nq functions if f run through F . Without loss of

generality, we can assume

∆qf ≤ ∆q−1f for any f ∈ F and q ≥ q0 + 1. (8.7)

Actually, let ∆qi be the measurable cover function of supf,g∈Fqi|f−g|. Then (8.5) also holds.

Let Fq−1 j be a partition set at the q−1 level such that Fqi ⊂ Fq−1 j . Then supf,g∈Fqi|f−g| ≤

supf,g∈Fq−1 j|f − g|. Thus, ∆qi ≤ ∆q−1 j. The assertion (8.7) follows by replacing ∆qi with

∆qi.

By (8.6), P (πqf −f)2 ≤ maxi P∆2qi < 2−2q. Thus, P

∑q(πqf −f)2 =

∑q P (πqf −f)2 <

∞. So∑

q(πqf − f)2 <∞ a.s. This implies

πqf → f a.s. on P (8.8)

as q → ∞ for any f ∈ F . Define

aq = 2−q/√

logNq+1 ,

τ = τ(n, f) = infq ≥ q0 : ∆qf >√naq.

The value of τ is thought to be +∞ if the above set is empty. This is the first time ∆qf >√naq. By construction, 2a(δ) = 2δ(logN[ ](δ,F , L2(P )))−1/2 ≤ aq0 since the denominator is

decreasing in δ. By Step 1, we know that |∆qf | ≤ 2√na(δ) ≤ √

naq0. This says that τ > q0.

We claim that

f − πq0f =

∞∑

q0+1

(f − πqf)Iτ = q +

∞∑

q0+1

(πqf − πq−1f)Iτ ≥ q, a.s on P. (8.9)

In fact, write f − πq0f = (f − πq1f) +∑q1

q0+1(πqf − πq−1f) for q1 > q0. Now,

62

(i) If τ = ∞, the right hand side above is identical to limq→∞ πqf −πq0f = f −πq0f a.s.

by (8.8);

(ii) If τ = q1 <∞, the right hand is equal to (f − πq1f) +∑q1

q0+1(πqf − πq−1f).

Step 4. Bound terms in the chain. Apply Gn to the both sides of (8.9). For |f | ≤ g,

note that |Gnf | ≤ |Gng| + 2√nPg. By (8.6), |f − πqf | ≤ ∆qf. One obtains that

E∗‖∞∑

q0+1

Gn

f︷︸︸︷(f − πqf)Iτ = q ‖F

≤∞∑

q0+1

E∗‖Gn(∆qfIτ = q︸︷︷︸g

)‖F + 2√n

∞∑

q0+1

‖P (∆qfIτ = q)‖F . (8.10)

Now, by (8.7), ∆qfIτ = q ≤ ∆q−1fIτ = q ≤ √naq−1. Moreover, P (∆qfIτ = q)2 ≤

2−2q. By Lemma 8.2, the middle term above is bounded by

∞∑

q0+1

(aq−1 logNq + 2−q√

logNq).

By Holder’s inequality, P (∆qfIτ = q) ≤ (P (∆qf)2)1/2P (τ = q)1/2 ≤ 2−qP (∆qf >√naq)

1/2 ≤ 2−q(√naq)

−1(P (∆qf)2)1/2 ≤ 2−2q(√naq)

−1. So the last term in (8.10) is

bounded by 2 · 2−2q/aq. In summary,

E∗

∥∥∥∥∥∥

∞∑

q0+1

Gn(f − πqf)Iτ = q

∥∥∥∥∥∥F

≤ C

∞∑

q0+1

2−q√

logNq (8.11)

for some universal constant C > 0.

Second, there are at mostNq functions πqf−πq−1f and two values the indicator functions

I(τ ≥ q) takes. Because the partitions are nested, the function |πqf − πq−1f |Iτ ≥ q ≤∆q−1fIτ ≥ q ≤ √

naq−1. The L2(P )-norm of πqf −πq−1f is bounded by 2−q+1. Applying

Lemma 8.2 again to obtain

E∗

∥∥∥∥∥∥

∞∑

q0+1

Gn(πqf − πq−1f)Iτ ≥ q

∥∥∥∥∥∥F

≤∞∑

q0+1

(aq−1 logNq + 2−q√

logNq)

≤ C

∞∑

q0+1

2−q√

logNq (8.12)

for some universal constant C > 0.

63

At last, we consider πq0f. Because |πq0f | ≤ F ≤ a(δ)√n ≤ √

naq0 and P (πq0f)2 ≤ δ2

by assumption, another application of Lemma 8.2 leads to

E∗‖Gnπq0f‖F ≤ aq0 logNq0 + δ√

logNq0.

In view of the choice of q0, this is no more than the bound in (8.12). All above inequalities

together with (8.4) yield the desired result.

COROLLARY 8.2 For any class F of measurable functions with envelope function F, there

exists an universal constant C such that

E∗P ‖Gn‖F ≤ C · J[ ](‖F‖P,2,F , L2(P )).

Proof. Since F has a single bracket [−F, F ], we have that N[ ](δ,F , L2(P )) = 1 for δ =

2‖F‖P,2. Review the definition in Lemma 8.4. Choose δ = ‖F‖P,2. It follows that

a(δ) =‖F‖P,2√

logN[ ](‖F‖P,2,F , L2(P )).

Now√nP ∗FI(F >

√na(δ)) ≤ ‖F‖2

P,2/a(δ) = ‖F‖P,2

√logN[ ](‖F‖P,2,F , L2(P )) by Markov’s

inequality, which is bounded by J[ ](‖F‖P,2,F , L2(P )) since the integrand is non-decreasing

and hence∫ ‖F‖P,2

0

√logN[ ](ǫ,F , L2(P )) dǫ ≥ ‖F‖P,2

√logN[ ](‖F‖P,2,F , L2(P )).

Proof of Theorem 30. We will use Theorem 31 to prove this theorem. The part (i) is

easily satisfied. Now we verify (ii).

Note there is no cover on F and we don’t know if Pf2 < δ2. Let G = f − g, f, g ∈ F.With a given set of ǫ-brackets [li, ui] over F we can construct 2ǫ-brackets over G by taking

differences [li − uj, ui − lj ] of upper and lower bounds. Therefore, the bracketing number

N[ ](2ǫ,G, L2(P )) are bounded by the square of the bracketing number N[ ](ǫ,F , L2(P )).

Easily, ‖(ui − lj) − (li − uj)‖ ≤ 2ǫ. Hence

N[ ](2ǫ,G, L2(P )) ≤ N[ ](ǫ,F , L2(P ))2.

This says that

J[](ǫ,G, L2(P )) <∞. (8.13)

64

For a given, small δ > 0, by the definition of N[ ](δ,F , L2(P )), choose a minimal number

of brackets of size δ that cover F , and use them to form a partition of F = ∪iFi. The subsets

G consisting of differences f − g of functions f and g belonging to the same partitioning

set consists of functions of L2(P )-norm smaller than δ. Hence, by Lemma 8.4, there exists

a finite number a(δ) := a(δ)F < a(2δ)G and a universal constant C such that

C · E∗ supi

supf,g∈Fi

|Gn(f − g)| = CE∗ suph∈H

|Gnh|

≤ J[](δ,H, L2(P )) + 2√nPFI(F > a(δ)

√n)

≤ J[ ](δ,G, L2(P )) + 2√nPFI(F > a(δ)

√n).

since H := f − g; f, g ∈ Fi, for all i ⊂ G, where the envelope function F can be taken

equal to the supremum of the absolute values of the upper and lower bounds of finitely

many brackets that cover F , for instance a minimal set of brackets of size 1. This F is

square integrable. The second term above is bounded by a(δ)−1PF 2I(F > a(δ)√n) → 0 as

n→ ∞ for fixed δ. First let n→ ∞, then let δ ↓ 0 the left hand side goes to 0. So part (ii)

in (31) is valid by using Markov’s inequality.

Example. Let F = ft = I(−∞, t], t ∈ R. Then

‖ft − fs‖2,P = (F (t) − F (s))1/2 for s < t.

Cut [0, 1] into many pieces with length less than ǫ2. Since F (t) is increasing, right continuous

and of left limits, there exist a partition, say, −∞ = t0 < t1 < · · · < tk = ∞ with

k = [1/ǫ2] + 1 such that

F (ti−) − F (ti−1) < ǫ2 for i = 1, 2, · · · , k.

Since ‖I(−∞, ti)−I(−∞, ti−1]‖P,2 = (F (ti−)−F (ti−1))1/2 < ǫ.Namely, (I(−∞, ti−1], I(−∞, ti))

is an ǫ-bracket for i = 1, 2, · · · , k. It follows that

N[ ](ǫ,F , L2(P )) ≤ 2

ǫ2for ǫ ∈ (0, 1).

But∫ 10

√log(1/ǫ) dǫ <∞. The previous classical Glivenko-Cantelli’s Lemma 26 and Donsker’s

Theorem 27 follow from Theorems 29 and 30.

Since ‖f‖ = supt |f(t)| is the norm in l∞, it is continuous. By the Delta method, we

have

supt∈R

√n|Fn(t) − F (t)| → sup

t∈R

G(t)

65

weakly, i.e., in distribution, as n → ∞, where G(t) is the Gaussian process we mentioned

before.

Example. Let F = fθ; θ ∈ Θ be a collection of measurable functions with Θ ⊂ Rd

bounded. Suppose Pmr <∞ and |fθ1(x)−fθ2(x)| ≤ m(x)‖θ1−θ2‖ for any θ1 and θ2. Then

N[ ](ǫ‖m‖P,r, F , Lr(P )) ≤ K

(diam(Θ)

ǫ

)d

(8.14)

for any 0 < ǫ < diam(Θ). As long as ‖θ1−θ2‖ < ǫ, we have that l(x) := fθ1(x)−m(x)ǫ(x) ≤fθ2(x) ≤ fθ1(x) + m(x)ǫ(x) =: u(x). Also, ‖u − l‖P,r = ǫ‖m‖P,r. Naturally, [l, u] is an

ǫ‖m‖P,r-bracket. It suffices to calculate the minimal number of balls of radius ǫ need to

cover Θ.

Note that Θ ⊂ Rd is in a cube with side length no bigger than diam(Θ). We can cover

Θ with fewer than (2diam(Θ)/ǫ)d cubes of size ǫ. The circumscribed balls have radius a

multiple of ǫ and also cover Θ. The intersection of these balls with Θ cover Θ. So the claim

is true. This says that F is a P -Donsker class.

Example (Sobolev classes). For k ≥ 1, let

F =

f : [0, 1] → R; ‖f‖∞ ≤ 1 and

∫ 1

0(f (k)(x))2 dx ≤ 1

.

Then there exists an universal constant K such that

logN[ ](ǫ,F , ‖ · ‖∞) ≤ K

(1

ǫ

)1/k

Since ‖f‖P,2 ≤ ‖f‖∞ for any P, it is easy to check that N[ ](ǫ,F , L2(P )) ≤ N[ ](ǫ,F , ‖ · ‖∞)

for any P. So F is P -Donsker class for any P.

Example (Bounded Variation). Let

F = f : R → [−1, 1] of bounded variation 1.

Any function of bounded variation is the difference of two monotone increasing functions.

Then for any r ≥ 1 and probability measure P,

logN[ ](ǫ,F , Lr(P )) ≤ K

(1

ǫ

).

Therefore F is P -Donsker for every P.

66

9 Consistency and Asymptotic Normality of Maximum Like-

lihood Estimators

9.1 Consistency

Let X1,X2, · · · ,Xn be a random sample from a population distribution with pdf or pmf

f(x|θ), where θ is an unknown parameter. The MLE θ under certain conditions on f(x|θ)will be consistent and satisfy Central Limit Theorems. But for some cases those will not

be true. Let’s see a good example and a pathological example.

Example. Let X1,X2, · · · ,Xn be i.i.d. from Exp(θ) with pdf

f(x|θ) =

θe−θx, if x > 0;

0, otherwise.

So EX1 = 1/θ and V ar(X1) = 1/θ2. It is easy to check that θ = 1/X. By the CLT,

√n

(X − 1

θ

)=⇒ N(0, θ−2)

as n → ∞. Let g(x) = 1/x, x > 0. Then g(EX1) = θ and g′(EX1) = −θ2. By the Delta

method,

√n(θ − θ) =⇒ N(0, g′(µ)2σ2) = N(0, θ2)

as n→ ∞. Of course, θ → θ in probability.

Example. Let X1,X2, · · · ,Xn be i.i.d. from U [0, θ], where θ is unknown. The MLE

estimator θ = maxXi. First, it is easy to check that θ → θ in probability. But θ doesn’t

follow the CLT. In fact,

P

(n(θ − θ)

θ≤ x

)→

1 − e−x, if x > 0;

0, otherwise

as n→ ∞.

For some cases even the consistency, i.e., θ → θ in probability, doesn’t hold. Next, we

will study some sufficient conditions for consistency. Later we provide sufficient conditions

for CLT.

Let X1, · · · ,Xn be a random sample from a density pθ with reference measure µ, that

is, Pθ(X1 ∈ A) =∫A pθ(x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θn

maximizes the function h(θ) :=∑

log pθ(Xi) over Θ, or equivalently, the function

Mn(θ) =1

n

n∑

i=1

logpθ

pθ0

(Xi),

67

where θ0 is the true parameter. Under suitable conditions, by the weak law of large numbers,

Mn(θ) →M(θ) := Eθ0 logpθ

pθ0

(X1) =

∫

R

pθ0(x) logpθ

pθ0

(x)µ( dx) (9.1)

in probability as n → ∞. The number −M(θ) is called the Kullback-Leibler divergence of

pθ and pθ0. Let Y = pθ0(Z)/pθ(Z), where Z ∼ pθ. Then EθY = 1 and

M(θ) = −Eθ(Y log Y ) ≤ −(EθY ) log(EθY ) = 0

by Jensen’s inequality. Of course M(θ0) = 0, that is, θ0 attains the maximum of M(θ).

Obviously, M(θ0) = Mn(θ0) = 0. The following gives the consistency of θn with θ0.

THEOREM 32 Suppose supθ∈Θ |Mn(θ) −M(θ)| P→ 0 and supθ:d(θ,θ0)≥ǫ M(θ) < M(θ0) for

any ǫ > 0. Then θn → θ0 in probability.

Proof. For any ǫ > 0, we need to show that

P (d(θn, θ0) ≥ ǫ) → 0 (9.2)

as n→ ∞. By the condition, there exists δ > 0 such that

supθ:d(θ,θ0)≥ǫ

M(θ) < M(θ0) − δ.

Thus, if d(θn, θ0) ≥ ǫ, then M(θn) < M(θ0) − δ. Note that

M(θn) −M(θ0) = Mn(θn) −Mn(θ0) + (Mn(θ0) −M(θ0)) + (M(θn) −Mn(θn))

≥ −2 supθ∈Θ

|Mn(θ) −M(θ)|

since Mn(θn) ≥Mn(θ0). Consequently,

P (d(θn, θ0) ≥ ǫ) ≤ P (M(θn) −M(θ0) < −δ) = P (supθ∈Θ

|Mn(θ) −M(θ)| ≥ δ/2) → 0.

by the first assumption.

Let X1, · · · ,Xn be a random sample from a density pθ with reference measure µ, that

is, Pθ(X1 ∈ A) =∫A pθ(x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θn

maximizes the function∑

log pθ(Xi) over Θ, or equivalently, the function

Mn(θ) :=1

n

n∑

i=1

logpθ

pθ0

(Xi),

68

where θ0 is the true parameter. This is usually practiced by solving the equation ∂Mn(θ)/∂θ =

0 for θ. Let

ψθ(x) =∂ log(pθ(x)/pθ0(x))

∂θ.

Then the maximum likelihood estimator θn is the zeroes of the function

Ψn(θ) :=1

n

n∑

i=1

ψθ(Xi). (9.3)

Though we still use the notation Ψn in the following theorem, it is not necessarily the one

above. It can be any function of properties stated below.

THEOREM 33 Let Θ be a subset of the real line and let Ψn be random functions and Ψ a

fixed function of θ such that Ψn(θ) → Ψ(θ) in probability for every θ. Assume that Ψn(θ) is

continuous and has exactly one zero θn or is nondecreasing satisfying Ψn(θn) = oP (1). Let

θ0 be a point such that Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ) for every ǫ > 0. Then θnP→ θ0.

Proof. Case 1. Suppose Ψn(θ) is continuous and has exactly one zero θn. Then

P (Ψn(θ0 − ǫ) < 0,Ψn(θ0 + ǫ) > 0) ≤ P (θ0 − ǫ ≤ θn ≤ θ0 + ǫ).

Since Ψn(θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability, and Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ), the left hand

side goes to one.

Case 2. Suppose Ψn(θ) is nondecreasing satisfying Ψn(θn) = oP (1). Then

P (|θn − θ0| ≥ ǫ) ≤ P (θn > θ0 + ǫ) + P (θn < θ0 − ǫ)

≤ P (Ψn(θn) ≥ Ψn(θ0 + ǫ)) + P (Ψn(θn) ≤ Ψn(θ0 − ǫ)).

Now, Ψn(θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability. This together with Ψn(θn) = oP (1) shows that

P (|θn − θ0| ≥ ǫ) → 0 as n→ ∞.

Example. Let X1, · · · ,Xn be a random sample from Exp(θ), that is, the density

function is pθ(x) = θ−1 exp(−x/θ)I(x ≥ 0) for θ > 0. We know that, under the true model,

the MLE θn = 1/Xn → θ0 in probability. Let’s verify that this conclusion indeed can be

deduced from Theorem 33.

Actually, ψθ(x) = (log pθ(x))′ = −θ−1 + xθ−2 for x ≥ 0. Thus Ψn(θ) = −θ−1 + Xnθ

−2,

which goes to Ψ(θ) = Eθ0(−θ−1 + X1θ−2) = θ−2(θ0 − θ), that is positive or negative

depending on if θ is samller or bigger than θ0. Also, θn = 1/Xn. Applying Theorem 33 to

−Ψn and −Ψ, we obtain the consisitency result.

69

9.2 Asymptotic Normality

Now we study the central limit theorems for the MLE.

Now we illustrate the idea of showing the normality of MLE. Recalling (9.3). Do the

Taylor’s expansion for ψθ(Xi).

ψθn(Xi)

.= ψθ0(Xi) + (θn − θ0)ψθ0(Xi) +

1

2(θn − θ0)

2ψθ0(Xi)

We will use the following notation:

Pf =

∫f(x)µ(dx) for any real function f(x) and Pn =

1

n

n∑

i=1

δXi .

Then Pnf = (1/n)∑n

i=1 f(Xi). Thus,

0 = Ψn(θn).= Pnψθ0 + (θn − θ0)Pnψθ0 +

1

2(θn − θ0)

2Pnψθ0 .

Reorganize it in the following form

√n(θn − θ0)

.=

−√n(Pnψθ0)

Pnψθ0 + 12(θn − θ0)Pnψθ0

.

Recall the Fisher information

I(θ0) = Eθ0

(∂ log pθ(X1)

∂θ|θ=θ0

)2

= Eθ0(ψθ0(X1))2 = −Eθ0

(∂2 log pθ(X1)

∂2θ|θ=θ0

)

= −Eθ0(ψθ0(X1)).

By CLT and LLN,√n(Pnψθ0) =⇒ N(0, I(θ0)), Pnψθ0 → −I(θ0) and Pnψθ0 → pθ0ψθ0(X1)

in probability. This illustrates that

√n(θn − θ0) =⇒ N(0, I(θ0)

−1)

as n→ ∞. The next two theorems will make these steps rigorous.

Let g(θ) = Eθ0(mθ(X1)) =∫mθ(x)pθ0(x)µ( dx). We need the following condition:

g(θ) = g(θ0) +1

2(θ − θ0)

TVθ0(θ − θ0) + o(‖θ − θ0‖2), (9.1)

where

Vθ0 =

(Eθ0

(∂mθ

∂θi∂θj|θ=θ0

))

1≤i,j≤d

, θ = (θ1, · · · , θd) ∈ Rd.

70

THEOREM 34 For each θ in an open subset of Euclidean space let x 7→ mθ(x) be a mea-

surable function such that θ 7→ mθ(x) is differentiable at θ0 for P -almost every x with

derivative mθ0(x) and such that , for every θ1 and θ2 in a neighborhood of θ0 and a mea-

surable function n(x) with En(X1)2 <∞

|mθ1(x) −mθ2(x)| ≤ n(x)‖θ1 − θ2‖.

Furthermore, assume the map θ 7→ Emθ(X1) has expression as in (9.1). If Pnmθn≥

supθ Pnmθ − oP (n−1) and θnP→ θ0, then

√n(θn − θ0) = −V −1

θ0

1√n

n∑

i=1

mθ0(Xi) + oP (1).

In particular,√n(θn − θ0) =⇒ N(0, V −1

θ0E(mθ0m

Tθ0

)V −1θ0

) as n→ ∞.

A statistical model (pθ, θ ∈ Θ) is called differentiable in quadratic mean if there exists a

measurable vector-valued function lθ0 such that, as θ → θ0,

∫ [√pθ −

√pθ0 −

1

2(θ − θ0)

T lθ0

√pθ0

]2

du = o(‖θ − θ0‖2).

THEOREM 35 Suppose that the model (Pθ : θ ∈ Θ) is differentiable in quadratic mean at

an inner point θ0 of Θ ⊂ Rk. Furthermore, suppose that there exists a measurable function

l(x) with Eθ0l2(X1) <∞ such that, for every θ1 and θ2 in a neighborhood of θ0,

| log pθ1(x) − log pθ2(x)| ≤ l(x)‖θ1 − θ2‖.

If the Fisher information matrix Iθ0 is non-singular and θn is consistent, then

√n(θn − θ0) = I−1

θ0

1√n

n∑

i=1

lθ0(Xi) + oP (1)

In particular,√n(θn − θ0) =⇒ N(0, I−1

θ0) as n→ ∞, where

I(θ0) = −(E∂2 log pθ(X1)

∂θi∂θj

)

1≤i,j≤k

.

We need some preparations in proving the above theorems.

Given functions x → mθ(x), θ ∈ Rd, we need conditions that ensure that, for a given

sequence rn → ∞ and any sequence hn = O∗P (1),

Gn

(rn(mθ0+hn/rn

−mθ0) − hTn mθ0

)P→ 0. (9.2)

71

LEMMA 9.1 For each θ in an open subset of Euclidean space let mθ(x), as a function

of x, be measurable for each θ; and as a function of θ, is differentiable for almost every x

(w.r.t. P ) with derivative mθ0(x) and such that for every θ1 and θ2 in a neighborhood of θ0

and a measurable function m such that pm2 <∞,

‖mθ1(x) −mθ2(x)‖ ≤ m(x)‖θ1 − θ2‖.

Then (9.2) holds for every random sequence hn that is bounded in probability.

Proof. Because hn is bounded in probability, to show (9.2), it is enough to show, w.l.o.g.

sup|θ|≤1

|Gn(rn(mθ0+θ/rn−mθ0) − θT mθ0)|

P→ 0

as n→ ∞. Define

Fn = fθ := rn(mθ0+θ/rn−mθ0) − θT mθ0, |θ| ≤ 1.

Then

|fθ1(x) − fθ2(x)| ≤ 2mθ0(x)‖θ1 − θ2‖

for any θ1 and θ2 in the unit ball in Rd by the Lipschitz condition. Further, set

Hn = sup|θ|≤1

|rn(mθ0+θ/rn−mθ0) − θT mθ0|.

Then Hn is a cover and Hn → 0 as n → ∞ by the definition of partial derivative. By

Bounded Convergence Theorem, δn := (PH2n)1/2 → 0. Thus, by Corollary 8.2 and Example

8.14,

E∗P‖Gn‖Fn ≤ C · J[ ](‖H‖P,2,Fn, L2(P )) ≤ C

∫ δn

0

√log(Kǫ−d) dǫ→ 0

as n→ ∞. The desired conclusion follows.

We need some preparation before proving Theorem 34.

Let Pn be the empirical distribution of a random sample of size n from a distribution P,

and, for every θ in a metric space (Θ, d), mθ(x) be a measurable function. Let θn (nearly)

maximize the criterion function Pnmθ. The number θ0 is the truth, that is, the maximizer

of mθ over θ ∈ Θ. Recall Gn =√n(Pn − P ).

THEOREM 36 (Rate of Convergence) Assume that for fixed constants C and α > β, for

every n and every sufficiently small δ > 0,

supd(θ,θ0)>δ

P (mθ −mθ0) ≤ −Cδα,

E∗ supd(θ,θ0)<δ

|Gn(mθ −mθ0)| ≤ Cδβ.

72

If the sequence θn satisfies Pnmθn≥ Pnmθ0 −OP (nα/(2β−2α)) and converges in outer prob-

ability to θ0, then n1/(2α−2β)d(θn, θ0) = O∗P (1).

Proof. Set rn = n1/(2α−2β) and Pnmθn≥ Pnmθ0 −Rn with 0 ≤ Rn = OP (nα/(2β−2α)).

Partition (0,∞) by Sj,n = θ : 2j−1 < rnd(θ, θ0) ≤ 2j for all integers j. If rnd(θn, θ0) ≥ 2M

for a given M, then θn is in one of the shells Sj,n with j ≥M. In that case, supθ∈Sj,n(Pnmθ−

Pnmθ0) ≥ −Rn. It follows that

P ∗(rnd(θn, θ0) ≥ 2M ) ≤∑

j≥M,2j≤ǫrn

P ∗(

supθ∈Sj,n

(Pnmθ − Pnmθ0) ≥ −K

rαn

)(9.3)

+P ∗(2d(θn, θ0) ≥ ǫ) + P (rαnRn ≥ K).

The middle term on right goes to zero; the last term can be arbitrarily small by choosing

large K. We only need to show the sum is arbitrarily small when M is large enough as

n→ ∞. By the given condition

supθ∈Sj,n

P (mθ −mθ0) ≤ −C 2j−1α

rαn

.

For M such that (1/2)C2(M−1)α ≥ K, by the fact that sups(fs + gs) ≤ sups fs + sups gs,

P ∗(

supθ∈Sj,n

(Pnmθ − Pnmθ0) ≥ −K

rαn

)≤ P ∗

(sup

θ∈Si,n

Gn(mθ −mθ0) ≥ C√n

2(j−1)α

2rαn

).

Therefore, the sum in (9.3) is bounded by

∑

j≥M,2j≤ǫrn

P ∗(

supθ∈Si,n

|Gn(mθ −mθ0)| ≥ C√n

2(j−1)α

2rαn

)≤∑

j≥M

(2j/rn)β2rαn√

n2(j−1)α

by Markov’s inequality and the definition of rn. The right hand side goes to zero as M → ∞.

COROLLARY 9.1 For each θ in an open subset of Euclidean space, mθ(x) is a measurable

function such that, there exists m(x) with Pm2 <∞ satisfying

|mθ1(x) −mθ2(x)| ≤ m(x)‖θ1 − θ2‖.

Furthermore, assume

Pmθ = Pmθ0 +1

2(θ − θ0)

TVθ0(θ − θ0) + o(‖θ − θ0‖2) (9.4)

with Vθ0 nonsingular. If Pnmθn≥ Pnmθ0−OP (n−1) and θn

P→ θ0, then√n(θn−θ0) = OP (1).

73

Proof. (9.4) implies the first condition of Theorem 36 holds with α = 2 since Vθ0 is

nonsingular. Now we use Corollary 8.2 to the class of functions F = mθ−mθ0; ‖θ−θ0‖ < δto see the second condition is valid with β = 1. This class has envelope function F = δm,

while

E∗ sup‖θ−θ0‖<δ

|Gn(mθ −mθ0)| ≤ C ·∫ δ‖m‖P,2

0

√log[ ](ǫ,F , L2(P )) dǫ

≤∫ δ‖m‖P,2

0

√√√√log

(K

(δ

ǫ

)d)dǫ = C1δ

by (8.14) with Diam(Θ) = const · δd, where C1 depends on ‖m‖P,2.

Proof of Theorem 34. By Lemma 9.1,

Gn

(rn(mθ0+hn/rn

−mθ0) − hTn mθ0

)P→ 0. (9.5)

Expand P (rn(mθ0+hn/rn−mθ0) by condition 9.1, we obtain

nPn(mθ0+hn/√

n −mθ0) =1

2hT

nVθ0 hn + hTnGnmθ0 + oP (1).

By Corollary 9.1,√n(θn − θ) is bounded in probability (same as tightness). Take hn =

√n(θn − θ) and hn = −V −1

θ0Gnmθ0 , we then obtain the Taylor’s expansions of Pnmθn

and

Pnmθ0−V −1θ0

Gnmθ0as follows

nPn(mθn−mθ0) =

1

2hT

nVθ0hn + hTnGnmθ0 + oP (1),

nPn(mθ0−V −1θ0

Gnmθ0/√

n −mθ0) = −1

2Gnm

Tθ0V −1

θ0Gnmθ0 + oP (1),

where the second one is obtained through a bit algebra. By the definition of θn, the left

hand side of the first equation is greater than that of the second one. So are he right hand

sides. Take the difference and make it to a complete square, we then have

1

2(hn + V −1

θ0Gnmθ0)

TVθ0(hn + V −1θ0

Gnmθ0) + oP (1) ≥ 0.

We know Vθ0 is strictly negative-definite, the quadratic form must converge to zero in

probability. This is also the case ‖hn + V −1θ0

Gnmθ0‖, that is, hn = −V −1θ0

Gnmθ0 + oP (1).

74

10 Appendix

Let A be a collection of subsets of Ω and B is generated by A, that is, B = σ(A). Let P be

a probability measure on (Ω,B).

LEMMA 10.1 Suppose A has the following property: (i) Ω ∈ A, (ii)Ac ∈ A if A ∈ A,and (iii) ∪m

i=1Ai ∈ A if Ai ∈ A for all 1 ≤ i ≤ m. Then, for any B ∈ B and ǫ > 0, there

exists A ∈ A such that P (B∆A) < ǫ.

Proof. Let B′ be the set of B ∈ B satisfying the conclusion. Obviously, A ⊂ B′ ⊂ B. It is

enough to verify that B′ is a σ-algebra.

It is easy to see that (i) Ω ∈ B′; (ii) Bc ∈ B′ if B ∈ B′ since A∆B = Ac∆Bc. (iii) If

Bi ∈ B′ for i ≥ 1, there exist Ai ∈ A such that P (Bi∆Ai) < ǫ/2i for all i ≥ 1. Evidently,

∪ni=1Bi ↑ ∪∞

i=1Bi as n → ∞. Therefore, there exists n0 < ∞ such that |P (∪∞i=1Bi) −

P (∪n0i=1Bi)| ≤ ǫ/2. It is easy to check that (∪n0

i=1Bi)∆(∪n0i=1Ai) ⊂ ∪n0

i=1(Bi∆Ai). Write

B = ∪∞i=1Bi and B = ∪n0

i=1Bi and A = ∪n0i=1Ai. Then A ∈ A. Note B∆A ⊂ (B\B)∪ (B∆A).

The above facts show that P (B∆A) < ǫ. Thus B′ is a σ-algebra.

LEMMA 10.2 Let X1,X2, · · · ,Xm, m ≥ 1 be random variables defined on (Ω,F ,P). Let

f(x1, · · · , xm) be a real measurable function with E|f(X1, · · · ,Xm)|p < ∞ for some p ≥ 1.

Then there exists fn(X1, · · · ,Xm); n ≥ 1 such that

(i) fn(X1, · · · ,Xm) → f(X1, · · · ,Xm) a.s.

(ii) fn(X1, · · · ,Xm) → f(X1, · · · ,Xm) in Lp(Ω,F ,P).

(iii) For each n ≥ 1, fn(X1, · · · ,Xm) =∑km

i=1 ci gi1(X1) · · · gim(Xm) for some km < ∞,

constants ci, and gij(Xj) = IAi,j (Xj) for some sets Ai,j ∈ B(R), and all 1 ≤ i ≤ km and

1 ≤ j ≤ m.

Proof. To save notation, we write f = f(x1, · · · , xm). Since E|fI(|f | ≤ C) − f |p → 0

as C → ∞, choose Ck such that E|fI(|f | ≤ Ck) − f |p ≤ 1/k2 for k = 1, 2, · · · . We will

show that there exists function gk = gk(x1, · · · , xk) of the form in (iii) such that E|fI(|f | ≤Ck) − gk|p ≤ 1/k2 for all k ≥ 1. Thus, E|f − gk|p < 2/k2 for all k ≥ 1. The assertion (ii)

then follows. Also, it implies E(∑

k≥1 |f − gk|p) < ∞, then∑

k≥1 |f − gk|p < ∞ a.s. we

ontain (i). Therefore, to prove this lemma, we assume w.l.o.g. that f is bounded and we

need to show that there exist fn = fn(x1, · · · , xm); n ≥ 1 of the form in (iii) such that

E|f − fn|p ≤ 1

n2(10.6)

for all n ≥ 1.

75

Since f is bouded, for any n ≥ 1, there exists hn such that

supx∈Rm

|f(x) − hn(x)| < 1

2n2, (10.7)

where hn is a simple function, i.e., hn(x) =∑kn

i=1 ciIx ∈ Bi for some kn < ∞, constants

ci’s and Bi ∈ B(Rn).

Now set X = (X1, · · · ,Xm) ∈ Rm and µ be the probability measure of X under proba-

bility P. Let A be the set of all finite unions of sets in A1 := ∏mi=1Ai ∈ B(Rn); Ai ∈ B(R).

By the construction of B(Rn), we know that B(Rn) = σ(A). It is not difficult to verify that

A satisfies the conditions in Lemma 10.1. Thus there exist Ei ∈ A such that

∫|IBi(x) − IEi(x)|p dµ = µ(Bi∆Ei) <

1

(2cn2)pand Ei =

ki⋃

j=1

m∏

l=1

Ai,j,l

where c = 1 +∑kn

i=1 |ci| and Ai,j,l ∈ B(R) for all i, j and l. Now, since ‖ · ‖p is a norm, we

have

‖hn(X) −kn∑

i=1

ciIEi(X)‖p ≤kn∑

i=1

|ci| · ‖IEi(X) − IBi(X)‖p ≤ 1

2n2. (10.8)

Observe that ν(E) := IE(X) is a measure on (Rn,B(Rn)). Note that the intersection of any

finite product sets∏m

l=1Ai,j,l is still in A1. By the inclusion-exclusion formula, IEi(X) is a

finite linear combination of IF (X) where F ∈ A1. Thus, fn(X) :=∑kn

i=1 ciIEi(X) is of the

form in (iii). Now, by (10.7) and (10.8)

‖f(X) − fn(X)‖p ≤ ‖f(X) − hn(X)‖p + ‖hn(X) − fn(X)‖p ≤ 1

n2.

Thus, (10.6) follows.

LEMMA 10.3 Let ξ1, · · · , ξk be random variables, and φ(x), x ∈ Rn be a measurable func-

tion with E|φ(ξ1, · · · , ξk)| <∞. Let F and G be two σ-algebras. Suppose

E(f1(ξ1) · · · fk(ξk)|F) = E(f1(ξ1) · · · fk(ξk)|G) a.s. (10.9)

holds for all bounded, measurable functions fi : R → R, 1 ≤ i ≤ n. Then E(φ(ξ1, · · · , ξk)|F) =

E(φ(ξ1, · · · , ξk)|G) almost surely.

Proof. From (10.9), we know that the desired conclusion holds for φ that is a finite linear

combination of f1 · · · fk’s. By Lemma 10.2, there exist such functions φn; n ≥ 1, such that

76

E|φn(ξ1, · · · , ξk) − φ(ξ1, · · · , ξk)| → 0 as n→ ∞. Therefore,

E|E(φn(ξ1, · · · , ξk)|F) − E(φ(ξ1, · · · , ξk)|F)|≤ E|E(φn(ξ1, · · · , ξk) − φ(ξ1, · · · , ξk)| → 0.

We know that E(φn(ξ1, · · · , ξk)|F) = E(φn(ξ1, · · · , ξk)|G) for all n ≥ 1, the conclusion

follows since the L1-limit is unique up to a set of measure zero.

PROPOSITION 10.1 Let X1,X2, · · · be a Markov chain. Let 1 ≤ l ≤ m ≤ n− 1. Assume

φ(x), x ∈ Rn−m is a measurable function such that E|φ(Xm+1, · · · ,Xn)| <∞. Then

E (φ(Xm+1, · · · ,Xn)|Xm, · · · ,Xl) = E (φ(Xm+1, · · · ,Xn)|Xm) .

Proof. Let p = n − m. By Lemma 10.3, we only need to prove the lemma for φ(x) =

φ1(x1) · · · φ(xp) with x = (x1, · · · , xp) and φi’s being bounded functions. We do this by

induction.

If n = m+1, by the Markov property, Y := E(φ1(Xm+1)|Xm, · · · ,X1) = E(φ1(Xm+1)|Xm).

Review the property that E(E(Z|F)|G) = E(E(Z|G)|F) = E(Z|F) if F ⊂ G. We know that

E (φ1(Xm+1)|Xm, · · · ,Xl) = E(Y |Xm, · · · ,Xl) = E(φ1(Xm+1)|Xm)

since Y = E(φ1(Xm+1)|Xm) is σ(Xm, · · · ,Xl)-measurable.

Now assume n ≥ m+ 2. Using the same argument to obtain,

Eφ1(Xm+1) · · · φp+1(Xn+1)|Xm, · · · ,Xl)= Eφ1(Xm+1) · · · φp−1(Xn−1)φp(Xn)|Xm, · · · ,Xl= Eφ1(Xm+1) · · · φp−1(Xn−1)φp(Xn)|Xm

where φp(Xn) = φp(Xn) · E(φp+1(Xn+1)|Xn), and the last step is by induction. Since

φ1(Xm+1) · · · φp−1(Xn−1)φp(Xn) is equal to E(φ1(Xm+1) · · · φp+1(Xn+1)|Xm, · · · ,X1). The

conclusion follows.

The following property says that a Markov chain has a reverse property.

PROPOSITION 10.2 Let X1,X2, · · · be a Markov chain. Let 1 ≤ m < n. Assume

φ(x), x ∈ Rm is a measurable function such that E|φ(X1, · · · ,Xm)| <∞. Then

E(φ(X1, · · · ,Xm)|Xm+1, · · · ,Xn) = E(φ(X1, · · · ,Xm)|Xm+1).

77

Proof. By lemma 10.3, we assume, without loss of generality, φ(x) is a bounded function.

From the definition of conditional expectation, to prove the proposition, it suffices to show

that

Eφ(X1, · · · ,Xm)g(Xm+1, · · · ,Xn)= EE(φ(X1, · · · ,Xm)|Xm+1) · g(Xm+1, · · · ,Xn) (10.10)

for any bounded, measurable function g(x), x ∈ Rn. By Lemma 10.3 again, we only need

to prove (10.10) for g(x) = g1(x1) · · · gp(xp) where p = n−m. Now

EE(φ(X1, · · · ,Xm)|Xm+1) · g1(Xm+1) · · · gp(Xn)= EZ · g2(Xm+2) · · · gp(Xn)

where Z = E(φ(X1, · · · ,Xm)g1(Xm+1)|Xm+1). Use the fact E(E(ξ|F) · η) = E(ξ ·E(η|F))

for any ξ, η and σ-algebra F to obtain that

EZ · g2(Xm+2) · · · gp(Xn) = Eφ(X1, · · · ,Xm)g1(Xm+1) ·E(η|Xm+1)

where η = g2(Xm+2) · · · gp(Xn). By Proposition 10.1, E(η|Xm+1) = E(η|Xm+1, · · · ,X1).

Therefore the above is identical to

EE(φ(X1, · · · ,Xm)g1(Xm+1) · η|Xm+1, · · · ,X1)= Eφ(X1, · · · ,Xm)g(Xm+1, · · · ,Xn)

since g(x) = g1(x1) · · · gp(xp).

The following proposition says that a Markov chain has the property: given the present,

the past and future are independent.

PROPOSITION 10.3 Let 1 ≤ k < l ≤ m be such that l − k ≥ 2. Let φ(x), x ∈ Rk

and ψ(x), x ∈ Rm−l+1 be two measurable functions satisfying E(φ(X1, · · · ,Xl)

2) < ∞ and

E(ψ(Xl, · · · ,Xm)2) <∞. Then

Eφ(X1, · · · ,Xk) · ψ(Xl, · · · ,Xm)|Xk+1,Xl−1= Eφ(X1, · · · ,Xk)|Xk+1,Xl−1 ·Eψ(Xl, · · · ,Xm)|Xk+1,Xl−1.

Proof. Fo saving notation, we write φ = φ(X1, · · · ,Xk) and ψ = ψ(Xl, · · · ,Xm). Then

E(φψ|Xk+1,Xl−1) = EE(φψ|X1, · · ·Xl−1) |Xk+1,Xl−1= Eφ ·E(ψ|Xl−1) |Xk+1,Xl−1

78

by Proposition 10.1 and that φ is σ(X1, · · · ,Xl−1) measurable. The last term is also equal

to E(ψ|Xl−1) ·Eφ|Xk+1,Xl−1. The conclusion follows since E(ψ|Xl−1) = E(ψ|Xk, Xl−1).

79

Class Notes of Stat 8112 1 Bayes estimatorsusers.stat.umn.edu/~jiang040/8112/8112notes.pdfClass Notes of Stat 8112 1 Bayes estimators Here are three methods of estimating parameters:

Documents