STAT:5100 (22S:193) Statistical Inference I - Week 13homepage.divms.uiowa.edu/~luke/classes/193/notes-week13.pdf · Week 13 Luke Tierney University of Iowa Fall 2015 Luke Tierney

STAT:5100 (22S:193) Statistical Inference IWeek 13

Luke Tierney

University of Iowa

Fall 2015

Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1

Monday, November 16, 2015

Recap

• Normal populations

• Order statistics

• Marginal CDF and density of an order statistic

• Little “o” and big “O” notation


Monday, November 16, 2015 Order Statistics

Example

• A N(µ, σ2) population has both mean µ and median µ.

• Either the sample mean or the sample median could be used toestimate µ.

• Which would produce a better estimate?

• We can explore this question using both simulation and theory.

• Some R code:http://www.stat.uiowa.edu/~luke/classes/193/median.R.

• The standard deviation of the sample median seems to satisfy

SD(Xn) ≈ 1.25√nσ

• Many statistics have sampling distributions that follow this squareroot relationship.


http://www.stat.uiowa.edu/~luke/classes/193/median.R

Monday, November 16, 2015 Order Statistics

Example (continued)

• Suppose we are considering estimating µ with X using a sample of size n.

• What would be the equivalent samples size nE we would need to achieve thesame accuracy using the median?

• We need to solve the equation

Var(X n) = Var(XnE )

orσ2

n≈ (1.25)2

nEσ2

• The solution is nE ≈ (1.25)2n = 1.5625n.

• The ratio n/nE ≈ 1/1.5625 = 0.64 is called the relative efficiency of X to X .

• X is less efficient than X if the data really are normally distributed.

• But X is much more robust to outliers than X .


Monday, November 16, 2015 Approximations and Limits

Approximations and Limits

• If we can’t say anything precise about a sampling distribution, weoften look for approximations.

• Approximations are usually stated as limit results.• This is common in mathematics, for example:

• The statement “f is differentiable at x∗” means

limx→x∗

f (x)− f (x∗)

x − x∗= f ′(x∗)

• This can also be expressed as

f (x) = f (x∗) + f ′(x∗)(x − x∗) + o(x − x∗)

as x → x∗.• This suggests the linear approximation

f (x) ≈ f (x∗) + f ′(x∗)(x − x∗)

for x close to x∗.• Care and experience are needed in interpreting “≈” and “close to.”



• We will look at two kinds of convergence results:• Convergence of sequences of random variables• Convergence of sequences of probability distributions

• This will help with questions like• Should Xn be close to µ for large samples?• Can the probability distribution of the error Xn − µ be approximated by

a normal distribution?

• These will be formalized as limits as n→∞:• Xn converges to µ.• The distribution of

Xn − µσ/√

n

converges to a standard normal distribution.



• We will develop tools to help us answer questions like• if Xn → X and Yn → Y does this imply that Xn + Yn → X + Y ?• If f is continuous and Xn → X , does this imply f (Xn)→ f (X )?• If the distribution of

Xn − µσ/√

n

converges to a N(0, 1) distribution and Sn → σ can we conclude thatthe distribution of

Xn − µSn/√

n

also converges to a N(0, 1) distribution?


Monday, November 16, 2015 Convergence of Sequences of Random Variables

Convergence of Sequences of Random Variables

Examples

• Suppose we want to use a statistic Tn to estimate a parameter θ.• To decide whether this makes sense a minimal requirement might be

that Tn → θ as n→∞.• This property is known as consistency.• The Weak Law of Large Numbers is an example of such a result.

• In showing that an approximate distribution for√

n(Xn − µ)/σ canalso be used as an approximate distribution for

√n(Xn − µ)/Sn a

useful step is to show that

Xn − µSn/√

n− Xn − µ

σ/√

n=

Xn − µσ/√

n

(σ

Sn− 1

)→ 0


Monday, November 16, 2015 Convergence of Sequences of Random Variables

• Some sumulations: http:

//www.stat.uiowa.edu/~luke/classes/193/convergence.R.

• X1,X2, . . . are i.i.d. Bernoulli(p) and

Pn =

∑ni=n Xi

n.

• Almost all sample paths converge to p.

• This means

P(Pn → p) = P({s ∈ S : Pn(s)→ p}) = 1.

• This is called almost sure convergence.


http://www.stat.uiowa.edu/~luke/classes/193/convergence.R


Monday, November 16, 2015 Almost Sure Convergence

Almost Sure Convergence

Definition

A sequence X1,X2, . . . , of random variables converges almost surely to arandom variable X if

P(

limn→∞

Xn = X)

= 1

orP({

s ∈ S : limn→∞

Xn(s) = X (s)})

= 1

Notation:

Xna.s.→ X

Xn → X a.s.

Plimn→∞

Xn = X



Example

Theorem (Strong Law of Large Numbers)

Let X1,X2, . . . , be i .i .d . with E [|X1|] <∞ and µ = E [X1]. Then

X n → µ

almost surely.



Example

• Let Z be a standard normal random variable.

• Define Zn as

Zn =i

nif

i − 0.5

n≤ Z <

i + 0.5

n

for all integers i .

• Using the notation {b} for the closest integer to b:

Zn =1

n{nZ}.

• Then Zn → Z almost surely:• For any number z define zn = 1

n{nz}.• Then |z − zn| ≤ 1

n → 0; i.e. zn → z .• Therefore Zn(s) = 1

n{Z (s)} → Z (s) for all s ∈ S .


Wednesday, November 18, 2015

Recap

• Normal populations

• Approximations and limits — motivations from calculus

• Almost sure convergence

• Strong Law of Large Numbers


Wednesday, November 18, 2015 Almost Sure Convergence

TheoremSuppose Xn → X almost surely and f is continuous. Then f (Xn)→ f (X )almost surely.

Proof.

• Let A = {s ∈ S : Xn(s)→ X (s)}.• Since f is continuous, for any s ∈ A

f (Xn(s))→ f (X (s))

• Therefore{s ∈ S : f (Xn(s))→ f (X (s))} ⊃ A

• SoP(f (Xn)→ f (X )) ≥ P(A) = 1.



Example

• Suppose X1,X2, . . . are independent draws from a population withfinite mean µ and finite variance σ2.

• The sample variance can be written as

S2n =

1

n − 1

n∑i=1

(Xi − Xn)2

=1

n − 1

[n∑

i=1

(Xi − µ)2 − n(Xn − µ)2

]

=n

n − 1

[1

n

n∑i=1

(Xi − µ)2 − (Xn − µ)2

]=

n

n − 1Un



Example (continued)

• By the Strong Law of Large Numbers

Xn → µ a.s.

1

n

n∑i=1

(Xi − µ)2 → σ2 a.s..

• Therefore

Un =1

n

n∑i=1

(Xi − µ)2 − (Xn − µ)2 → σ2 a.s..

• SoS2n =

n

n − 1Un → σ2 a.s..

• Since the square root is continuous, we also have

Sn =√

S2n →

√σ2 = σ a.s..



• Almost sure convergence, if you can show that you have it, is theeasiest form of convergence to work with.

• But almost sure convergence can be difficult to verify.

• It is also more than we need for many useful results.

• It is useful to look for other notions of convergence that may beeasier to verify.

• One alternative is convergence in probability.


Wednesday, November 18, 2015 Convergence in Probability

Convergence in Probability

Definition

A sequence of random variables X1,X2, . . ., converges in probability to arandom variable X if for every ε > 0

limn→∞

P(|Xn − X | ≥ ε) = 0

orlimn→∞

P(|Xn − X | < ε) = 1

Notation:

XnP→ X

plimn→∞

Xn = X



Examples

• Weak Law of Large Numbers: XnP→ µ.

• Suppose U1,U2, . . . are independent Uniform[0, 1] random variables.• Let Xn = max{U1, . . . ,Un} and let X ≡ 1.• Then for any ε > 0

P(|Xn − X | ≥ ε) = P(Xn ≤ 1− ε)

=

{(1− ε)n if ε < 1

0 otherwise

→ 0

• So XnP→ 1.

• It is also true that Xn → 1 almost surely.



• If Xn → X almost surely, then XnP→ X :

• For an ε > 0

P(|Xn − X | ≥ ε) = E[1{|Xn−X |≥ε}

].

• For any s ∈ S where Xn(s)→ X (s) we have

1{|Xn(s)−X (s)|≥ε} → 0.

• So 1{|Xn−X |≥ε} → 0 almost surely.• By the dominated convergence theorem this implies that

P(|Xn − X | ≥ ε)→ 0.



• It is possible to have convergence in probability but not almost sureconvergence.

• Let X1,X2, . . . be independent Bernoulli random variables withP(Xn = 1) = 1

n .• For any ε > 0 with ε ≤ 1

P(|Xn| ≥ ε) =1

n→ 0.

• So XnP→ 0.

• But for every n

P(all of Xn,Xn+1, . . . are zero) =∞∏k=n

(1− 1

k

)≤ exp

{−∞∑k=n

1

k

}= 0

• So with probability one the sequence X1,X2, . . . contains infinitelymany ones and cannot converge almost surely to zero.

• So almost sure convergence is stronger than convergence inprobability.


Wednesday, November 18, 2015 A Sufficient Condition

A Sufficient Condition

Suppose for some p ≥ 1 we have E [|Xn|p] <∞ for all n, E [|X |p] <∞,and

limn→∞

E [|Xn − X |p] = 0

Then XnP→ X .

Proof.

By Markov’s inequality,

P(|Xn − X | ≥ ε) = P(|Xn − X |p ≥ εp)

≤ E [|Xn − X |p]

εp→ 0

for any ε > 0.


Wednesday, November 18, 2015 A Sufficient Condition

• If E [|Xn − X |p]→ 0 then Xn is said to converge to X in Lp.

• Usually we use this for p = 2.

• This is called convergence in mean square.

• If the limit is a constant a then the sufficient condition becomes

E [(Xn − a)2] = Var(Xn) + (E [Xn]− a)2 → 0.

• This convergence holds if and only if both

Var(Xn)→ 0

E [Xn]→ a.

• If Xn is used to estimate a then• E [Xn]− a is called the bias of Xn;• E [(Xn − a)2] is the Mean Squared Error (MSE).


Wednesday, November 18, 2015 Weak Law of Large Numbers

Weak Law of Large Numbers

TheoremLet X1,X2, . . . , be i .i .d . with mean µ and finite variance σ2. LetX n = 1

n

∑ni=1 Xi . Then

X nP→ µ.

Proof.

E [(X n − µ)2] = Var(X n) =σ2

n→ 0

This is sometimes called (weak) consistency of X n for µ.


Wednesday, November 18, 2015 Distance and Convergence

Distance and Convergence

• One way to develop a notion of convergence for complicated objects,like random variables, is to define a distance between two objects.

• A distance is a function d(x , y) with these properties:• d(x , y) ≥ 0 for all x , y .• d(x , y) = d(y , x) for all x , y .• d(x , y) = 0 if and only if x = y .• d(x , y) ≤ d(x , z) + d(z , y) for all x , y , z .

• A distance is also called a metric

• A metric space is a set together with a distance.

• Convergence xn → x in a metric space means d(xn, x)→ 0.


Wednesday, November 18, 2015 Distance and Convergence

Examples

• Lp convergence corresponds to convergence with respect to thedistance

d(Xn,X ) = E [|Xn − X |p]1/p.

• To satisfy the requirement that d(X ,Y ) = 0 implies X = Y we needto work in terms of equivalence classes of almost surely equal randomvariables.

• Convergence in probability also corresponds to convergence withrespect to a distance; one possible distance is the Ky Fan distance

d(X ,Y ) = E [min {|X − Y |, 1}] .


Wednesday, November 18, 2015 Distances for Probabilities

Distances for Probabilities

• One possible distance between two probabilities P and Q is the totalvariation distance

dTV(P,Q) = supA∈B|P(A)− Q(A)|.

• If P and Q are both continuous with densities f and g then

dTV(P,Q) =1

2

∫|f (x)− g(x)|dx .

• An analogous result holds if P and Q are both discrete.


Wednesday, November 18, 2015 Distances for Probabilities

• If Fn and F are probability distributions with densities or PMFs fn andf , and fn(x)→ f (x) for all x then

dTV (Fn,F )→ 0.

• This is known as Scheffe’s Theorem.

• If P is continuous and Q is discrete, then

dTV(P,Q) = 1.

• So total variation distance cannot be used to help with approximatingcontinuous distribution with discrete ones, or vice versa.


Friday, November 20, 2015

Recap

• Almost sure convergence

• Strong Law of Large Numbers

• Convergence in probability

• Weak law of large numbers

• Lp convergence

• Distances and convergence


Friday, November 20, 2015 Distances for Probabilities

• A distance among cumulative distribution functions is theKolmogorov distance:

dK (F ,G ) = supx∈R|F (x)− G (x)|

• This is a useful distance for continuous distributions or for discretedistributions with a common support.

• It is useful for capturing convergence of a sequence of discretedistributions to a continuous distribution.

• For general discrete distributions it has some undesirable features.



Example

• Let Fy (x) be the CDF of a random variable that equals y withprobability one:

Fy (x) =

{1 if x ≥ y

0 if x < y .

• Let yn = 1n .

• Then dK (Fyn ,F0) = 1 for all n.



• An alternative distance among CDFs is the Levy distance:

dL(F ,G ) = inf{ε > 0 : F (x − ε)− ε ≤ G (x) ≤ F (x + ε) + ε for all x ∈ R}

• Another way of defining this distance:• Think of placing a square parallel to the axes with side ε in a gap

between F and G .• dL is the largest ε that will fit.

• The Levy distance between a N(0,1) and a N(1, 1) distribution isapproximately 0.28.

• For point mass distributions Fx and Fy the Levy distance isdL(Fx ,Fy ) = |x − y |.



• Two useful results: Suppose Xn ∼ Fn and X ∼ F . Then• dL(Fn,F )→ 0 if and only if Fn(x)→ F (x) for all x where F is

continuous.• dL(Fn,F )→ 0 if and only if

E [g(Xn)]→ E [g(X )]

for all bounded, continuous functions g .


Friday, November 20, 2015 Convergence in Distribution

Convergence in Distribution

Definition

A sequence of random variables X1,X2, . . . , converges in distribution to arandom variable X if

limn→∞

FXn(x) = FX (x)

for all x where FX is continuous.

• This is different—it is really about distributions, not random variables.• This is also called weak convergence of distributions.• It corresponds to convergence in the Levy distance.

Notation:

XnD→ X

Xn ⇒ X

L(Xn)→ L(X )



Example

• Suppose X ∼ N(0, 1) and let Xn = (−1)nX .

• Then Xn ∼ X for all n, so Xn → X in distribution.

• Xn does not converge to X almost surely or in probability.



TheoremA sequence of random variables X1,X2, . . . converges to a random variable X indistribution if and only if

P(Xn ∈ A)→ P(X ∈ A)

for every open set A with P(X ∈ ∂A) = 0, where ∂A is the boundary of A.

TheoremA sequence of random variables X1,X2, . . . converges to a random variable X indistribution if and only if

E [g(Xn)]→ E [g(X )]

for all bounded, continuous functions g.

TheoremSuppose Xn has MGF Mn, n = 1, 2, . . ., X has MGF M, M is finite in aneighborhood of the origin, and for all t in a neighborhood of the origin

Mn(t)→ M(t). Then XnD→ X .



TheoremIf Xn

P→ X then XnD→ X .

TheoremIf c is a constant and Xn

D→ c then XnP→ c.



Example

• Suppose U1,U2, . . . are independent Uniform[0, 1] random variables.

• Let Xn = max{U1, . . . ,Un}.• Then for 0 < x < 1

FXn(x) = xn → 0.

• For x ≤ 0 we have FXn(x) = 0 and FXn(x) = 1 for x ≥ 0.

• So FXn(x)→ FX (x) for all x , where FX is the CDF of X ≡ 1.

• So XnD→ X .



Example (continued)

• Now suppose Yn = min{U1, . . . ,Un} and Y ≡ 0.

• The CDF of Yn is FYn(y) = 1− (1− y)n for 0 ≤ y ≤ 1.

• For y > 0 we have FYn(y)→ 1.

• But for y = 0 we have FYn(y) = 0 for all n.

• So FYn(y)→ FY (y) for all y except y = 0, where FY is notcontinuous.

• So YnD→ Y .

• YnP→ 0 as well (also almost surely).



Example (continued)

• At what rate does Yn → 0?

• The mean of Yn is

E [Yn] =

∫ ∞0

(1− FYn(t))dt =

∫ 1

0(1− t)ndt =

1

n + 1= O(n−1).

• What happens to the distribution of Vn = nYn?

• For 0 ≤ v ≤ n the CDF of Vn is

FVn(v) = P(Vn ≤ v) = P(Yn ≤ v/n) = 1− (1− v/n)n → 1− e−v

• So Vn converges in distribution to V ∼ Exp(1).

• The distribution of Yn = Vn/n is approximately Exponential(λ = n).



Example

• Let Pn be the sample proportion of successes in n Bernoulli(p) trials.

• What can we say about the distribution of Pn for large n?

• It is useful to look at the standardized version

Z =Pn − p√p(1− p)/n

.

• Some simulations:http://www.stat.uiowa.edu/~luke/classes/193/convergence.R

• The sample paths do not converge.

• Their probability distributions do converge.

• The limiting distribution is the standard normal distribution.

• This suggests that the distribution of Pn for large n is approximately

N

(p,

p(1− p)

n

).

• This is an example of the Central Limit Theorem.



Friday, November 20, 2015 Central Limit Theorem

Theorem (Central Limit Theorem)

Let X1,X2, . . . , be i .i .d . from a population with an MGF that is finite nearthe origin. Then X1 has finite mean µ and finite variance σ2. Let

Zn =X n − µσ/√

n=√

nX n − µσ

and let Z ∼ N(0, 1). Then Zn → Z in distribution, i.e.

P(Zn ≤ z)→∫ z

−∞

1√2π

e−u2/2du

for all z.


Friday, November 20, 2015 Central Limit Theorem

• If we only assume E [X 21 ] <∞ then the theorem is still true; the proof

works with characteristic functions.

• Independence and identical distribution can be weakened somewhat.


STAT:5100 (22S:193) Statistical Inference I - Week 13homepage.divms.uiowa.edu/~luke/classes/193/notes-week13.pdf · Week 13 Luke Tierney University of Iowa Fall 2015 Luke Tierney

Documents