Week 5 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

ACTL2002/ACTL5101 Probability and Statistics

c© Katja Ignatieva

School of Risk and Actuarial StudiesAustralian School of Business

University of New South Wales

[email protected]

Week 5 Video Lecture NotesProbability: Week 1 Week 2 Week 3 Week 4

Estimation: Week 5 Week 6 Review

Hypothesis testing: Week 7 Week 8 Week 9

Linear regression: Week 10 Week 11 Week 12

Video lectures: Week 1 VL Week 2 VL Week 3 VL Week 4 VL

mailto:[email protected]


Special Sampling Distributions: chi-squared distribution

Chi-squared distribution: one degree of freedom

Special sampling distributions & sample mean and variance

Special Sampling Distributions: chi-squared distributionChi-squared distribution: one degree of freedomChi-squared distribution: n degrees of freedom

Special Sampling Distributions: student-t distributionJacobian technique and William Gosset (t-distribution)

Special Sampling Distributions: Snecdor’s F distributionJacobian technique and Snecdor’s F distribution

Distribution of sample mean/varianceBackgroundFundamental sampling distributions




Chi-squared distribution: one degree of freedomSampling from a normal distribution; independent andidentically distributed (i.i.d.) random values.

Suppose Z ∼ N (0, 1), then

Y = Z 2 ∼ χ2 (1)

has a chi-squared distribution with one degree of freedom.

Distribution characteristics:

fY (y) =1√2πy

· exp(−y/2);

FY (y) =FZ (√y)− FZ (−√y) = 2 · FZ (

√y)− 1;

E[Y ] =E[Z 2]

= 1;

Var(Y ) =E[Y 2]− (E [Y ])2 = E

[Z 4]−(E[Z 2])2

= 3− 1 = 2.

Prove: see next slides.802/827




Prove that Z 2 has a chi-squared distributed with one degreeof freedom (using p.d.f.), with Z a standard normal r.v..

Proof: using the CDF technique (seen last week). Consider:

FY (y) = Pr(Z 2 ≤ y

)= Pr (−√y ≤ Z ≤ √y)

=

∫ √y−√y

1√2π· e−

12z2dz

= 2 ·∫ √y

0

1√2π· e−

12z2dz

∗= 2 ·

∫ y

0

1√2π· 1

2· w−1/2 · e−

12wdw .

* using change of variable z =√w , so that

dz = 12 · w

−1/2dw .

Proof continues on next slide.803/827




Proof (cont.).

FY (y) =

∫ y

0

1√2π· w−1/2 · e−

12wdw .

Differentiating to get the p.d.f. gives:

∂FY (y)

∂y= fY (y)

∗∗=

1√2π· y−1/2 · e−

12y

=1

21/2 · Γ(

12

) · y (1−2)/2 · e−y/2,

** using differentiation of integral:∂∫ ba f (x)dx

∂b = f (b).

which is the density of a χ2 (1) distributed random variables(see F&T pages 164-169 for tabulated values of c.d.f.).

Note: Yi ∼ χ2 (1)dist= Gamma

(12 ,

12

)⇒ MY (t) =

(1/2

1/2−t

)1/2= (1− 2 · t)−1/2.

804/827



Chi-squared distribution: n degrees of freedom









Chi-squared distribution: n degrees of freedomLet Zi , i = 1, . . . , n be i.i.d. N(0,1), then X =

∑ni=1 Z

2i , has

a Chi-squared distribution with n d.f.: X ∼ χ2 (n).

Distribution properties:

fX (x) =1

2n/2 · Γ (n/2)· x (n−2)/2 · e−x/2, if x > 0,

and zero otherwise. Parameter constraints: n = 1, 2, . . .

E[X ] = E

[n∑

i=1

Yi

]∗= n · E [Yi ] = n

Var(X ) = Var

(n∑

i=1

Yi

)∗= n · Var (Yi ) = 2 · n

MX (t) = M∑ni=1 Yi

(t) = MnYi

(t)∗= (1− 2 · t)−n/2 , t < 1/2.

Prove: * use i = 1, . . . , n i.i.d. Yi ∼ χ2(1).805/827




Alternative proof: Recall the p.d.f. of Y :

fY (y) =1√

2 · Γ(1/2)· y−1/2 · ey/2

Recall X ∼ Gamma (n, λ), with p.d.f.:

fX (x) =λn · xn−1 · e−λ·x

Γ (n), if x ≥ 0 and zero otherwise.

For independent Y1,Y2, . . . ,Yn ∼ χ2 (1) ,

Y1 + Y2 + . . .+ Yn ∼ Gamma

(n

2,

1

2

)dist= χ2 (n) ,

since the sum of i.i.d. Gamma random variablesGamma(αi , λ) is also a Gamma random variable but withGamma(

∑ni=1 αi , λ) (see lecture week 2).

See F&T pages 164-169 for tabulated values of c.d.f.806/827




Chi-squared probability/cumulative density function

0 10 20 300

0.1

0.2

0.3

0.4

0.5

x

f X(x)

χ2 p.d.f.

n=1n=2n=3n=5n=10n=25

0 10 20 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

F X(x)

χ2 c.d.f.

807/827


Special Sampling Distributions: student-t distribution

Jacobian technique and William Gosset (t-distribution)









Jacobian technique and William Gosset

As an illustration of the Jacobian transformation technique,consider deriving the t-distribution (see exercises 4.111, 4.112and 7.30 in W+(7ed)).

t-Distributions discovered by William Gosset in 1908. Gossetwas a statistician employed by the Guinness brewing company.

Suppose Z ∼ N (0, 1) and V ∼ χ2 (r) =r∑

k=1

Z 2i , where

Zi , i = 1, . . . , r i.i.d. and Z , V are independent.

Then, the random variable:

T =Z√V /r

has a t-distribution with r degrees of freedom.808/827




Jacobian transformation technique procedure

Recall the procedure to find joint density of u1 = g1(x1, x2)and u2 = g2(x1, x2):

1. Find u1 = g1 (x1, x2) and u2 = g2 (x1, x2).

2. Determine h (u1, u2) = g−1 (u1, u2).

3. Find the absolute value of the Jacobian of the transformation.

4. Multiply that with the joint density of X1, X2 evaluated inh1(u1, u2), h2(u1, u2).

809/827




Proof:

Note p.d.f.’s:

fV (v) = v r/2−1

2r/2·Γ(r/2)· e−v/2, if 0 ≤ v <∞;

fZ (z) = 1√2π· e−

12·z2, if −∞ < z <∞.

1. Define the variables:

s = g1(z , v) = v and t = g2(z , v) =z√v/r

.

2. So that this forms a one-to-one transformation with inverse:

v = h1(s, t) = s and z = h2(s, t) = t ·√s/r .

810/827




3. The Jacobian is:

J (s, t) = det

∂h1(s, t)

∂s

∂h1(s, t)

∂t

∂h2(s, t)

∂s

∂h2(s, t)

∂t

= det

1 0

12 · t · s

−1/2/√r√

s /r

=√s /r

Note that the support is:

0 < v < ∞ and −∞ < z < ∞;0 < s < ∞ and −∞ < t < ∞.

811/827




Since Z and V are independent, their joint density can bewritten as:

fZ ,V (z , v) =fZ (z) · fV (v)

=1√2π· e−

12z2 · 1

Γ (r/2) · 2r/2· v r/2−1 · e−v/2.

4. Using the Jacobian transformation formula above, the jointdensity of (S ,T ) is given by:

fS ,T (s, t) =√s /r · 1√

2πe− 1

2

(t√

s/r)2

· 1

Γ (r/2) 2r/2sr/2−1e−s/2

=1√

2πΓ (r/2) 2r/2· s(r+1)/2−1 · 1√

r· exp

(− s

2

(1 +

t2

r

)),

5. Therefore, the marginal density of T is given by:

fT (t) =

∫ ∞0

fS,T (s, t) ds

(continues on next slide).812/827




Making the transformation:

w =s

2

(1 +

t2

r

)⇔ s =

2w

1 + t2/r,

so that:

dw =1

2

(1 +

t2

r

)ds ⇔ ds =

(2

1 + t2/r

)dw .

So that we have:

fT (t) =

∫ ∞0

1√2πΓ (r/2) 2r/2

· s(r+1)/2−1 · 1√r· exp

(− s

2·(

1 +t2

r

))ds

=

∫ ∞0

1√2πΓ (r/2) 2r/2

·

(2w

1 + t2

r

) (r+1)2 −1

· 1√r· exp(−w) ·

(2

1 + t2

r

)dw .

813/827




Simplifying:

fT (t) =

∫ ∞0

1√2πr · Γ (r/2) · 2r/2

(2

1 + t2/r

)(r+1)/2−1(2

1 + t2/r

)× w (r+1)/2−1e−wdw

=1√

πr · Γ (r/2) · 2(r+1)/2

(2

1 + t2/r

)(r+1)/2 ∫ ∞0

w (r+1)/2−1e−wdw

∗=

1√πr· Γ ((r + 1) /2)

Γ (r/2)

(1

1 + t2/r

)(r+1)/2

, for −∞ < t <∞,

* using Gamma function:∫∞

0 xα−1 · exp(−x)dx = Γ(α).

This is the standard form of t−distribution (see F&T page163 for tabulated values of c.d.f.).

814/827




Student-t probability/cumulative density function

−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

f X(x)

Student−t p.d.f.

r=1r=2r=3r=5r=10r=25

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

F X(x)

Student−t c.d.f.

815/827


Special Sampling Distributions: Snecdor’s F distribution

Jacobian technique and Snecdor’s F distribution









Snecdor’s F distribution

Suppose U ∼ χ2 (n1) and V ∼ χ2 (n2) are two independentchi-squared distributed random variables.

Then, the random variable:

F =U /n1

V /n2

has a F distribution with n1 and n2 degrees of freedom.

See F&T pages 170-174 for tabulated values of c.d.f.

Prove: Use Jacobian technique.

1. Define variables: f = u/n1

v/n2, g = v ;

2. Inverse transformation: v = g and u = f · g · n1n2

.816/827




Snecdor’s F distribution

3. Jacobian of the transformation:

J(f , g) = det

([∂v/∂f ∂v/∂g∂u/∂f ∂u/∂g

])= det

([0 1

g · n1n2

f · n1n2

])= −g · n1

n2.

Absolute value of the Jacobian: |J(f , g)| = g · n1n2

.

4. Multiply the absolute value of the Jacobian by the jointdensity (joint density, using independence:fU,V (u, v) = fU(u) · fV (v)):

fU,V (u, v) =fU(u) · fV (v)

=u

(n1−2)2

2n1/2 · Γ(n1/2)· exp

(−u

2

)· v

(n2−2)2

2n2/2 · Γ(n2/2)· exp

(−v

2

)Continues on the next slide.

817/827




Snecdor’s F distribution(Cont.) Joint density F and G (using u = f · g · n1

n2and

v = g):

fF ,G (f , g) =n1 · gn2·

(f ·n1·gn2

) (n1−2)2

2n1/2 · Γ(n12

) · exp

(−

f ·n1·gn2

2

)· g

(n2−2)2

2n2/2 · Γ(n22

) · exp(−g

2

)5. The marginal of F is obtained by integrating over all possible

values of G :

fF (f ) =

∫ ∞0

fF ,G (f , g)dg

=func(f ) ·∫ ∞

0g (n1+n2−2)/2 · exp

(−g(

1

2+

fn1

2n2

))dg

where func(f ) =n1

2n2/2 · Γ(n2/2)· (f · n1)(n1−2)/2

nn1/22 · 2n1/2 · Γ(n1/2)

818/827




Continues:

fF (f )∗=func(f ) ·

(2 · n2

n2 + f · n1

)(n1+n2−2)/2+1

·∫ ∞

0x (n1+n2−2)/2 · exp (−x) dx

∗∗=func(f ) ·

(2 · n2

n2 + f · n1

)(n1+n2)/2

· Γ ((n1 + n2)/2)

∗∗∗= n

n1/21 · nn2/2

2 · Γ ((n1 + n2)/2)

Γ (n1/2) · Γ (n2/2)· f n1/2−1

(n2 + f · n1)(n1+n2)/2

* using transformation x = g ·(

12 + f ·n1

2·n2

), thus g = 2·n2

n2+f ·n1· x and

dx =(n2+f ·n1

2·n2

)dg , thus dg =

(n2+f ·n1

2·n2

)−1dx .

** using Gamma function: Γ(α) =∫∞

0 xα−1 · exp(−x)dx .

*** using func(f ) = n1·(f ·n1)(n1−2)/2

2(n1+n2)/2·nn1/22 ·Γ(n2/2)·Γ(n1/2)

819/827




Snecdor’s F probability density function

0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

f X(x)

Snecdor‘s F p.d.f.

n1=2, n

2=2

n1=2, n

2=4

n1=2, n

2=6

n1=2, n

2=10

n1=10, n

2=2

n1=10, n

2=10

0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

F X(x)

Snecdor‘s F c.d.f.

820/827


Distribution of sample mean/variance

Background








Background

Properties of the sample mean and sample varianceSuppose you select randomly from a sample.

Assume selected with replacement or, alternatively, from alarge population size.

These outcomes (x1, . . . , xn) are random variables, all with thesame distribution and independent.

Suppose X1,X2, . . . ,Xn are n independent r.v. with identicaldistribution. Define the sample mean by:

X =1

n·

n∑k=1

Xk ,

and recall the sample variance by:

S2 =1

n − 1·

n∑k=1

(Xk − X

)2.

821/827



Fundamental sampling distributions










Sampling distributions for i.i.d. normal samples, i.e.,Xi ∼ N(µ, σ2).

In the next slides we will prove the following importantproperties:

- X ∼ N(µ, 1

nσ2): sample mean using known population

variance.

- T =X − µ

S√n

∼ tn−1: sample mean using sample variance.

-(n − 1) · S2

σ2∼ χ2 (n − 1): sample variance using population

variance.

- X and S2 are independent (proof given in Exercise 13.93 ofW+(7ed)).

822/827




Distribution of sample mean (known σ2)

Prove that the distribution of the sample mean given knownvariance is N(µ, σ2/n).

We have X1, . . . ,Xn are i.i.d. normally distributed variables.

We defined the sample mean by: X =n∑

i=1

Xin .

Use MGF-technique to find the distribution of X :

MX (t) =M n∑i=1

Xi/n(t) = Mn

Xi(t/n) = exp

(µ · t

n+

1

2· σ2 ·

( tn

)2)n

= exp

(µ · t +

1

2· σ

2

n· t2

)which is the m.g.f. of a normal distribution with mean µ andvariance σ2/n.

823/827




Distribution of sample mean (unknown σ2)

The distribution of the sample mean given unknown(population) variance is given by:

X − µS√n

∼ tn−1

Proof:

X − µS√n

=

X−µσ√n√S2

σ2

∗∼ Z√χ2n−1

n−1

∼ tn−1,

where Z ∼ N(0, 1) is a standard normal r.v..

* Using (n − 1) · S2/σ2 ∼ χ2n−1 (prove: see next slides).

824/827




Distribution of sample variance

Prove that the distribution of the sample variance is given by:

(n − 1) · S2

σ2∼ χ2

n−1.

First note that:

(n − 1) · S2

σ2=

∑ni=1

(Xi − X

)2

σ2

and second note that:∑ni=1 (Xi − µ)2

σ2=

n∑i=1

(Xi − µσ

)2

=n∑

i=1

Z 2i ∼ χ2

n,

where Zi ∼ N(0, 1), i = 1, . . . , n i.i.d. standard normal r.v..825/827




We have:∑ni=1 (Xi − µ)2

σ2︸︷︷︸∑ni=1 Z

2i ∼χ2

n

=

∑ni=1

((Xi − X ) + (X − µ)

)2

σ2

∗=

∑ni=1

(Xi − X

)2

σ2+

∑ni=1

(X − µ

)2

σ2

=

∑ni=1

(Xi − X

)2

σ2+

(X − µ

σ√n

)2

︸︷︷︸Z2∼χ2

1

.

Hence, the first term on right is χ2n−1 (using gamma sum

property/MGF-technique).

* Using 2 · (X − µ) ·∑n

i=1(Xi − X )︸︷︷︸=0

= 0.

826/827





We have now proven the following important properties:

- X ∼ N(µ, 1

nσ2)

- T =X − µ

S√n

∼ tn−1

-(n − 1) · S2

σ2∼ χ2 (n − 1)

We will use this for:

- confidence intervals for population mean and variance;- testing population mean and variance;- parameter uncertainty of a linear regression model.

Notice, when applying CLT, we do not need that Xi arenormally distributed anymore.

827/827

ACTL2002/ACTL5101 Probability and Statistics: Week 5

ACTL2002/ACTL5101 Probability and Statistics

c© Katja Ignatieva

School of Risk and Actuarial StudiesAustralian School of Business

University of New South Wales

[email protected]

Week 5Probability: Week 1 Week 2 Week 3 Week 4

Estimation: Week 6 Review

Hypothesis testing: Week 7 Week 8 Week 9

Linear regression: Week 10 Week 11 Week 12

Video lectures: Week 1 VL Week 2 VL Week 3 VL Week 4 VL Week 5 VL

mailto:[email protected]


1001/1074


Last four weeks

Introduction to probability;

Moments: (non)-central moments, mean, variance (standarddeviation), skewness & kurtosis;

Special univariate distribution (discrete & continue);

Joint distributions;

Dependence of multivariate distributions

Functions of random variables

1002/1074


This week

Parameter estimation:

- Method of Moments;

- Maximum Likelihood method;

- Bayesian estimator.

Convergence (almost surely, probability, & distribution);

Application (important theorems):

- Law of large numbers;

- Central limit theorem.

1003/1074


Parameter estimation

Definition of an estimator

Limit theorems & parameter estimatorsParameter estimation


Estimator I: the method of momentsThe method of momentsExample & exercise

Estimator II: maximum likelihood estimatorMaximum likelihood estimationExample & exerciseSampling distribution and the bootstrap

Estimator III: Bayesian estimatorIntroductionBayesian estimationExample & exercise

Convergence of seriesChebyshev’s InequalityConvergence conceptsApplication of strong convergency: Law of Large NumbersApplication of weak convergency: Central Limit TheoremApplication of convergence in distribution: Normal Approximation to the BinomialApplication of convergence in distribution: Normal Approximation to the Poisson

SummarySummary




Definition of an Estimator

Problem of statistical estimation: a population has somecharacteristics that can be described by a r.v. X with densityfX (· |θ ).

Density has unknown parameter (or set of parameters) θ.

We observe values of the random sample X1,X2, . . . ,Xn

from the population fX (· |θ ). Denote this observed samplevalues by x1, x2, . . . , xn.

We then estimate the parameter (or some function of theparameter) based on this random sample.

1004/1074




Definition of an Estimator

Any statistic, i.e., a function T (X1,X2, . . . ,Xn), that is afunction of observable random variables and whose values areused to estimate τ (θ), where τ (·) is some function of theparameter θ, is called an estimator of τ (θ).

A value θ of the statistic evaluated at the observed samplevalues by x1, x2, . . . , xn, will be called an (point) estimate.

For example:

T (X1,X2, . . . ,Xn) = X n =1

n

∑nj=1 Xj , estimator;

θ = 0.23, point estimate.

Note θ can be a vector, then the estimator is a set ofequations.

1005/1074


Estimator I: the method of moments

The method of moments







SummarySummary




The Method of Moments

Example of estimator: Method of Moments (MME).

Let X1,X2, . . . ,Xn be a random sample from the populationwith density fX (·|θ) which we will assume has k number ofparameters, say θ = [θ1, θ2, . . . , θk ]>.

The method of moments estimator τ(θ) procedure is:1. Equate (the first) k sample moments to the corresponding k

population moments;

2. Equate the k population moments to the parameters of thedistribution;

3. Solve the resulting system of simultaneous equations.

The method of moment point estimates (θ) are the estimatevalues of the estimator corresponding to the data set.

1006/1074




The Method of Moments

Denote the sample moments by:

m1 =1

n·

n∑j=1

xj , m2 =1

n·

n∑j=1

x2j , . . . , mk =

1

n·

n∑j=1

xkj ,

and the population moments by:

µ1 (θ1, θ2, . . . , θk) = E [X ] , µ2 (θ1, θ2, . . . , θk) = E[X 2],

. . . , µk (θ1, θ2, . . . , θk) = E[X k].

The system of equations to solve for (θ1, θ2, . . . , θk) is givenby:

mj = µj (θ1, θ2, . . . , θk) , for j = 1, 2, . . . , k.

Solving this provides us the point estimate θ.1007/1074



Example & exercise







SummarySummary



Example & exercise

Example: MME & Binomial distributionSuppose X1,X2, . . . ,Xn is a random sample from Bin (n, p)distribution, with known parameter n.

Question: Use the method of moments to find pointestimators of θ = p.

1. Solution: Equate population moment to sample moment:

E[X ] =1

n·

n∑j=1

xj = x .

2. Equate population moment to the parameter (use week 2):

E[X ] = n · p.

3. Then the method of moments estimator is (i.e., solving it):

x = n · p ⇒ p = x/n.1008/1074



Example & exercise

Exercise: MME & Normal distribution

Suppose X1,X2, . . . ,Xn is a random sample from N(µ, σ2

)distribution.

Question: Use the method of moments to find pointestimators of µ and σ2.

1. Solution: Equate population moment to sample moment:

E [X ]︸︷︷︸population moment

=1

n·

n∑j=1

xj = x︸︷︷︸sample moment

E[X 2]︸︷︷︸

population moment

=1

n·

n∑j=1

x2j .︸︷︷︸

sample moment1009/1074



Example & exercise

Exercise: MME & Normal distribution

2. Equate population moment to the parameters (use week 2):

E[X ] = µ and E[X 2] = Var(X ) + E[X ]2 = σ2 + µ2.

3. The method of moments estimators are:

µ =E [X ] = x

σ2 =E[X 2]− (E [X ])2

=1

n

n∑j=1

x2j − x2 =

1

n

n∑j=1

(xj − x)2 ∗=n − 1

ns2,

* using s2 =∑n

j=1(xj−x)2

n−1 is the sample variance.

Note: E[σ2]6= σ2 (biased estimator), more on this next

week.1010/1074


Estimator II: maximum likelihood estimator

Maximum likelihood estimation







SummarySummary




Maximum Likelihood function

Another example (mostly used) of an estimator is themaximum likelihood estimator.

First, we need to define the likelihood function.

If x1, x2, . . . , xn are drawn from a population with a parameterθ (where θ could be a vector of parameters), then thelikelihood function is given by:

L (θ; x1, x2, . . . , xn) = fX1,X2,...,Xn (x1, x2, . . . , xn) ,

where fX1,X2,...,Xn (x1, x2, . . . , xn) is the joint probability densityof the random variables X1,X2, . . . ,Xn.

1011/1074




Maximum Likelihood EstimationLet L (θ) = L (θ; x1, x2, . . . , xn) be the likelihood function forX1,X2, . . . ,Xn.

The set of parameters θ = θ (x1, x2, . . . , xn) (note: function ofobserved values) that maximizes L (θ) is the maximumlikelihood estimate of θ.

The random variable θ (X1,X2, . . . ,Xn) is called the maximumlikelihood estimator.

When X1,X2, . . . ,Xn is a random sample from fX (x |θ), thenthe likelihood function is (using i.i.d. property):

L (θ; x1, x2, . . . , xn) =n∏

j=1

fX (xj |θ) ,

which is just the product of the densities evaluated at each ofthe observations in the random sample.1012/1074




Maximum Likelihood Estimation

If the likelihood function contains k parameters so that:

L (θ1, θ2, . . . , θk ; x) = fX (x1|θ) · fX (x2; θ) · . . . · fX (xn; θ) ,

then (under certain regularity conditions), the point where thelikelihood is a maximum is a solution of the k equations:

∂L (θ1, θ2, . . . , θk ; x)

∂θ1= 0,

∂L (θ; x)

∂θ2= 0, . . . ,

∂L (θ; x)

∂θk= 0.

Normally, the solutions to this system of equations give theglobal maximum, but to ensure, you should usually check forthe second derivative (or Hessian) conditions and boundaryconditions for a global maximum.

1013/1074





Consider the case of estimating two variables, say θ1 and θ2.

Define the gradient vector:

D (L) =

∂L

∂θ1

∂L

∂θ2

and define the Hessian matrix:

H (L) =

∂2L

∂θ21

∂2L

∂θ1∂θ2

∂2L

∂θ1∂θ2

∂2L

∂θ22

.1014/1074





From calculus we know that the maximum choice θ1 and θ2

should satisfy not only:

D (L) = 0,

but also H should be negative definite which means:

[h1 h2

]

∂2L

∂θ21

∂2L

∂θ1∂θ2

∂2L

∂θ1∂θ2

∂2L

∂θ22

[h1

h2

]< 0,

for all [h1, h2] 6= 0.

1015/1074




Log-Likelihood functionGenerally, maximizing the log-likelihood function is easier.

Not surprisingly, we define the log-likelihood function as:

` (θ1, θ2, . . . , θk ; x) = log (L (θ1, θ2, . . . , θk ; x))

= log

n∏j=1

fX (xj |θ)

∗=

n∑j=1

log (fX (xj |θ)) .

* using log(a · b) = log(a) + log(b).

Maximizing the log-likelihood function gives the sameparameter estimates as maximizing the likelihood function,because taking the log is a monotonic increasing function.

1016/1074




MLE procedure

The general procedure to find the ML estimator is:

1. Determine the likelihood function L (θ1, θ2, . . . , θk ; x);

2. Determine the log-likelihood function` (θ1, θ2, . . . , θk ; x) = log (L (θ1, θ2, . . . , θk ; x));

3. Equate the derivatives of ` (θ1, θ2, . . . , θk ; x) w.r.t.θ1, θ2, . . . , θk to zero (⇒ global/local minimum/maximum).

4. Check wether second derivative is negative (maximum) andboundary conditions.

1017/1074



Example & exercise







SummarySummary



Example & exercise

Example: MLE and Poisson1. Suppose X1,X2, . . . ,Xn are i.i.d. and Poisson(λ). The

likelihood function is given by:

L (λ; x) =n∏

j=1

fX (xj |θ) =

(e−λλx1

x1!

)·(e−λλx2

x2!

)· . . . ·

(e−λλxn

xn!

)

=e−λ·n(λx1

x1!· λ

x2

x2!· . . . · λ

xn

xn!

).

2. So that taking the log of both sides, we get:

` (λ; x) = −λ · n + log (λ) ·n∑

k=1

xk −n∑

k=1

log (xk !) .

Or, equivalently, using directly the log-likelihood function:

` (λ; x) =n∑

j=1

log (fX (xj |θ)) =n∑

j=1

−λ+xk · log (λ)− log (xk !) .

1018/1074



Example & exercise

Example: MLE and Poisson

Now we need to maximize this log-likelihood function withrespect to the parameter λ.

3. Taking the first order condition (FOC) with respect to λ wehave:

∂

∂λ` (λ) = 0 ⇒ −n +

1

λ

n∑k=1

xk = 0.

This gives the maximum likelihood estimate (MLE):

λ =1

n

n∑k=1

xk = x ,

which equals the sample mean.

4. Check for second derivative condition to ensure globalmaximum.

1019/1074



Example & exercise

Exercise: MLE and Normal

Suppose X1,X2, . . . ,Xn are i.i.d. and Normal(µ, σ2

)where

both parameters are unknown.

The p.d.f. is given by:

fX (x) =1√

2π · σ· exp

(−1

2·(x − µσ

)2).

1. Thus the likelihood function is given by:

L (µ, σ; x) =n∏

k=1

1√2πσ

exp

(−1

2

(xk − µσ

)2).

Question: Find the MLE of µ and σ2.

1020/1074



Example & exercise

Exercise: MLE and Normal

2. Solution: Its log-likelihood function is:

` (µ, σ; x) =n∑

i=1

log

(1√

2π · σ· exp

(−1

2·(xk − µσ

)2))

∗=−n · log(σ)− n

2· log(2π)− 1

2σ2·

n∑k=1

(xk − µ)2 .

* using log(1/a) = log(a−1) = − log(a), with a = σand log(1/

√b) = log(b−0.5) = −0.5 log(b), with b = 2π.

Take the derivative w.r.t. µ and σ and set that equal to zero.

1021/1074



Example & exercise

3./4. Then, we obtain:

∂

∂µ` (µ, σ; x) =

1

σ2

n∑k=1

(xk − µ) = 0

⇒n∑

k=1

xk − nµ = 0

⇒ µ = x

∂

∂σ` (µ, σ; x) =

−nσ

+

∑nk=1(xk − µ)

σ3= 0

⇒ n =

n∑k=1

(xk − µ)

σ2

⇒ σ2 =1

n

n∑k=1

(xk − x)2 .

See §9.7 and §9.8 of W+(7ed) for further details.1022/1074



Example & exercise

Example: MME & MLE and Gamma

You may not always obtain closed-form solutions for theparameter estimates with the maximum likelihood method.

An example of such problem when estimating the parametersusing MLE is the Gamma distribution.

As we will see in the next slides, using MLE yields oneparameter estimate in closed-form solution; not so for thesecond parameter.

To find the MLE one should do the following: numericallyestimate the estimates (!) by solving a non-linear equation.This can be done by employing an iterative numericalapproximation (e.g. Newton-Ralphson).

Application: Surrender mortgages, see Excel.1023/1074



Example & exercise


In such cases an initial value may be needed so that othermeans of estimating first may be used, such as using themethod of moments. Then use it as the starting value.

Question: Consider X1,X2, . . . ,Xn i.i.d. and Gamma(λ, α)find the MME of the Gamma distribution.

fX (x) = λα

Γ(α) · xα−1 · e−λ·x ; E [X r ] = Γ(α+r)

λrΓ(α)

MX (t) = E[etX]

=(

λλ−t

)α; Var (X ) = α

λ2 .

1. Solution: Equate sample moments to population moments:

µ1 = M(1)X (t)

∣∣∣t=0

= E [X ] = x and µ2 = M(2)X (t)

∣∣∣t=0

= E[X 2]

=n∑

i=1

x2i

n.

1024/1074



Example & exercise


2. Equate population moments to the parameters:

µ1 =α

λand µ2 =

α · (α + 1)

λ2=α

λ·(α + 1

λ

)= µ1 ·

(µ1 +

1

λ

).

3. Therefore, the method of moments estimates are given by:

µ2µ1

= µ1 + 1λ ⇒λ = µ1

µ2−µ21

α = µ1 · λ ⇒α =µ2

1

µ2−µ21.

So that estimators are:

λ =x

σ2and α =

x2

σ2.

using (step 1.) µ1 = x and

µ2 =n∑

i=1

x2in ⇒ µ2 − µ2

1 =n∑

i=1

x2in − x2 = σ2

1025/1074



Example & exercise


Question: Find the ML-estimates.

1. Solution: Now, X1,X2, . . . ,Xn are i.i.d. and Gamma(λ, α) solikelihood function is:

L (λ, α; x) =n∏

i=1

1

Γ (α)· λα · xα−1

i · e−λ·xi .

2. The log-likelihood function is then:

` (λ, α; x) =− n · log (Γ (α)) + n · α · log(λ)

+ (α− 1) ·n∑

i=1

log(xi )− λ ·n∑

i=1

xi .

1026/1074



Example & exercise


3. Maximizing this:

∂

∂α` (λ, α; x) =− n ·

∂Γ(α)∂α

Γ (α)+ n · log(λ) +

n∑i=1

log(xi ) = 0

∂

∂λ` (λ, α; x) =

n · αλ−

n∑i=1

xi = 0.

Easy to solve for second equation:

λ =n · αn∑

i=1xi

,

but need numerical (iterative) techniques for solving the firstequation.

1027/1074



Example & exercise

Example: MLE and Uniform

Suppose X1,X2, . . . ,Xn are i.i.d. U [0, θ], i.e., fX (x) =1

θ, for

0 ≤ x ≤ θ, and zero otherwise. Here the range of x dependson the parameter θ.

The likelihood function can be expressed as:

L (θ; x) =

(1

θ

)n

·n∏

k=1

I0≤xk≤θ,

where I0≤xk≤θ is an indicator function taking 1 if x ∈ [0, θ]and zero otherwise.

Question: How to find the maximum of this Likelihoodfunction?

1028/1074



Example & exercise

Example: MLE and Uniform

x(1) x(4) x(3) x(2)0

2

4

6

8

10

12

14

θ

L(θ;x)

Solution: Non-linearity in theindicator function ⇒ cannot usecalculus to maximize this function,i.e., setting FOC equal to zero.

You can maximize it by looking at itsproperties:

-∏n

k=1 I0≤xk≤θ can only take value 0and 1;Note: it will take the value 0 ifθ < x(n) and 1 else!

- (1/θ)n is a decreasing function in θ;

- Hence, function is maximized for thelowest value of θ for which∏n

k=1 I0≤xk≤θ = 1 i.e.:

θ = max x1, x2, . . . , xn = x(n).

1029/1074



Sampling distribution and the bootstrap







SummarySummary





We might not only be interested in the point estimate, but inthe whole distribution of the MLE estimate (parameteruncertainty!);

However, we have no closed solution for MLE estimates. Howto obtain their sampling distribution? Use bootstrapping.

Step 1: Generate k samples from Gamma(λ, α).

Step 2: Estimate λ, α for each of these k samples using MLE.

Step 3: The empirical joint cumulative distribution function ofthese k parameter estimates is an approximation to sampledistribution of the MLE estimates.

Quantification of risk: produce histograms of estimates.1030/1074




Sampling distribution and bootstrap, k = 250, see Excel

1 2 30

0.2

0.4

0.6

0.8

1

Approximation sample distr of α

α

F α(α)

1st time 2nd 3rd 4th 5th

0.1 0.2 0.3 0.40

0.2

0.4

0.6

0.8

1

Approximation sample distr of λ

λ

F λ(λ)

1031/1074


Estimator III: Bayesian estimator

Introduction







SummarySummary



Introduction

IntroductionWe have seen:

I Method of moment estimator:Idea: first k moments of the estimated special distribution andsample are the same.

I Maximum likelihood estimator:Idea: Probability of sample given a class of distribution is thehighest with this set of parameters.

Warning: Bayesian estimation is hard to understand. Partlydue to non-standard notation in Bayesian estimates.

Pure Bayesian interpretation: Suppose you have, a priori,prior belief about a distribution;

Then you observe data ⇒ more information about thedistribution.1032/1074



Introduction

Example frequentist interpretation: Let Xi ∼ Ber(θ) bewhether individual i lodge a claim at the insurer:

-∑T

i=1 Xi = Y ∼ Bin(T , θ) be the number of car accidents;

- The probability of insured having a car accident depends onadverse selection;

- A new insurer does not know the amount of adverse selectionin his pool;

- Now, let θ ∈ Θ, with Θ ∼ Beta(a, b) the distribution of therisk among individuals (i.e., representing adverse selection);

- Use this for estimating the parameter ⇒ what is our prior forθ?

This is called empirical Bayes.

Similar idea: Bayesian updating, in case of time varyingparameters:

- Prior: Last year’s estimated claim distribution;- Data: This years claims;- Posterior: revised estimated claim distribution.1033/1074



Bayesian estimation







SummarySummary



Bayesian estimation

Notation for Bayesian estimation

Under this approach, we assume that Θ is a random quantitywith density π (θ) called the prior density.(This is usual notation, rather than fΘ(θ).)

A sample X = x(= [x1, x2, . . . , xT ]>) is taken from itspopulation and the prior density is updated using theinformation drawn from this sample and applying Bayes’ rule.This updated prior is called the posterior density, which is theconditional density of Θ given the sample X = x is π(θ|x)(=fΘ|X (θ|x)).

So we’re using a conditional r.v., Θ|X , associated with themultivariate distribution of Θ and the X (look back at lecturenotes for week 3).

Use for example E [π(θ|x)] as the Bayesian estimator.1034/1074



Bayesian estimation

Bayesian estimation, theory

First, let us define a loss function L(θ; θ) on T which is anestimator of τ(θ) with:

L(θ; θ) ≥ 0, for every θ;

L(θ; θ) = 0, when θ = θ.

Interpretation loss function: for reasonable functions wehave:a loss function has a lower value ⇒ better estimator.

Examples of the loss function:

- Mean squared error: L(θ, θ) = (θ − θ)2 (mostly used);

- Absolute error: L(θ, θ) = |θ − θ|.

1035/1074



Bayesian estimation

Bayesian estimation, theoryNext, we define a risk function, the expected loss:

Rθ(θ) =E

θ

[L(θ; θ)

]=

∫L(θ(x); θ) · fx |Θ(x |θ)dx .

Note: estimator is a random variable (e.g. T = θ = X ,τ(θ) = θ = µ) depending on observations.

Interpretation risk function: loss function is a randomvariable ⇒ taking expectation returns a number given θ.

Note: Rθ(θ) is a function of θ (we only know prior density).

Define Bayes risk under prior π as:

Bπ(θ) = Eθ[Rθ(θ)]

=

∫ΘRθ(θ) · π(θ)dθ.

Goal: minimize Bayes risk.1036/1074



Bayesian estimation

Bayesian estimation, theoryNow, we can introduce the Bayesian estimator, for a givenloss function, θB , for which the following hold:

Eθ[RθB

(θ)]≤ Eθ

[Rθ(θ)],

for any θ.

Rewriting, * using reversing order of integrals; ** using thelaw of iterative expectations (week 3) we have:

θB =argminθ

Eθ[Eθ

[L(θ|θ)

]]∗=argmin

θ

Eθ

[Eθ[L(θ|θ)

]]∗∗=argmin

θ

Eθ

[L(θ)

].

Interpretation: θ is the “best estimator” with respect to lossfunction L(θ; θ).1037/1074



Bayesian estimation

Bayesian estimation, estimatorsRewriting the Bayes risk we have:

Bπ(θ) =

∫ΘRθ(θ) · π(θ)dθ =

∫Θ

∫L(θ(x), θ) · fx |θdx · π(θ)dθ

∗=

∫Θ

∫L(θ(x), θ) · fx |θ · π(θ|x)dxdθ

∗∗=

∫ ∫ΘL(θ(x), θ) · π(θ|x)dθ︸︷︷︸

≡r(θ|x)

· fx |θdx

=

∫r(θ|x) · fx |θdx .

Implying: minimizing Bπ(θ) is equivalent to minimizing r(θ|x) forall x .* using

∫π(θ|x)dx = π(θ), i.e., Law of Total Probability and **

changing order of integration.1038/1074



Bayesian estimation

Bayesian estimation, estimatorsFor the squared error loss function (used in *) we have:

minθ

Bπ(θ)

⇔minimizing r(θ|x) for all x ⇒ ∂r(θ|x)

∂θ|x= 0

∗⇒2

∫Θ

(θ − θ(x)) · π(θ|x)dθ = 0

⇒θB(x) =

∫Θθ · π(θ|x)dθ

⇒θB(x) = Eθ|x [θ] .

Interpretation: Bayesian estimator under squared error lossfunction is the expectation of the posterior density, i.e.,θB = E[π(θ|x)]!

One can show that for absolute error loss function:θB(x) = median(π(θ|x)).

1039/1074



Bayesian estimation

Bayesian estimation, derivationThe posterior density (i.e., fΘ|X (θ|x)) is derived as:

π (θ|x)∗=

fX |Θ (x1, x2, . . . , xT |θ ) · π (θ)∫fX |Θ (x1, x2, . . . , xT |θ ) · π (θ) dθ

(1)

∗∗=

fX |Θ (x1, x2, . . . , xT |θ ) · π (θ)

fX (x1, x2, . . . , xT )

* Using Bayes formula: Pr(Ai |B) = Pr(B|Ai )·Pr(Ai )∑nj=1 Pr(B|Aj )·Pr(Aj )

, with

A1, . . . ,An a complete partition of Ω.

** Using LTP: Pr(A) =∑n

i=1 Pr(A|Bi ) · Pr(Bi )(where B1, . . . ,Bn a complete partition of Ω, week 1).

Hence, denominator is is the marginal density of theX = [x1, x2, . . . , xT ]> (=constant given the observations!).

Note: π(θ) is a complete partition of the sample space.1040/1074



Bayesian estimation

Bayesian estimation, derivationNotation: ∝ is “proportional to”, i.e., f (x) ∝ g(x)⇒ f (x) = c · g(x).

We have that the posterior is given by:

π (θ|x) ∝ fX |Θ (x1, x2, . . . , xT |θ ) · π (θ) . (2)

Either use equation (1) (difficult/tidious integral!) or (2).

Equation (2) can be used to find the posterior density by:I. Find c such that c ·

∫fX |Θ (x1, x2, . . . , xT |θ ) · π (θ) dθ = 1.

II. Find a (special) distribution that is proportional tofX |Θ (x1, x2, . . . , xT |θ ) · π (θ). (fastest way, if possible!)

Estimation procedure:1. Find posterior density using (1) (difficult/tidious integral!) or

(2).2. Compute the Bayesian estimator (using the posterior) under a

given loss function (under mean squared loss function: takeexpectation of the posterior distribution).

1041/1074



Example & exercise







SummarySummary



Example & exercise

Example Bayesian estimation: Bernoulli-Beta

Let X1,X2, . . . ,XT be i.i.d. Bernoulli(Θ), i.e.,(Xi |Θ = θ) ∼ Bernoulli(θ).

Assume the prior density of Θ is Beta(a, b) so that:

π (θ) =Γ (a + b)

Γ (a) · Γ (b)· θa−1 · (1− θ)b−1 .

We know that the conditional density (density conditional onthe true value of θ) of our data is given by:

fX |Θ (x |θ ) =θx1 (1− θ)1−x1 · θx2 (1− θ)1−x2 · . . . · θxT (1− θ)1−xT

=θ

T∑j=1

xj· (1− θ)

T−T∑j=1

xj ∗= θs · (1− θ)T−s .

This is just the likelihood function.* Simplifying notation, let s =

∑Tj=1 xj .

1042/1074



Example & exercise

1. Easy method: The posterior density, the density of Θ givenX = x , using (1) is proportional to:

π (θ|x) ∝fX |Θ (x1, x2, . . . , xT |θ ) · π (θ)

=Γ (a + b)

Γ (a) Γ (b)· θ(a+s)−1 · (1− θ)(b+T−s)−1 (3)

I. Posterior density is also solvable by finding c such that:∫c · Γ (a + b)

Γ (a) · Γ (b)· θ(a+s)−1 · (1− θ)(b+T−s)−1 dθ = 1.

Posterior density is c · fX |Θ (x1, x2, . . . , xT |θ ) · π (θ).

II. However, we observe (3) is proportional to the p.d.f. ofΞ ∼ Beta (a + s, b + T − s).

1. Tedious method: To find the posterior density using (2) wefirst need to find the marginal density of the X (next slide).

1043/1074



Example & exercise

The marginal density of the X (* using LTP) is given by:

fX (x)∗=

∫ 1

0fX |Θ (x |θ ) · π (θ)dθ

=

∫ 1

0

Γ (a + b)

Γ (a) · Γ (b)· θ(a+s)−1 · (1− θ)(b+T−s)−1 dθ

∗∗=

Γ (a + b)

Γ (a) · Γ (b)

Γ (a + s) · Γ (b + T − s)

Γ (a + b + T ).

** :∫ 1

0 xα−1 · (1− x)β−1dx = B(α, β) = Γ(α)·Γ(β)Γ(α+β) ; Posterior density

using (2): π (θ|x) =fX |Θ (x |θ ) · π (θ)

fX (x)

=θs · (1− θ)T−s · Γ(a+b)

Γ(a)·Γ(b) · θa−1 · (1− θ)b−1

Γ(a+b)Γ(a)·Γ(b)

Γ(a+s)·Γ(b+T−s)Γ(a+b+T )

=Γ (a + b + T )

Γ (a + s) · Γ (b + T − s)· θ(a+s)−1 · (1− θ)(b+T−s)−1,

1044/1074



Example & exercise

Example Bayesian estimation: Bernoulli-Beta

2. The mean of this r.v. with the above posterior density is then:

θB = E[Θ|X = x ] = E [Ξ ∼ Beta (a + s, b + T − s)] =a + s

a + b + T

gives the Bayesian estimator of Θ.

We note that we can write the Bayesian estimator as aweighted average of the prior mean (which is a/(a + b)) andthe sample mean (which is s/T ) as follows:

θB = E[Θ|X = x ] =

(T

a + b + T

)︸︷︷︸

weight sample

·( s

T

)︸︷︷︸

sample mean

+

(a + b

a + b + T

)︸︷︷︸

weight prior

·(

a

a + b

)︸︷︷︸prior mean

.

1045/1074



Example & exercise

Exercise Normal-NormalLet X1,X2, . . . ,XT be i.i.d. Normal

(Θ, σ2

2

), i.e.,

(Xi |Θ = θ) ∼ Normal(θ, σ22).

Assume the prior density of Θ is Normal(m, σ2

1

)so that:

π (θ) =1√

2πσ1

· exp

(−(θ −m)2

2 · σ21

).

Question: Find the Bayesian estimator for θ.

Solution: We know that the conditional density of our data isgiven by the likelihood function:

fX |Θ (x |θ ) =T∏j=1

1√2πσ2

· exp

(−

(xj − θ)2

2 · σ22

)

=1

(√

2πσ2)T· exp

(−∑T

j=1(xj − θ)2

2 · σ22

)1046/1074



Example & exercise

1. Posterior density:

π(θ|x) ∝ fX |Θ(x |θ) · π(θ) ∝ exp

(−∑T

j=1(xj − θ)2

2 · σ22

)· exp

(−(θ −m)2

2 · σ21

)

= exp

(−∑T

j=1(xj − θ)2

2 · σ22

− (θ −m)2

2 · σ21

)

= exp

(−∑T

j=1(x2j + θ2 − 2 · θ · xj)

2 · σ22

− (θ2 + m2 − 2 · θ ·m)

2 · σ21

)

= exp

(−σ2

2 · (θ2 + m2 − 2 · θ ·m) + σ21 ·∑T

j=1(x2j + θ2 − 2 · θ · xj)

2 · σ22 · σ2

1

)∗∝ exp

(−θ

2 · (σ22 + T · σ2

1)− 2 · θ · (m · σ22 + T · x · σ2

1)

2 · σ22 · σ2

1

)

= exp

− θ2 − 2 · θ · (m·σ22+T ·x ·σ2

1)

(σ22+T ·σ2

1)

2 · σ22 · σ2

1/(σ22 + T · σ2

1)

∗∗∝ exp

−(θ − m·σ2

2+T ·x ·σ21

σ22+T ·σ2

1

)2

2 · σ22 · σ2

1/(σ22 + T · σ2

1)

1047/1074



Example & exercise

*: exp

(−m2+

∑Tj=1 xj

2·σ22 ·σ2

1

)and **:

exp

((m·σ2

2+T ·x ·σ21

σ22+T ·σ2

1

)2/(

2 · σ22 · σ2

1/(σ22 + T · σ2

1)))

are

constants given x .

1. Thus θ|X is Normally distributed with meanm·σ2

2+T ·x ·σ21

σ22+T ·σ2

1and

varianceσ2

2 ·σ21

σ22+T ·σ2

1. Note that we can rewrite it to:

mean:

1σ2

1

1σ2

1+ T

σ22

·m +

Tσ2

1

1σ2

1+ T

σ22

· x , and variance:

(1

σ21

+T

σ22

)−1

2. The Bayesian estimator under both the mean squared lossfunction and absolute error loss function is:

θB =

1σ2

1

1σ2

1+ T

σ22

·m +

Tσ2

1

1σ2

1+ T

σ22

· x .

1048/1074


Convergence of series

Chebyshev’s Inequality







SummarySummary




Chebyshev’s InequalityThe Chebyshev’s inequality, states that for any randomvariable X with mean µ and variance σ2, the followingprobability inequality holds for all ε > 0:

Pr (|X − µ| > ε) ≤ σ2

ε2.

Note that this applies to all distributions, hence alsonon-symmetric ones! This implies that:

Pr (X − µ > ε) ≤ σ2

ε2≥ Pr (X − µ < −ε) .

Interesting example: set ε = k · σ then:

Pr (|X − µ| > k · σ) ≤ 1

k2.

This provides us with an upper bound of the probability thatX deviates more than k standard deviations of its mean.1049/1074




Application: Chebyshev’s Inequality

The distribution of fire insurance claims does not have aspecial distribution.

We do know that the mean claim size in the portfolio is $50million with a standard deviation of $150 million.

Question: What is an upper bound for the probability thatthe claim size is larger than $500 million?

Solution: We have:

Pr (X − µ > k · σ) ≤Pr (|X − µ| > k · σ)

= Pr (|X − 50| > k · 150)

≤ 1

k2=

1

9.

Thus, Pr (X > 500) ≤ 1/9.1050/1074



Convergence concepts







SummarySummary




Convergence conceptsSuppose X1,X2, . . . form a sequence of r.v.’s. Example: Xi isthe sample variance using the first i observations.

Xn is said to converge almost surely (a.s.) to the randomvariable X as n→∞ if and only if:

Pr (ω : Xn (ω)→ X (ω) , as n→∞) =1,

and we write Xna.s.→ X , as n→∞.

Sometimes called strong convergence. It means that beyondsome point in the sequence (ω), the difference will always beless than some positive ε, but that point is random.

OPTIONAL:Also expressed as: Pr (|Xn(ω)− X (ω)| > ε, i.o.) = 0, wherei.o. stands for infinitely often: Pr(An i.o.) = Pr(lim supn An).

Applications: Law of large numbers, Monte Carlo integration.1051/1074




Xn converges in probability to the random variable X asn→∞ if and only if, for every ε > 0,

Pr (|Xn − X | > ε)→0, as n→∞,

and we write Xnp→ X , as n→∞.

Difference converges in probability and converges almostsurely: Pr (|Xn − X | > ε) goes to zero instead of equals zero

as n goes to infinity (hencep→ is weaker than

a.s.→).

1052/1074




Xn converges in distribution to the random variable X asn→∞ if and only if, for every x ,

FXn (x)→ FX (x) , as n→∞.

and we write Xnd→ X , as n→∞. Sometimes called weak

convergence.

Convergence of MGF’s implies weak convergence.

Applications (see later in lecture):

- Cental Limit Theorem;

- Xn ∼ Bin(n, p) and X ∼ N(n · p, n · p · (1− p));

- Xn ∼ Poi(λn), with λn →∞ and X ∼ N(λn, λn).

1053/1074



Application of strong convergency: Law of Large Numbers







SummarySummary




The Law of Large NumbersSuppose X1,X2, . . . ,Xn are independent random variableswith common mean E[Xk ] = µ and common varianceVar(Xk) = σ2, for k = 1, 2, . . . , n.Define the sequence of sample means as:

X n =1

n

n∑k=1

Xk .

Then, according to the law of large numbers, for any ε > 0,we have:

limn→∞

Pr(∣∣X n − µ

∣∣ > ε)

= limn→∞

σ2n

ε2= lim

n→∞

σ2

n · ε2= 0,

Proof: special case: ∼ N(µ, σ2): X − µ ∼ N(0, σ2/n), thuswhen n→∞ we have lim

n→∞σ2/n = 0.

General case: When second moment exists, use Chebychev’sinequality with σ → 0.1054/1074




The law of large numbers (LLN) is sometimes written as:

Pr(∣∣X n − µ

∣∣ > ε)→ 0, as n→∞.

The result above is sometimes called the (weak) law of large

numbers and sometimes we write X np→ µ, because this is the

same concept as convergence in probability to a constant.

However, there is also what we call the (strong) law of largenumbers which simply states that the sample mean convergesalmost surely to µ:

X na.s.→ µ, as n→∞.

Important result in Probability and Statistics!

Intuitively, the law of large number states that the samplemean X n converges to the true value µ.

How accurate the estimate is will depend on:I) how large the sample size is; II) the variance σ2.1055/1074




Application of LLN: Monte Carlo Integration

Suppose we wish to calculate

I (g) =

∫ 1

0g (x) dx ,

where elementary techniques of integration will not work.

Using the Monte Carlo method, we generate U [0, 1] variablessay X1,X2, . . . ,Xn and compute:

In (g) =1

n·

n∑k=1

g (Xk) ,

where In (g) denotes the approximation of I (g), we have:In (g)

a.s.→ I (g), as n→∞.

Prove: next slide.1056/1074




Proof: Using the law of large numbers, we haveIn(g) = 1

n

∑nk=1 g(Xk)

a.s.→ E [g (X )] which is:

E [g (X )] =

∫ 1

0g (x) · 1dx =

∫ 1

0g (x) dx = I (g) .

Try this in Excel using the integral of the standard normaldensity. How good is your approximation for 100 (1,00010,000 100,000 and 1,000,000) random numbers?

This method is called Monte Carlo integration.

1057/1074




Application of LLN: Pooling of Risks in Insurance

Individuals may be faced with large and unpredictable losses.Insurance may help reduce the financial consequences of suchlosses by pooling individual risks. This is based on the LLN.

If X1,X2, . . . ,Xn are the amount of losses faced by n differentindividuals, but homogeneous enough to have a commondistribution, and if these individuals pool together and eachagrees to pay:

X n =1

n·

n∑k=1

Xk .

Then, the LLN tells us that the amount each person will endup paying becomes more predictable as the size of the groupincreases. In effect, this amount will becomecloser to µ, the average loss each individual expects.

1058/1074



Application of weak convergency: Central Limit Theorem







SummarySummary




Central Limit Theorem

Suppose X1,X2, . . . ,Xn are independent, identicallydistributed random variables with finite mean µ and finitevariance σ2. As before, denote the sample mean by X n.

Then, the central limit theorem states:

X n − µσ/√

n

d→ N (0, 1) , as n→∞.

This holds for all r.v. with finite mean and variance, not onlynormal r.v.!

Prove & rewriting CLT: see next slides.

1059/1074




Rewriting Central Limit TheoremWe can write this result as:

limn→∞

Pr

(X n − µσ/√

n≤ x

)= Φ (x) ,

for all x where Φ (·) denotes the cdf of a standard normal r.v..

Intuitively for large n, the random variable:

Zn =X n − µσ/√

n

is approximately standard normally distributed.

The Central Limit Theorem is usually expressed in terms ofthe standardized sums Sn =

∑nk=1 Xk . Then the CLT applies

to the random variable:

Zn =Sn − n · µ√

n · σd→ N (0, 1) , as n→∞.

1060/1074




Proof of the Central Limit TheoremLet X1,X2, . . . be a sequence of independent r.v.’s with meanµ and variance σ2 and denote Sn =

∑ni=1 Xi . Prove that

Zn =Sn − n · µσ ·√n

converges to the standard normal distribution.

General procedure to prove Xnd→ X :

1. Find the m.g.f. of X : MX (t);

2. Find the m.g.f. of Xn: MXn(t);

3. Take the limit n→∞ of m.g.f. of Xn: limn→∞

MXn(t) and

rewrite it. This should be equal to MX (t).

Note: useful are expansions for log and exp (see F&T page 2)!

1. Proof: Consider the case with µ = 0 and assuming the MGFexists for X , then we have: MZ (t) = exp

(t2/2

).1061/1074




2. Recall Sn =n∑

i=1Xi , the m.g.f. of Zn = Sn

σ·√n

=

n∑i=1

Xi

σ·√n

is

obtained by:

MZn (t)∗=Msn

(t

σ ·√n

)∗∗=

(MXi

(t

σ ·√n

))n

* using Ma·X (t) = MX (a · t) ** using Sn is the sum of n i.i.d.random variables Xi , thus M∑n

i=1Xi(t) = Mn

Xi(t).

Note that we only assumed that:

MXi(t) =f

(t, σ2

);

E [Xi ] =µ;

Var (Xi ) =σ2 <∞,hence, for any distribution Xi with mean µ and finite variance!1062/1074




Note: limn→∞

b · n−c = 0, for b ∈ R and c > 0.

Recall from week 1: 1) An m.g.f. uniquely defines adistribution; 2) The m.g.f. is a function of all moments.

Consider Taylor series around zero for any M (t):

M (t) =∞∑i=0

t i

i !· M(i)(t)

∣∣∣t=0︸︷︷︸

i th moment

=M (0) + t ·M(1) (t)∣∣∣t=0

+1

2·t2 ·M(2) (t)

∣∣∣t=0

+ O(t3),

where O(t3) covers all terms ck · tk , with ck ∈ R for k ≥ 3.

We have M (0) = E[e0·X ] = 1 and because we assumed that E[Xi ] = 0:

M(1)Xi

(t)∣∣∣t=0

=E [Xi ] = 0, and M(2)Xi

(t)∣∣∣t=0

= E[X 2i

]= Var (Xi ) + (E [Xi ])

2 = σ2.

3. Proof continues on next slide.1063/1074




Now we can align the results from the previous two slides:

limn→∞

MZn (t)1062= lim

n→∞

(MXi

(t/(σ√n)))n

1063= lim

n→∞

( ∞∑i=0

(t/(σ ·

√n))i

i !· M(i)

Xi(t)∣∣∣t=0

)n

1063= lim

n→∞

(1 + 0 +

1

2

(t

σ√n

)2

σ2 + O

((t

σ/√n

)3))n

⇒ limn→∞

log (MZn (t)) = limn→∞

n · log

(1 +

1

2

(t

σ√n

)2

σ2 + O

((1

n

)3/2))

∗= lim

n→∞n ·

(1

2

(t√n

)2

+ O

((1

n

)3/2)

︸︷︷︸=n·

(O(( 1n )

3/2)

+O(( 1n )

2))

=O(( 1n )

1/2)→0, if n→∞

)=

t2

2,

* using log(1 + a)=∑∞

i=1(−1)i+1ai

i = a+O(a2), with a= t2

n +O((

1n

)3/2)

.1064/1074




Application CLT: An insurer offers builder’s risk insurance. Ithas yearly 400 contracts and offers the product already 9years. The sample mean of a claim is $10 million and thesample standard deviation is $25 million.

Question: What is the probability that in a year the claimsize is larger than $5 billion?

Solution: Using CLT (why is σ ≈ sample s.d.?)

X n − µσ/√

n

d→N (0, 1) , as n→∞

⇒ X n ∼N(µ,(σ/√n)2)

⇒ n · X n ∼N(n · µ, n · σ2

)⇒ 0.9772 = Pr

(400 · X 400 ≤ 400 · 10 million + 2 · 20 · 25 million

).

Thus, Pr(400 · X 400 > $5 billion

)= 1− 0.9772 = 0.0228.

1065/1074



Application of convergence in distribution: Normal Approximation to the Binomial







SummarySummary




Normal Approximation to the Binomial

From week 2 we know: a Binomial random variable is the sumof Bernoulli random variables. Let Xk ∼ Bernoulli (p). Then:

S = X1 + X2 + . . .+ Xn

has a Binomial(n, p) distribution.

Applying the Central Limit Theorem, S must beapproximately normal with mean E[S ] = n · p and varianceVar(S) = n · p · q, so that approximately for large n we have:

S − n · p√n · p · q

∼ N (0, 1) .

Question: What is the probability that X = 60 ifX ∼ Bin(1000, 0.06)? Not in Binomial tables!

1066/1074




In practice, for large n and for p around 0.5 (but in particularnp > 5 and np (1− p) > 5 or n > 30) then can approximatethe binomial probabilities with the Normal distribution.

Use µ = n · p and σ2 = n · p · (1− p).

Continuity correction for binomial: note that Binomial randomvariable X takes integer values k = 0, 1, 2, . . . but Normalprobability is continuous so that for value:

Pr (X = k) ,

we require the Normal approximation:

Pr

((k−1

2

)− µ

σ< Z <

(k+ 1

2

)− µ

σ

)and similarly for probability that Pr (X ≤ k).

1067/1074




Normal approximation to Binomial

0 2 40

0.2

0.4

x

prob

abilit

y mas

s fun

ction Binomial(5,0.1) p.m.f.

← p.d.f. N(0.5,0.45)

n = 5, p = 0.1

0 5 100

0.1

0.2

0.3

0.4

x

prob

abilit

y mas

s fun


← p.d.f. N(1,0.9)

n = 10, p = 0.1

0 10 20 300

0.05

0.1

0.15

0.2

x

prob

abilit

y mas

s fun


← p.d.f. N(3,2.7)

n = 30, p = 0.1

0 100 2000

0.02

0.04

0.06

0.08

x

prob

abilit

y mas

s fun


← p.d.f. N(20,18)

n = 200, p = 0.1

1068/1074



Application of convergence in distribution: Normal Approximation to the Poisson







SummarySummary




Normal approximation to the Poisson

Approximation of Poisson by Normal for large values of λ.

Let Xn be a sequence of Poisson random variables withincreasing parameters λ1, λ2, . . . such that λn →∞.

We have:

E[Xn] =λn

Var(Xn) =λn

Standardize the random variable (i.e., subtract mean anddivide by standard deviation):

Zn =Xn − E[Xn]√

Var(Xn)=

Xn − λn√λn

d→ Z ∼ N(0, 1).

Proof: See next slides.1069/1074




1. We have the m.g.f. of Z : MZ (t) = exp(t2/2

).

2. Next, we need to find the m.g.f. of Zn. We know (week 2):

MXn(t) = exp(λn ·

(et − 1

)).

Thus, using the calculation rules for m.g.f., we have:

MZn (t) =MXn−λn√λn

(t) = M Xn√λn−√λn

(t)

∗=exp

(−√λn · t

)·MXn

(t/√λn

)= exp

(−√λn · t

)· exp

(λn ·

(et/√λn − 1

))= exp

(−√λn · t + λn ·

(et/√λn − 1

))* using Ma·X+b(t) = exp (b · t) ·MX (a · t).

1070/1074




3. Find the limit of the MZn(t) and proof it equals MZ (t):

limn→∞

MZn (t) = limn→∞

exp(−√λn · t + λn ·

(et/√λn − 1

))⇒ lim

n→∞log (MZn (t)) = lim

n→∞− t ·

√λn + λn ·

(e

t√λn − 1

)∗= lim

n→∞−t√λn + λn ·

(1 +

t√λn

+1

2!

(t√λn

)2

+1

3!

(t√λn

)3

+ . . .−1

)

= limn→∞

1

2!t2 + O

(1√λn

)= t2/2

⇒ limn→∞

MZn (t) = exp(t2/2

)= MZ (t).

* using exponential expansion: ea =∑∞

i=1ai

i! , witha = t/

√λn.

1071/1074




Normal approximation to Poisson

0 1 20

0.5

1

x

prob

abilit

y m

ass

func

tion Poisson(0.1) p.m.f.

← p.d.f. N(0.1,0.1)

λ = 0.1

0 2 4 60

0.1

0.2

0.3

x

prob

abilit

y m

ass

func

tion Poisson(1) p.m.f.

← p.d.f. N(1,1)

λ = 1

0 10 20 300

0.05

0.1

x

prob

abilit

y m

ass

func


← p.d.f. N(10,10)

λ = 10

0 100 2000

0.01

0.02

0.03

x

prob

abilit

y m

ass

func


← p.d.f. N(100,100)

λ = 100

1072/1074


Summary

Summary







SummarySummary


Summary

Summary

Parameter estimatorsMethod of moments:

1. Equate (the first) k sample moments to the corresponding kpopulation moments;

2. Equate the k population moments to the parameters of thedistribution;

3. Solve the resulting system of simultaneous equations.Maximum likelihood:

1. Determine the likelihood function L (θ1, θ2, . . . , θk ; x);2. Determine the log-likelihood function

` (θ1, θ2, . . . , θk ; x) = log (L (θ1, θ2, . . . , θk ; x));3. Equate the derivatives of ` (θ1, θ2, . . . , θk ; x) w.r.t.

θ1, θ2, . . . , θk to zero (⇒ global/local minimum/maximum).4. Check wether second derivative is negative (maximum) and

boundary conditions.Bayesian:

1. Posterior density using (1) (difficult/tidious integral!) or (2).2. Compute the Bayesian estimator under a given loss function.

1073/1074


Summary

Summary

LLN & CLT

Law of large numbers: Let Xi , . . . ,Xn be independentrandom variables with equal mean E[Xk ] = µ and varianceVar(Xk) = σ2 for k = 1, . . . , n, then for all ε > 0 we have:

Pr(∣∣X n − µ

∣∣ > ε)→ 0, as n→∞.

Central limit theorem: Let Xi , . . . ,Xn be independent andidentically distributed random variables with mean E[Xk ] = µand variance Var(Xk) = σ2 for k = 1, . . . , n, then:

X n − µσ/√n

d→ N(0, 1), as n→∞.

1074/1074

Week 5 Annotated

Documents

vl week

pr y z y

fz y fz y

distribution characteristics

degree of freedomchi

y012pi e

video lecture notesprobability

piy expy2fy y