Introduction to Probability Theory and Statistics [Matzinger]

INTRODUCTION TO PROBABILITYTHEORY AND STATISTICS

HEINRICH MATZINGERGeorgia Tech

E-mail: [email protected]

October 7, 2014

Contents

1 Definition and basic properties 2

1.1 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Definition of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Direct consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Conditional probability and independence 11

2.1 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Expectation 14

4 Dispersion, average fluctuation and standard deviation 18

4.1 Matzingers rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Calculation with the Variance 21

5.1 Getting the big picture with the help of Matzinger’s rule of thumb . . . . . 23

6 Covariance and correlation 24

6.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Chebyshev’s and Markov’s inequalities 26

8 Combinatorics 29

1

9 Important discrete random variables 31

9.1 Bernoulli variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319.2 Binomial random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 329.3 Geometric random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

10 Continuous random variables 34

11 Normal random variables 36

12 Distribution functions 38

13 Expectation and variance for continuous random variables 40

14 Central limit theorem 42

15 Statistical testing 44

15.1 Looking up probabilities for the standard normal in a table . . . . . . . . . 4615.2 Two sample testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

16 Statistical estimation 53

16.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5316.2 Estimation of variance and standart deviation . . . . . . . . . . . . . . . . 5516.3 Maximum Likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 5616.4 Estimation of parameter for geometric random variables . . . . . . . . . . 57

17 Linear Regression 58

17.1 The case where the exact linear model is known . . . . . . . . . . . . . . . 5817.2 When α and β are not know . . . . . . . . . . . . . . . . . . . . . . . . . 6017.3 Where the formula for the estimates of α and β come from . . . . . . . . . 6217.4 Expectation and variance of β . . . . . . . . . . . . . . . . . . . . . . . . 6417.5 How precise are our estimates . . . . . . . . . . . . . . . . . . . . . . . . . 6517.6 Multiple factors and or polynomial regression . . . . . . . . . . . . . . . . 6517.7 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

1 Definition and basic properties

1.1 Events

Imagine that we throw a die which has 4 sides. The outcome, of this experiment will beone of the four numbers: 1,2,3 or 4. The set of all possible outcomes in this case is:

Ω = 1, 2, 3, 4 .

Ω is called the outcome space or sample space. Before doing the experiment we don’t knowwhat the outcome will be. Each possible outcome has a certain probability to occur. This

2

die-experiment is a random experiment.We can use our die to make bets. Somebody might bet that the number will be even. Wethrow the die: if the number we see is 2 or 4 we say that the event “even” has occurredor has been observed. We can identify the event “even” with the set: 2, 4. This mightseem a little bit abstract, but by identifying the event with a set, events become easierto handle: Sets are well known mathematical objects, whilst the events as we know themfrom every day language are not.In a similar way one might bet that the outcome is a number greater-equal to 3. Thisevent is realized when we observe a 3 or a 4. The event greater or equal 3 can thus beviewed as the set 3, 4.Another example, is the event “odd”. This is the set: 1, 3.With this way of looking at things, events are simply subsets of Ω. Take another example:a coin with a side 0 and a side 1. The outcome space or sample space in that case is:

Ω = 0, 1 .

The events are the subsets of Ω, in this case there are 4 of them:

∅, 0, 1, 0, 1.

Example 1.1 It might at first seem very surprising that events can be viewed as sets.Consider for example the following sets:the set of bicycles which belong to a Ga tech student, the set of all sky-scrapers in Atlanta,the set of all one dollar bills which are currently in the US.Let us give a couple of events:the event that after X-mas the unemployment rate is lower than now, the event that ourfavorite pet dies from hart attack, the event that I go down with flue next week.At first, it seems that events are something very different from sets. Let us see in a realworld example how mathematicians view events as sets:Assume that we are interested in where the American economy is going to stand in exactlyone year from now. More specifically, we look at unemployment and inflation and wonderif they will be above or below their current level. To describe the situation which weencounter in a year from now, we introduce a two digit variable Z = XY . Let X be equalto one if unemployment is higher in a year than its current level. If it is lower, let X beequal to 0. Similarly, let Y be equal to one if inflation is higher in a year from now. If itis lower, let Y be equal to zero. The possible outcomes for Z are:

00, 01, 10, 11.

This is the situation of a random experiment, where the outcome is one of the four possiblenumbers: 00,01,10,11. We don’t know what the outcome will be. But each possibility canoccur with a certain probability. Let A be the event that unemployment is higher in a year.This corresponds to the outcomes 10 and 11. We thus identify the event A with the set:

10, 11 .

3

Let B be the event that inflation is higher in a year form now. This corresponds to theoutcomes 01 and 11. We thus view the event B as the set:

01, 11 .

Recall that the intersection A∩B of two sets A and B, is the set consisting of all elementscontained in both A and B. In our example, the intersection of A and B is equal toA ∩ B = 11. Let C designate the event that unemployment goes up and that inflationgoes up at the same time. This corresponds to the outcome 11. Thus, C is identified withthe set: 11. In other words, C = A ∩B. The general rule which we must remember is:For any events A and B, if C designates the event that A and B both occur at

the same time, then C = A ∩ B.

Let D be the event that unemployment or inflation will be up in a year from now. (By “or”we mean that at least one of them is up.) This corresponds to the outcomes: 01,10,11.Thus D gets identified with the set:

D = 01, 10, 11 .

Recall that the union of two sets A and B is defined to be the set consisting of all elementswhich are in A or in B. We see in our example that D = A∪B. This is true in general.We must thus remember the following rule:For any events A and B, if D designates the event that A or B occur, then

D = A ∪ B.

1.2 Frequencies

Assume that we have a six sided die. In this case the outcome space is

Ω = 1, 2, 3, 4, 5, 6.

The event “even” in this case is the set:

2, 4, 6

whilst the event “odd” is equal to1, 3, 5 .

Instead of throwing the die only once, we throw it several times. As a result, instead ofjust a number, we get a sequence of numbers. When throwing the six-sided die I obtainedthe sequence:

1, 4, 3, 5, 2, 6, 3, 4, 5, 3, . . .

When repeating the same experiment which consists in throwing the die a couple of times,we are likely to obtain another sequence. The sequence we observe is a random sequence.

4

In this example we observe one 3 within the first 5 trials and three 3’s occurring withinthe first 10 trials. We write:

n3

for the number of times we observe a 3 among the first n trials. In our example thus: forn = 5 we have n3 = 1 whilst for n = 10 we find n3 = 3.Let A be an event. We denote by nA the number of times A occurred up to time n.Take for example A to be the event “even”. In the above sequence within the first 5 trialswe obtained 2 even numbers. Thus for n = 5 we have that nA = 2. Within the first 10trials we found 4 even numbers. Thus, for n = 10 we have nA = 4. The proportion ofeven numbers nA/n for the first 5 trials is equal to 2/5 = 40%. For the first 10 trials, thisproportion is 4/10 = 40%.

1.3 Definition of probability

The basic definition of probability which we use is based on frequencies. For our definitionof probability we need an “assumption” about the world surrounding us:Let A designate an event. When we repeat the same random experiment independentlymany times we observe that on the long run the proportion of times A occurs tends tostabilize. Whenever we repeat this experiment, the proportion nA/n on the long runtends to be the same number. A more mathematical way of formulating this, is to staythat nA/n converges to a number only depending on A, as n tends to infinity. This is ourbasic assumption.

Assumption As we keep repeating the same random experiment under the same con-ditions and such that each trial is independent of the previous ones, we find that:the proportion nA/n tends to a number which only depends on A, as n → ∞.

We are now ready to give our definition of probability:

Definition 1.1 Let A be an event. Assume that we repeat the same random experimentunder exactly the same conditions independently many times. Let nA designate the numberof times the event A occurred within the n first repeats of the experiment. We define theprobability of the event A to be the real number:

P (A) =: limn→∞

nA

n.

Thus, P (A) designates the probability of the event A. Take for example a four-sidedperfectly symmetric die. Because, of symmetry each side must have same probability. Onthe long run we will see a forth of the times a 1, a forth of the times a 2, a forth of thetimes a 3 and a forth of the times a 4. Thus, for the symmetric die the probability ofeach side is 0.25.

5

1.4 Direct consequences

From our definition of probability there are several useful facts, which follow immediately:

1. For any event A, we have that:P (A) ≥ 0.

2. For any event A, we have that:P (A) ≤ 1.

3. Let Ω designate the state space. Then:

P (Ω) = 1.

Let us prove these elementary facts:

1. By definition na/n ≥ 0. However, the limit of a sequence which is ≥ 0 is also ≥ 0.Since P (A) is by definition equal to the limit of the sequence nA/n we find thatP (A) ≥ 0.

2. By definition nA ≤ n. It follows that na/n ≤ 1. The limit of a sequence whichis always less or equal to one must also be less or equal to one. Thus, P (A) =limn→∞ na/n ≤ 1.

3. By definition nΩ = n. Thus:

P (Ω) = limn→∞

nΩ/n = limn→∞

n/n = limn→∞

1 = 1.

The next two theorems are essential for solving many problems:

Theorem 1.1 Let A and B be disjoint events. Then:

P (A ∪ B) = P (A) + P (B).

Proof. Let C be the event C = A∪B. C is the event that A or B has occurred. BecauseA and B are disjoint, we have that A and B can not occur at the same time. Thus, whenwe count up to time n how many times C has occurred, we find that this is exactly equalto the number of times A has occurred plus the number or times B has occurred. In otherwords,

nC = nA + nB. (1.1)

From this it follows that:

P (C) = limn→∞

nC

n= lim

n→∞

nA + nB

n= lim

n→∞

!nA

n+

nB

n

"

.

We know that the sum of limits is equal to the limit of the sum. Applying this to theright side of the last equality above, yields:

limn→∞

!nA

n+

nB

n

"

= limn→∞

nA

n+ lim

n→∞

nB

n= P (A) + P (B).

6

This finishes to prove that

P (C) = P (A ∪ B) = P (A) + P (B).

Let us give an example which might help us to understand why equation 1.1 holds.Imagine we are using a 6-sided die. Let A be the event that we observe a 2 or a 3. ThusA = 2, 3. Let B be the event that we observe a 1 or a 5. Thus, B = 1, 5. The twoevents A and B are disjoint: it is not possible to observe at the same time A and B sinceA ∩B = ∅. Assume that we throw the die 10 times and obtain the sequence of numbers:

1, 3, 4, 6, 3, 4, 2, 5, 1, 2.

We have seen the event A four times: at the second, fifth, seventh and tenth trial. Theevent B is observed at the first trial, at the eight and ninth trials. C = A∪B = 2, 3, 1, 5is observed at the trials number: 2,5,7,10 and 1,8,9. We thus find in this case that nA = 4,nB = 3 and nC = 7 which confirms equation 1.1.

Example 1.2 Assume that we are throwing a fair coin with sides 0 and 1. Let Xi des-ignate the number which we obtain when we flip the coin for the i-th time. Let A be theevent that we observe right at the beginning the number 111. In other words:

A = X1 = 1, X2 = 1, X3 = 1.

Let B designate the event that we observe the number 101 when we read our randomsequence starting from the second trial. Thus:

B = X2 = 1, X3 = 0, X4 = 1.

Assume that we want to calculate the probability to observe that at least one of the twoevents A or B holds. In other words we want to calculate the probability of the eventC = A ∪ B.Note that A and B can not occur both at the same time. The reason is that for A to holdit is necessary that X3 = 1 and for B to hold it is necessary that X3 = 0. X3 howevercan not be equal at the same time to 0 ant to 1. Thus, A and B are disjoint events, sowe are allowed to use theorem 1.1.We find, applying theorem 1.1 that:

P (A ∪ B) = P (A) + P (B).

With a fair coin, each 3-digit number has same probability. There are 8, 3-digit numbersso each one has probability 1/8. It follows that P (A) = 1/8 and P (B) = 1/8. Thus

P (A ∪ B) =1

8+

1

8=

1

4= 25%

7

The next theorem is useful for any pair of events A and B and not just disjoint events:

Theorem 1.2 Let A and B be two events. Then:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Proof. Let C = A∪B. Let D = B −A, that is D consists of all the elements that are inB, but not in A. We have by definition that C = D ∪ A and that D and A are disjoint.Thus we can apply theorem 1.1 and find:

P (C) = P (A) + P (D) (1.2)

Furthermore (A ∩ B) and D are disjoint, and we have B = (A ∩ B) ∪ D. We can thusapply theorem 1.1 and find that:

P (B) = P (A ∩ B) + P (D) (1.3)

Subtracting equation 1.3 from equation 1.2 yields:

P (C) − P (B) = P (A) − P (A ∩ B).

By adding P (B) on both sides of the last equation, we find:

P (C) = P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

This finishes this proof.

Problem 1.1 Let a and b designate two genes. Let the probability that a randomly pickedperson in the US, has gene a be 20%. Let the probability for gene b be 30%. And eventually,let the probability that he has both genes at the same time be 10%. What is the probabilityto have at least one of the two genes?

Let us explain how we solve the above problem: Let A, resp. B designate the event thatthe randomly picked person has gene a, resp. b. We know that:

• P (A) = 20%

• P (B) = 30%

• P (A ∩ B) = 10%

The event to have at least one gene is the event A ∪ B. By theorem 1.2 we have that:P (A∪B) = P (A)+P (B)−P (A∩B). Thus in our case: P (A∪B) = 20%+30%−10% =40%. This finishes to solve the above problem.In many situations we will be considering the union of more of 3 events. The next theoremgives the formula for three events:

8

Theorem 1.3 Let A, B and C be three events. Then we have

P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (A∩C)−P (B∩C)+P (A∩B∩C).

Proof. We already now the formula for the probability of the union of two events. So,we are going to use this formula. Let D denote the union: D := B ∪ C. Then we find

A ∪ B ∪ C = A ∪ D

and henceP (A ∪ B ∪ C) = P (A ∪ D). (1.4)

By theorem 1.2, the right side of the last equation above is equal to:

P (A ∪ D) = P (A) + P (D) − P (A ∩ D) = P (A) + P (B ∪ C) − P (A ∩ (B ∪ C)) (1.5)

Note that by theorem 1.2 we have:

P (B ∪ C) = P (B) + P (C) − P (B ∩ C).

We have:A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

and henceP (A ∩ (B ∪ C)) = P ((A ∩ B) ∪ (A ∩ C)). (1.6)

But the right side of the last equation above is the probability of the union of two eventsand hence theorem 1.2 applies:

P ((A∩B)∪(A∩C)) = P (A∩B)+P (A∩C)−P ((A∩B)∩(A∩C)) = P (A∩B)+P (A∩C)−P ((A∩B∩C)).(1.7)

Combining now equation 1.4, 1.5, 1.6 with 1.7, we find

P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (A∩C)−P (B∩C)+P (A∩B∩C).

Often it is easier to calculate the probability of a complement than the probability of theevent itself. In such a situation, the following theorem is useful:

Theorem 1.4 Let A be an event and let Ac denote its complement. Then:

P (A) = 1 − P (Ac)

Proof. Note that the events A and Ac are disjoint. Furthermore by definition A∪Ac = Ω.Recall that for the sample space Ω, we have that P (Ω) = 1. We can thus apply theorem1.1 and find that:

1 = P (Ω) = P (A ∪ Ac) = P (A) + P (Ac).

This implies that:P (A) = 1 − P (Ac)

which finishes this proof.

9

1.5 Some inequalities

Theorem 1.5 Let A and B be two events. Then:

P (A ∪ B) ≤ P (A) + P (B)

Proof. We know by theorem 1.2 that

P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

Since P (A ∩ B) ≥ 0 we have that

P (A) + P (B) − P (A ∩ B) ≤ P (A) + P (B).

It follows thatP (A ∪ B) ≤ P (A) + P (B).

For several events a similar theorem holds:

Theorem 1.6 Let A1, . . . , An be a collection of n events. Then

P (A1 ∪ A2 ∪ . . . An∪) ≤ A1 + . . . + An

Proof. By induction.

Another often used inequality is:

Theorem 1.7 Let A ⊂ B. Then:

P (A) ≤ P (B).

Proof. If A ⊂ B, then for every n we have that:

nA ≤ nB

hence alsonA

n≤ nB

n.

Thus:lim

n→∞

nA

n≤ lim

n→∞

nB

n.

HenceP (A) ≤ P (B).

10

2 Conditional probability and independence

Imagine the following situation: in a population there are two illnesses a and b. Weassume that 20% suffer from b, 15% suffer from a whilst 10% suffer from both. Let A bethe event that a person suffers from a and let B be the event that a person suffers from b.If a patient comes to a doctor and says that he suffers from illness b, how likely is he tohave illness a also? (We assume that the patient has been tested for b but not yet testedfor a.) We note that half the population group suffering from b, also suffer from a. Hence,when the doctor meets such a patients suffering from b, there is a chance of 1 out of 2,that the person suffers also from a. This is called the conditional probability of B givenA and denoted by P (B|A). The formula we used is 10%/20% = P (A ∩ B)/P (A).

Definition 2.1 Let A, B be two events. Then we define the probability of A conditionalon the event B, and write P (A|B) for the number:

P (A|B) :=P (A ∩ B)

P (B).

Definition 2.2 Let A, B be two events. We say that A and B are independent of eachother iff

P (A ∩ B) = P (A) · P (B).

Note that A and B are independent of each other if and only if P (A|B) = P (A). In otherword, A and B are independent of each other if and only if the realization of one of theevents does not affect the conditional probability of the other.Assume that we perform two random experiments independently of each other, in thesense that the two experiments do not interact. That is the experiments have no influenceon each other. Le A denote an event related to the first experiment, and let B denotean event related to the second experiment. We saw in class that in this situation theequation P (A∩B) = P (A) ·P (B) must hold. And thus, A and B are independent in thesense of the above definition. To show this we used an argument where we simulated thetwo random experience by picking marbles from two bags.There are also many cases, where events related to a same experiment are independent,in the sense of the above definition. For example for a fair die, the events A = 1, 2 andB = 2, 4, 6 are independent.There can also be more than two independent events at a time:

Definition 2.3 Let A1, A2, . . . , An be a finite collection of events. We say that A1, A2, . . . , An

are all independent of each other iff

P (∩i∈IAi) = Πi∈IP (Ai)

for every subset I ⊂ 1, 2, . . . , n.

The next example is very important for the test on Wednesday.

11

Example 2.1 Assume we flip the same coin independently three times. Let the coin bebiased, so that side 1 has probability 60% and side 0 has probability 40%. What is theprobability to observe the number 101? (By this we mean: what is the probability to firstget a 1, then a 0 and eventually, at the third trial, a 1 again?)To solve this problem let A1, resp. A3 be the event that at the first, resp. third trial weget a one. Let A2 be the event that at the second trial we get a zero. Observing a 101is thus equal to the event A := A1 ∩ A2 ∩ A3. Because, the trials are performed in an“independent” manner it follows that the events A1, A2, A3 are independent of each other.Thus we have that:

P (A1 ∩ A2 ∩ A3) = P (A1) · P (A2) · P (A3).

We have that:P (A1) = 60%, P (A2) = 40%, P (A3) = 60%.

It follows that:P (A1 ∩ A2 ∩ A3) = 60% · 40% · 60% = 0.144.

2.1 Law of total probability

Lemma 2.1 Let A and B be two events. Then

P (A) = P (A ∩ B) + P (A ∩ Bc). (2.1)

Furthermore, if B and Bc have both probabilities that are not equal to zero, then

P (A) = P (A|B) · P (B) + P (A|Bc)P (Bc). (2.2)

Proof. Let D be the event D := A∩B. Let E be the event E := A∩Bc. Then we havethat D and E are disjoint. Furthermore, A = E ∪ D so that by theorem 1.1, we find:

P (A) = P (E ∪ D) = P (E) + P (D).

Replacing E and D by A ∩ B and A ∩ Bc, yields equation 2.1.We can use 2.1 to find

P (A) =P (A ∩ B) · P (B)

P (B)+

P (A ∩ Bc) · P (Bc)

P (Bc).

The right side of the last equality above is equal to

P (A) = P (A|B)P (B) + P (A|Bc)P (Bc),

which finishes to prove equation 2.2

Let us give an example which should show that intuitively this law is very clear. Assumethat in a town 90% of women are blond but only 20% of men. Assume we chose a person

12

at random from this town. Each person is equally likely to be drawn. Let W be the eventthat the person be a women and B be the event that the person be blond. The law oftotal probability can be written as

P (B) = P (B|W )P (W ) + P (B|W c)P (W c). (2.3)

In our case, the conditional probability of “blond” conditional on women is P (B|W ) = 0.9.On the other hand W c is the event to draw a male and P (B|W c) is the conditionalprobability to have a blond given that the person is a man. In our case, P (B|W c) = 0.2.So, when we put the numerical values into equation 2.3, we find

P (B) = 0.9P (W ) + 0.2P (W c). (2.4)

Here P (W ) is the probability that the chosen person is a women. This is then thepercentage of women in this population. Similarly, P (W c) is the proportion of men. Inother words, equation 2.4 can be read as follows: the total proportion of blonds in thepopulation is the weighted average between the proportion of blonds among the femaleand the male population.

2.2 Bayes rule

Bayes rule is useful when one would like to calculate a conditional probability of A givenB, but one is given the opposite, that is the probability of B given A. Let us next stateBayes rule:

Lemma 2.2 Let A and B be events both having non zero probabilities. Then

P (A|B) =P (B|A) · P (A)

P (B). (2.5)

Proof. By definition of conditional probability we have P (B|A) = P (B ∩ A)/P (A). Weare now going to plug the last expression into the right side of equation 2.5. We find:

P (B|A) · P (A)

P (B)=

P (A ∩ B)P (A)

P (A) · P (B)=

P (A ∩ B)

P (B)= P (A|B),

which establishes equation 2.5.

Let us give an example. Assume that 30% of men are interested in car races, but only 10%of women are. If I know that a person is interested in car races, what is the probabilitythat it is a man? Again I imagine that we pick a person at random in the population.Let M be the event that the person is a man and C the event that she/he is interestedin car races. We know P (C|M) = 0.3 and P (C|M c) = 0.1. Now by Bayes rule we havethat the conditional probability that the person is a man given that he/she is interestedin car races is:

P (M |C) = P (C|M) · P (M)

P (C). (2.6)

13

We have that P (C) = P (C|M)P (M) + P (C|M c)P (M c) which we can plug into 2.6 tofind

P (M |C) = P (C|M) · P (M)

P (C|M)P (M) + P (C|M c)P (M c).

In the present numerical example, we find

P (M |C) = 0.3P (M)

0.3P (M) + 0.1P (M c),

where P (M) represents the proportion of men in the population, whilst P (M c) representsthe proportion of women.

3 Expectation

Imagine a firm which every year makes a profit. It is not known in advance what the profitof the firm is going to be. This means that the profit is random: we can assign to eachpossible outcome a certain probability. Assume that from year to year the probabilitiesfor the profit of our firm do not change. Assume also that from one year to the next theprofits are independent. What is the long term average yearly profit equal to?For this let us look at a specific model. Assume the firm could make 1, 2, 3 or 4 millionprofit with the following probabilities

P (X = x) 0.1 0.4 0.3 0.2x 1 2 3 4

(The model here is not very realistic since there are only a few possible outcomes. Wechose it merely to be able to illustrate our point). Let Xi denote the profit in year i.Hence, we have that X, X1, X2,... are i.i.d. random variables.To calculate the long term average yearly profit consider the following. In 10% of theyear on the long run we get 1 million. If we take a period of n years, where n is large,we thus find that in about 0.1n years we make 1 million. In 40% of the years we make2 millions on the long run. Hence, in a period of n years, this means that in about 0.4nyears we make 2 millions. This corresponds to an amount of money equal to about 0.4ntimes 2 millions. Similarly, for n large the money made during the years where we earned3 million is about 3 · 0.3n, whilst for the years where we made 4 millions we get 4 · 0.2n.The total during this n year period is thus about

1·0.1+2·0.4+3·0.3+4·0.4 = 1·P (X = 1)+2·P (X = 2)+3·P (X = 3)+4·P (X = 4) == 3.3

Hence, on the long run the yearly average profit is 3.3 millions. This long term average iscalled expected value or expectation and is denoted by E[X]. Let us formalize this concept:In general if X denotes the outcome of a random experiment, then we call X a randomvariable.

14

Definition 3.1 Let us consider a random experiment with a finite number of possibleoutcomes, where the state space is

Ω = x1, x2, . . . , xs .

(In the profit example above, we would have Ω = 1, 2, 3, 4.) Let X denote the outcomeof this random experiment. For x ∈ Ω, let px denote the probability that the outcome ofour random experiment is is x. That is:

px := P (X = x).

(In the last example above, we have for example p1 = 0.1 and p2 = 0.4...) We define theexpected value E[X]:

E[X] :=#

x∈Ω

xpx.

In other words, to calculate the expected value of a random variable, we simply multiplethe probabilities with the corresponding values and then take the sum over all possibleoutcomes. Let us see yet another example for expectation.

Example 3.1 Let X denote the value which we obtain when we throw a fair coin withside 0 and side 1. Then we find that:

E[X] = 0.5 × 1 + 0.5 × 0 = 0.5

When we keep repeating the same random experiment independently and under the sameconditions on the long run, we will see that the average value which we observe converges tothe expectation. This is what we saw in the firm/profit example above. Let us formalizethis. This fact is actually a theorem which is called the Law of Large Numbers. Thistheorem goes as follows:

Theorem 3.1 Assume we repeat the same random experiment under the same conditionsindependently many times. Let Xi denote the (random variable) which is the outcome ofthe i-th experiment. Then:

limn→∞

(X1 + X2 + . . . + Xn)

n= E[X1] (3.1)

This simply means that on the long run, the average is going to be equal to to the expec-tation.

Proof. Let Ω denote the state space of the random variables Xi:

Ω = x1, x2, . . . , xs .

by regrouping the same terms together, we find:

X1 + X2 + . . . + Xn = x1nx1+ x2nx2

+ . . . + xsnxs.

15

(Remember that nxidenotes the number of times we observe the value xi in the finite

sequence: X1, X2, . . . , Xn.) Thus:

limn→∞

(X1 + X2 + . . . + Xn)

n= lim

n→∞

!

x1nx1

n+ . . . + xs

nxs

n

"

.

By definition

P (X1 = xi) = limn→∞

nxi

n.

Since the limit of a sum is the sum of the limits we find,

limn→∞

!

x1nx1

n+ . . . + xs

nxs

n

"

= x1 limn→∞

nx1

n+ . . . + xs lim

n→∞

nxs

n=

=x1P (X = x1) + . . . + xsP (X = xs) = E[X1].

So, we can now generalize our firm profit example. Imagine for this that the profit a firmmakes every month is random. Imagine also that the earnings from month to month areindependent of each other and also have the same “probabilities”. In this case we canview the sequence of earnings month for month, as a sequence of repeats of the samerandom experiment. Because of theorem 3.1, on the long run the average monthly incomewill be equal to the expectation.Let us next give a few useful lemmas in connection with expectation. The first lemma dealswith the situation where we take an i.i.d. sequence of random outcomes X1, X2, X3, . . .and multiply each one of them with a constant a. Let Yi denote the number Xi multipliedby a: hence Yi := aXi. Then the long term average of the Xi’s multiplied by a equals tothe long term average of the Yi’s. Let us state this fact in a formal way:

Lemma 3.1 Let X denote the outcome of a random experiment. (Thus X is a so-calledrandom variable.) Let a be a real (non-random) number. Then:

E[aX] = aE[X].

Proof. Let us repeat the same experiment independently many times. Let Xi denotethe outcome of the i-th trial. Let Yi be equal to Yi := aXi. Then by the law of largenumbers, we have that

limn→∞

Y1 + . . . + Yn

n= E[Y1] = E[aX1].

However:

limn→∞

Y1 + . . . + Yn

n= lim

n→∞

aX1 + . . . + aXn

n= lim

n→∞a

$

X1 + . . . + Xn

n

%

=

=a limn→∞

X1 + . . . + Xn

n= aE[X1].

16

This proves that E[aX1] = aE[X1] and finishes this proof. The next lemma isextremely important when dealing with the expectation of sums of random variables. Itstates that the sum of the expectation is equal to the expectation of the sum. We canthink of a simple real life example which shows why this should be true. Imagine thatMatzinger is the owner of two firm (wishful thinking since Matzinger is a poor professor).Let Xi denote the profit made by his first firm in year i. Let Yi denote the profit madeby his second firm in year i. We assume that from year to year the “probabilities” do notchange for both firms and the profits are independent (from year to year). In other wordsX, X1, X2, X3, . . . are i.i.d. variables and so are Y, Y1, Y2, . . .. Let Zi denote the total profitMatzinger makes in year i, so that Zi = Xi + Yi. Now obviously the long term averageyearly profit of Matzinger is the long term average yearly profit form the first firm plusthe long term average profit from the second firm. In mathematical writing this gives:

E[X + Y ] = E[X] + E[Y ].

As a matter of fact, E[X + Y ] denotes the long term average profit of Matzinger. Onthe other hand, E[X] denotes the average profit of the first firm, whilst E[Y ] denotes theaverage profit of the second firm.Let us next formalize all of this:

Lemma 3.2 Let X, Y denote the outcomes of two random experiments.Then:

E[X + Y ] = E[X] + E[Y ].

Proof. Let us repeat the two random experiments independently many times. Let Xi

denote the outcome of the i-th trial of the first random experiment. Let Yi be equal to theoutcome of the i-th trial of the second random experiment. For all i ∈ N, let Zi := Xi+Yi.Then by the law of large numbers, we have that:

limn→∞

Z1 + . . . + Zn

n= E[Z1] = E[X1 + Y1].

However:

limn→∞

Z1 + . . . + Zn

n= lim

n→∞

X1 + Y1 + X2 + Y2 + . . . + Xn + Yn

n=

= limn→∞

$

(X1 + . . . + Xn) + (Y1 + . . . + Yn)

n

%

=

= limn→∞

X1 + . . . + Xn

n+ lim

n→∞

Y1 + . . . + Yn

n+ = E[X1] + E[Y1].

This proves that E[X1+Y1] = E[X1]+E[Y1] and finishes this proof. It is very importantto note that we do not need for the above theorem to have X and Y being independentof each other.

17

Lemma 3.3 Let X, Y denote the outcomes of two independent random experiments.Then:

E[X · Y ] = E[X] · E[Y ].

Proof. We assume that X takes values in a countable set Ωx, whilst Y takes on valuesfrom the countable set ΩY . We have that

E[XY ] =#

x∈ΩX ,y∈ΩY

xyP (X = x, Y = y). (3.2)

By independence of X and Y , we have that P (X = x, Y = y) = P (X = x)P (Y = y).Plugging the last equality into 3.2, we find

E[XY ] =#

x∈ΩX ,y∈ΩY

xyP (X = x)P (Y = y) =#

x∈ΩX

xP (X = x)#

y∈ΩY

yP (Y = y) = E[X]·E[Y ]

So we have proven that E[XY ] = E[X] · E[Y ].

4 Dispersion, average fluctuation and standard devi-

ation

In some problems we are only interested in the expectation of a random variable. Forexample, consider insurance policies for mobile telephones sold by a big phone company.Say Xi is the amount which will be paid during the coming year to the i-th customerdue to his/her phone breaking down. It seems reasonable to assume that the Xi’s areindependent of each other. (We assume no phone viruses). We also assume that they allfollow the same random model. So, by the Law of Large Numbers we have that for nlarge, the average is approximately equal to the expectation:

X1 + X2 + . . . + Xn

n≈ E[Xi].

Hence, when n is “really large”, there is no risk involved for the phone company: theyknow how much they will have to pay total: on a per customer basis, they will have tospend an amount very close to E[X1]. In other words, they only need one real numberfrom the probability model for the claims: that is the expectation E[Xi]. Now, in manyother applications knowing only the expected value will not be enough: we will also needa measure of the dispersion. This means that we will also want to know how much onaverage the variables “fluctuate from their long term average E[X1]”.Let us give an example. Matzinger as a child used to walk with his mother every day on the shoresof Lake Geneva. Now, there is a place where there is a scale to measure the height of the water. So,hydrologists measure the water level and then analyze this data. Assume that Xi denotes the water levelon a specific day day in year i. (We assume that we always measure on the same day of the year, like forexample on the first of January). For the current discussion we assume that the model does not change

18

over time (no global warming). We furthermore assume that from one year to the next the values areindependent. Say the random model would be given as follows:

x 4 5 6 7 8 9P (X = x) 1

6

1

6

1

6

1

6

1

6

1

6

How much does the water level fluctuate on average from year to year? Note that the long term average,that is the expectation is equal to

E[Xi] = 4 · 1

6+ 5

1

6+ 6

1

6+ 7

1

6+ 8

1

6+ 9

1

6= 6.5

Now, when the water level is 6 or 7, then we are 0.5 away from the long term average of 6.5. In such ayear i, we will say that the fluctuation fi is 0.5. In other words, we measure for each year i, how far weare from E[Xi]. This “observed fluctuation in year i is then equal to

fi := |Xi − 6.5| = |Xi − E[Xi]|.

In our model, fi = 0.5 happens with a probability of 1/3, that is on the long run, in one third of theyears. When the water level is either at 8 or 5, then we are 1.5 away from the long term average of 6.5.This has also a probability of 1/3. Finally, with water levels of 4 or 9, we are 2.5 away from the longterm average and again this will happen in a third of the year on the long run. So, the long term averagefluctuation if this models holds, will always tend to be about

E[fi] = E[|Xi − E[Xi]|] = 2.5 · 1

3+ 1.5 · 1

3+ 0.5 · 1

3= 1.5.

after many years. To understand why simply consider the fluctuations f1, f2, f3, . . .. By the Law of LargeNumbers applied to them we get that for n large, the average fluctuation is approximately equal to itsexpectation:

f1 + f2 + . . . + fn

n≈ E[fi] = E[|Xi − E[Xi]|]. (4.1)

So, now matter, what after many years, we will always now what the average fluctuation is approxi-

mately equal to: the expression on the right side of 4.1

The real number

Long term average fluctuation = E[|Xi − E[Xi]|] (4.2)

is a measure of the dispersion (around the expectation) in our model. It should be obviouswhy this “dispersion” is important: if it is small people of Geneva will be safe. If it is big,they will often have to deal with flooding. So, in some sense, we can view the value givenin 4.2 as a measure of risk: if the dispersion is 0, then there is no risk and the randomnumber is not random but always equal to the fixed value E[X1]!

In modern statistics, one considers however most often a number which represents thesame idea, but can be slightly different from 4.2. The number we will use most often, isnot the average fluctuation, but instead the square root of the average fluctuation square.This number is called the standard deviation of a random variable. We usually denote itby σ, so

σX :=&

E[(X − E[X])2].

19

The long term average fluctuation square of a random variable X is also called variance,and will be denoted by V AR[X] so that

V AR[X] := E[(X − E[X])2].

With this definition the standard deviation is simply the square root of the variance:

σX =&

V AR[X].

In most cases, σX and our other measure of dispersion given by E[|X −E[x]|] are almostequal.

Let us go back to our example. the variance is the average fluctuation square. We get thus:

V AR[Xi] = E[f2i ] = 2.52 · 1

3+ 1.5 · 1

3+ 0.52 · 1

3= 2.91

and hence the standard deviation is

σXi=&

V AR[Xi] =√

2.91 ≈ 1.7

So, we see the average fluctuation size was E[|Xi − E[Xi]|] = 1.5 whilst the standard deviation is (only)

about 13% bigger.

Now, standard deviation is most often used for determining the order of magnitude of a“random imprecision”. So, we don’ t care about knowing absolutely exactly that number:instead we just want the order of magnitude. In other words, in most applications,E[|Xi − E[Xi]| and σXi

are sufficiently close to each other, that for applications it doesnot matter which one of the two we take! But, it will turn out that the standard deviationallows for certain calculations which the other measure of dispersion in 4.2 does not allowfor. So, we will work more often with the standard deviation than the other.

4.1 Matzingers rule of thumb

A rule of thumb is that:most variables most of the time take values not further than two standard deviations fromtheir expected values. We could thus write in a lose way:

X ≈ E[X] ± 2σX .

To understand where this rule comes from simply think of the following: for exampleaverage American household income is around 70.000. How many households make morethan twice that much, that is above 140000? Certainly not a very large portion of thepopulation. Now, in our case the argument is not about “average”, but about the “averagefluctuation”. Still it is an average. So, what is true for averages should also be true foran average of fluctuations....We will see below Chebycheff rule which is the “worst possible scenario”. The probability

20

for any random variable to be further than 2 standard deviations from it expected valuecould be as much as 25% but never more:

P (|Z − E[Z]| ≥ 2σZ) ≤ 0.25.

the above inequality holds for any random variable, so it represents in some sense the“worse case”. Inequality ?? will be proven in our section on chebycheff.For normal variables, the probability to be further than two standard deviations is muchsmaller: it is about 0.05. Now, we will see in the section on central limit theorem, thatany sum of many independent random contributions is approximately normal as soon asthey follow about the same model. Now, 0.0 is much smaller than 0.25. In real life, inmany cases, one will be in between these two possibilities. This rule of thumb is extremelyuseful when analyzing data and trying to get the big picture!

5 Calculation with the Variance

Let X be the outcome of a random experiment. We define the variance of X to be equalto:

V AR[X] := E'

(X − E[X])2(

.

The square root of the variance is called standard deviation:

σX :=&

V AR[X].

The standard deviation is a measure for the typical order of magnitude of how far awaythe value we get after doing the experiment once, is from E[X].

Lemma 5.1 Let a be a non-random number and X the outcome of a random experiment.Then:

V AR[aX] = a2V AR[X].

Proof. We have:

V AR[aX] =E[(aX − E[aX])2] = E[(aX − aE[X])2] =

=E[a2 · (X − E[X])2] = a2E[(X − E[X])2] = a2 · V AR[X],

which finishes to prove that: V AR[aX] = a2V AR[X].

Lemma 5.2 Let X be the outcome of a random experiment, (in other words a randomvariable). Then:

V AR[X] = E[X2] − (E[X])2.

Proof. We have that

E[(X − E[X])2] = E[X2 − 2XE[X] + E[X]2] = E[X2] − 2E[XE[X]] + E[E[X]2]. (5.1)

21

Now E[X] is a constant and constants can be taken out of the expectation. This impliesthat

E[XE[X]] = E[X]E[X] = E[X]2. (5.2)

On the other hand, the expectation of a constant is the constant itself. Thus, since E[X]2

is a constant, we find:E[E[X]2] = E[X]2. (5.3)

Using equation 5.2 and 5.3 with 5.1 we find

E[(X − E[X])2] = E[X2] − 2E[X]2 + E[X]2 = E[X2] − E[X]2.

this finishes to prove that V AR[X] = E[X2] − E[X]2.

Lemma 5.3 Let X and Y be the outcomes of two random experiments, which are inde-pendent of each other. Then:

V AR[X + Y ] = V AR[X] + V AR[Y ].

Proof. We have:

V AR[X + Y ] =E[((X + Y ) − E[X + Y ])2] = E[(X + Y − E[X] − E[Y ])2] =

E[((X − E[X]) + (Y − E[Y ]))2] =

=E[(X − E[X])2 + 2(X − E[X])(Y − E[Y ]) + (Y − E[Y ])2] =

=E[(X − E[X])2] + 2E[(X − E[X])(Y − E[Y ])] + E[(Y − E[Y ])2] =

Since X and Y are independent, we have that (X − E[X]) is also independent from(Y − E[Y ]). Thus, we can use lemma 3.3, which says that the expectation of a productequals the product of the expectations in case the variables are independent. We find:

E[(X − E[X])(Y − E[Y ])] = E[X − E[X]] · E[Y − E[Y ]].

Furthermore:E[X − E[X]] = E[X] − E[E[X]] = E[X] − E[X] = 0

ThusE[(X − E[X])(Y − E[Y ])] = 0.

Applying this to the above formula for V AR[X + Y ], we get:

V AR[X + Y ] =E[(X − E[X])2] + 2E[(X − E[X])(Y − E[Y ])] + E[(Y − E[Y ])2] =

= E[(X − E[X])2] + E[(Y − E[Y ])2] = V AR[X] + V AR[Y ].

This finishes our proof.

22

5.1 Getting the big picture with the help of Matzinger’s rule of

thumb

We mentioned that most of the time, any random variable takes values no further than twotimes its standard deviation from its expectation. We can apply this and our calculationfor variance to understand how insurances work, hedging investments, and even statisticalestimation work. Let X1, X2, . . . be a sequence of random variables which all follow thesame model and are independent of each other. Let Z be the sum of n such variables:

Z = X1 + X2 + . . . + Xn

We find that

E[Z] = E[X1 + X2 + . . . + Xn] = E[X1] + E[X2] + . . . + E[Xn] = nE[X1]

Similarly we can use the fact that the variance of a sum of independent variables is thesum of the variance to find:

V AR[Z] = V AR[X1+X2+. . .+Xn] = V AR[X1]+V AR[X2]+. . .+V AR[Xn] = nV AR[X1].

Using the last equation above with the fact that standard deviation is the square root ofvariance we find:

σZ =&

V AR[Z] =√

n&

V AR[X1] =√

nσX1.

In other words: the sum of n independents has its expectation grow like n times constant,but the standard deviation grows only like square root of n times constant!!!! This iseverything you need to know for understanding how insurances and other risk reducerswork....Let us see different examples, what these random numbers Xi could represent:

• Say you are an insurance company specializing in proving life insurance. Let Xi bethe claim in the current year of the ith client. You have n clients, so the total claimwhich you as a company will have to pay is Z = X1 + X2 + . . . + Xn.

• You buy houses which you flip and then try to sell at a profit. You have boughthouses all over the US. Assuming the economy and real estate market stays verystable, we can assume that the selling prices will be independent of each other.So, let Xi represent the profit (or loss) for the i-th house which you are currentlyrenovating. This profit or loss is random, since you don’t know exactly what itwill be until you sell. Assume that you have currently n houses which you arerenovating. Then, Z = X1 + . . . + Xn is your total profit or loss, with the n-housesyou are currently holding. This is a random variable since its outcome is not knowin advance.

•

23

Rizwan

Rizwan

So, again it is all based on the following two equations which hold when the Xi’s areindependent and follow the same model:

σX1+...+Xn= σX1

·√

n

E[X1 + . . . + Xn] = nE[X1]

So for example with n = 1000000, we get

σZ = 1000σXi

whilstE[Z] = 10000000E[Xi]

so σZ becomes negligence compared to E[Z]. So, if we think that most of the times avariable is within two standard deviations of its expectation, we find

Z ≈ 1000000E[X1] ± 1000σX1

so, compared to the order of magnitude of Z the fluctuation becomes almost negligible!

6 Covariance and correlation

Two random variables are dependent when their joint distribution is not simply the prod-uct of their marginal distribution. But the degree of dependence can vary from strongdependence to loose dependence. One measure of the degree of dependence of randomvariables is Covariance. For random variables X and Y we define the covariance as follows:

COV [X, Y ] = E[(X − E[X])(Y − E[Y ])]

Lemma:

For random variables X and Y there is also another equivalent formula for the covariance:

COV [X, Y ] = E[XY ] − E[X]E[Y ]

Proof:

E[(X − E[X])(Y − E[Y ])] = E[XY − Y E[X] − XE[Y ] + E[X]E[Y ]]

= E[XY ] − E[X]E[Y ]] − E[X]E[Y ]] + E[X]E[Y ]]

= E[XY ] − E[X]E[Y ]]

Lemma:

For independent random variables X and Y,

COV [X, Y ] = 0

24

Rizwan

TEST II

Rizwan

Rizwan

Proof:

COV [X, Y ] = E[XY ] − E[X]E[Y ]

For independent X and Y, E[XY ] = E[X]E[Y ]. Hence COV [X, Y ] = 0Lemma:

COV [X, X] = V AR[X]

Proof:

COV [X, X] = E[X2] − E[X]2 = V AR[X]

Lemma:

Assume that a is a constant and let X and Y be two random variables. Then

COV [X + a, Y ] = COV [X, Y ]

Proof:

COV [X + a, Y ] = E[(X + a − E[X + a])(Y − E[Y ])]

= E[Y X + Y a − Y E[X + a] − XE[Y ] − aE[Y ] + E[Y ]E[X + a]]

= E[XY ] + aE[Y ] − E[Y ]E[X + a] − E[X]E[Y ] − aE[Y ] + E[Y ]E[X + a]

= E[XY ] − E[X]E[Y ]

= COV [X, Y ]

Lemma:

Let a be a constant and let X and Y be random variables. Then

COV [aX, Y ] = aCOV [X, Y ]

Proof:

COV [aX, Y ] = E[(aX − E[aX])(Y − E[Y ])]

= E[aXY − Y E[aX] − aXE[Y ] + E[aX]E[Y ]]

= aE[XY ] − aE[X]E[Y ] − aE[X]E[Y ] + aE[X]E[Y ]

= aE[XY ] − aE[X]E[Y ]

= a(E[XY ] − E[X]E[Y ])

= aCOV [X, Y ]

25

Rizwan

Lemma:

For any random variables X, Y and Z we have:

COV [Z + X, Y ] = COV [Z, Y ] + COV [X, Y ]

Proof:

COV [Z + X, Y ] = E[(X + Z − E[X + Z])(Y − E[Y ])]

= E[Y X + Y Z − Y E[X + Z] − XE[Y ] − ZE[Y ] + E[X + Z]E[Y ]]

= E[Y X] + E[Y Z] − E[Y ]E[X + Z] − E[X]E[Y ] − E[Z]E[Y ] + E[X + Z]E[Y ]

since E[A + B] = E[A] + E[B], we get

= E[Y X]+E[Y Z]−E[Y ]E[X]−E[Y ]E[Z]−E[X]E[Y ]−E[Z]E[Y ]−E[X]E[Y ]+E[Z]E[Y ]

= E[Y X] + E[Y Z] − E[X]E[Y ] − E[Z]E[Y ]

= E[Y X] − E[X]E[Y ] + E[Y Z] − E[Z]E[Y ]

= COV [X, Y ] + COV [Z, Y ]

NOTE THAT COV [X, Y ] = COV [Y, X]

6.1 Correlation

We define the correlation as follows:

COR[X, Y ] =COV [X, Y ]

&

V AR[X]V AR[Y ]

One can prove that correlation is always between −1 and 1. When the variables areindependent the correlation is zero. The correlation is one when Y can be written asY = a + bX where a, b are constants such that b > 0. If the correlation is −1 then thevariable Y can be written as Y = a + bX where b is a negative constant and a is anyconstant.An important property of the correlation is that when we multiply the variable by aconstant then the correlation does not change: COR[aX, Y ] = COR[X, Y ]. This impliesthat a change of units does not affect correlation.

7 Chebyshev’s and Markov’s inequalities

Let us first explain the Markov inequality with an example.For this assume the dividend payed (per share) next year to be a random variable. Letthe expected amount of money payed be equal to E[X] = 2 dollars. Then the probabilitythat the dividend pays more than 100 dollars can not be more then 2/100 = E[X]/100since otherwise the expectation would have to be bigger than 2. In other words, for arandom variable X which can not take negative values, the probability that the randomvariable is bigger than a is at most E[X]/a. This is the content of the next lemma:

26

Rizwan

Rizwan

Lemma 7.1 Assume that a > 0 is a constant and let X be a random variable taking ononly non-negative values, i.e. P (X ≥ 0) = 1. Then,

P (X ≥ a) ≤ E[X]

a.

Proof. To simplify the notation, we assume that the variable takes on only integer values.The result remains valid otherwise. We have that

E[X] = 0 · P (X = 0) + 1 · P (X = 1) + 2 · P (X = 2) + 3 · P (X = 3) + . . . (7.1)

Note that the sum on the right side of the above inequality contains only non-negativeterms. If we leave out some of these terms, the value can only decrease or stay equal. Weare going to just keep the values x ·P (X = x) for x greater equal to a. This way equation7.1, becomes

E[X] ≥ xa ·P (X = xa) + (xa + 1) ·P (X = xa + 1) + (xa + 2) ·P (X = xa + 2) + . . . (7.2)

where xa denotes the smallest natural number which is larger or equal to a. Note thatxa + i ≥ a for any i natural number. With this we obtain that the right side of 7.2 islarger-equal than

a(P (X = xa) + P (X = xa + 1) + P (X = xa + 2) + . . .) = aP (X ≥ a).

and henceE[X] ≥ aP (X ≥ a).

The last inequality above implies:

P (X ≥ a) ≤ E[X]

a.

The inequality given in the last lemma is called Markov inequality. In is very useful: inmany real world situations it is difficult to estimate all the probabilities (the probabilitydistribution) for a random variable. However, it might be easier to estimate the expecta-tion, since that is just one number. If we know the expectation of a random variable, wecan at least get upper-bounds on the probability to be far away from the expectation.

Let us next present the Chebycheff inequality:

Lemma 7.2 If X is a random variable with expectation E[X] and variance VAR[X] anda ≥ 0 is a non-random number, then

P (|X − E[X]| ≥ a) ≤ V AR[X]

a2

27

Proof.

Note that |X − E[X]| ≥ a implies (X − E[X])2 ≤ a2 and vice versa. Hence,

P (|X − E[X]| ≥ a) = P ((X − E[X])2 ≥ a2) (7.3)

If Y = (X − E[X])2, then Y is a non-negative variable, and

P (|X − E[X]| ≥ a) = P (Y ≥ a2). (7.4)

Since Y is non-negative, Markov inequality applies. Hence,

P (Y ≥ a2) ≤ E[Y ]

a2=

E[(X − E[X])2]

a2=

V AR[X]

a2

Using the last chain of inequalities above with equalities 7.3 and 7.4 we find

P (|X − E[X]| ≥ a) ≤ V AR[X]

a2

Let us consider one more example. Assume the total expected claim at the end of nextyear for an insurance company is 1′000′000$. What is the risk that the insurance companyhas to pay more than 5′000′000 as total claim at the end of next year? The answer goesas follows:let Z be the total claim at the end of next year. By Markov inequality, we find

P (Z ≥ 5′000′000) ≤ E[Z]

5′000′000=

1

5= 20%.

Hence, we know that the probability to have to pay more than five millions is at most20%. To derive this the only information needed was the expectation of Z.When the standard deviation is also available, one can usually get better bounds usingthe Chebycheef inequality. Assume in the example above that the expected total claim isas before, but let the standard deviation of the total claim be one million. Then we haveV AR[Z] = (1′000′000)2.

Note that for Z to be above 5′000′000 we need Z −E[Z] to be above 4′000′000. Hence,

P (Z ≥ 5′000′000) = P (Z − E[Z] ≥ 4′000′000) ≤ P (|Z − E[Z]| ≤ 4′000′000).

Using Chebycheff, we get

P (|Z − 1′000′000| ≥ 4′000′000) ≤ V AR[Z]

(4′000′000)2=

1

16= 0.0625.

It follows that the probability that the total claim is above five millions is less than 6.25percent. This is a lot less than the bound we had found using Markov’s inequality.

28

8 Combinatorics

Theorem 8.1 LetΩ = x1, x2, . . . , xs .

denote the state space of a random experiment. Let each possible outcome have sameprobability. Let E ⊂ Ω be an event. Then,

P (E) =number of outcomes in E

total number of outcomes=

|E|s

Proof. We know thatP (Ω) = 1

Now

P (Ω) = P (X ∈ x1, . . . , xs) = P (X = x1 ∪ . . . ∪ X = xs) =#

t=1,...,s

P (X = xt)

since all the outcomes have equal probability, we have that

#

t=1,...,s

P (X = xt) = sP (X = x1).

Thus,

P (X = x1) =1

s.

Now if:E = y1, . . . , yj ⊂ Ω

We find that:

P (E) = P (X ∈ E) = P (X = y1 ∪ . . . ∪ Xj = yj) =j#

i=1

P (X = yi) =j

s

which finishes the proof.Next we present one of the main principles used in combinatorics:

Lemma 8.1 Let m1, m2, . . . , mr denote a given finite sequence of natural numbers. As-sume that we have to make a sequence of r choices. At the s-th choice, assume that wehave ms possibilities to choose from. Then the total number of possibilities is:

m1 · m2 · . . . · mr

Why this lemma holds can best be understood when thinking of a tree, where at eachknot which is s away from the root we have ms new branches.

29

Example 8.1 Assume we first throw a coin with a side 0 and a side 1. Then we throw afour sided die. Eventually we throw the coin again. For example we could get the number031. How many differ numbers are there which we could get? The answer is: First wehave to possibilities. For the second “choice” we have four, and eventually we have againtwo. Thus, m1 = 2, m2 = 4, m3 = 2. This implies that the total number of possibilities is:

m1 · m2 · m3 = 2 · 4 · 2 = 16.

Recall that the product of all natural numbers which are less or equal to k, is denoted byk!. k! is called k-factorial.

Lemma 8.2 There arek!

possibilities to put k different objects in a linear order. Thus there are k! permutations ofk elements.

To realize why the last lemma above holds we use lemma 8.1. To place k different objectsin a row we first choose the first object which we will place down. For this we have kpossibilities. For the second object, there remain k − 1 objects to choose from. For thethird, there are k − 3 possibilities to choose from. And so on and so forth. This thengives that the total number of possibilities is equal to k · (k − 1) . . . · 2 · 1.

Lemma 8.3 There are:n!

(n − k)!

possibilities to pick k out of n different objects, when the order in which we pick themmatters.

For the first object, we have n possibilities. For the second object we pick, we have n− 1remaining objects to choose from. For the last object which we pick, (that is the k-thwhich we pick), we have n − k + 1 remaining objects to choose from. Thus the totalnumber of possibilities is equal to:

n · (n − 1) · · · . . . · (n − k + 1)

which is equal to:n!

(n − k)!.

The number n!(n−k)! is also equal to the number of words of length k written with a n-letter

alphabet, when we require that the words never contain twice the same letter.

Lemma 8.4 There are:n!

k!(n − k)!

subsets of size k in a set of size n.

30

The reason why the last lemma holds is the following: there are k! ways of putting agiven subset of size k into different orders. Thus, there are k! times more ways to pick kelements, than there are subsets of size k.

Lemma 8.5 There are:2n

subsets of any size in a set of size n.

The reason why the last lemma above holds is the following: we can identify the subsetstwo binary vectors with n entries. For example, let n = 5. Let the set we consider be1, 2, 3, 4, 5. Take the binary vector:

(1, 1, 1, 0, 0).

This vector would correspond to the subset containing the first three elements of the set,thus to the subset:

1, 2, 3.

So, for every non zero entry in the vector we pick the corresponding element in the set. Itis clear that this correspondence between subsets of a set of size n and binary vectors ofdimension n is one to one. Thus, there is the same number of subsets as there is binaryvectors of length n. The total number of binary vectors of dimension n however is 2n.

9 Important discrete random variables

9.1 Bernoulli variable

Let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 − p bethe probability of side 0. Let X designate the random number we obtain when we flipthis coin. Thus, with probability p the random variable X takes on the value 1 and withprobability 1 − p it takes on the value 0. The random variable X is called a Bernoullivariable with parameter p. It is named after the famous swiss mathematician Bernoulli.For a Bernoulli variable X with parameter p we have:

• E[X] = p.

• V AR[X] = p(1 − p).

Let us show this:E[X] = 1 · p + 0 · (1 − p) = p.

For the variance we find:

V AR[X] = E[X2] − (E[X])2 = 12 · p + 02 · (1 − p) − (E[X])2 = p − p2 = p(1 − p).

31

9.2 Binomial random variable

Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1−p bethe probability of side 0. We toss this coin independently n times and count the numbersof 1’s observed. The number Z of 1’s observed after n coin-tosses is equal to

Z := X1 + X2 + . . . + Xn

where Xi designates the result of the i-th toss. (Hence the Xi’s are independent Bernoullivariables with parameter p.) The random variable Z is called a binomial variable withparameter p and n. For the binomial random variable with parameter p we find:

• E[Z] = np

• V AR[Z] = np(1 − p)

• For k ≤ n, we have: P (Z = k) =)

nk

*

pk(1 − p)n−k.

Let us show the above statements:

E[Z] = E[X1 + . . . + Xn] = E[X1] + . . . + E[Xn] = n · E[X1] = n · p.Also:

V AR[Z] = V AR[X1 + . . . + Xn] = V AR[X1] + . . . + V AR[Xn] = nV AR[X1] = np(1− p).

Let us calculate next the probability: P (Z = k). We start with an example. Take n = 3and k = 2. We want to calculate the probability to observe exactly to ones among thefirst three coin tosses. To observe exactly two ones out of three successive trials there areexactly three possibilities:

• Let A be the event: X1 = 1, X2 = 1, X3 = 0

• Let B be the event: X1 = 1, X2 = 0, X3 = 1

• Let C be the event: X1 = 0, X2 = 1, X3 = 1.

Each of these possibilities has probability p2(1 − p). As a matter of fact, since the trialsare independent we have for example:

P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1)P (X2 = 1)P (X3 = 0) = p2(1 − p).

The three different possibilities are disjoint of each other. Thus,

P (Z = 2) = P (A ∪ B ∪ C) = P (A) + P (B) + P (C) = 3p2(1 − p).

Here 3 is the number of realization where we have exactly two ones within the first threecoin tosses. This is equal to the different number of ways, there is to choose two differentobjects out of three items. In other words the number three stand in our formula for “3

32

choose 2”.We can now generalize to n trials and a number k ≤ n. There are “n choose k” possibleoutcomes for which among the first n coin tosses there appear exactly k ones. Each ofthese outcomes has probability:

pk(1 − p)(n−k).

This gives then:

P (Z = k) =

$

n

k

%

pk(1 − p)(n−k).

9.3 Geometric random variable

Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1−p bethe probability of side 0. We toss this coin independently n many times. Let Xi designatethe result of the i-th coin-toss. Let T designate the number of trials it takes until we firstobserve a 1. For example, if we have:

X1 = 0, X2 = 0, X3 = 1

we would have that T = 3. If we observe on the other hand:

X1 = 0, X2 = 1

we have that T = 2. T is a random variable. As we are going to show, we have:

• For k > 0, we have P (T = k) = p(1 − p)k−1.

• E[T ] = 1/p

• V AR[T ] = (1 − p)/p2

A random variable T for which P (T = k) = p(1 − p)k−1, ∀k ∈ N, is called geometricrandom variable with parameter p. Let us next prove the above statements: For T to beequal to k we need to observe k − 1 time a zero followed by a one. Thus:

P (T = k) = P (X1 = 0, X2 = 0, . . . , Xk−1 = 0, Xk = 1) =

P (X1 = 0) · P (X2 = 0) · . . . · P (Xk−1 = 0) · P (Xk = 1) = (1 − p)k−1p.

Let us calculate the expectation of T . We find:

E[T ] =∞#

k=1

kp(1 − p)k−1

Let f(x) be the function:

x /→ f(x) =∞#

k=1

kxk−1.

33

We have that

f(x) =∞#

k=1

d(xk)

dx=

d)+∞

k=1 xk*

dx=

d (x/(1 − x))

dx=

1

1 − x+

x

(1 − x)2=

1

(1 − x)2(9.1)

This shows that:∞#

k=1

k(1 − p)k−1 = f(1 − p) =1

(p)2

Thus,

E[T ] = p ·,

∞#

k=1

k(1 − p)k−1

-

= p · 1

(p)2=

1

p.

Let us next calculate the variance of a geometric random variable. We find:

E[T 2] =∞#

k=1

k2p(1 − p)k−1.

Let g(.) be the map:

x /→ g(x) =∞#

k=1

k2(x)k−1

We find:

g(x) =∞#

k=1

kd(xk)

dx=

d)

x+∞

k=1 kxk−1*

dx

Using equation 9.1, we find:

g(x) =d (x/(1 − x)2)

dx=

1 + x

(1 − x)3.

This implies that

E[T 2] = pg(1 − p) =2 − p

p2.

Now,

V AR[T ] = E[T 2] − (E[T ]2) =2 − p

p2−$

1

p

%2

=1 − p

p2.

10 Continuous random variables

So far we have only been studying discrete random variables. Let us see how continuousrandom variables are defined.

34

Definition 10.1 Let X be a number “generated by a random experiment”. (Such a ran-dom number is also called random variable). X is a continuous random variable if thereexists a non-negative piecewise continuous function

f : x /→ f(x) R → R+

such that for any interval I = [i1, i2] ⊂ R we have that:

P (X ∈ I) =

.

I

f(x)dx.

The function f(.) is called the density function of X or simply the density of X.

Note that the notation/

I f(x)dx stands for:

.

I

f(x)dx =

. i2

i1

f(x)dx. (10.1)

Recall also that integrals like the one appearing in equation 10.1 are defined to be equalto the air under the curve f(.) and above the interval I.

Remark 10.1 Let f(.) be a piecewise continuous function from R into R. Then, thereexists a continuous random variable X such that f(.) is the density of X, if and only ifall of the following conditions are satisfied:

1. f is everywhere non-negative.

2./

Rf(x)dx = 1.

Let us next give some important examples of continuous random variables:

• The uniform variable in the interval I = [i1, i2], where i1 < i2. The density of f(.)is equal to 1/|i2 − i2| everywhere in the interval I. Anywhere outside the interval I,f(.) is equal to zero.

• The standard normal variable has density:

f(x) :=1√2π

e−x2/2.

A standard normal random variable is often denoted by N (0, 1).

• Let µ ∈ R, σ > 0 be given numbers. The density of the normal variable withexpectation µ and standard deviation σ is defined to be equal to:

f(x) :=1√2πσ

e−(x−µ)2/2σ2

.

35

11 Normal random variables

The probability of a normal random variable is given by the probability density:

f(x) =1√2πσ

e−(x−µ)2/2σ2

.

Hence there are two parameters which determine a normal distribution: µ and σ. Wewrite N (µ, σ) for an normal variable with parameters µ and σ.If we analyze the density function f(x), we see that for any value a, we have that f(µ+a) =f(µ − a). Hence the function f(.) is symmetric at the point µ. This implies that theexpected value has to be µ:

E[N (µ, σ)] = µ.

One could also show this by verifying that

E[N (µ, σ) =

.

x1√2πσ

e−(x−µ)2/2σ2

dx = 0.

By integration by parts, one can show that.

x2 1√2πσ

e−(x−µ)2/2σ2

dx = σ2

and hence the variance is σ2:V AR[N (µ, σ)] = σ2.

Note that the function f(x) decreases when we go away from µ: the shape is a bell shapewith maximum at µ and width σ. (Go onto the Internet to see the graph of a normaldensity plotted.)

Let us give next a few very useful facts about normal variables:

• Let a and b be two constants such that a = 0. Let X be a normal variable withparameters µX and σX . Let Y be the random variable defined by affine transfor-mation from X in the following way: Y := aX + b. Then Y is also normal. Theparameters of Y are

E[Y ] = µY = aµX + b

andσY = aσX .

This we obtain simply from the fact that these parameters are expectation andstandard deviation of their respective variables.

• Let X and Y be normal variables independent of each others. Let Z := X+Y . ThenZ is also normal. The same result is true for sums of more than two independentnormal variables.

36

• If X is normal, then

Z :=X − E[X]&

V AR[X]

is a standard normal.

For the last point above note that for any random variable X (not necessarily normal) wehave that if Z = (X −E[X])/σX , then Z has expectation zero and standard deviation 1.This is a simple straight forward calculation:

E[Z] = E

0

X − E[X]

σX

1

=1

σX(E[X] − E[E[X]]) (11.1)

but since E[E[X]] = E[X], equality 11.1 implies that E[Z] = 0. Also

V AR[Z] = V AR

0

X − E[X]

σX

1

= V AR[X]/σ2 = 1.

Now if X is normal then we saw that Z = (X − E[X])/σX is also normal, since Z isjust obtained from X by multiplying and adding constants. But Z has expectation 0 andstandard deviation 1 and hence it is standard normal.One can use normal variables to model financial processes and many others. Let usconsider an example. Assume that a portfolio consists of three stocks. Let Xi denote thevalue of stock number i in one year from now. We assume that the three stocks in theportfolio are all independent of each others and normally distributed so that µi = E[Xi]and σi =

&

V AR[Xi] for i = 1, 2, 3. Let

µ1 = 100, µ2 = 110, µ3 = 120

and letσ1 = 10, σ2 = 20, σ3 = 20.

The value of the portfolio after one year is Z = X1+X2+X3 and E[Z] = E[X1]+E[X2]+E[X3] = 330. : Question: What is the probability that the value of the portfolio after ayear is above 360?Answer: We have that

V AR[Z] = V AR[X1] + V AR[X2] + V AR[X3] = 100 + 400 + 400 = 900

and henceσZ =

&

V AR[Z] = 30.

We are now going to calculateP (Z ≥ 330).

For this we want to transform the probability into a probability involving a standardnormal since for standard normal we have tables available. We find

P (Z ≥ 360) = P

$

Z − E[Z]

σZ≥ 360 − E[Z]

σZ

%

. (11.2)

37

Note that(360 − E[Z])/σz = 1

and also (Z −E[Z])/σZ is standard normal. Using this in equation 11.2, we find that theprobability that the portfolio after a year is above 360 is equal to

P (Z ≥ 360) = P (N (0, 1) ≥ 1) = 1 − Φ(1),

where Φ(1) = P (N (0, 1) ≤ 1) = 0.8413 can be found in a table for the standard normal.

12 Distribution functions

Definition 12.1 Let X be a random variable. The distribution function FX : R → R ofX is defined in the following way:

FX(s) := P (X ≤ s)

for all s ∈ R.

Let us next mention a few properties of the distribution function:

• FX is an increasing function. This means that for any two numbers s < t in R, wehave that FX(s) ≤ FX(t).

• lims→∞ FX(s) = 1

• lims→−∞ FX(s) = 0

We leave the proof of the three facts above to the reader.Imagine next that X is a continuous random variable with density function fX . Then,we have for all s ∈ R, that:

FX(s) = P (X ≤ s) =

. s

−∞

fX(t)dt.

Taking the derivative on all sides of the above system of equations we find that:

dFX(s)

ds= fX(s).

In other words, for a continuous random variables X, the derivative of the distributionfunction is equal to the density of X. Hence, in this case, the distribution function isdifferentiable and thus also continuous. Another implication is: the distribution functionuniquely determines the density function of f . This implies, that the distribution functiondetermines uniquely all the probabilities of events which can be defined in terms of X.

Assume next that the random variable X has a finite state space:

ΩX = s1, s2, . . . , sr

38

such that s1 < s2 < . . . < sr. Then, the distribution function FX is a step function. Leftof s1, we have that FX is equal to zero. Right of sr it is equal to one. Between si andsi+1, that is on the interval [si, si+1[, the distribution function is constantly equal to:

#

j≤i

P (X = sj).

(This holds for all i between 1 and r − 1.)To sum up: for continuous random variables the distribution functions are differentiablefunctions, whilst for discrete random variables the distribution functions are step func-tions. Let us next show how we can use the distribution function to simulate randomvariables. The situation is the following: our computer can generate a uniform randomvariable U in the interval [0, 1]. (This is a random variable with density equal to 1 in [0, 1]and 0 everywhere else.) We want to generate a random variable with a given probabilitydensity function fX , using U . We do this in the following manner: we plug the randomnumber U into the map invFX . (Here invFX designates the inverse map of FX(.).) Thenext lemma says that this method really produces a random variable with the desireddensity function.

Lemma 12.1 Let fX denote the density function of a continuous random variable andlet FX designate its distribution function. Let Y designate the random variable obtainedby plugging the uniform random variable U into the inverse distribution function:

Y := invFX(U).

Then, the density of Y is equal to fX.

Proof. Since, F (.) is an increasing function. Thus for any number s we have:

Y ≤ s.

is equivalent toFX(Y ) ≤ FX(s).

Hence:P (Y ≤ s) = P (FX(Y ) ≤ FX(s)).

Now, FX(Y ) = U , thusP (Y ≤ s) = P (U ≤ FX(s)).

We know that FX(s) ∈ [0, 1]. Using the fact that U has density function equal to one inthe interval [0, 1], we find:

P (U ≤ FX(s)) =

. FX(s)

0

1 dt = FX(s).

ThusP (Y ≤ s) = FX(s).

39

This shows that the distribution function FY of Y is equal to FX(s). Applying thederivative according to s to both FY (s) and FX(s), yields:

fY (s) = fX(s).

Hence, X and Y have same density function. This finishes the proof.

13 Expectation and variance for continuous random

variables

Definition 13.1 Let X be a continuous random variable with density function fX(.).Then, we define the expectation E[X] of X to be:

E[X] :=

. ∞

−∞

sfX(s)ds.

Next we are going to prove that the law of large numbers also holds for continuous randomvariables.

Theorem 13.1 Let X1, X2, . . . be a sequence of i.i.d. continuous random variables allwith same density function fX(.). Then,

limn→∞

X1 + X2 + . . . + Xn

n= E[X1].

Proof. Let ∆ > 0 be a fix number. Let us approximate the continuous variables Xi bya discrete variable X∆

i . For this we let X∆i be the largest integer multiple of ∆ which is

still smaller equal to Xi. In this way, we always get that

|X∆i − Xi| < ∆.

This implies that:2

2

2

2

X1 + X2 + . . . + Xn

n− X∆

1 + X∆2 + . . . + X∆

n

n

2

2

2

2

< ∆

However the variables X∆i are discrete. So for them the law of large number has already

been proven and we find:

limn→∞

X∆1 + X∆

2 + . . . + X∆n

n= E

'

X∆1

(

(13.1)

We have thatE[X∆

i ] =#

z∈Z

z∆ · P (X∆i = z∆)

40

However, by definition:

P (X∆i = z∆) = P (Xi ∈ [z∆, (z + 1)∆[).

The expression on the right side of the last inequality is equal to

. (z+1)∆

z∆

fX(s)ds.

Thus

E[X∆i ] =

#

z∈Z

z∆

. (z+1)∆

z∆

fX(s)ds.

As ∆ tends to zero, the expression on the left side of the last equality above tends to:. ∞

−∞

sfX(s)ds

This implies that by taking ∆ fix and sufficiently small, we have that, for large enough n, the fraction

X1 + X2 + . . . + Xn

nis as close as we want from

. ∞

−∞

sfX(s)ds.

This implies thatX1 + X2 + . . . + Xn

nactually converges to

. ∞

−∞

sfX(s)ds.

The linearity of expectation holds in the same way as for discrete random variables.This is the content of the next lemma.

Lemma 13.1 Let X and Y be two continuous random variables and let a be a number.Then

E[X + Y ] = E[X] + E[Y ]

andE[aX] = aE[X]

Proof. The proof goes like in the discrete case: The only thing used for the proof in thediscrete case is the law of large numbers. Since the central limit theorem also holds forthe continuous case, the exactly same proof holds for the continuous case.

41

14 Central limit theorem

The Central Limit Theorem (CLT) is one of the most important theorems in probability.Roughly speaking it says that if we build the sum of many independent random variables,no matter what these little contributions are, we will always get approximately a normaldistribution. This is very important in every day life, because often times you havesituations where a lot of little independent “things” add up. So, you end up observingsomething which is approximately a normal random variable. For example, when youmake a measurement you are most of the time in this situation. That is, when you don’tmake one big measurement error. In that case, you have a lot of little imprecisions whichadd up to give you your measurement error. Most of the time, these imprecisions can beseen as close to being independent of each other. This then implies: unless you make onebig error, you will always end up having your measurement-error being close to a normalvariable.Let X1, X2, X3, . . . be a sequence of independent, identically distributed random variables.(This means that they are the outcome of the same random experiment repeated severaltimes independently.) Let µ denote the expectation µ := E[X1] and let σ denote thestandard deviation σ :=

&

V AR[X1]. Let Z denote the sum

Z := X1 + X2 + X3 + . . . + Xn.

Then, by the calculation rules we learned for expectation and variance it follows that:

E[Z] = nµ

and the standard deviation σZ of Z is equal to:

σZ =√

nσ.

When you subtract from a random variable its mean and divide by the standard deviationthen you always get a new variable with zero expectation and variance equal to one. Thusthe “standardized” sum:

Z − nµ√nσ

has expectation zero and standard deviation 1. The central limit theorem says that ontop of this, for large n, the expression

Z − nµ√nσ

is close to being a standard normal variable. Let us now formulate the central limittheorem:

Theorem 14.1 LetX1, X2, X3, . . .

42

be a sequence of independent, identically distributed random variables. Then we have thatfor large n, the normalized sum Y :

Y :=X1 + . . . + Xn − nµ

σ√

n

is close to being a standard normal random variable.

The version of the Central Limit Theorem is not yet very precise. As a matter of fact, whatmeans “close to being a standard normal random variable”? We certainly understandwhat means that two points are close to each other. But we have not yet discussed theconcept of closeness for random variables. Let us do this by using the example of a six-sided die. Let us assume that we have a six-sided die which is not perfectly symmetric.For i ∈ 1, 2, . . . , 6, let pi denote the probability of side i:

P (X = i) = pi

where X denotes the number which we get when we through this die once. A perfectlysymmetric die would have the probabilities pi all equal to 1/6. Say, our die is not exactlysymmetric but close to a perfectly symmetric die. What does this mean? This meansthat for all i ∈ 1, 2, . . . , 6 we have that pi is close to 1/6.For the die example with have a finite number of outcomes. For a continuous randomvariable on the other hand we are interested in the probabilities of intervals. By this Imeans that we are interested for a given interval I, in the probabilities that the randomexperiment gives result in I. If X denotes our continuous random variable, this meansthat we are interested in the probabilities of type:

P (X ∈ I).

We are now ready to explain what we mean by: “two continuous random variables Xand Y have there probability laws close to each other”. By X and Y are close (haveprobability laws which are closed to each other) we mean: for each interval I we havethat the real number P (Y ∈ I) is close to the real number P (X ∈ I). For the interval,i = [i1, i2] with i1 < i2, we have that

P (X ∈ I) = P (X ≤ i2) − P (X < i1).

It follows that if we know all the probabilities for semi-infinite intervals we can determinethe probabilities of type P (X ∈ I). Thus, for two continuous random variables X andY to be close to each other (with respect to their probability law), it is enough to askthat for all x ∈ R we have that the real number P (X ≤ x) is close to the real numberP (Y ≤ y).Now that we have clarified the concept of closeness in distribution for continuous randomvariables, we are ready to formulate the CLT in a more precise way. Hence saying that

Z :=X1 + . . . + Xn − nµ

σ√

n

43

is close to a standard normal random variable N (0, 1) means that for every z ∈ R wehave that:

P (Z ≤ z)

is close toP (N (0, 1) ≤ z).

In other words, as n goes to infinity, P (Z ≤ z) converges to P (N (0, 1) ≤ z). Let us givea more precise version of the CLT then what we have done so far:

Theorem 14.2 LetX1, X2, X3, . . .

be a sequence of independent, identically distributed random variables. Let E[X1] = µ and&

V AR[X1] = σ. Then, for any z ∈ Z, we have that:

limn→∞

P

$

X1 + . . . + Xn − nµ

σ√

n≤ z

%

= P (N (0, 1) ≤ z).

15 Statistical testing

Let us first give an example:Assume that you read in the newspaper that 50% of the population in Atlanta smokes.You don’t believe that number, so you start a survey. You ask 100 randomly chosenpeople, and find that 70 out of the hundred smoke. Now, you want to know if the resultof your survey constitutes strong evidence against the 50% claimed by the newspaper.If the true percentage of the population of Atlanta which smokes would be 50%, youwould expect to find in your survey a number closer to 50 people. However, it could bethat although the true percentage is 50%, you still observe a figure as high as 70. Just bychance. So, the procedure is the following: determine the probability of getting 70 peopleor more in your survey who smoke, given that the percentage would really be 50%. If thatprobability is very small you decide to reject the idea that 50% of the population smokein Atlanta. In general one takes a fix level α > 0 and rejects the idea one wants to testif the probability is smaller than α. Most of the times statisticians work with α beingequal to 0.05 or 0.1. So, if the probability of getting 70 people or more in our survey whosmoke is smaller than α = 0.05(the probability given that 50% of the population smokes),then statisticains will say: we reject the hypothesis that 50% of the population in Atlantasmokes. We do this on the confidence level α = 0.05, based on the evidence of our survey.How do we calculate the probability to observe 70 or more people in our survey whosmoke if the percentage would really be 50% of the Atlanta population? For this it isimportant how we choose, the people for our survey. The correct way to choose them isthe following: take a complete list of the inhabitants of Atlanta. Numerate them. Choose100 of them with replacement and with equal probability. This means that a person couldappear twice.Let Xi be equal to one if the i-th person chosen is a smoker. Then, if we chose the people

44

following the procedure above we find that the Xi’s are i.i.d. and that P (Xi = 1) =p where p designates the true percentage of people in Atlanta who smoke. Then alsoE[Xi] = p. The total number of people in our survey who smoke Z, can now be expressedas

Z := X1 + X2 + . . . + X100.

Let P50%(.) designate the probability given that the true percentage which smoke is really50%. Testing if 50% in Atlanta smoke can now be discribed as follows:

• Calculate the probability:

P50%(X1 + . . . + X100 ≥ 70).

• If the above probability is smaller than α = 0.05 we reject the hypothesis that 50%of the population smokes in Atlanta (we reject it on the α = 0.05 level). Otherwise,we keep the hypothesis. When we keep the hypothesis, this means that the resultof our survey does not constitute strong evidence against the hypothesis: the resultof the survey does not “contradict” the hypothesis.

Note that we could also have done the test on the α = 0.1 level. In that case we wouldreject the hypothesis if that probability is smaller that 0.1.Next we are explaining how we can calculate approximately the probability P50%(Z ≥ 70),using the CLT. Simply note that, by basic algebra, the inequality

Z ≥ 70

is equivalent toZ − nµ ≥ 70 − nµ

which is itself equivalent to:Z − nµ

σ√

n≥ 70 − nµ

σ√

n.

Equivalent inequalities must also have same probability. Hence:

P50%(Z ≥ 70) = P50%(Z − nµ ≥ 70 − nµ) = P50%

$

Z − nµ

σ√

n≥ 70 − nµ

σ√

n

%

(15.1)

By the CLT we have thatZ − nµ

σ√

n

is close to being a standard normal random variable N (0, 1). Thus, the probability onthe right side of inequality 15.1, is approximately equal to

P

$

N (0, 1) ≥ 70 − nµ

σ√

n

%

. (15.2)

If the probability in expression 15.2 is smaller than 0.05 then we reject the hypothesisthat 50% of the Atlanta populations smokes. (on the α = 0.05 level). We can look up theprobability that the standard normal N (0, 1) is smaller than the number (70−nµ)/(σ

√n)

in a table. We have tables, for the standard normal variable N (0, 1).

45

15.1 Looking up probabilities for the standard normal in a table

Let z ∈ R. Let φ(z) denote the probability that a standard normal variable is smallerequal than z. Thus:

φ(z) := P (N (0, 1) ≤ z) =

. z

−∞

1

2πe−x2/2dx.

For example, let z > 0 be a number. Say wen want to find the probability

P (N (0, 1) ≥ z). (15.3)

The table for the standard normal gives the values of φ(z) for z > 0 thus we have totry to express probability 15.3 in terms of φ(z). For this note that:

P (N (0, 1) ≥ z) = 1 − P (N (0, 1) < z).

Furthermore, P (N (0, 1) < z) is equal to P (N (0, 1) ≤ z) = φ(z). Thus we find that:

P (N (0, 1) ≥ z) = 1 − φ(z).

Let us next explain how, if z < 0, we can find the probability:

P (N (0, 1) ≤ z).

Note that N (0, 1) is symmetric around the origin. Thus,

P (N (0, 1) ≤ z) = P (N (0, 1) ≥ |z|).

This brings us back to the previously studied case. We find

P (N (0, 1) ≤ z) = 1 − φ(|z|).

Eventually let z > 0 again. What is the probability:

P (−z ≤ N (0, 1) ≤ z)

equal to? For this problem note that

P (−z ≤ N (0, 1) ≤ z) = 1 − P (N (0, 1) ≥ z) − P (N (0, 1) ≤ −z).

Thus, we find that:

P (−z ≤ N (0, 1) ≤ z) = 1 − (1 − φ(z)) − (1 − φ(z)) = 2φ(z) − 1.

46

15.2 Two sample testing

Let us give an example to introduce this subject. Assume that we are testing a new fuelfor a certain type of rocket. We would like to know if the new fuel gives a different initialvelocity to the rocket. The initial velocity with the old fuel is denote by µX whilst µY isthe initial velocity with the new fuel. We fire the rocket five times with the old fuel andmeasure each time the initial velocity. We find:

X1 = 100, X2 = 102, X3 = 97, X4 = 100, X5 = 101 (15.4)

(here Xi denotes the initial velocity measured whilst firing the rocket for the i-th timewith the old fuel). Then we fire the rocket five times with the new fuel. Every time wemeasure the initial velocity. We find

Y1 = 101, Y2 = 103, Y3 = 99, Y4 = 102, Y5 = 100 (15.5)

We calculate the averages:

X :=X1 + X2 + X3 + X4 + X5

5= 100

and

Y :=Y1 + Y2 + Y3 + Y4 + Y5

5= 101

When we measure the initial velocities we find different values even when we use the samefuel. The reason is that our measurement instruments are not very precise, so we get thetrue value plus a measurement error. The model is as follows:

Xi = µX + ϵXi

andYi = µY + ϵY

i .

Furthermore ϵX1 , ϵX

2 , . . . are i.i.d. random errors and so are ϵY1 , ϵY

2 , . . .. We assume thatthe measurement instrument is well calibrated so that

E[ϵXi ] = E[ϵY

i ] = 0

for all i = 1, 2, . . .. Here µX and µY are unknown constants (in our example µX is theinitial speed when we use the old fuel whilst µY is the initial speed when we use the newfuel). We find that

E[Xi] = E[µX + ϵXi ] = E[µX ] + E[ϵX

i ] = µX + 0 = µX ,

and similarlyE[Yi] = µY .

47

So our testing problem can be described as follows: we want to figure out based on ourdata 15.4 and 15.5, if the second fuel gives a different initial speed than the old fuel. Weobserved

Y − X = 1 > 0.

This means that in the second sample, obtained with the new fuel, the initial speed ishigher by one unit on average to the initial speed in the first sample obtained with theold fuel. But is this evidence enough to conclude that the new fuel provides higher initialspeed, or could this difference just be due to the measurement errors? As a matter offact, since we make measurement errors, it could be that even, if the second fuel doesnot provide higher initial speed, (i.e. µX = µy) that due to the random errors and “badluck” the second average is higher than the first. In our present setting we can neverbe absolutely sure, but we try to see if there is statistically significant evidence forarguing that µX and µY are not equal.The exact method to do this depends on whether we know the standard deviation of theerrors or not and if they are identical for the two samples. We will need the expectationand standard deviation of the means. This is what we calculate in the next paragraph.

Expectation and standard deviation of the means Let the standard deviation ofthe errors be denoted by

σX :=3

V AR[ϵXi ] , σY :=

3

V AR[ϵYi ].

Let Z := Y − X. We find that the standard deviation of Z is given by

σZ = σY −X =3

V AR[Y − X] =3

V AR[Y ] + V AR[X (15.6)

where the last equality above was obtained using the facts that the X and the Y areindependent of each other, and variance of a sum of independent variable is equal to thesum of the variances. Now

V AR[X] = V AR

0

X1 + . . . + Xn

n

1

=V AR[X1 + . . . + Xn]

n2=

=V AR[X1] + V AR[X2] + . . . + V AR[Xn]

n2=

nV AR[X1]

n2=

V AR[X1]

n=

σ2X

n

and similarly

V AR[Y ] =σ2

Y

nUsing this in equation 15.6, we find

σZ = σY −X =

4

σ2X

n+

σ2Y

n(15.7)

48

If σX = σY (which should be the case when we use the same measurement instrument),then equation 15.7 can be rewritten as

σY −X =

4

σ2

n+

σ2

n= σ

&

2/n, (15.8)

where σ = σX = σY . If the two samples would have different sizes, we would find by asimilar calculation

σZ =

4

σX

n1+

σY

n2(15.9)

where n1 is the size of the first sample and n2 is the size of the second sample. Furthermorewe have for the expectation

E[Y − X] = E[Y ] − E[X] =

= E

0

Y1 + Y2 + . . . + Yn

n

1

− E

0

X1 + . . . + Xn

n

1

=

=E[Y1 + . . . + Yn]

n− E[X1 + . . . + Xn]

n=

=E[Y1] + E[Y2] + . . . + E[Yn]

n− E[X1] + E[X2] + . . . + E[Xn]

n=

= E[Y1] − E[X1] = µX − µY

To summarize we found that

E[Y − X] = µY − µX . (15.10)

A simplified method Let us first explain a rough method, to explain in a simple waythe idea. This method up to a small detail is the same as what is really used in practice.

At this stage we are ready to explain how we could proceed to know if we have strongevidence for the case µY − µX = 0. We are going to use the rule of thumb which saysthat in most cases for most variables the values we typically observe are within a distanceof at most 2 times the standard deviation from the expected value. We apply this ruleto Z = Y − X. If we would have that there is no difference between the new and oldfuel, then µY − µX would be equal to zero and hence E[Z] = 0 (see equation 15.10). Wecan then check if the value we observe for Z is within 2 times the standard deviation σZ .Thus in our case we check if the value 1 is within 2 times the standard deviation σZ . Thisis the same as checking if

Y − X

σZ(15.11)

is not more than 2 in absolute value. If it is, we would think that µY is probably notequal to µX . In that case, we say that we reject the hypothesis that µX = µY . Theexpression 15.11 is called test statistic. What we did here is check if the value taken bythe test statistics is within the interval [−cr, cr], where we took cr = 2. The number cr is

49

called critical value for our test. If we do not know σX and σY , we estimate them andreplace them by their estimates in the formulas 15.7,15.8, 15.9. (To see how to estimatea standard deviation go to subsection 16.2). We then use that value for the test statisticinstead of 15.11.The method described here differs from the one really used only in as much as the criticalvalue is concerned. However even with the way the test is usually done in practice, thecritical value will not be very far from 2. Let us next explain in detail the differentmethods used in practice. They depend on whether the standard deviation is known ornot. Also, to perform a statistical test in a precise way, need to specify the level of

confidence for the test. The higher the lever of confidence the bigger the critical valuewill be. Let us see the details in the next paragraphs:

The case with identical, known standard deviation Assume that the standarddeviations σX and σY are known to us and identical. This is typically the case, when themeasurement instruments used for both samples are identical. In this case, we denote by σthe value σ = σX = σY . If we work often with the same measurement instruments, we willknown the “typical size” of the measurement error, hence we will know from experienceσ. Assume here that the measurement errors are normal. Then, the test statistic

X − Y

σZ=

X1 + . . . + Xn − Y1 − Y2 − . . . − Yn

nσZ(15.12)

is also normal. As a matter of fact, as can be seen in 15.12, the test statistic can bewritten as a sum of independent normal variables divided by a constant. We know thatsums of independent normal variables are again normal. Furthermore dividing a normalby a constant gets you a normal again. If µX = µY , then the expectation of the teststatistic is zero:

E

0

Y − X

σZ

1

=E[Y − X]

σZ=

µY − µX

σZ= 0.

Similarly the variance of the test statistic is one. This can be seen from:

V AR

0

Y − X

σZ

1

=V AR[Y − X]

σ2Z

= 1.

Hence, if µX = µY , then the test statistic is a normal variable with expectation 0 andvariance 1. In other words, the test statistic is a standard normal variable. So, in thiscase, the critical value cr at a confidence level p is the number cr > 0 satisfying

P (−cr ≤ N (0, 1) ≤ cr) = p.

By symmetry around the origin, this implies (see subsection 15.1) that

φ(cr) = (1 + p)/2, (15.13)

where φ(x) = P (N (0, 1) ≤ x) is the distribution function of a standard normal. Whichvalue satisfies equation 15.13 can be found in a table for standard normal variables. For

50

example for p = 95%, the corresponding critical value is cr = 1.96Let us get back to our example with the rocket. Assume that the average measurementerror when we make one measurement is 3. In other words, let σ = 3. Assume we want totest on the 95%-level if there is a statistical significant difference between the means in oursamples. That is, we want to test the hypothesis µY = µX against the hypothesis

µY = µX on the 95%-level. For this we simply need to check if the test statistic liesbetween −cr and +cr. The test statistic is (Y − X)/(σZ). The constant σZ = σY −X hasbeen calculated in 15.8, where it was found:

σZ := σ&

2/n.

Hence, with our value σ = 3 and n = 5, the test statistic takes on the value

Y − X

σZ=

Y − X

σ&

2/n=

1

3√

0.4≈ 1

1.8= 0.55

The value for the test statistic lies within the interval [−cr, cr], when cr = 1.96. Hencein this situation we can not reject the hypothesis that µX = µY on the 95%-

confidence level. In other words, we do not have enough statistical evidence to rejectthe idea that µX = µY . This means that our data does not seem to imply that the newfuel is better or worse then the old. Note that this does not necessarily mean that µX

and µY must be identical. It could be that the difference is so small, that it gets maskedby our measurement errors.The way the test was done is called a two-sided test. If we would be interested just inknowing if the new fuel is better then we would do a one-sided test. (It could be thata company might change to a new fuel but only if it is proven to be better. In that case,the only interesting thing is to know if the new fuel is better and not if it is different). Inthe case of a one-sided test the confidence interval would be [−∞, cr] where the criticalvalue cr is determined by

P (N (0, 1) ≤ cr) = Φ(cr) = p

on the confidence level p. Here as before Φ(.) designates the distribution function of astandard normal variable.In this example, we assumed the measurement errors to be normal. If this is not the case,but we have many measurements, the above method still applies due to the Central LimitTheorem.

Case when the standard deviations are known but not equal or different sample

sizes. If in the two samples the standard deviations are different (because of differentmeasurement instruments maybe), then all the above remains the same except that weuse a different formula for σZ . The formula then used for σZ is formula 15.7. The samegoes when the samples have different sizes from each other. The formula used in that caseis 15.9.

51

The case with unknown, but equal standard deviation Assume that σ = σX =σY , but σ is unknown to us. Then instead of σZ , we use an estimate for σZ . For this notethat

σZ =

4

σ2X

n+

σ2Y

n. (15.14)

We will estimate σ2X and σ2

Y and plug the values into the formula 17.7 instead of the realvalues. The estimates we use (see subsection 16.2), are

s2X :=

(X1 − X)2 + (X2 − X)2 + . . . + (Xn − X)2

n − 1

for σ2X and

s2Y :=

(Y1 − Y )2 + (Y2 − Y )2 + . . . + (Yn − Y )2

n − 1

for σ2Y . This then gives as estimate for σZ the following expression:

3

(s2X + s2

Y )/n

Our test statistic is obtained by replacing σZ by its estimate in the previously used teststatistic. Hence the test statistic for the case of unknown standard deviation is:

Y − X&

(s2X + s2

Y )/n. (15.15)

The distribution of the test statistic is no longer normal. It is slightly modified. One canprove when µX = µY and the measurement errors are normal, that then the test statistichas a student t-distribution with 2n − 2 degrees of freedom. So our testing procedure isalmost as before, only that we have to find the critical value cr in a different table. Thistime we have to find it in a table for the t-distribution with 2n − 2 degrees of freedom.So, if we test on the confidence level p, then cr is defined to be the number such that

P (−cr ≤ T2n−2 ≤ cr) = p.

We reject the hypothesis µX = µY on the level p, if the test statistic 15.15 takes

a value outside [−cr, cr] Let us get back to our rocket example. Say we want to testµX = µY on the level p = 95%. We find

s2X =

0 + 22 + 32 + 0 + 1

4=

14

4= 3.5

and

s2Y =

0 + 22 + 22 + 1 + 1

4=

10

4= 2.5

With n = 5 the test statistic takes on the value

Y − X&

(s2X + s2

Y )/n=

1&

6/5=

1√1.2

≈ 0.9.

52

Now we have to look of the t-distribution with 2n − 2 = 8 degrees of freedom. Uponreading the table we find the critical value for a two sided test on the 95% level to becr = 2.2???. We see that the value taken by the test statistic is way within the interval[−cr, cr] and hence we can not reject the hypothesis that the new fuel has no effect. Moreprecisely, our data does not contain significant evidence on the 95%-level that there is adifference between the two fuels, i.e. that uX = uY .

16 Statistical estimation

16.1 An example

Imagine that we want to measure the distance d between two points y and z. Everytime we repeat the measurement we make a measurement error. In order to improve theprecision we make several measurements and then take the average value measured. LetXi designate measurement number i and ϵi the error number i. We have that:

Xi = d + ϵi.

We assume that the measurement errors are i.i.d. such that

E[ϵi] = 0

andV ar[ϵi] = σ2.

The standard deviation σ of the measurement instrument is supposed to be know to us.Imagine that we make 4 measurements and find in meters the four values:

100, 102, 99, 101

We see that the distance d must be around 101 meters. However, the exact value of thedistance d remains unknown to us, since each of the four measurements above contains anerror. So, we can only estimate what the true distance is equal to. Typically we take theaverage of the measurements as estimate for d. We write d for our estimate of d. In thecase we decide to use the average of our measurements as estimate for d, we have that:

d =X1 + X2 + X3 + X4

4.

The advantage of taking four measurements of the same distance instead of only one,is that that probability to have a large error is reduced. The errors in the differentmeasurements tend to even each other out when we compute the average. As a matter offact, assume we make n measurements and then take the average. In this case:

d :=X1 + . . . + Xn

n.

53

We find:

E[d] = (1/n) (E[X1] + . . . + E[Xn]) =

(1/n) (nE[X1]) = E[X1] = E[d + ϵi] = E[d] + E[ϵi] = d + 0 = d.

An estimator which has its expectation equal to the true value we want to estimate iscalled unbiased estimator.Let us calculate:

V AR[d] = V AR

0

X1 + . . . + Xn

n

1

=1

n2(V AR[X1] + . . . + V AR[Xn]) =

1

n2(nV AR[X1]) = V AR[X1]/n

Thus, the standart deviation of d is equal to&

V AR[X1]/n = σ/√

n.

The standard deviation of the average d is thus√

n times smaller than the standarddeviation of the error when we make one measurement. This justifies taking severalmeasurements and taking the average, since it reduces the size of a typical error by afactor

√n.

When we make a measurement and give an estimate of what the distance is, it is importantthat when know the order of magnitude of the error. Imagine for example that the orderof magnitude of the error is 100 meters. The situation would then be: “ our estimate ofthe distance is 101 meters, and the precision of this estimate is plus/ minus 100 meters”.In this case our estimate our estimate of the distance is almost useless because of the hugeimprecision. This is why, we try to always give the “precision” of the estimate. Sincethe errors are random, theoretically even very large errors are always possible. Very largeerrors however have small probability. Hence one tries to be able to be able to give aupper bound on the size of the error which holds with a given probability. Typically oneuses the probabilities 95% or 99%. The type of statement one whishes to make is forexample: our estimate for the distance is 101 meters. Furthermore, with 95% probabilitythe true distance is within 2 meters of our estimate. In this case the interval [99, 103] iscalled the 95% confidence interval for d. With 95% probability, d should lie within thisinterval. More precisely, we look for a real number a > 0 such that:

P (d − a ≤ d ≤ d + a) = 95%

or equivalently:P (−a ≤ d − d ≤ a) = 95%

Hence we are looking for a number a such that:

95% = P

$

−a ≤ X1 + . . . + Xn

n− d ≤ a

%

= P

$

−a ≤ d + ϵ1 + . . . + d + ϵn − nd

n≤ a

%

=

= P

$

−a ≤ ϵ1 + . . . + ϵn

n≤ a

%

54

Now, either way we assume that the errors ϵI are normal or that n is big enough so thatthe sum ϵ1 + . . . + ϵn is approximately normal due to the central limit theorem. Dividingthe sum ϵ1 + . . . + ϵn by σ

√n, we get (approximately) a standard normal variable. This

then gives:

95% = P

$

−a√

n

σ≤ ϵ1 + . . . + ϵn

σ√

n≤ a

√n

σ

%

≈ P

$

−a√

n

σ≤ N (0, 1) ≤ a

√n

σ

%

We thus find the number b > 0 from the table for standard normal random variable suchthat:

95% = P (−b ≤ N (0, 1) ≤ b).

Hence:95% = φ(b) − (1 − φ(b)) = 2φ(b) − 1

where φ(.) designates the distribution function of the standard normal variable. Then,we find a > 0 solving:

b =a√

n

σ.

The confidence interval on the 95% level is then:

[d − a, d + a].

This means that although we don’t know the exact value of d, we can say that with 95%probability d lies in the interval [d − a, d + a].

16.2 Estimation of variance and standart deviation

Assume that we are in the same situation as in the previous subsection. The only differenceis that instead of trying to determine the distance we want to find out how precise ourmeasurement instrument is. In other words, we try to determine the standard deviationσ =

&

V AR[ϵi]. For this we make several measurements of the distance between to pointsy and z. We choose the point so that we know the distance d between them. Again, if Xi

designates the i-th measurement we have Xi = d + ϵi. Define the random variable Zi inthe following way:

Zi := (Xi − d)2 = ϵ2i .

Thus:E[Zi] = V AR[ϵi].

We have argued that if we have a number of independent copies of the same randomvariables, a good way to estimate the expectation is to take the average. Thus to estimatethe expectation E[Zi], we take the average:

ˆE[Zi] :=Z1 + . . . + Zn

n.

55

In other words, as an estimate for V AR[ϵi] = E[Zi] = σ2, we take:

Z1 + . . . + Zn

n=

(X1 − d)2 + . . . + (Xn − d)2

n.

The estimate for σ is then simply the square root of the estimate for the variance. Thus,our estimator for σ =

&

V AR[ϵi] is:

σ =

4

(X1 − d)2 + . . . + (Xn − d)2

n.

If the distance d, should not be known, we simply take and estimate for d instead of d.In that case our estimate for σ is

σ =

5

(X1 − d)2 + . . . + (Xn − d)2

n − 1

where

d :=X1 + . . . + Xn

n.

(Note that instead of dividing by n in the case that d in unknown, we divide usuallyby n − 1. This is a little detail which I am not going to explain. For large d, it is notimportant since then n/(n − 1) is close to 1.)

16.3 Maximum Likelihood estimation

Imagine the following situation: we have two 6-sided dice. Let X designate the numberwe obtain when we throw the first die. Let Y designate the number we obtain when wethrow the second one. Assume that the first die is regular whilst the second is skewed.We have:

(P (X = 1), P (X = 2), . . . , P (X = 6)) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

(Note that 1/6 = 0.16) Assume furthermore that:

(P (Y = 1), . . . , P (Y = 6)) = (0.01, 0.3, 0.2, 0.1, 0.1, 0.29).

Imagine that we are playing the following game: I choose from a bag one of the two dice.Then I throw it and get a number between 1 and 6. I don’t tell you which die I used, butI tell you the number obtained. You have to guess which die I used based on the numberwhich I tell you. (This guessing is what statisticians call estimating.) For example I tellyou that obtained the number 1. With the first die, the probability to obtain a 1 is 0.16,whilst with the second die it is 0.01. The probability to obtain a 1 is thus much smallerwith the second die. Having obtained a one makes us thus think “that it is likelier” thatthe die used is the first die. Our guess will thus be the first die. Of course you could bewrong, but based on what you know the first die appears to be “likelier”.

56

If on the other hand, after throwing the die we obtain a 2 we guess that it was the seconddie which got used. The reason is that with the second die a 2 has a probability of 0.3which is larger than the probability to see a 2 with the first die. Again, our guess might bewrong, but when we observe a 2, the second die seem “likelier”. The method of guessingdescribed here is called Maximum likelihood estimation. It consist in guessing (estimating)the possibility which makes the observed result most likely. In other words, we choosethe possibility, for which the probability of the observed outcome is highest.Let us look at is in a slightly more abstract way. Let I designate the first die and IIthe second. For x = 1, 2, . . . , 6, let P (x, I) designate the probability that the number weobtain by throwing the first die equals to x. Thus:

P (x, I) := P (X = x).

Let P (x, II) designate the probability that the number we obtain by throwing the seconddie equals to x. Thus:

P (x, II) := P (Y = x).

For example, P (1, I) is the probability that the first die gives a 1 and P (1, II) is theprobability that the second die equals 1 whilst P (2, II) designates the probability thatthe second die gives a 2.Let θ be a (non-random) variable with can take one out of two values: I or II. Statisticianscall θ the parameter. In this example guessing which die we are using, is the same astrying to figure out if θ equals I or II. We consider the probability function P (., .) withtwo entries:

(x, θ) /→ P (x, θ).

Formally what we did can be describe as follows: given that we observe an outcome x, wetake the θ which maximizes P (x, θ) as our guess for which die was used. . Our maximumlikelihood estimate θ of θ is the theta maximizing P (x, θ) where x is the observed outcome.This is a general method, and can be used in many different settings. Let us give anotherexample of maximum likelihood estimation, based on the same principle.

16.4 Estimation of parameter for geometric random variables

Let T1, T2, . . . be a sequence of i.i.d. geometric random variables with parameter p > 0.Assume that p > 0 is unknown. We want to estimate p (in other words we want to try toguess what p is approximately equal to). Say we observe:

(T1, T2, T3, T4, T5) = (6, 7, 5, 8, 8)

Based on this evidence, what should our estimate p for p be? (Hence what should ourguess for the unknown p be?) We can use the Maximum Likelihood method. For this theestimate p is the p ∈ [0, 1] for which the probability to observe

(6, 7, 5, 8, 8)

57

is maximal. Since we assumed the Ti’s to be independent we find that:

P ( (T1, T2, T3, T4, T5) = (6, 7, 5, 8, 8) ) (16.1)

is equal toP (T1 = 6) · P (T2 = 7) · . . . · P (T5 = 8)

For a geometric random variable T with parameter p we have that:

P (T = k) = p(1 − p)k−1.

Thus the probability 16.1 is equal to:

p(1−p)5 ·p(1−p)6 · . . .·p(1−p)7 = exp(ln(p)+5 ln(1−p)+ . . .+ln(p)+7 ln(1−p)). (16.2)

We want to find p maximizing the last expression. This is the same as maximizing theexpression:

ln(p) + 5 ln(1 − p) + . . . + ln(p) + 7 ln(1 − p),

since exp(.) is an increasing function. To find the maximum, we take the derivativeaccording to p and set it equal to 0. This gives:

0 =d (ln(p) + 5 ln(1 − p) + . . . + ln(p) + 7 ln(1 − p))

dp=

1

p− 5

1 − p+ . . . +

1

p− 7

1 − p.

The last equality leads to:n(1 − p) = (5 + . . . + 7)p

where n designates the number or observations. (In the special example considered heren = 5.) We find:

n = (6 + . . . + 8)p = p(T1 + T2 + . . . + Tn)

and hence:1

p=

6 + 7 + 5 + 8 + 8

5=

T1 + T2 + . . . + Tn

n(16.3)

Our estimate p of p is the p which maximizes expression 16.2. This is the p which satisfiesequation 16.3. Thus our estimate:

p :=

$

6 + 7 + 5 + 8 + 8

5

%−1

=

$

T1 + T2 + . . . + Tn

n

%−1

.

17 Linear Regression

17.1 The case where the exact linear model is known

Imagine a situation where you have a chain of shops: The shops can have different sizesand the profit seems to be to some extend a function of the size. The chain owns n shops.

58

Let xi denote the size of the i-th shop and Yi its profit. Now you assume that there aretwo constants α and β so that the following relationship holds:

Yi = α + βxi + ϵi,

where we assume that ϵ1, ϵ2, . . . are i.i.d. random variables with expectation zero:

E[ϵi] = 0.

Let σ denote the standard deviation of the variables ϵi. Often it will also be assumed thatthe variables ξi are normal. Now, we have that the expected profit is equal to

E[Yi] = E[α + βxi + ϵi] = E[α] + E[βxi] + E[ϵi] = α + βxi.

In other words, the expected profit is a linear function of the size: E[Y ] = α + xβ, wherex is the size and Y is the profit of a shop. So, if you draw a curve representing expectedprofit as function of size, you would get a straight line.Say for your chain of shops, you would have the relationship Yi = 3 + 4xi + ϵi. So, in this case α = 3and β = 4. Say, you own a shop of size 5. Then for that shop, the expected profit given that the sizeis 5, would be E[Y |x = 5] = 3 + 4 · 5 = 23. Here we denote by E[Y |x = 5] the expectation given thatthe size is 5. Now why would the profit of that shop be random? Well very simple. It could be thatyou own many shops gggbut this one shop with size 5 is going to open next month. So, nobody knowsin advance the exact profit. One can forecast it, give maybe a confidence interval, but nobody knows inadvance the exact value! Hence, the profit is behaving like a random variable. If you are told to predict(estimate) what the profit will be, you will give the expected value α + β5 = 23. Of course, this requiresthat you know the constants α and β. Now, if you know the standard deviation σ of ϵ then you can alsogive a confidence interval. First using Matzi’s rule of thumb, you could simply say that most probably,the profit of the shop, will be withing two standard deviation of the expected profit. In our case, thus wecould say that typically the profit is 23 +−2σ and hence most likely to be between 23− 2σ and 23 + 2σ.If for example σ = 3, then most likely the profit for our shop will be between 17 and 29. Now, thisis using a rule of thumb which says that random variables typically take values not further than twicethe standard deviation from their expectation most of the time. But this is not very precise. So, wecould actually give a confidence interval. That is we could give an interval so, that with for example 95%probability the profit will be in that interval. If we assume that the errors are normal, then we have thatϵ/σ is standard normal. Hence, (Y − α − βx)/σ = ϵ/σ is standard normal. Hence,

P

$

−c ≤ Y − α − βx

σ≤ c

%

= P (−c ≤ N (0, 1) ≤ c)

The above allows us to give an interval (think of confidence interval), so that the profit of the new shopwill be in that interval with a given probability. for example of 95%-confidence, the interval is going tobe

[α + β5 − σc0.95, α + β5 − σc0.95] = [23 − σC0.95, 23 + σc0.95],

where c0.95 denotes the constant so that a standard normal is between +− that constant with 0.95-probability. We have seen how to calculate, such a constant.Imagine next a situation where α and β are known, and σ is not known. Then we want to estimate σbased on our data. Note that Yi = α + βxi + ϵi and hence:

ϵi = Yi − α − βxi. (17.1)

When the data (Yi, xi) is known and α, β are known as well, then we can figure out the value of theϵi’s using formula 17.1. Note that σ designates the standard deviation for the errors ϵi. But in previous

59

chapters we have learned how to estimate a standard deviation. So, this is what we are going to do usingthe ϵi’s to estimate the standard deviation σ:

σ :=

4

ϵ21+ . . . + ϵn

n. (17.2)

Let us give an example. Say we have five shops and as before α = 3 and β = 4. The data for the shopsis given in the table below:

xi 1 2 3 4 6Yi 8 10 17 17 27

Now, for the example ϵ1 = Y1 − 3 − 4x1 = 8 − 3 − 4 = 1 So for each i = 1, 2, . . . , 5 we can calculate thecorresponding ϵi. We get the values:

xi 1 2 3 4 6Yi − α − βXi 1 1 − 1 2 −2 1

So, our estimate of the standard deviation σ becomes:

σ :=

4

11 + (−1)2 + 22 + (−2)2 + 02

5=

√2 ≈ 1.41. (17.3)

We can now using Matzinger’s rule of thumb, which says that random variable most of the time, takesvalues not further than two standard deviation from their expectation. So, that tells us that for our shop,the profit should be within 23 + −2σ ≈ 23 + −2.82 So, typically the profit would be in the interval

[20.1716, 25.8284].

The above interval is just to have a rough idea of which area most likely the profit will be in. For a moreprecise approach with an explicit confidence level α, we would take the interval:

[23 − cασ, 23 − cασ] (17.4)

where cα is the constant so that a standard normal is with probability α between −cα and +cα:

α = P (−cα ≤ N (0, 1) ≤ cα).

Now if we do not know the standard deviation, we can replace the true standard deviation by its estimate.The coefficient cα from the normal table has to be replaced by a coefficient from the student table tnα/2

.So, the confidence interval if we have to estimate the standard deviation becomes:

[α + β10 − tnα/2σ, α + β10 + tnα/2

σ]. (17.5)

17.2 When α and β are not know

If α and β are not known, then we estimate them using least square. The estimates aregiven by the two following equations:

y = α + βx

and

60

β :=

+ni=1(xi − x)yi

+ni=1(xi − x)2

Put in the value of β from the second equation into the first to calculate α.Now, in principle, all the things we did in the last subsection where α and β were knownwill be done here. The difference is mainly that instead of α and β we use the estimateα and β instead. But then we act as if the estimates where the true values. (For theconfidence interval there will be a small adjustment). In other words, given some real data(xi, Yi) for i = 1, 2, . . . , n, you could estimate α and β. Then forget that your estimatesα and β are only estimates. Act as if they where the true α and β and do everything wedid in the section above...in this way, you can figure out how to estimate the standarddeviation, get a confidence interval and so on and so forth. Let us summarize:

1. To estimate the expected profit of a new shop of size x0, we used in the previoussection α + βx0. Now, however α and β are not known. So, we simply take theestimates for α and β and act as if they would be the true values. Our estimate forthe expected profit of a shop of size x0, when α and β is not known is:

ˆE[Y |x0] := α + βx0.

2. To estimate the standard deviation, we had used the ϵi’s which are equal to ϵi =Yi − α − βxi. Now, α and β are not known here, so we replace them by theirrespective estimates. So our estimated random errors are

ϵi := Yi − α − βxi

For estimating the standard deviation σ, we now simply replace ϵi by the estimateϵi in formula 17.2. Hence, the estimated σ is defined to be:

σ :=

5

ϵ21 + ϵ2

2 + . . . + ϵ2n

n − 2. (17.6)

3. Let us see how we give a “rough confidence” interval using Matzi’s rule of thumb.(That rule of thumb is: mostly variables take values not further than two timesthe standard deviation from their expectation). So, in the formula α + βx0 + −2σwe simply replace α, β, σ by their respective estimates: so the “rough” confidenceinterval for the profit of a shop with size x0 would be

[α + βx0 − 2σ, α + βx0 + 2σ]

where σ is our estimate given in 17.6.

61

4. For an exact confidence interval we take the same as in 17.5 but replacing again α,β and σ by their respective estimates. (Here for estimating σ we take 17.6). Also,there is an additional factor equal to

5

1 +1

n+

(x0 − x)2

+

i(xi − x)2

This factor is needed, because we have additional uncertainty since we do not knowα + βx0, but only have an estimate for it. Also, for large n, this factor becomesclose to 1. So, all this being said, our confidence-interval on the α-co0nfidence levelis6

α + β10 − tnα/2σ

5

1 +1

n+

(x0 − x)2

+

i(xi − x)2, α + β10 + tnα/2σ

5

1 +1

n+

(x0 − x)2

+

i(xi − x)2

7

.

17.3 Where the formula for the estimates of α and β come

from

So a typical situation is that we have data:

x1 x2 . . . xn

y1 y2 . . . yn

We can assume that these points where generated by a model like the one describedat the beginning of this section:

yi = α + βxi + ϵi

for all i = 1, 2, . . . , n and where α,β do not depend on i. Again, ϵ1, ϵ2, . . . are i.i.d.with expectation 0 and standard deviation σ. Typically, α, β and σ are not knownto us. So how can we figure them out? Now when we have many data points, wewant to try to find a straight line which is close to all the points. Consider anystraight line y = a + bx. We could try to find such a line so that the sum of thedistances to all the points (xi, yi) is small. This would correspond to searching fora straight line which minimizes:

n#

i=1

|yi − a − bxi|.

Note that in the above sum, the yi’s and the xi’s are given numbers, so we onlyneed to find a and b minimizing the above expression. Now, absolute values are amess to calculate with. So, instead, we will take the sum of the distances square:

d2(a, b) :=n#

i=1

(yi − a − bxi)2

62

and find a and b minimizing d2(a, b). This will yield very nice explicit formulas. Tofind those formulas we simple take the derivative according to a and according to band set equal to 0. This yields:

d+n

i=1(yi − a − bxi)2

da= −2

n#

i=1

(yi − a − bxi)

Setting the expression on the right side of the last equation above equal to 0 wefind:

y = a + bx,

where

y :=y1 + . . . + yn

nand

x :=x1 + . . . + xn

n.

Then, we take the derivative according to b and set it equal to 0:

d+n

i=1(yi − a − bxi)2

db= −2

n#

i=1

xi(yi − a − bxi)

So, setting the expression ont he right side of the last equation above equal to 0yields:

n#

i=1

xiyi − an#

i=1

xi − bn#

i=1

xixi = 0 (17.7)

Take a change of variables: let x′i := xi− x. For a shift in x-coordinate the slope b is

not changed. Also, the distances square are not affected by a shift in x-coordinates.So,the same formula as 17.7 must hold for the values x′

i. Hence, formula 17.7 isequivalent to

n#

i=1

x′iyi − a

n#

i=1

x′i − b

n#

i=1

(x′i)

2 = 0.

Note however that+

x′i = 0. Hence, we have

n#

i=1

x′iyi − b

n#

i=1

(x′i)

2 = 0

which implies that

b =

+ni=1 x′

iyi+n

i=1(x′i)

2=

+ni=1(xi − x)yi

+ni=1(xi − x)2

We have now found a system of two equations for a and b, which determines whichstraight line y = a + bx gets closes to the data-points (x1, y1), (x2, y2), . . . , (xn, yn).

63

By closest, we mean the sum of the vertical distances square between the pointsand the line should be minimal. So, the system of two equations is:

y = a + bx (17.8)

b =

+ni=1(xi − x)yi

+ni=1(xi − x)2

. (17.9)

Solving the above system of two equations in a and b yields, the straight line y =a + bx which is closest (in our sense of “sum of distances square”) to our points.We will use, these value for a and b which minimize the sum of distances square asour estimates for α and β. An explanation why this is a good idea can be foundbelow in the subsection entitled: “how precise is our estimate”. So, we have thatthe estimates α and β are the only solution to 17.8 and 17.9. Hence, they

are given by the following two equations:

y = α + βx

and

β :=

+ni=1(xi − x)yi

+ni=1(xi − x)2

.

17.4 Expectation and variance of β

We can calculate the expectation of our estimate β. We have seen that that estimateβ is given by:

β =

+ni=1(xi − x)yi

+ni=1(xi − x)2

.

We are going to take the variance on both sides of the last equation above, anduse the fact that the xi’s are constants and not random. Recall that constants whomultiple a random variable, can be taken out of the variance after squaring. Thisleads to

V AR[β] = V AR

0+ni=1(xi − x)yi

+ni=1(xi − x)2

1

=

+ni=1(xi − x)2V AR[yi]

(+n

i=1(xi − x)2)2=

V AR[yi]+n

i=1(xi − x)2=

σ2

+ni=1(xi − x)2

.

So, we get finally:

V AR[β] =σ2

+ni=1(xi − x)2

,

andσβ =

σ&

+ni=1(xi − x)2

. (17.10)

Next we want to calculate the expectation of the estimate β. Recall that the errorterms ϵi have zero expectation: E[ϵi] = 0 and hence

E[Y − i] = E[α + βxi + ϵi] = E[α] + E[βxi] + E[ϵi] = α + βxi.

64

We are not ready to calculate the expectation of our estimate:

E[β] = E

0+ni=1(xi − x)yi

+ni=1(xi − x)2

1

=

+ni=1(xi − x)E[yi]+n

i=1(xi − x)2=

=

+ni=1(xi − x)(α + βxi)+n

i=1(xi − x)2= α

+ni=1(xi − x)

+ni=1(xi − x)2

+ β

+ni=1(xi − x)xi

+ni=1(xi − x)2

= β.

In other words, the expectation value of the estimator β is β itself. This has somevery important application: We have that β is a random number itself since itdepends on the ϵi’s which we have assumed to be random. Now for any randomvariable Z we have that we measure the “approximate average distance from itsexpectation”(=dispersion) by the standard deviation of the variable. So, how far βis from E[β] = β on average when we keep repeating the experiment, is given by σβ.

But, the distance between β and β is the estimation error of our estimate. So, inother words, the average size of the estimation error (when we estimate β) is givenby σβ for which we have a close expression given in equation 17.10 above.

6

17.5 How precise are our estimates

17.6 Multiple factors and or polynomial regression

17.7 Other applications

65

Introduction to Probability Theory and Statistics [Matzinger]

Documents