Chapter 4

Chapter 4

Probability: Studying Randomness

Randomness and Probability

• Random: Process where the outcome in a particular trial is not known in advance, although a distribution of outcomes may be known for a long series of repetitions

• Probability: The proportion of time a particular outcome will occur in a long series of repetitions of a random process

• Independence: When the outcome of one trial does not effect probailities of outcomes of subsequent trials

Probability Models

• Probability Model:– Listing of possible outcomes– Probability corresponding to each outcome

• Sample Space (S): Set of all possible outcomes of a random process

• Event: Outcome or set of outcomes of a random process (subset of S)

• Venn Diagram: Graphic description of a sample space and events

Rules of Probability

• The probability of an event A, denoted P(A) must lie between 0 and 1 (0 P(A) 1)

• For the sample space S, P(S)=1• Disjoint events have no common outcomes. For 2

disjoint events A and B, P(A or B) = P(A) + P(B)• The complement of an event A is the event that A

does not occur, denoted Ac. P(A)+P(Ac) = 1• The probability of any event A is the sum of the

probabilities of the individual outcomes that make up the event when the sample space is finite

Assigning Probabilities to Events• Assign probabilities to each individual outcome and

add up probabilities of all outcomes comprising the event

• When each outcome is equally likely, count the number of outcomes corresponding to the event and divide by the total number of outcomes

• Multiplication Rule: A and B are independent events if knowledge that one occurred does not effect the probability the other has occurred. If A and B are independent, then P(A and B) = P(A)P(B)

• Multiplication rule extends to any finite number of events

Example - Casualties at Gettysburg

• Results from Battle of Gettysburg

North South North SouthKilled 3155 2592 0.0331 0.0334Wounded 14525 12709 0.1523 0.1640Captured/Missing 5365 12227 0.0563 0.1578Safe Survival 72324 49972 0.7584 0.6448Total 95369 77500 1.0000 1.0000

Counts Proportions

Killed, Wounded, Captured/Missing are considered casualties, what is the probability a randomly selected Northern soldier was a casualty? A Southern soldier? Obtain the distribution across armies

Random Variables• Random Variable (RV): Variable that takes on the value

of a numeric outcome of a random process• Discrete RV: Can take on a finite (or countably infinite)

set of possible outcomes• Probability Distribution: List of values a random variable

can take on and their corresponding probabilities– Individual probabilities must lie between 0 and 1

– Probabilities sum to 1

• Notation:– Random variable: X

– Values X can take on: x1, x2, …, xk

– Probabilities: P(X=x1) = p1 … P(X=xk) = pk

Example: Wars Begun by Year (1482-1939)

• Distribution of Numbers of wars started by year

• X = # of wars stared in randomly selected year

• Levels: x1=0, x2=1, x3=2, x4=3, x5=4

• Probability Distribution:

#Wars Probability0 0.52841 0.32312 0.10703 0.03284 0.0087

Histogram

0100200300

0 1 2 3 4 More

Wars

Yearr

Masters Golf Tournament 1st Round ScoresScore Frequency Probability

63 1 0.00028864 2 0.00057665 6 0.00172866 16 0.00460867 46 0.01324968 67 0.01929769 151 0.04349170 238 0.06854871 337 0.09706272 428 0.12327273 467 0.13450574 498 0.14343375 397 0.11434376 293 0.08438977 203 0.05846878 125 0.03600279 78 0.02246580 50 0.01440181 28 0.00806582 17 0.00489683 7 0.00201684 7 0.00201685 4 0.00115286 3 0.00086487 1 0.00028888 2 0.000576

Histogram

0100200300400500600

Score

Fre

qu

ency

Continuous Random Variables

• Variable can take on any value along a continuous range of numbers (interval)

• Probability distribution is described by a smooth density curve

• Probabilities of ranges of values for X correspond to areas under the density curve– Curve must lie on or above the horizontal axis

– Total area under the curve is 1

• Special case: Normal distributions

Means and Variances of Random Variables

• Mean: Long-run average a random variable will take on (also the balance point of the probability distribution)

• Expected Value is another term, however we really do not expect that a realization of X will necessarily be close to its mean. Notation: E(X)

• Mean of a discrete random variable:

iikkX pxpxpxpxXE 2211)(

Examples - Wars & Masters Golf

#Wars Probability x*p0 0.5284 0.00001 0.3231 0.32312 0.1070 0.21403 0.0328 0.09834 0.0087 0.0349

Sum 1.0000 0.6703

=0.67

Score prob x*p63 0.000288 0.018164 0.000576 0.036965 0.001728 0.112366 0.004608 0.304167 0.013249 0.887768 0.019297 1.312269 0.043491 3.000970 0.068548 4.798471 0.097062 6.891472 0.123272 8.875673 0.134505 9.818874 0.143433 10.614175 0.114343 8.575776 0.084389 6.413677 0.058468 4.502078 0.036002 2.808279 0.022465 1.774880 0.014401 1.152181 0.008065 0.653282 0.004896 0.401583 0.002016 0.167384 0.002016 0.169485 0.001152 0.097986 0.000864 0.074387 0.000288 0.025188 0.000576 0.0507

Sum 1 73.54

=73.54

Statistical Estimation/Law of Large Numbers

• In practice we won’t know but will want to estimate it• We can select a sample of individuals and observe the

sample mean:

• By selecting a large enough sample size we can be very confident that our sample mean will be arbitrarily close to the true parameter value

• Margin of error measures the upper bound (with a high level of confidence) in our sampling error. It decreases as the sample size increases

x

Rules for Means

• Linear Transformations: a + bX (where a and b are constants): E(a+bX) = a+bX = a + bX

• Sums of random variables: X + Y (where X and Y are random variables): E(X+Y) = X+Y = X + Y

• Linear Functions of Random Variables:

E(a1X1++anXn) = a1+…+ann where E(Xi)=i

Example: Masters Golf Tournament

• Mean by Round (Note ordering):

1=73.54 2=73.07 3=73.76 4=73.91

Mean Score per hole (18) for round 1:

E((1/18)X1) = (1/18)1 = (1/18)73.54 = 4.09

Mean Score versus par (72) for round 1:

E(X1-72) = X1-72 = 73.54-72= +1.54 (1.54 over par)

Mean Difference (Round 1 - Round 4):

E(X1-X4) = 1 - 4 = 73.54 - 73.91 = -0.37

Mean Total Score:

E(X1+X2+X3+X4) = 1+ 2+ 3+ 4 =

= 73.54+73.07+73.76+73.91 = 294.28 (6.28 over par)

Variance of a Random Variable

• Variance: Measure of the spread of the probability distribution. Average squared deviation from the mean

• Standard Deviation: (Positive) Square Root of Variance

lues)integer vaon takes when (useful )(

)()()()(2222

221

21

2

X -μXEpx

pxpxpxXV

XXii

iXikXkXX

Rules for Variances (X, Y RVs a, b constants)

YX

abbabYaXV

bbXaV

YXYXbYaX

XbXa

and between n correlatio theis where

2)(

)(22222

222

Variance of a Random Variable

YX

abbabYaXV

bbXaV

YXYXbYaX

XbXa

and between n correlatio theis where

2)(

)(22222

222

Special Cases:

• X and Y are independent (outcome of one does not alter the distribution of the other): = 0, last term drops out

• a=b=1 and = 0 V(X+Y) = X2 + Y

2

• a=1 b= -1 and = 0 V(X-Y) = X2 + Y

2

• a=b=1 and 0 V(X+Y) = X2 + Y

2 + 2XY

• a=1 b= -1 and 0 V(X-Y) = X2 + Y

2 -2XY

Wars & Masters (Round 1) Golf ScoresScore prob (x-) 2̂ ((x-) 2̂)p

63 0.000288 111.0916 0.03199664 0.000576 91.0116 0.05242665 0.001728 72.9316 0.12603466 0.004608 56.8516 0.26198967 0.013249 42.7716 0.56667468 0.019297 30.6916 0.59226369 0.043491 20.6116 0.89641570 0.068548 12.5316 0.85902171 0.097062 6.4516 0.62620772 0.123272 2.3716 0.29235273 0.134505 0.2916 0.03922274 0.143433 0.2116 0.0303575 0.114343 2.1316 0.24373476 0.084389 6.0516 0.51069177 0.058468 11.9716 0.69995278 0.036002 19.8916 0.71614379 0.022465 29.8116 0.66973180 0.014401 41.7316 0.60097481 0.008065 55.6516 0.44880382 0.004896 71.5716 0.35043783 0.002016 89.4916 0.18042784 0.002016 109.4116 0.22058885 0.001152 131.3316 0.15130486 0.000864 155.2516 0.13414687 0.000288 181.1716 0.05218188 0.000576 209.0916 0.120444

Sum 1 9.474503

Wars (x) Prob (x- ) (x- )^2 ((x- )^2)*p0 0.5284 -0.6703 0.4493 0.23741 0.3231 0.3297 0.1087 0.03512 0.1070 1.3297 1.7681 0.18923 0.0328 2.3297 5.4275 0.17804 0.0087 3.3297 11.0869 0.0965

Sum 1.0000 0.7362

2=.7362 = .8580

2 =9.47

Masters Scores (Rounds 1 & 4)

1 = 73.54 4 = 73.91 12=9.48 4

2=11.95 =0.24

• Variance of Round 1 scores vs Par: V(X1-72)=12=9.48

• Variance of Sum and Difference of Round 1 and Round 4 Scores:

04.432.1615.554.26

32.1611.595.1148.9)95.11)(48.9()24.0(295.1148.9

2)(:)( Difference

54.2611.595.1148.9)95.11)(48.9()24.0(295.1148.9

2)(:)( Sum

4141

4124

214141

4124

214141

XXXX

XXVXX

XXVXX

General Rules of Probability• Union of set of events: Event that any (at least one) of

the events occur• Disjoint events: Events that share no common sample

points. If A, B, and C are pairwise disjoint, the probability of their union is: P(A)+P(B)+P(C)

• Intersection of two (or more) events: The event that both (all) events occur.

• Addition Rule: P(A or B) = P(A)+P(B)-P(A and B)• Conditional Probability: The probability B occurs

given A has occurred: P(B|A)• Multiplication Rule (generalized to conditional prob):

P(A and B)=P(A)P(B|A)=P(B)P(A|B)

Conditional Probability

• Generally interested in case that one event precedes another temporally (but not necessary)

• When P(A) > 0 (otherwise is trivial):

)(

) and ()|(

)(

) and ()|(

BP

BAPBAP

AP

BAPABP

• Contingency Table: Table that cross-classifies individuals or probabilities across 2 or more event classifications

• Tree Diagram: Graphical description of cross-classification of 2 or more events

John Snow London Cholera Death Study

• 2 Water Companies (Let D be the event of death): – Southwark&Vauxhall (S): 264913 customers, 3702 deaths

– Lambeth (L): 171363 customers, 407 deaths

– Overall: 436276 customers, 4109 deaths

people) 10000per (24 0024.171363

407)|(

people) 10000per (140 0140.264913

3702)|(

people) 10000per (94 0094.436276

4109)(

LDP

SDP

DP

Note that probability of death is almost 6 times higher for S&V customers than Lambeth customers (was important in showing how cholera spread)


Cholera Death

WaterCompany

Yes No Total

S&V 3702(.0085)

261211(.5987)

264913(.6072)

Lambeth 407(.0009)

170956(.3919)

171363(.3928)

Total 4109(.0094)

432167(.9906)

436276(1.0000)

(

Contingency Table with joint probabilities (in body of table) and marginal probabilities (on edge of table)


WaterUser

S&V

L

.6072

.3928

CompanyDeath

D (.0085)

.0140

.9860 DC (.5987)

.0024

.9976

D (.0009)

DC (.3919)

Tree Diagram obtaining joint probabilities by multiplication rule

Example: Florida lotto• You select 6 distinct digits from 1 to 53 (no replacement)• State randomly draws 6 digits from 1 to 53• Probability you match all 6 digits:

– First state draw: P(match 1st) = 6/53

– Given you match 1st, you have 5 left and state has 52 left: P(match 2nd given matched 1st) = 5/52

– Process continues: P(match 3rd given 1&2) = 4/51

– P(match 4th given 1&2&3) = 3/50

– P(match 5th given 1&2&3&4) = 2/49

– P(match 6th given 1&2&3&4) = 1/48

480,957,22

1

48

1

49

2

50

3

51

4

52

5

53

6 all)P(match :ruletion Multiplica

Bayes’s Rule - Updating Probabilities

• Let A1,…,Ak be a set of events that partition a sample space such that (mutually exclusive and exhaustive):– each set has known P(Ai) > 0 (each event can occur)

– for any 2 sets Ai and Aj, P(Ai and Aj) = 0 (events are disjoint)

– P(A1) + … + P(Ak) = 1 (each outcome belongs to one of events)

• If C is an event such that – 0 < P(C) < 1 (C can occur, but will not necessarily occur)

– We know the probability will occur given each event Ai: P(C|Ai)

• Then we can compute probability of Ai given C occurred:

)(

) and (

)()|()()|(

)()|()|(

11 CP

CAP

APACPAPACP

APACPCAP i

kk

iii

Northern Army at GettysburgRegiment Label Initial # Casualties P(Ai) P(C|Ai) P(C|Ai)*P(Ai) P(Ai|C)I Corps A1 10022 6059 0.1051 0.6046 0.0635 0.2630II Corps A2 12884 4369 0.1351 0.3391 0.0458 0.1896III Corps A3 11924 4211 0.1250 0.3532 0.0442 0.1828V Corps A4 12509 2187 0.1312 0.1748 0.0229 0.0949VI Corps A5 15555 242 0.1631 0.0156 0.0025 0.0105XI Corps A6 9839 3801 0.1032 0.3863 0.0399 0.1650XII Corps A7 8589 1082 0.0901 0.1260 0.0113 0.0470Cav Corps A8 11501 852 0.1206 0.0741 0.0089 0.0370Arty Reserve A9 2546 242 0.0267 0.0951 0.0025 0.0105Sum 95369 23045 1 0.2416 1.0002

P(C)

• Regiments: partition of soldiers (A1,…,A9). Casualty: event C

• P(Ai) = (size of regiment) / (total soldiers) = (Column 3)/95369

• P(C|Ai) = (# casualties) / (regiment size) = (Col 4)/(Col 3)

• P(C|Ai) P(Ai) = P(Ai and C) = (Col 5)*(Col 6)

•P(C)=sum(Col 7)

• P(Ai|C) = P(Ai and C) / P(C) = (Col 7)/.2416

Independent Events

• Two events A and B are independent if P(B|A)=P(B) and P(A|B)=P(A) , otherwise they are dependent or not independent.

• Cholera Example:

P(D) = .0094 P(D|S) = .0140 P(D|L) =.0024Not independent (which firm would you prefer)?

• Union Army Example:

P(C) = .2416 P(C|A1)=.6046 P(C|A5)=.0156

Not independent: Almost 40 times higher risk for A1

Chapter 4

Documents