Conditional distributions - Stanford University · Conditional distributions Will Monroe July 26, 2017 with materials by Mehran Sahami and Chris Piech

Conditional distributions

Will MonroeJuly 26, 2017

with materials byMehran Sahamiand Chris Piech

Independence ofdiscrete random variables

Two random variables are independent if knowing the value of one tells you nothing about the value of the other (for all values!).

X⊥Y iff ∀ x , y :

P(X=x ,Y= y)=P(X=x)P(Y= y)- or -

pX ,Y (x , y)=pX (x) pY ( y)

Independence ofcontinuous random variables

Two random variables are independent if knowing the value of one tells you nothing about the value of the other (for all values!).

X⊥Y iff ∀ x , y :

f X ,Y (x , y)=f X (x) f Y ( y)- or -

F X ,Y (x , y)=FX (x)FY ( y)- or -

f X ,Y (x , y)=g(x)h( y)

Review: Sum ofindependent binomials

X∼Bin (n , p) Y∼Bin (m, p)

n flips m flipsX: number of headsin first n flips

Y: number of headsin next m flips

X+Y∼Bin (n+m, p)

More generally:

X i∼Bin (ni , p) ⇒ ∑i=1

N

X i∼Bin (∑i=1

N

ni , p)all X i independent

Review: Sum ofindependent Poissons

X∼Poi (λ1) Y∼Poi (λ2)

λ₁ chips/cookieX: number of chipsin first cookie

Y: number of chipsin second cookie

X+Y∼Poi(λ1+λ2)

More generally:

X i∼Poi(λi) ⇒ ∑i=1

N

X i∼Poi(∑i=1

N

λ i)

λ₂ chips/cookie

all X i independent

Review: Convolution

A convolution is the distribution of the sum of two independent random variables.

f X+Y (a)=∫−∞

∞

dy f X (a− y) f Y ( y)

Review: Sum of independent normals

More generally:

X∼N (μ1 ,σ12) Y∼N (μ2 ,σ2

2)

X+Y∼N (μ1+μ2 ,σ12+σ2

2)

X i∼N (μi ,σi2) ⇒ ∑

i=1

N

X i∼N (∑i=1

N

μi ,∑i=1

N

σ i2)

all X i independent

Virus infections

M∼Bin (50 ,0.1)≈X∼N (5,4.5)

150 computers in a dorm:

50 Macs (each independently infected with probability 0.1)

100 PCs (each independently infected with probability 0.4)

What is P(≥ 40 machines infected)?

M: # infected Macs

P: # infected PCs P∼Bin (100 ,0.4)≈Y∼N (40,24)

P(M+P≥40)≈P(X+Y≥39.5)W=X+Y∼N (5+40, 4.5+24)=N (45,28.5)

P(W≥39.5)=P (W−45√28.5

≥39.5−45√28.5 )≈1−Φ(−1.03)≈0.8485

Review: Conditional probability

The conditional probability P(E | F) is the probability that E happens, given that F has happened. F is the new sample space.

P(E|F )=P(EF)P(F)

E F

S

EF

Discrete conditional distributions

The value of a random variable, conditioned on the value of some other random variable, has a probability distribution.

pX∣Y (x , y)=P(X=x ,Y= y)

P(Y= y)

=pX ,Y (x , y)

pY ( y)

Conditionals from a joint PMF

P(R=1∣Y=3)=P(R=1,Y=3)

P(Y=3)

12345

Y

0R1 2

=0.190.50≈0.38

=pR∣Y (1∣3)

Conditionals from a joint PMF

pR∣Y (r , y)=P(R=r ,Y= y)

P(Y= y)

12345

Y

0R1 2

12345

Y

0R1 2

More web server hitsYour web server gets X requests from humans and Y requests from bots in a day, independently.

X ~ Poi(λ₁)Y ~ Poi(λ₂)

so X + Y ~ Poi(λ₁ + λ₂)

P(X=k∣X+Y=n)=P(X=k ,Y=n−k )

P (X+Y=n)=

P (X=k)P (Y=n−k)P(X+Y=n)

(independence)

=e−λ1λ1

k

k !e−λ2λ2

n−k

(n−k)!n !

e−λ1+λ2(λ1+λ2)n

=n!

k !(n−k)!

λ1kλ2

n−k

(λ1+λ2)n

=(nk )(λ1

λ1+λ2 )k

(λ2

λ1+λ2 )n−k

→(X∣X+Y )∼Bin (X+Y ,λ1

λ1+λ2)

=n

Continuous conditional distributions

The value of a random variable, conditioned on the value of some other random variable, has a probability distribution.

f X∣Y (x∣y)=f X ,Y (x , y )

f Y ( y)

Ratios of continuous probabilities

The probability of an exact value for a continuous random variable is 0.

But ratios of these probabilities are still well-defined!

P(X=a)P(X=b)

=f X (a)

f X (b)

Defining the undefined

≈ fX(a) if ε is small

P(X=a)P(X=b)

=P(X≈a)P(X≈b)

=limε→0

P(a−ε≤X≤a+ε)P(b−ε≤X≤b+ε)

=limε→0

∫a−ε

a+ε

dx f X (x)

∫b−ε

b+ε

dx f X (x)

=2ε f X (a)

2ε f X (b)=

f X (a)

f X (b)

≈ fX(b)

→ ←2ε

→ ←f

X(a) f

X(b)

2ε

Conditioning on a continuous RV

f X∣Y (x∣y)=P(X=x∣Y= y)

=P (X=x ,Y= y)

P(Y= y)

=f X ,Y (x , y)

f Y ( y)

Mixing discrete and continuous

P(a1≤X≤a2 , b1≤N≤b2)=∫a1

a2

dx∑n=b1

b2

f X , N (x ,n)

pN∣X (n∣x)=f X , N (x ,n)

f X (x)

f X∣N (x∣n)=f X , N (x ,n)

pN (n)

Discrete + Continuous Bayes

f X∣N (x∣n)=pN∣X (n∣x) f X (x)

pN (n)

pN∣X (n∣x)=f X∣N (x∣n) pN (n)

f X (x)

N

X

P(N|X)

Break time!

The probability of a probability

Beta random variable

An beta random variable models the probability of a trial’s success, given previous trials. The PDF/CDF let you compute probabilities of probabilities!

X∼Beta (a ,b)

f X (x)={C xa−1(1−x)b−1 if 0<x<10 otherwise

Estimating an unknown probabilityYou roll a loaded die N times,get A sixes (and N - A non-sixes).

What’s the probability that the die is loaded such that sixes come up less than 1/4 of the time?

X: probability of getting a sixA: number of sixes in N rolls

f X∣A(x∣a)=P(A=a|X=x) f X (x)

P(A=a)

A | X ~ Bin(N, X)

=1

P(A=a) (Na ) x

a(1−x)n−a

⋅1

X ~ Uni(0, 1)

=C xa(1−x)n−a

(“I know nothing”)

???

Beta: Fact sheet

PDF:

expectation: E [X ]=a

a+bvariance: Var(X )=

ab(a+b)2(a+b+1)

number of successes + 1

X∼Beta (a ,b)

number offailures + 1

probabilityof success

f X (x)={C xa−1(1−x)b−1 if 0<x<10 otherwise

Beta takes many forms

f X (x)=C x1−1(1−x)1−1 if 0<x<1

=C x0(1−x)0

Conjugate distribution

X∼Beta (1,1)

=C=1 ∫

0

1

dxC=1

⇒ Beta (1,1)=Uni(0,1)


P(A=a)

X ~ Beta(1, 1)“prior”

“likelihood”X | A ~ Beta(a + 1, N – a + 1)“posterior”

“normalizing constant”

Subjective priors


P(A=a)

X ~ Beta(1, 1)“prior”

X | A ~ Beta(a + 1, N – a + 1)“posterior”

How did we decide onBeta(1, 1) for the prior?

Beta(1, 1): “we haven’t seen any rolls yet.”Beta(4, 1): “we’ve seen 3 sixes and 0 non-sixes.”Beta(2, 6): “we’ve seen 1 six and 5 non-sixes.”

Beta prior = “imaginary” previous trials

Advanced: Dirichlet distributionBeta is the distribution (“conjugate prior”) for the p in the Bernoulli and binomial.

Dirichlet is the distributionfor the p₁, p₂, … in the multinomial.

X1 , X2 ,…∼Dir (a1,a2,…)

f X1, X2,…(x1, x2,…)=

C x1a1−1 x2

b2−1…

if 0<{x1 , x2 ,…}<1,x1+x2+⋯=1(0 otherwise)

Frequentists vs. Bayesians

image: Eric Kilby

Frequentist

A probability is the (real or theoretical) result of a number of experiments.

All probabilities are based on objective experiences.

Bayesian

A probability is a belief.

All probabilities are based on subjective priors.

(It’s not really a debate anymore—real statisticians / data scientists / machine learning practitioners can and do think both ways!)

Conditional distributions - Stanford University · Conditional distributions Will Monroe July 26, 2017 with materials by Mehran Sahami and Chris Piech

Documents