Condional distribuons Will Monroe July 26, 2017 with materials by Mehran Sahami and Chris Piech
Conditional distributions
Will MonroeJuly 26, 2017
with materials byMehran Sahamiand Chris Piech
Independence ofdiscrete random variables
Two random variables are independent if knowing the value of one tells you nothing about the value of the other (for all values!).
X⊥Y iff ∀ x , y :
P(X=x ,Y= y)=P(X=x)P(Y= y)- or -
pX ,Y (x , y)=pX (x) pY ( y)
Independence ofcontinuous random variables
Two random variables are independent if knowing the value of one tells you nothing about the value of the other (for all values!).
X⊥Y iff ∀ x , y :
f X ,Y (x , y)=f X (x) f Y ( y)- or -
F X ,Y (x , y)=FX (x)FY ( y)- or -
f X ,Y (x , y)=g(x)h( y)
Review: Sum ofindependent binomials
X∼Bin (n , p) Y∼Bin (m, p)
n flips m flipsX: number of headsin first n flips
Y: number of headsin next m flips
X+Y∼Bin (n+m, p)
More generally:
X i∼Bin (ni , p) ⇒ ∑i=1
N
X i∼Bin (∑i=1
N
ni , p)all X i independent
Review: Sum ofindependent Poissons
X∼Poi (λ1) Y∼Poi (λ2)
λ₁ chips/cookieX: number of chipsin first cookie
Y: number of chipsin second cookie
X+Y∼Poi(λ1+λ2)
More generally:
X i∼Poi(λi) ⇒ ∑i=1
N
X i∼Poi(∑i=1
N
λ i)
λ₂ chips/cookie
all X i independent
Review: Convolution
A convolution is the distribution of the sum of two independent random variables.
f X+Y (a)=∫−∞
∞
dy f X (a− y) f Y ( y)
Review: Sum of independent normals
More generally:
X∼N (μ1 ,σ12) Y∼N (μ2 ,σ2
2)
X+Y∼N (μ1+μ2 ,σ12+σ2
2)
X i∼N (μi ,σi2) ⇒ ∑
i=1
N
X i∼N (∑i=1
N
μi ,∑i=1
N
σ i2)
all X i independent
Virus infections
M∼Bin (50 ,0.1)≈X∼N (5,4.5)
150 computers in a dorm:
50 Macs (each independently infected with probability 0.1)
100 PCs (each independently infected with probability 0.4)
What is P(≥ 40 machines infected)?
M: # infected Macs
P: # infected PCs P∼Bin (100 ,0.4)≈Y∼N (40,24)
P(M+P≥40)≈P(X+Y≥39.5)W=X+Y∼N (5+40, 4.5+24)=N (45,28.5)
P(W≥39.5)=P (W−45√28.5
≥39.5−45√28.5 )≈1−Φ(−1.03)≈0.8485
Review: Conditional probability
The conditional probability P(E | F) is the probability that E happens, given that F has happened. F is the new sample space.
P(E|F )=P(EF)P(F)
E F
S
EF
Discrete conditional distributions
The value of a random variable, conditioned on the value of some other random variable, has a probability distribution.
pX∣Y (x , y)=P(X=x ,Y= y)
P(Y= y)
=pX ,Y (x , y)
pY ( y)
Conditionals from a joint PMF
P(R=1∣Y=3)=P(R=1,Y=3)
P(Y=3)
12345
Y
0R1 2
=0.190.50≈0.38
=pR∣Y (1∣3)
Conditionals from a joint PMF
pR∣Y (r , y)=P(R=r ,Y= y)
P(Y= y)
12345
Y
0R1 2
12345
Y
0R1 2
More web server hitsYour web server gets X requests from humans and Y requests from bots in a day, independently.
X ~ Poi(λ₁)Y ~ Poi(λ₂)
so X + Y ~ Poi(λ₁ + λ₂)
P(X=k∣X+Y=n)=P(X=k ,Y=n−k )
P (X+Y=n)=
P (X=k)P (Y=n−k)P(X+Y=n)
(independence)
=e−λ1λ1
k
k !e−λ2λ2
n−k
(n−k)!n !
e−λ1+λ2(λ1+λ2)n
=n!
k !(n−k)!
λ1kλ2
n−k
(λ1+λ2)n
=(nk )(λ1
λ1+λ2 )k
(λ2
λ1+λ2 )n−k
→(X∣X+Y )∼Bin (X+Y ,λ1
λ1+λ2)
=n
Continuous conditional distributions
The value of a random variable, conditioned on the value of some other random variable, has a probability distribution.
f X∣Y (x∣y)=f X ,Y (x , y )
f Y ( y)
Ratios of continuous probabilities
The probability of an exact value for a continuous random variable is 0.
But ratios of these probabilities are still well-defined!
P(X=a)P(X=b)
=f X (a)
f X (b)
Defining the undefined
≈ fX(a) if ε is small
P(X=a)P(X=b)
=P(X≈a)P(X≈b)
=limε→0
P(a−ε≤X≤a+ε)P(b−ε≤X≤b+ε)
=limε→0
∫a−ε
a+ε
dx f X (x)
∫b−ε
b+ε
dx f X (x)
=2ε f X (a)
2ε f X (b)=
f X (a)
f X (b)
≈ fX(b)
→ ←2ε
→ ←f
X(a) f
X(b)
2ε
Conditioning on a continuous RV
f X∣Y (x∣y)=P(X=x∣Y= y)
=P (X=x ,Y= y)
P(Y= y)
=f X ,Y (x , y)
f Y ( y)
Mixing discrete and continuous
P(a1≤X≤a2 , b1≤N≤b2)=∫a1
a2
dx∑n=b1
b2
f X , N (x ,n)
pN∣X (n∣x)=f X , N (x ,n)
f X (x)
f X∣N (x∣n)=f X , N (x ,n)
pN (n)
Discrete + Continuous Bayes
f X∣N (x∣n)=pN∣X (n∣x) f X (x)
pN (n)
pN∣X (n∣x)=f X∣N (x∣n) pN (n)
f X (x)
N
X
P(N|X)
Break time!
The probability of a probability
Beta random variable
An beta random variable models the probability of a trial’s success, given previous trials. The PDF/CDF let you compute probabilities of probabilities!
X∼Beta (a ,b)
f X (x)={C xa−1(1−x)b−1 if 0<x<10 otherwise
Estimating an unknown probabilityYou roll a loaded die N times,get A sixes (and N - A non-sixes).
What’s the probability that the die is loaded such that sixes come up less than 1/4 of the time?
X: probability of getting a sixA: number of sixes in N rolls
f X∣A(x∣a)=P(A=a|X=x) f X (x)
P(A=a)
A | X ~ Bin(N, X)
=1
P(A=a) (Na ) x
a(1−x)n−a
⋅1
X ~ Uni(0, 1)
=C xa(1−x)n−a
(“I know nothing”)
???
Beta: Fact sheet
PDF:
expectation: E [X ]=a
a+bvariance: Var(X )=
ab(a+b)2(a+b+1)
number of successes + 1
X∼Beta (a ,b)
number offailures + 1
probabilityof success
f X (x)={C xa−1(1−x)b−1 if 0<x<10 otherwise
Beta takes many forms
f X (x)=C x1−1(1−x)1−1 if 0<x<1
=C x0(1−x)0
Conjugate distribution
X∼Beta (1,1)
=C=1 ∫
0
1
dxC=1
⇒ Beta (1,1)=Uni(0,1)
f X∣A(x∣a)=P(A=a|X=x) f X (x)
P(A=a)
X ~ Beta(1, 1)“prior”
“likelihood”X | A ~ Beta(a + 1, N – a + 1)“posterior”
“normalizing constant”
Subjective priors
f X∣A(x∣a)=P(A=a|X=x) f X (x)
P(A=a)
X ~ Beta(1, 1)“prior”
X | A ~ Beta(a + 1, N – a + 1)“posterior”
How did we decide onBeta(1, 1) for the prior?
Beta(1, 1): “we haven’t seen any rolls yet.”Beta(4, 1): “we’ve seen 3 sixes and 0 non-sixes.”Beta(2, 6): “we’ve seen 1 six and 5 non-sixes.”
Beta prior = “imaginary” previous trials
Advanced: Dirichlet distributionBeta is the distribution (“conjugate prior”) for the p in the Bernoulli and binomial.
Dirichlet is the distributionfor the p₁, p₂, … in the multinomial.
X1 , X2 ,…∼Dir (a1,a2,…)
f X1, X2,…(x1, x2,…)=
C x1a1−1 x2
b2−1…
if 0<{x1 , x2 ,…}<1,x1+x2+⋯=1(0 otherwise)
Frequentists vs. Bayesians
image: Eric Kilby
Frequentist
A probability is the (real or theoretical) result of a number of experiments.
All probabilities are based on objective experiences.
Bayesian
A probability is a belief.
All probabilities are based on subjective priors.
(It’s not really a debate anymore—real statisticians / data scientists / machine learning practitioners can and do think both ways!)