BIOL2300 Biostatistics Chapter 5 Discrete probability distributions
BIOL2300 Biostatistics
Chapter 5 Discrete probability
distributions
https://www.cartoonstock.com/directory/o/odds.asp
Random variable
• Intuitively, a random variable (r.v.) is the numerical value expressing outcome of a stochastic experiment
• Formally, a r.v. is a function X : Ω ! Reals Since Ω has an associated probability function, we will be able to discuss P(X=k) = P({x 2 Ω: X(x)=k})
Discrete vs continuous
• Discrete r.v. takes either finitely many values or countably many (range of X is finite of countably infinite)
• Continuous r.v. X has range containing an interval of the real numbers
Probability distribution
• Probability that r.v. X takes value k p(k) = P( {x 2 Ω: X(x) = k} )
• Probability distribution for r.v. X 0 ≤ p(k) = P(X=k) ≤ 1 for all y
Answer
• No: sum of probabilities not equal to 1
Mean, variance, stdev of r.v.(population variance & stdev)
Round off rule
• Round mean, variance and stdev to one more decimal place than accuracy of r.v.
Hardy-Weinberg theorem
Under random mating hypothesis, genotypes reach fixation levels in one generation
#x is AA, y is AB, z is BB: genotype frequencies #a,b,c are old values of x,y,z y = (1-x-z) a = x b = y c = z gen = 0 #generation count x = a**2 + 0.5*a*b + 0.5*b*a + 0.25*b*b z = c**2 + 0.5*c*b + 0.5*b*c + 0.25*b*b y = 1-x-z
Answer: under random mating hypothesis
Female A (p)
Female B (q)
Male A (p) AA (p2) AB (pq)
Male B (q) BA (pq) BB (q2)
• p,q are allele frequencies. p = allele frequency of A, q = allele frequency of B. We have p+q=1
• AA, AB and BB are genotypes. Assume that A is dominant over B, and there is a phenotypic difference between – AA, AB – BB
• Then since E[BB]=q2, we can compute the B-allele frequency from the square root of the phenotype frequency BB, assuming that the population is in Hardy-Weinberg equilibrium.
• Thus p = 1-q, and E[AA]=p2, E[AB]= 2pq.
Binomial coefficient
• Pascal’s triangle • 1, 1, 1 1, 2, 1 1, 3, 3, 1 1, 4, 6, 4, 1 1, 5, 10, 10, 5, 1
• What is sum of each row?
More on binomial coefficients
• “n choose k”
Binomial distribution
• Bernouilli trial: 2 outcomes, success or failure with prob p and q=(1-p)
• binomially distributed r.v. counts the number of successes in n trials
Binomial theorem
Answer
Graph of binomial distribution
Out[2]=
Here n=50, p=0.3 in relative frequency plot of binomial distr.
Mean, variance, stdev of Bernouilli distributed r.v.
• Let Y be Bernouilli r.v. with probability p of success
• E[Y] = 1*p + 0*(1-p) = p • V[Y] = (1-p)2*p + (0-p)2*(1-p)
= p(1-2p+p2) +p2 -p3 = p-p2 = p(1-p) = pq
• stdev[Y] = sqrt(pq)
Mean and variance of linear combination of independent r.v.
• Theorem 1: Expectation is always additive E[X+Y] = E[X] + E[Y] E[cX] = cE[X]
• Theorem 2: Variance is additive if r.v. independent V[X+Y] = V[X] + V[Y] V[cX] = c2V[X]
What does it mean that two r.v. are independent?
• X,Y independent iff for all values x,y respectively taken on by X,Y we have
• P(X=x,Y=y) = P(X=x) * P(Y=y)
Mean, variance, stdev of binomial distributed r.v.
• If X is r.v. that counts number of successes in n trials, then by additivity of expectation E[X] = E[Y]+...+E[Y] where there are n terms in sum and Y is Bernouilli distributed r.v.
• Thus E[X] = np if X is binomially distributed
r.v. that counts number of successes in n trials where probability of success in one trial is p
• Since n Bernouilli trials are independent, by additivity of variance (REQUIRES independence) V[X] = nV[Y] = npq = np(1-p) if X is binomially distributed r.v. that counts number of successes in n trials where probability of success in one trial is p
Answer: D. E[2 X + 7] = 2E[X]+7 E[2X+7 – 2µ+7)2] = 4E[(X-µ)2] = 4 V[X] So stdev[2X+7] = 2 ¾
Binomial distribution is for sampling with replacement
• urn has 4 red balls and 6 black balls • P(red ball) = 4/10 = 0.4 • b(r;n,p) is probability of r red balls in
sample of n balls, where balls are drawn from urn WITH replacement
• What about drawing balls without replacement?
Binomial distribution: sampling with replacement
n=100, p=0.15
Proportion of successes in n trials
• Let Z be r.v. that counts the number of successes in n trials, where probability of success is p (absolute frequency).
• Z = X+...+X, where P(X=1) = p. • Let Y be r.v. that returns the proportion of
successes in n trials (relative frequency) • E[Y] = E[Z/n] = E[X+...+X]/n = np/n = p • V[Y] = V[Z/n] = V[X+...+X]/n2 = npq/n2 = pq/n
Hypergeometric distribution
• R red balls in population of size N • what is probability of drawing r red balls
in sample of size n, when drawing WITHOUT replacement?
Mean,variance of hypergeometric dist.
• Let p = R/N. Then mean is np Note that the mean of hypergeometric (sampling without replacement) is same as mean of binomial (sampling with replacement)
• Let p = R/N and q = 1-p = (N-R)/N, so that q = B/N where B is number of black balls in population. Recall that variance of binomial is npq. Then variance of hypergeometric is npq(N-n)/(N-1)
Graph of hypergeometric distribution
Out[2]=
Here N=100, R=50, n=30. Dot (x,y) represents h(n,x;N,R), the probability of drawing x red balls in size n sample, without replacement
Lotto problem
• In Lotto game, a player selects six numbers from 1,...,54 without repetition, thus forming an unordered set of 6 numbers.
• P(selecting 6 winning numbers)
Continuation of Lotto Problem
• P(selecting exactly 5 winning numbers)
Continuation of Lotto Problem
• P(selecting exactly 3 winning numbers)
Continuation of Lotto Problem
• P(selecting no winning number)
Poisson distribution X is Poisson r.v. with parameter λ (λ is the mean) if
P[X=1]=0.014872513 = µ1/1! · e-µ
P[X=4] = (3.1)4/4! · e-3.1
Graph of Poisson distribution
Out[2]=
Here lambda=10.
Mean and variance of Poisson r.v
• If X is Poisson distributed r.v. with parameter λ, then
– E[X] = λ – V[X] = λ
Mean of Poisson with parameter lambda is lambda
Variance of Poisson with parameter lambda is lambda
Applications of Poisson
• Suppose that p is small, where p is the probability that an elementary event happens in a given time or space interval (accident in 5 minute interval, nucleotide mutation in genome)
• Poisson distributed r.v. is good distribution for fitting the NUMBER of elementary events that occur per time or space interval
Number of TATA boxes (TATAAA) in M. jannaschii (archea)
Number of TATA boxes (TATAAA) in E. coli K12
• http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.2
Interarrival time, or distance between successive genomic
motifs
What probability distribution is this?
Remarks
• When N is large compared with sample size n, sampling with replacement (binomial) is approximately equal to sampling without replacement (hypergeometric).
• For large n, both binomial and hypergeometric distribution look approximately normal. For small p and large n, binomial distribution can be approximated by Poisson distribution.
Is number of women having car accidents in fixed time interval
Poisson distributed? Accidents 0 1 2 3 4 5 >5 Total
#female 447 132 42 21 3 2 0 647
E[#female] 406 189 44 7 1 0 0 647
Columns 0 and 3 deviate from values of Poisson r.v. Explanation: prudent drivers have fewer accidents, reckless drivers have more?
Geometric distribution
Defective component problem
• Probability of defective component is 0.2. Find the probability that the first defect is found in the seventh component tested.
• P(X=7), where X is Poisson distributed • P(X=7) = (1-0.2)6*(0.2)
Mean of geometric distribution is 1/p
Application of mean of geometric distribution
• Recall that the absolute risk reduction is defined by |P(event occurring in treatment group) - P(event occurring in control group)|
• The book states that the number of persons needed to treat, in order to prevent one disease, is computed by 1/(absolute risk reduction)
• This follows since the number of persons who must be treated (without success) before treating a person with success is geometrically distributed, and the mean is 1/p, where p is absolute risk reduction.
Variance of geometric distribution is q/p2
Multinomial coefficients Recall that the binomial coefficient is number of ways of choosing a size k subset from a size n set.
The multinomial coefficient at right is number of ways of choosing partitioning size n set into subsets of size n1,n2,...,nk
Problem on multinomial distribution
Solution to problem on multinomial distribution
• P(A)=p1, P(B)=p2, P(C)=p3 • Multinomial probability (generalization of
binomial probability) that in a sample of size n, there are n1 items of type A, n2 items of type B and n3 items of type C is:
Solution to problem on multinomial distribution
• Genetics experiment involves 6 mutually exclusive genotypes: A,B,C,D,E,F, all equally likely (so pi=1/6 for i=1,...,6)
• Probability of 5 A’s, 4 B’s, 3 C’s, 2 D’s, 3 E’s, 3 F’s is
Chebyshev’s inequality
Prob that z-score greater than or equal to 2 is at most 1/4 Prob that z-score greater than or equal to 3 is at most 1/9
Proof of Chebyshev’s inequality
Equivalent formulation of Chebyshev’s inequality
Prob that z-score is less than or equal to k is greater than or equal to 1-1/k2 1/4