Discrete Distributions Bernoulli f (x) = p x (1 - p) 1-x , x = 0, 1 0 < p < 1 M(t) = 1 - p + pe t , -∞ < t < ∞ μ = p, σ 2 = p(1 - p) Binomial f (x) = n! x!(n - x)! p x (1 - p) n-x , x = 0, 1, 2, ... , n b(n, p) 0 < p < 1 M(t) = (1 - p + pe t ) n , -∞ < t < ∞ μ = np, σ 2 = np(1 - p) Geometric f (x) = (1 - p) x-1 p, x = 1, 2, 3, ... 0 < p < 1 M(t) = pe t 1 - (1 - p)e t , t < - ln(1 - p) μ = 1 p , σ 2 = 1 - p p 2 Hypergeometric f (x) = N 1 x N 2 n - x N n , x ≤ n, x ≤ N 1 , n - x ≤ N 2 N 1 > 0, N 2 > 0 N = N 1 + N 2 μ = n N 1 N , σ 2 = n N 1 N N 2 N N - n N - 1 Negative Binomial f (x) = x - 1 r - 1 p r (1 - p) x-r , x = r, r + 1, r + 2, ... 0 < p < 1 r = 1, 2, 3, ... M(t) = (pe t ) r [1 - (1 - p)e t ] r , t < - ln(1 - p) μ = r 1 p , σ 2 = r(1 - p) p 2 Poisson f (x) = λ x e -λ x! , x = 0, 1, 2, ... λ > 0 M(t) = e λ(e t -1) , -∞ < t < ∞ μ = λ, σ 2 = λ Uniform f (x) = 1 m , x = 1, 2, ... , m m > 0 μ = m + 1 2 , σ 2 = m 2 - 1 12
41
Embed
Discrete Distributions - Kennesaw State University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Discrete DistributionsBernoulli f (x) = px(1 − p)1−x, x = 0, 10 < p < 1 M(t) = 1 − p + pet, −∞ < t < ∞
µ = p, σ 2 = p(1 − p)
Binomial f (x) = n!x!(n − x)! px(1 − p)n−x, x = 0, 1, 2, . . . , n
b(n, p)0 < p < 1 M(t) = (1 − p + pet)n, −∞ < t < ∞
µ = np, σ 2 = np(1 − p)
Geometric f (x) = (1 − p)x−1p, x = 1, 2, 3, . . .0 < p < 1
M(t) = pet
1 − (1 − p)et , t < − ln(1 − p)
µ = 1p
, σ 2 = 1 − pp2
Hypergeometric f (x) =
(N1
x
)(N2
n − x
)
(Nn
) , x ≤ n, x ≤ N1, n − x ≤ N2
N1 > 0, N2 > 0N = N1 + N2
µ = n(
N1
N
), σ 2 = n
(N1
N
)(N2
N
)(N − nN − 1
)
Negative Binomial f (x) =(
x − 1r − 1
)pr(1 − p)x−r, x = r, r + 1, r + 2, . . .
0 < p < 1
r = 1, 2, 3, . . . M(t) = (pet)r
[1 − (1 − p)et]r , t < − ln(1 − p)
µ = r(
1p
), σ 2 = r(1 − p)
p2
Poisson f (x) = λxe−λ
x! , x = 0, 1, 2, . . .λ > 0
M(t) = eλ(et−1), −∞ < t < ∞µ = λ, σ 2 = λ
Uniform f (x) = 1m
, x = 1, 2, . . . , mm > 0
µ = m + 12
, σ 2 = m2 − 112
Continuous Distributions
Beta f (x) = !(α + β)!(α)!(β)
xα−1(1 − x)β−1, 0 < x < 1α > 0β > 0 µ = α
α + β, σ 2 = αβ
(α + β + 1)(α + β)2
Chi-square f (x) = 1!(r/2)2r/2 xr/2−1e−x/2, 0 < x < ∞
4.1 Bivariate Distributions of the Discrete Type4.2 The Correlation Coefficient4.3 Conditional Distributions
4.4 Bivariate Distributions of the ContinuousType
4.5 The Bivariate Normal Distribution
4.1 BIVARIATE DISTRIBUTIONS OF THE DISCRETE TYPESo far, we have taken only one measurement on a single item under observation.However, it is clear in many practical cases that it is possible, and often very desir-able, to take more than one measurement of a random observation. Suppose, forexample, that we are observing female college students to obtain information aboutsome of their physical characteristics, such as height, x, and weight, y, because we aretrying to determine a relationship between those two characteristics. For instance,there may be some pattern between height and weight that can be described byan appropriate curve y = u(x). Certainly, not all of the points observed will beon this curve, but we want to attempt to find the “best” curve to describe therelationship and then say something about the variation of the points around thecurve.
Another example might concern high school rank—say, x—and the ACT(or SAT) score—say, y—of incoming college students. What is the relationshipbetween these two characteristics? More importantly, how can we use those mea-surements to predict a third one, such as first-year college GPA—say, z—witha function z = v(x, y)? This is a very important problem for college admissionoffices, particularly when it comes to awarding an athletic scholarship, because theincoming student–athlete must satisfy certain conditions before receiving such anaward.
Definition 4.1-1Let X and Y be two random variables defined on a discrete space. Let S denotethe corresponding two-dimensional space of X and Y, the two random vari-ables of the discrete type. The probability that X = x and Y = y is denoted byf (x, y) = P(X = x, Y = y). The function f (x, y) is called the joint probabilitymass function (joint pmf) of X and Y and has the following properties:
125
126 Chapter 4 Bivariate Distributions
(a) 0 ≤ f (x, y) ≤ 1.
(b)∑ ∑
(x,y)∈S
f (x, y) = 1.
(c) P[(X, Y) ∈ A] =∑ ∑
(x,y)∈A
f (x, y), where A is a subset of the space S.
The following example will make this definition more meaningful.
Example4.1-1
Roll a pair of fair dice. For each of the 36 sample points with probability 1/36, letX denote the smaller and Y the larger outcome on the dice. For example, if theoutcome is (3, 2), then the observed values are X = 2, Y = 3. The event {X = 2,Y = 3} could occur in one of two ways—(3, 2) or (2, 3)—so its probability is
136
+ 136
= 236
.
If the outcome is (2, 2), then the observed values are X = 2, Y = 2. Since the event{X = 2, Y = 2} can occur in only one way, P(X = 2, Y = 2) = 1/36. The joint pmfof X and Y is given by the probabilities
f (x, y) =
136
, 1 ≤ x = y ≤ 6,
236
, 1 ≤ x < y ≤ 6,
when x and y are integers. Figure 4.1-1 depicts the probabilities of the various pointsof the space S.
1/36
2/36 1/36
1/36
2/36
1/36
2/36
2/36
2/36
1/36
2/36
2/36
2/36
5/36 3/367/36
y
2/36
2/36
2/36
2/36
1/36
1/36
x
11/36
2/36
2/36
2/36
5/36
9/36
1/36
11/364 53 621
7/36
3/36
9/36
6
5
4
3
2
1
Figure 4.1-1 Discrete joint pmf
Section 4.1 Bivariate Distributions of the Discrete Type 127
Notice that certain numbers have been recorded in the bottom and left-handmargins of Figure 4.1-1. These numbers are the respective column and row totalsof the probabilities. The column totals are the respective probabilities that X willassume the values in the x space SX = {1, 2, 3, 4, 5, 6}, and the row totals arethe respective probabilities that Y will assume the values in the y space SY ={1, 2, 3, 4, 5, 6}. That is, the totals describe the probability mass functions of X andY, respectively. Since each collection of these probabilities is frequently recordedin the margins and satisfies the properties of a pmf of one random variable, each iscalled a marginal pmf.
Definition 4.1-2Let X and Y have the joint probability mass function f (x, y) with space S. Theprobability mass function of X alone, which is called the marginal probabilitymass function of X, is defined by
fX(x) =∑
yf (x, y) = P(X = x), x ∈ SX ,
where the summation is taken over all possible y values for each given x in thex space SX . That is, the summation is over all (x, y) in S with a given x value.Similarly, the marginal probability mass function of Y is defined by
fY(y) =∑
xf (x, y) = P(Y = y), y ∈ SY ,
where the summation is taken over all possible x values for each given y in they space SY . The random variables X and Y are independent if and only if, forevery x ∈ SX and every y ∈ SY ,
P(X = x, Y = y) = P(X = x)P(Y = y)
or, equivalently,
f (x, y) = fX(x)fY(y);
otherwise, X and Y are said to be dependent.
We note in Example 4.1-1 that X and Y are dependent because there are manyx and y values for which f (x, y) "= fX(x)fY(y). For instance,
fX(1)fY(1) =(
1136
)(1
36
)"= 1
36= f (1, 1).
Example4.1-2
Let the joint pmf of X and Y be defined by
f (x, y) = x + y21
, x = 1, 2, 3, y = 1, 2.
Then
fX(x) =∑
yf (x, y) =
2∑
y=1
x + y21
= x + 121
+ x + 221
= 2x + 321
, x = 1, 2, 3,
48 Probability and Statistics for Computer Scientists
(a) E(X) = 0.5
!0 0.5 1
(b) E(X) = 0.25
!0 0.25 0.5 1
FIGURE 3.3: Expectation as a center of gravity.
Similar arguments can be used to derive the general formula for the expectation.
Expectation,discrete case
µ = E(X) =∑
x
xP (x) (3.3)
This formula returns the center of gravity for a system with masses P (x) allocated at pointsx. Expected value is often denoted by a Greek letter µ.
In a certain sense, expectation is the best forecast of X . The variable itself is random. Ittakes different values with different probabilities P (x). At the same time, it has just oneexpectation E(X) which is non-random.
3.3.2 Expectation of a function
Often we are interested in another variable, Y , that is a function of X . For example, down-loading time depends on the connection speed, profit of a computer store depends on thenumber of computers sold, and bonus of its manager depends on this profit. Expectation ofY = g(X) is computed by a similar formula,
E {g(X)} =∑
x
g(x)P (x). (3.4)
Remark: Indeed, if g is a one-to-one function, then Y takes each value y = g(x) with probability
P (x), and the formula for E(Y ) can be applied directly. If g is not one-to-one, then some values ofg(x) will be repeated in (3.4). However, they are still multiplied by the corresponding probabilities.
When we add in (3.4), these probabilities are also added, thus each value of g(x) is still multipliedby the probability PY (g(x)).
3.3.3 Properties
The following linear properties of expectations follow directly from (3.3) and (3.4). For anyrandom variables X and Y and any non-random numbers a, b, and c, we have
50 Probability and Statistics for Computer Scientists
Example 3.10. Here is a rather artificial but illustrative scenario. Consider two users.One receives either 48 or 52 e-mail messages per day, with a 50-50% chance of each. Theother receives either 0 or 100 e-mails, also with a 50-50% chance. What is a common featureof these two distributions, and how are they different?
We see that both users receive the same average number of e-mails:
E(X) = E(Y ) = 50.
However, in the first case, the actual number of e-mails is always close to 50, whereas italways differs from it by 50 in the second case. The first random variable, X , is more stable;it has low variability. The second variable, Y , has high variability. ♦
This example shows that variability of a random variable is measured by its distance fromthe mean µ = E(X). In its turn, this distance is random too, and therefore, cannot serveas a characteristic of a distribution. It remains to square it and take the expectation of theresult.
DEFINITION 3.6
Variance of a random variable is defined as the expected squared deviationfrom the mean. For discrete random variables, variance is
σ2 = Var(X) = E (X − EX)2 =∑
x
(x− µ)2P (x)
Remark: Notice that if the distance to the mean is not squared, then the result is always µ−µ = 0bearing no information about the distribution of X.
According to this definition, variance is always non-negative. Further, it equals 0 only ifx = µ for all values of x, i.e., when X is constantly equal to µ. Certainly, a constant(non-random) variable has zero variability.
Variance can also be computed as
Var(X) = E(X2)− µ2. (3.6)
A proof of this is left as Exercise 3.38a.
DEFINITION 3.7
Standard deviation is a square root of variance,
σ = Std(X) =√
Var(X)
Continuing the Greek-letter tradition, variance is often denoted by σ2. Then, standarddeviation is σ.
If X is measured in some units, then its mean µ has the same measurement unit as X .Variance σ2 is measured in squared units, and therefore, it cannot be compared with X orµ. No matter how funny it sounds, it is rather normal to measure variance of profit in squareddollars, variance of class enrollment in squared students, and variance of available disk spacein squared gigabytes. When a squared root is taken, the resulting standard deviation σ isagain measured in the same units as X . This is the main reason of introducing yet anothermeasure of variability, σ.
Discrete Random Variables and Their Distributions 49
Propertiesof
expectations
E(aX + bY + c) = aE(X) + bE(Y ) + c
In particular,E(X + Y ) = E(X) + E(Y )E(aX) = aE(X)E(c) = c
For independent X and Y ,E(XY ) = E(X)E(Y )
(3.5)
Proof: The first property follows from the Addition Rule (3.2). For any X and Y ,
E(aX + bY + c) =∑
x
∑
y
(ax+ by + c)P(X,Y )(x, y)
=∑
x
ax∑
y
P(X,Y )(x, y) +∑
y
by∑
x
P(X,Y )(x, y) + c∑
x
∑
y
P(X,Y )(x, y)
= a∑
x
xPX(x) + b∑
y
yPY (y) + c.
The next three equalities are special cases. To prove the last property, we recall that P(X,Y )(x, y) =
PX(x)PY (y) for independent X and Y , and therefore,
E(XY ) =∑
x
∑
y
(xy)PX(x)PY (y) =∑
x
xPX(x)∑
y
yPY (y) = E(X)E(Y ). !
Remark: The last property in (3.5) holds for some dependent variables too, hence it cannot be
Remark: Clearly, the program will never have 1.65 errors, because the number of errors is alwaysinteger. Then, should we round 1.65 to 2 errors? Absolutely not, it would be a mistake. Although
both X and Y are integers, their expectations, or average values, do not have to be integers at all.
3.3.4 Variance and standard deviation
Expectation shows where the average value of a random variable is located, or where thevariable is expected to be, plus or minus some error. How large could this “error” be, andhow much can a variable vary around its expectation? Let us introduce some measures ofvariability.
50 Probability and Statistics for Computer Scientists
Example 3.10. Here is a rather artificial but illustrative scenario. Consider two users.One receives either 48 or 52 e-mail messages per day, with a 50-50% chance of each. Theother receives either 0 or 100 e-mails, also with a 50-50% chance. What is a common featureof these two distributions, and how are they different?
We see that both users receive the same average number of e-mails:
E(X) = E(Y ) = 50.
However, in the first case, the actual number of e-mails is always close to 50, whereas italways differs from it by 50 in the second case. The first random variable, X , is more stable;it has low variability. The second variable, Y , has high variability. ♦
This example shows that variability of a random variable is measured by its distance fromthe mean µ = E(X). In its turn, this distance is random too, and therefore, cannot serveas a characteristic of a distribution. It remains to square it and take the expectation of theresult.
DEFINITION 3.6
Variance of a random variable is defined as the expected squared deviationfrom the mean. For discrete random variables, variance is
σ2 = Var(X) = E (X − EX)2 =∑
x
(x− µ)2P (x)
Remark: Notice that if the distance to the mean is not squared, then the result is always µ−µ = 0bearing no information about the distribution of X.
According to this definition, variance is always non-negative. Further, it equals 0 only ifx = µ for all values of x, i.e., when X is constantly equal to µ. Certainly, a constant(non-random) variable has zero variability.
Variance can also be computed as
Var(X) = E(X2)− µ2. (3.6)
A proof of this is left as Exercise 3.38a.
DEFINITION 3.7
Standard deviation is a square root of variance,
σ = Std(X) =√
Var(X)
Continuing the Greek-letter tradition, variance is often denoted by σ2. Then, standarddeviation is σ.
If X is measured in some units, then its mean µ has the same measurement unit as X .Variance σ2 is measured in squared units, and therefore, it cannot be compared with X orµ. No matter how funny it sounds, it is rather normal to measure variance of profit in squareddollars, variance of class enrollment in squared students, and variance of available disk spacein squared gigabytes. When a squared root is taken, the resulting standard deviation σ isagain measured in the same units as X . This is the main reason of introducing yet anothermeasure of variability, σ.
50 Probability and Statistics for Computer Scientists
Example 3.10. Here is a rather artificial but illustrative scenario. Consider two users.One receives either 48 or 52 e-mail messages per day, with a 50-50% chance of each. Theother receives either 0 or 100 e-mails, also with a 50-50% chance. What is a common featureof these two distributions, and how are they different?
We see that both users receive the same average number of e-mails:
E(X) = E(Y ) = 50.
However, in the first case, the actual number of e-mails is always close to 50, whereas italways differs from it by 50 in the second case. The first random variable, X , is more stable;it has low variability. The second variable, Y , has high variability. ♦
This example shows that variability of a random variable is measured by its distance fromthe mean µ = E(X). In its turn, this distance is random too, and therefore, cannot serveas a characteristic of a distribution. It remains to square it and take the expectation of theresult.
DEFINITION 3.6
Variance of a random variable is defined as the expected squared deviationfrom the mean. For discrete random variables, variance is
σ2 = Var(X) = E (X − EX)2 =∑
x
(x− µ)2P (x)
Remark: Notice that if the distance to the mean is not squared, then the result is always µ−µ = 0bearing no information about the distribution of X.
According to this definition, variance is always non-negative. Further, it equals 0 only ifx = µ for all values of x, i.e., when X is constantly equal to µ. Certainly, a constant(non-random) variable has zero variability.
Variance can also be computed as
Var(X) = E(X2)− µ2. (3.6)
A proof of this is left as Exercise 3.38a.
DEFINITION 3.7
Standard deviation is a square root of variance,
σ = Std(X) =√
Var(X)
Continuing the Greek-letter tradition, variance is often denoted by σ2. Then, standarddeviation is σ.
If X is measured in some units, then its mean µ has the same measurement unit as X .Variance σ2 is measured in squared units, and therefore, it cannot be compared with X orµ. No matter how funny it sounds, it is rather normal to measure variance of profit in squareddollars, variance of class enrollment in squared students, and variance of available disk spacein squared gigabytes. When a squared root is taken, the resulting standard deviation σ isagain measured in the same units as X . This is the main reason of introducing yet anothermeasure of variability, σ.
Discrete Random Variables and Their Distributions 51
3.3.5 Covariance and correlation
Expectation, variance, and standard deviation characterize the distribution of a single ran-dom variable. Now we introduce measures of association of two random variables.
FIGURE 3.4: Positive, negative, and zero covariance.
DEFINITION 3.8
Covariance σXY = Cov(X,Y ) is defined as
Cov(X,Y ) = E {(X − EX)(Y − EY )}= E(XY )− E(X)E(Y )
It summarizes interrelation of two random variables.
Covariance is the expected product of deviations of X and Y from their respective expecta-tions. If Cov(X,Y ) > 0, then positive deviations (X− EX) are more likely to be multipliedby positive (Y − EY ), and negative (X− EX) are more likely to be multiplied by negative(Y − EY ). In short, large X imply large Y , and small X imply small Y . These variablesare positively correlated, Figure 3.4a.
Conversely, Cov(X,Y ) < 0 means that large X generally correspond to small Y and smallX correspond to large Y . These variables are negatively correlated, Figure 3.4b.
If Cov(X,Y ) = 0, we say that X and Y are uncorrelated, Figure 3.4c.
DEFINITION 3.9
Correlation coefficient between variables X and Y is defined as
ρ =Cov(X,Y )
( StdX)( StdY )
Correlation coefficient is a rescaled, normalized covariance. Notice that covarianceCov(X,Y ) has a measurement unit. It is measured in units of X multiplied by units ofY . As a result, it is not clear from its value whether X and Y are strongly or weakly corre-lated. Really, one has to compare Cov(X,Y ) with the magnitude of X and Y . Correlationcoefficient performs such a comparison, and as a result, it is dimensionless.
Discrete Random Variables and Their Distributions 51
3.3.5 Covariance and correlation
Expectation, variance, and standard deviation characterize the distribution of a single ran-dom variable. Now we introduce measures of association of two random variables.
FIGURE 3.4: Positive, negative, and zero covariance.
DEFINITION 3.8
Covariance σXY = Cov(X,Y ) is defined as
Cov(X,Y ) = E {(X − EX)(Y − EY )}= E(XY )− E(X)E(Y )
It summarizes interrelation of two random variables.
Covariance is the expected product of deviations of X and Y from their respective expecta-tions. If Cov(X,Y ) > 0, then positive deviations (X− EX) are more likely to be multipliedby positive (Y − EY ), and negative (X− EX) are more likely to be multiplied by negative(Y − EY ). In short, large X imply large Y , and small X imply small Y . These variablesare positively correlated, Figure 3.4a.
Conversely, Cov(X,Y ) < 0 means that large X generally correspond to small Y and smallX correspond to large Y . These variables are negatively correlated, Figure 3.4b.
If Cov(X,Y ) = 0, we say that X and Y are uncorrelated, Figure 3.4c.
DEFINITION 3.9
Correlation coefficient between variables X and Y is defined as
ρ =Cov(X,Y )
( StdX)( StdY )
Correlation coefficient is a rescaled, normalized covariance. Notice that covarianceCov(X,Y ) has a measurement unit. It is measured in units of X multiplied by units ofY . As a result, it is not clear from its value whether X and Y are strongly or weakly corre-lated. Really, one has to compare Cov(X,Y ) with the magnitude of X and Y . Correlationcoefficient performs such a comparison, and as a result, it is dimensionless.
Discrete Random Variables and Their Distributions 51
3.3.5 Covariance and correlation
Expectation, variance, and standard deviation characterize the distribution of a single ran-dom variable. Now we introduce measures of association of two random variables.
FIGURE 3.4: Positive, negative, and zero covariance.
DEFINITION 3.8
Covariance σXY = Cov(X,Y ) is defined as
Cov(X,Y ) = E {(X − EX)(Y − EY )}= E(XY )− E(X)E(Y )
It summarizes interrelation of two random variables.
Covariance is the expected product of deviations of X and Y from their respective expecta-tions. If Cov(X,Y ) > 0, then positive deviations (X− EX) are more likely to be multipliedby positive (Y − EY ), and negative (X− EX) are more likely to be multiplied by negative(Y − EY ). In short, large X imply large Y , and small X imply small Y . These variablesare positively correlated, Figure 3.4a.
Conversely, Cov(X,Y ) < 0 means that large X generally correspond to small Y and smallX correspond to large Y . These variables are negatively correlated, Figure 3.4b.
If Cov(X,Y ) = 0, we say that X and Y are uncorrelated, Figure 3.4c.
DEFINITION 3.9
Correlation coefficient between variables X and Y is defined as
ρ =Cov(X,Y )
( StdX)( StdY )
Correlation coefficient is a rescaled, normalized covariance. Notice that covarianceCov(X,Y ) has a measurement unit. It is measured in units of X multiplied by units ofY . As a result, it is not clear from its value whether X and Y are strongly or weakly corre-lated. Really, one has to compare Cov(X,Y ) with the magnitude of X and Y . Correlationcoefficient performs such a comparison, and as a result, it is dimensionless.
52 Probability and Statistics for Computer Scientists
!
"
X
Y
!
"
X
Y
ρ = 1 ρ = −1
FIGURE 3.5: Perfect correlation: ρ = ±1.
How do we interpret the value of ρ? What possible values can it take?
As a special case of famous Cauchy-Schwarz inequality,
−1 ≤ ρ ≤ 1,
where |ρ| = 1 is possible only when all values of X and Y lie on a straight line, as inFigure 3.5. Further, values of ρ near 1 indicate strong positive correlation, values near (−1)show strong negative correlation, and values near 0 show weak correlation or no correlation.
3.3.6 Properties
The following properties of variances, covariances, and correlation coefficients hold for anyrandom variables X , Y , Z, and W and any non-random numbers a, b, c and d.
Properties of variances and covariances
Var(aX + bY + c) = a2 Var(X) + b2 Var(Y ) + 2abCov(X,Y )
52 Probability and Statistics for Computer Scientists
!
"
X
Y
!
"
X
Y
ρ = 1 ρ = −1
FIGURE 3.5: Perfect correlation: ρ = ±1.
How do we interpret the value of ρ? What possible values can it take?
As a special case of famous Cauchy-Schwarz inequality,
−1 ≤ ρ ≤ 1,
where |ρ| = 1 is possible only when all values of X and Y lie on a straight line, as inFigure 3.5. Further, values of ρ near 1 indicate strong positive correlation, values near (−1)show strong negative correlation, and values near 0 show weak correlation or no correlation.
3.3.6 Properties
The following properties of variances, covariances, and correlation coefficients hold for anyrandom variables X , Y , Z, and W and any non-random numbers a, b, c and d.
Properties of variances and covariances
Var(aX + bY + c) = a2 Var(X) + b2 Var(Y ) + 2abCov(X,Y )
200 Chapter 5 Distributions of Functions of Random Variables
5.6 THE CENTRAL LIMIT THEOREMIn Section 5.4, we found that the mean X of a random sample of size n from a dis-tribution with mean µ and variance σ 2 > 0 is a random variable with the propertiesthat
E(X) = µ and Var(X) = σ 2
n.
As n increases, the variance of X decreases. Consequently, the distribution of Xclearly depends on n, and we see that we are dealing with sequences of distributions.
In Theorem 5.5-1, we considered the pdf of X when sampling is from the normaldistribution N(µ, σ 2). We showed that the distribution of X is N(µ, σ 2/n), and inFigure 5.5-1, by graphing the pdfs for several values of n, we illustrated the propertythat as n increases, the probability becomes concentrated in a small interval centeredat µ. That is, as n increases, X tends to converge to µ, or ( X − µ) tends to convergeto 0 in a probability sense. (See Section 5.8.)
In general, if we let
W =√
nσ
( X − µ) = X − µ
σ/√
n= Y − nµ√
n σ,
where Y is the sum of a random sample of size n from some distribution with meanµ and variance σ 2, then, for each positive integer n,
E(W) = E
[X − µ
σ/√
n
]
= E(X) − µ
σ/√
n= µ − µ
σ/√
n= 0
and
Var(W) = E(W2) = E
[(X − µ)2
σ 2/n
]
=E
[(X − µ)2
]
σ 2/n= σ 2/n
σ 2/n= 1.
Thus, while X−µ tends to “degenerate” to zero, the factor√
n/σ in√
n(X−µ)/σ“spreads out” the probability enough to prevent this degeneration. What, then, is thedistribution of W as n increases? One observation that might shed some light on theanswer to this question can be made immediately. If the sample arises from a normaldistribution, then, from Theorem 5.5-1, we know that X is N(µ, σ 2/n), and hence Wis N(0, 1) for each positive n. Thus, in the limit, the distribution of W must be N(0, 1).So if the solution of the question does not depend on the underlying distribution (i.e.,it is unique), the answer must be N(0, 1). As we will see, that is exactly the case, andthis result is so important that it is called the central limit theorem, the proof ofwhich is given in Section 5.9.
Theorem5.6-1
(Central Limit Theorem) If X is the mean of a random sample X1, X2, . . . , Xn ofsize n from a distribution with a finite mean µ and a finite positive variance σ 2,then the distribution of
W = X − µ
σ/√
n=
∑ni=1 Xi − nµ√
n σ
is N(0, 1) in the limit as n → ∞.
Section 5.8 Chebyshev’s Inequality and Convergence in Probability 213
g(u) =
6
(324u5
5
)
, 0 < u < 1/6,
6(
120
− 324u5 + 324u4 − 108u3 + 18u2 − 3u2
), 1/6 ≤ u < 2/6,
6(
−7920
+ 117u2
+ 648u5 − 1296u4 + 972u3 − 342u2)
, 2/6 ≤ u < 3/6,
6(
73120
− 693u2
− 648u5 + 1944u4 − 2268u3)
, 3/6 ≤ u < 4/6,
6(−1829
20+ 1227u
2− 1602u2 + 2052u3 + 324u5 − 1296u4
), 4/6 ≤ u < 5/6,
6
(324
5− 324u + 648u2 − 648u3 + 324u4 − 324u5
5
)
, 5/6 ≤ u < 1.
We can also calculate
∫ 2/6
1/6g(u) du = 19
240= 0.0792
and
∫ 1
11/18g(u) du = 5, 818
32, 805= 0.17735.
Although these integrations are not difficult, they are tedious to do by hand. !
5.8 CHEBYSHEV’S INEQUALITY AND CONVERGENCE IN PROBABILITYIn this section, we use Chebyshev’s inequality to show, in another sense, that thesample mean, X, is a good statistic to use to estimate a population mean µ; therelative frequency of success in n independent Bernoulli trials, Y/n, is a good statisticfor estimating p. We examine the effect of the sample size n on these estimates.
We begin by showing that Chebyshev’s inequality gives added significance tothe standard deviation in terms of bounding certain probabilities. The inequality isvalid for all distributions for which the standard deviation exists. The proof is givenfor the discrete case, but it holds for the continuous case, with integrals replacingsummations.
Theorem5.8-1
(Chebyshev’s Inequality) If the random variable X has a mean µ and variance σ 2,then, for every k ≥ 1,
P(|X − µ| ≥ kσ ) ≤ 1k2 .
Proof Let f (x) denote the pmf of X. Then
214 Chapter 5 Distributions of Functions of Random Variables
σ 2 = E[(X − µ)2] =∑
x∈S
(x − µ)2f (x)
=∑
x∈A
(x − µ)2f (x) +∑
x∈A′(x − µ)2f (x), (5.8-1)
where
A = {x : |x − µ| ≥ kσ }.
The second term in the right-hand member of Equation 5.8-1 is the sum of non-negative numbers and thus is greater than or equal to zero. Hence,
σ 2 ≥∑
x∈A
(x − µ)2f (x).
However, in A, |x − µ| ≥ kσ ; so
σ 2 ≥∑
x∈A
(kσ )2f (x) = k2σ 2∑
x∈A
f (x).
But the latter summation equals P(X ∈ A); thus,
σ 2 ≥ k2σ 2P(X ∈ A) = k2σ 2P(|X − µ| ≥ kσ ).
That is,
P(|X − µ| ≥ kσ ) ≤ 1k2 . !
Corollary5.8-1
If ε = kσ , then
P(|X − µ| ≥ ε) ≤ σ 2
ε2 .
"
In words, Chebyshev’s inequality states that the probability that X differs fromits mean by at least k standard deviations is less than or equal to 1/k2. It follows thatthe probability that X differs from its mean by less than k standard deviations is atleast 1 − 1/k2. That is,
P(|X − µ| < kσ ) ≥ 1 − 1k2 .
From the corollary, it also follows that
P(|X − µ| < ε) ≥ 1 − σ 2
ε2 .
Thus, Chebyshev’s inequality can be used as a bound for certain probabilities.However, in many instances, the bound is not very close to the true probability.
Example5.8-1
If it is known that X has a mean of 25 and a variance of 16, then, since σ = 4, a lowerbound for P(17 < X < 33) is given by
214 Chapter 5 Distributions of Functions of Random Variables
σ 2 = E[(X − µ)2] =∑
x∈S
(x − µ)2f (x)
=∑
x∈A
(x − µ)2f (x) +∑
x∈A′(x − µ)2f (x), (5.8-1)
where
A = {x : |x − µ| ≥ kσ }.The second term in the right-hand member of Equation 5.8-1 is the sum of non-negative numbers and thus is greater than or equal to zero. Hence,
σ 2 ≥∑
x∈A
(x − µ)2f (x).
However, in A, |x − µ| ≥ kσ ; so
σ 2 ≥∑
x∈A
(kσ )2f (x) = k2σ 2∑
x∈A
f (x).
But the latter summation equals P(X ∈ A); thus,
σ 2 ≥ k2σ 2P(X ∈ A) = k2σ 2P(|X − µ| ≥ kσ ).
That is,
P(|X − µ| ≥ kσ ) ≤ 1k2 . !
Corollary5.8-1
If ε = kσ , then
P(|X − µ| ≥ ε) ≤ σ 2
ε2 .
"
In words, Chebyshev’s inequality states that the probability that X differs fromits mean by at least k standard deviations is less than or equal to 1/k2. It follows thatthe probability that X differs from its mean by less than k standard deviations is atleast 1 − 1/k2. That is,
P(|X − µ| < kσ ) ≥ 1 − 1k2 .
From the corollary, it also follows that
P(|X − µ| < ε) ≥ 1 − σ 2
ε2 .
Thus, Chebyshev’s inequality can be used as a bound for certain probabilities.However, in many instances, the bound is not very close to the true probability.
Example5.8-1
If it is known that X has a mean of 25 and a variance of 16, then, since σ = 4, a lowerbound for P(17 < X < 33) is given by
214 Chapter 5 Distributions of Functions of Random Variables
σ 2 = E[(X − µ)2] =∑
x∈S
(x − µ)2f (x)
=∑
x∈A
(x − µ)2f (x) +∑
x∈A′(x − µ)2f (x), (5.8-1)
where
A = {x : |x − µ| ≥ kσ }.The second term in the right-hand member of Equation 5.8-1 is the sum of non-negative numbers and thus is greater than or equal to zero. Hence,
σ 2 ≥∑
x∈A
(x − µ)2f (x).
However, in A, |x − µ| ≥ kσ ; so
σ 2 ≥∑
x∈A
(kσ )2f (x) = k2σ 2∑
x∈A
f (x).
But the latter summation equals P(X ∈ A); thus,
σ 2 ≥ k2σ 2P(X ∈ A) = k2σ 2P(|X − µ| ≥ kσ ).
That is,
P(|X − µ| ≥ kσ ) ≤ 1k2 . !
Corollary5.8-1
If ε = kσ , then
P(|X − µ| ≥ ε) ≤ σ 2
ε2 .
"
In words, Chebyshev’s inequality states that the probability that X differs fromits mean by at least k standard deviations is less than or equal to 1/k2. It follows thatthe probability that X differs from its mean by less than k standard deviations is atleast 1 − 1/k2. That is,
P(|X − µ| < kσ ) ≥ 1 − 1k2 .
From the corollary, it also follows that
P(|X − µ| < ε) ≥ 1 − σ 2
ε2 .
Thus, Chebyshev’s inequality can be used as a bound for certain probabilities.However, in many instances, the bound is not very close to the true probability.
Example5.8-1
If it is known that X has a mean of 25 and a variance of 16, then, since σ = 4, a lowerbound for P(17 < X < 33) is given by
Chapte rChapte r
4Bivariate Distributions
4.1 Bivariate Distributions of the Discrete Type4.2 The Correlation Coefficient4.3 Conditional Distributions
4.4 Bivariate Distributions of the ContinuousType
4.5 The Bivariate Normal Distribution
4.1 BIVARIATE DISTRIBUTIONS OF THE DISCRETE TYPESo far, we have taken only one measurement on a single item under observation.However, it is clear in many practical cases that it is possible, and often very desir-able, to take more than one measurement of a random observation. Suppose, forexample, that we are observing female college students to obtain information aboutsome of their physical characteristics, such as height, x, and weight, y, because we aretrying to determine a relationship between those two characteristics. For instance,there may be some pattern between height and weight that can be described byan appropriate curve y = u(x). Certainly, not all of the points observed will beon this curve, but we want to attempt to find the “best” curve to describe therelationship and then say something about the variation of the points around thecurve.
Another example might concern high school rank—say, x—and the ACT(or SAT) score—say, y—of incoming college students. What is the relationshipbetween these two characteristics? More importantly, how can we use those mea-surements to predict a third one, such as first-year college GPA—say, z—witha function z = v(x, y)? This is a very important problem for college admissionoffices, particularly when it comes to awarding an athletic scholarship, because theincoming student–athlete must satisfy certain conditions before receiving such anaward.
Definition 4.1-1Let X and Y be two random variables defined on a discrete space. Let S denotethe corresponding two-dimensional space of X and Y, the two random vari-ables of the discrete type. The probability that X = x and Y = y is denoted byf (x, y) = P(X = x, Y = y). The function f (x, y) is called the joint probabilitymass function (joint pmf) of X and Y and has the following properties:
125
126 Chapter 4 Bivariate Distributions
(a) 0 ≤ f (x, y) ≤ 1.
(b)∑ ∑
(x,y)∈S
f (x, y) = 1.
(c) P[(X, Y) ∈ A] =∑ ∑
(x,y)∈A
f (x, y), where A is a subset of the space S.
The following example will make this definition more meaningful.
Example4.1-1
Roll a pair of fair dice. For each of the 36 sample points with probability 1/36, letX denote the smaller and Y the larger outcome on the dice. For example, if theoutcome is (3, 2), then the observed values are X = 2, Y = 3. The event {X = 2,Y = 3} could occur in one of two ways—(3, 2) or (2, 3)—so its probability is
136
+ 136
= 236
.
If the outcome is (2, 2), then the observed values are X = 2, Y = 2. Since the event{X = 2, Y = 2} can occur in only one way, P(X = 2, Y = 2) = 1/36. The joint pmfof X and Y is given by the probabilities
f (x, y) =
136
, 1 ≤ x = y ≤ 6,
236
, 1 ≤ x < y ≤ 6,
when x and y are integers. Figure 4.1-1 depicts the probabilities of the various pointsof the space S.
1/36
2/36 1/36
1/36
2/36
1/36
2/36
2/36
2/36
1/36
2/36
2/36
2/36
5/36 3/367/36
y
2/36
2/36
2/36
2/36
1/36
1/36
x
11/36
2/36
2/36
2/36
5/36
9/36
1/36
11/364 53 621
7/36
3/36
9/36
6
5
4
3
2
1
Figure 4.1-1 Discrete joint pmf
Section 4.1 Bivariate Distributions of the Discrete Type 127
Notice that certain numbers have been recorded in the bottom and left-handmargins of Figure 4.1-1. These numbers are the respective column and row totalsof the probabilities. The column totals are the respective probabilities that X willassume the values in the x space SX = {1, 2, 3, 4, 5, 6}, and the row totals arethe respective probabilities that Y will assume the values in the y space SY ={1, 2, 3, 4, 5, 6}. That is, the totals describe the probability mass functions of X andY, respectively. Since each collection of these probabilities is frequently recordedin the margins and satisfies the properties of a pmf of one random variable, each iscalled a marginal pmf.
Definition 4.1-2Let X and Y have the joint probability mass function f (x, y) with space S. Theprobability mass function of X alone, which is called the marginal probabilitymass function of X, is defined by
fX(x) =∑
yf (x, y) = P(X = x), x ∈ SX ,
where the summation is taken over all possible y values for each given x in thex space SX . That is, the summation is over all (x, y) in S with a given x value.Similarly, the marginal probability mass function of Y is defined by
fY(y) =∑
xf (x, y) = P(Y = y), y ∈ SY ,
where the summation is taken over all possible x values for each given y in they space SY . The random variables X and Y are independent if and only if, forevery x ∈ SX and every y ∈ SY ,
P(X = x, Y = y) = P(X = x)P(Y = y)
or, equivalently,
f (x, y) = fX(x)fY(y);
otherwise, X and Y are said to be dependent.
We note in Example 4.1-1 that X and Y are dependent because there are manyx and y values for which f (x, y) "= fX(x)fY(y). For instance,
fX(1)fY(1) =(
1136
)(1
36
)"= 1
36= f (1, 1).
Example4.1-2
Let the joint pmf of X and Y be defined by
f (x, y) = x + y21
, x = 1, 2, 3, y = 1, 2.
Then
fX(x) =∑
yf (x, y) =
2∑
y=1
x + y21
= x + 121
+ x + 221
= 2x + 321
, x = 1, 2, 3,
48 Probability and Statistics for Computer Scientists
(a) E(X) = 0.5
!0 0.5 1
(b) E(X) = 0.25
!0 0.25 0.5 1
FIGURE 3.3: Expectation as a center of gravity.
Similar arguments can be used to derive the general formula for the expectation.
Expectation,discrete case
µ = E(X) =∑
x
xP (x) (3.3)
This formula returns the center of gravity for a system with masses P (x) allocated at pointsx. Expected value is often denoted by a Greek letter µ.
In a certain sense, expectation is the best forecast of X . The variable itself is random. Ittakes different values with different probabilities P (x). At the same time, it has just oneexpectation E(X) which is non-random.
3.3.2 Expectation of a function
Often we are interested in another variable, Y , that is a function of X . For example, down-loading time depends on the connection speed, profit of a computer store depends on thenumber of computers sold, and bonus of its manager depends on this profit. Expectation ofY = g(X) is computed by a similar formula,
E {g(X)} =∑
x
g(x)P (x). (3.4)
Remark: Indeed, if g is a one-to-one function, then Y takes each value y = g(x) with probability
P (x), and the formula for E(Y ) can be applied directly. If g is not one-to-one, then some values ofg(x) will be repeated in (3.4). However, they are still multiplied by the corresponding probabilities.
When we add in (3.4), these probabilities are also added, thus each value of g(x) is still multipliedby the probability PY (g(x)).
3.3.3 Properties
The following linear properties of expectations follow directly from (3.3) and (3.4). For anyrandom variables X and Y and any non-random numbers a, b, and c, we have
50 Probability and Statistics for Computer Scientists
Example 3.10. Here is a rather artificial but illustrative scenario. Consider two users.One receives either 48 or 52 e-mail messages per day, with a 50-50% chance of each. Theother receives either 0 or 100 e-mails, also with a 50-50% chance. What is a common featureof these two distributions, and how are they different?
We see that both users receive the same average number of e-mails:
E(X) = E(Y ) = 50.
However, in the first case, the actual number of e-mails is always close to 50, whereas italways differs from it by 50 in the second case. The first random variable, X , is more stable;it has low variability. The second variable, Y , has high variability. ♦
This example shows that variability of a random variable is measured by its distance fromthe mean µ = E(X). In its turn, this distance is random too, and therefore, cannot serveas a characteristic of a distribution. It remains to square it and take the expectation of theresult.
DEFINITION 3.6
Variance of a random variable is defined as the expected squared deviationfrom the mean. For discrete random variables, variance is
σ2 = Var(X) = E (X − EX)2 =∑
x
(x− µ)2P (x)
Remark: Notice that if the distance to the mean is not squared, then the result is always µ−µ = 0bearing no information about the distribution of X.
According to this definition, variance is always non-negative. Further, it equals 0 only ifx = µ for all values of x, i.e., when X is constantly equal to µ. Certainly, a constant(non-random) variable has zero variability.
Variance can also be computed as
Var(X) = E(X2)− µ2. (3.6)
A proof of this is left as Exercise 3.38a.
DEFINITION 3.7
Standard deviation is a square root of variance,
σ = Std(X) =√
Var(X)
Continuing the Greek-letter tradition, variance is often denoted by σ2. Then, standarddeviation is σ.
If X is measured in some units, then its mean µ has the same measurement unit as X .Variance σ2 is measured in squared units, and therefore, it cannot be compared with X orµ. No matter how funny it sounds, it is rather normal to measure variance of profit in squareddollars, variance of class enrollment in squared students, and variance of available disk spacein squared gigabytes. When a squared root is taken, the resulting standard deviation σ isagain measured in the same units as X . This is the main reason of introducing yet anothermeasure of variability, σ.
Discrete Random Variables and Their Distributions 49
Propertiesof
expectations
E(aX + bY + c) = aE(X) + bE(Y ) + c
In particular,E(X + Y ) = E(X) + E(Y )E(aX) = aE(X)E(c) = c
For independent X and Y ,E(XY ) = E(X)E(Y )
(3.5)
Proof: The first property follows from the Addition Rule (3.2). For any X and Y ,
E(aX + bY + c) =∑
x
∑
y
(ax+ by + c)P(X,Y )(x, y)
=∑
x
ax∑
y
P(X,Y )(x, y) +∑
y
by∑
x
P(X,Y )(x, y) + c∑
x
∑
y
P(X,Y )(x, y)
= a∑
x
xPX(x) + b∑
y
yPY (y) + c.
The next three equalities are special cases. To prove the last property, we recall that P(X,Y )(x, y) =
PX(x)PY (y) for independent X and Y , and therefore,
E(XY ) =∑
x
∑
y
(xy)PX(x)PY (y) =∑
x
xPX(x)∑
y
yPY (y) = E(X)E(Y ). !
Remark: The last property in (3.5) holds for some dependent variables too, hence it cannot be
Remark: Clearly, the program will never have 1.65 errors, because the number of errors is alwaysinteger. Then, should we round 1.65 to 2 errors? Absolutely not, it would be a mistake. Although
both X and Y are integers, their expectations, or average values, do not have to be integers at all.
3.3.4 Variance and standard deviation
Expectation shows where the average value of a random variable is located, or where thevariable is expected to be, plus or minus some error. How large could this “error” be, andhow much can a variable vary around its expectation? Let us introduce some measures ofvariability.
50 Probability and Statistics for Computer Scientists
Example 3.10. Here is a rather artificial but illustrative scenario. Consider two users.One receives either 48 or 52 e-mail messages per day, with a 50-50% chance of each. Theother receives either 0 or 100 e-mails, also with a 50-50% chance. What is a common featureof these two distributions, and how are they different?
We see that both users receive the same average number of e-mails:
E(X) = E(Y ) = 50.
However, in the first case, the actual number of e-mails is always close to 50, whereas italways differs from it by 50 in the second case. The first random variable, X , is more stable;it has low variability. The second variable, Y , has high variability. ♦
This example shows that variability of a random variable is measured by its distance fromthe mean µ = E(X). In its turn, this distance is random too, and therefore, cannot serveas a characteristic of a distribution. It remains to square it and take the expectation of theresult.
DEFINITION 3.6
Variance of a random variable is defined as the expected squared deviationfrom the mean. For discrete random variables, variance is
σ2 = Var(X) = E (X − EX)2 =∑
x
(x− µ)2P (x)
Remark: Notice that if the distance to the mean is not squared, then the result is always µ−µ = 0bearing no information about the distribution of X.
According to this definition, variance is always non-negative. Further, it equals 0 only ifx = µ for all values of x, i.e., when X is constantly equal to µ. Certainly, a constant(non-random) variable has zero variability.
Variance can also be computed as
Var(X) = E(X2)− µ2. (3.6)
A proof of this is left as Exercise 3.38a.
DEFINITION 3.7
Standard deviation is a square root of variance,
σ = Std(X) =√
Var(X)
Continuing the Greek-letter tradition, variance is often denoted by σ2. Then, standarddeviation is σ.
If X is measured in some units, then its mean µ has the same measurement unit as X .Variance σ2 is measured in squared units, and therefore, it cannot be compared with X orµ. No matter how funny it sounds, it is rather normal to measure variance of profit in squareddollars, variance of class enrollment in squared students, and variance of available disk spacein squared gigabytes. When a squared root is taken, the resulting standard deviation σ isagain measured in the same units as X . This is the main reason of introducing yet anothermeasure of variability, σ.
50 Probability and Statistics for Computer Scientists
Example 3.10. Here is a rather artificial but illustrative scenario. Consider two users.One receives either 48 or 52 e-mail messages per day, with a 50-50% chance of each. Theother receives either 0 or 100 e-mails, also with a 50-50% chance. What is a common featureof these two distributions, and how are they different?
We see that both users receive the same average number of e-mails:
E(X) = E(Y ) = 50.
However, in the first case, the actual number of e-mails is always close to 50, whereas italways differs from it by 50 in the second case. The first random variable, X , is more stable;it has low variability. The second variable, Y , has high variability. ♦
This example shows that variability of a random variable is measured by its distance fromthe mean µ = E(X). In its turn, this distance is random too, and therefore, cannot serveas a characteristic of a distribution. It remains to square it and take the expectation of theresult.
DEFINITION 3.6
Variance of a random variable is defined as the expected squared deviationfrom the mean. For discrete random variables, variance is
σ2 = Var(X) = E (X − EX)2 =∑
x
(x− µ)2P (x)
Remark: Notice that if the distance to the mean is not squared, then the result is always µ−µ = 0bearing no information about the distribution of X.
According to this definition, variance is always non-negative. Further, it equals 0 only ifx = µ for all values of x, i.e., when X is constantly equal to µ. Certainly, a constant(non-random) variable has zero variability.
Variance can also be computed as
Var(X) = E(X2)− µ2. (3.6)
A proof of this is left as Exercise 3.38a.
DEFINITION 3.7
Standard deviation is a square root of variance,
σ = Std(X) =√
Var(X)
Continuing the Greek-letter tradition, variance is often denoted by σ2. Then, standarddeviation is σ.
If X is measured in some units, then its mean µ has the same measurement unit as X .Variance σ2 is measured in squared units, and therefore, it cannot be compared with X orµ. No matter how funny it sounds, it is rather normal to measure variance of profit in squareddollars, variance of class enrollment in squared students, and variance of available disk spacein squared gigabytes. When a squared root is taken, the resulting standard deviation σ isagain measured in the same units as X . This is the main reason of introducing yet anothermeasure of variability, σ.
Discrete Random Variables and Their Distributions 51
3.3.5 Covariance and correlation
Expectation, variance, and standard deviation characterize the distribution of a single ran-dom variable. Now we introduce measures of association of two random variables.
FIGURE 3.4: Positive, negative, and zero covariance.
DEFINITION 3.8
Covariance σXY = Cov(X,Y ) is defined as
Cov(X,Y ) = E {(X − EX)(Y − EY )}= E(XY )− E(X)E(Y )
It summarizes interrelation of two random variables.
Covariance is the expected product of deviations of X and Y from their respective expecta-tions. If Cov(X,Y ) > 0, then positive deviations (X− EX) are more likely to be multipliedby positive (Y − EY ), and negative (X− EX) are more likely to be multiplied by negative(Y − EY ). In short, large X imply large Y , and small X imply small Y . These variablesare positively correlated, Figure 3.4a.
Conversely, Cov(X,Y ) < 0 means that large X generally correspond to small Y and smallX correspond to large Y . These variables are negatively correlated, Figure 3.4b.
If Cov(X,Y ) = 0, we say that X and Y are uncorrelated, Figure 3.4c.
DEFINITION 3.9
Correlation coefficient between variables X and Y is defined as
ρ =Cov(X,Y )
( StdX)( StdY )
Correlation coefficient is a rescaled, normalized covariance. Notice that covarianceCov(X,Y ) has a measurement unit. It is measured in units of X multiplied by units ofY . As a result, it is not clear from its value whether X and Y are strongly or weakly corre-lated. Really, one has to compare Cov(X,Y ) with the magnitude of X and Y . Correlationcoefficient performs such a comparison, and as a result, it is dimensionless.
Discrete Random Variables and Their Distributions 51
3.3.5 Covariance and correlation
Expectation, variance, and standard deviation characterize the distribution of a single ran-dom variable. Now we introduce measures of association of two random variables.
FIGURE 3.4: Positive, negative, and zero covariance.
DEFINITION 3.8
Covariance σXY = Cov(X,Y ) is defined as
Cov(X,Y ) = E {(X − EX)(Y − EY )}= E(XY )− E(X)E(Y )
It summarizes interrelation of two random variables.
Covariance is the expected product of deviations of X and Y from their respective expecta-tions. If Cov(X,Y ) > 0, then positive deviations (X− EX) are more likely to be multipliedby positive (Y − EY ), and negative (X− EX) are more likely to be multiplied by negative(Y − EY ). In short, large X imply large Y , and small X imply small Y . These variablesare positively correlated, Figure 3.4a.
Conversely, Cov(X,Y ) < 0 means that large X generally correspond to small Y and smallX correspond to large Y . These variables are negatively correlated, Figure 3.4b.
If Cov(X,Y ) = 0, we say that X and Y are uncorrelated, Figure 3.4c.
DEFINITION 3.9
Correlation coefficient between variables X and Y is defined as
ρ =Cov(X,Y )
( StdX)( StdY )
Correlation coefficient is a rescaled, normalized covariance. Notice that covarianceCov(X,Y ) has a measurement unit. It is measured in units of X multiplied by units ofY . As a result, it is not clear from its value whether X and Y are strongly or weakly corre-lated. Really, one has to compare Cov(X,Y ) with the magnitude of X and Y . Correlationcoefficient performs such a comparison, and as a result, it is dimensionless.
Discrete Random Variables and Their Distributions 51
3.3.5 Covariance and correlation
Expectation, variance, and standard deviation characterize the distribution of a single ran-dom variable. Now we introduce measures of association of two random variables.
FIGURE 3.4: Positive, negative, and zero covariance.
DEFINITION 3.8
Covariance σXY = Cov(X,Y ) is defined as
Cov(X,Y ) = E {(X − EX)(Y − EY )}= E(XY )− E(X)E(Y )
It summarizes interrelation of two random variables.
Covariance is the expected product of deviations of X and Y from their respective expecta-tions. If Cov(X,Y ) > 0, then positive deviations (X− EX) are more likely to be multipliedby positive (Y − EY ), and negative (X− EX) are more likely to be multiplied by negative(Y − EY ). In short, large X imply large Y , and small X imply small Y . These variablesare positively correlated, Figure 3.4a.
Conversely, Cov(X,Y ) < 0 means that large X generally correspond to small Y and smallX correspond to large Y . These variables are negatively correlated, Figure 3.4b.
If Cov(X,Y ) = 0, we say that X and Y are uncorrelated, Figure 3.4c.
DEFINITION 3.9
Correlation coefficient between variables X and Y is defined as
ρ =Cov(X,Y )
( StdX)( StdY )
Correlation coefficient is a rescaled, normalized covariance. Notice that covarianceCov(X,Y ) has a measurement unit. It is measured in units of X multiplied by units ofY . As a result, it is not clear from its value whether X and Y are strongly or weakly corre-lated. Really, one has to compare Cov(X,Y ) with the magnitude of X and Y . Correlationcoefficient performs such a comparison, and as a result, it is dimensionless.
52 Probability and Statistics for Computer Scientists
!
"
X
Y
!
"
X
Y
ρ = 1 ρ = −1
FIGURE 3.5: Perfect correlation: ρ = ±1.
How do we interpret the value of ρ? What possible values can it take?
As a special case of famous Cauchy-Schwarz inequality,
−1 ≤ ρ ≤ 1,
where |ρ| = 1 is possible only when all values of X and Y lie on a straight line, as inFigure 3.5. Further, values of ρ near 1 indicate strong positive correlation, values near (−1)show strong negative correlation, and values near 0 show weak correlation or no correlation.
3.3.6 Properties
The following properties of variances, covariances, and correlation coefficients hold for anyrandom variables X , Y , Z, and W and any non-random numbers a, b, c and d.
Properties of variances and covariances
Var(aX + bY + c) = a2 Var(X) + b2 Var(Y ) + 2abCov(X,Y )
52 Probability and Statistics for Computer Scientists
!
"
X
Y
!
"
X
Y
ρ = 1 ρ = −1
FIGURE 3.5: Perfect correlation: ρ = ±1.
How do we interpret the value of ρ? What possible values can it take?
As a special case of famous Cauchy-Schwarz inequality,
−1 ≤ ρ ≤ 1,
where |ρ| = 1 is possible only when all values of X and Y lie on a straight line, as inFigure 3.5. Further, values of ρ near 1 indicate strong positive correlation, values near (−1)show strong negative correlation, and values near 0 show weak correlation or no correlation.
3.3.6 Properties
The following properties of variances, covariances, and correlation coefficients hold for anyrandom variables X , Y , Z, and W and any non-random numbers a, b, c and d.
Properties of variances and covariances
Var(aX + bY + c) = a2 Var(X) + b2 Var(Y ) + 2abCov(X,Y )
200 Chapter 5 Distributions of Functions of Random Variables
5.6 THE CENTRAL LIMIT THEOREMIn Section 5.4, we found that the mean X of a random sample of size n from a dis-tribution with mean µ and variance σ 2 > 0 is a random variable with the propertiesthat
E(X) = µ and Var(X) = σ 2
n.
As n increases, the variance of X decreases. Consequently, the distribution of Xclearly depends on n, and we see that we are dealing with sequences of distributions.
In Theorem 5.5-1, we considered the pdf of X when sampling is from the normaldistribution N(µ, σ 2). We showed that the distribution of X is N(µ, σ 2/n), and inFigure 5.5-1, by graphing the pdfs for several values of n, we illustrated the propertythat as n increases, the probability becomes concentrated in a small interval centeredat µ. That is, as n increases, X tends to converge to µ, or ( X − µ) tends to convergeto 0 in a probability sense. (See Section 5.8.)
In general, if we let
W =√
nσ
( X − µ) = X − µ
σ/√
n= Y − nµ√
n σ,
where Y is the sum of a random sample of size n from some distribution with meanµ and variance σ 2, then, for each positive integer n,
E(W) = E
[X − µ
σ/√
n
]
= E(X) − µ
σ/√
n= µ − µ
σ/√
n= 0
and
Var(W) = E(W2) = E
[(X − µ)2
σ 2/n
]
=E
[(X − µ)2
]
σ 2/n= σ 2/n
σ 2/n= 1.
Thus, while X−µ tends to “degenerate” to zero, the factor√
n/σ in√
n(X−µ)/σ“spreads out” the probability enough to prevent this degeneration. What, then, is thedistribution of W as n increases? One observation that might shed some light on theanswer to this question can be made immediately. If the sample arises from a normaldistribution, then, from Theorem 5.5-1, we know that X is N(µ, σ 2/n), and hence Wis N(0, 1) for each positive n. Thus, in the limit, the distribution of W must be N(0, 1).So if the solution of the question does not depend on the underlying distribution (i.e.,it is unique), the answer must be N(0, 1). As we will see, that is exactly the case, andthis result is so important that it is called the central limit theorem, the proof ofwhich is given in Section 5.9.
Theorem5.6-1
(Central Limit Theorem) If X is the mean of a random sample X1, X2, . . . , Xn ofsize n from a distribution with a finite mean µ and a finite positive variance σ 2,then the distribution of
W = X − µ
σ/√
n=
∑ni=1 Xi − nµ√
n σ
is N(0, 1) in the limit as n → ∞.
Section 5.8 Chebyshev’s Inequality and Convergence in Probability 213
g(u) =
6
(324u5
5
)
, 0 < u < 1/6,
6(
120
− 324u5 + 324u4 − 108u3 + 18u2 − 3u2
), 1/6 ≤ u < 2/6,
6(
−7920
+ 117u2
+ 648u5 − 1296u4 + 972u3 − 342u2)
, 2/6 ≤ u < 3/6,
6(
73120
− 693u2
− 648u5 + 1944u4 − 2268u3)
, 3/6 ≤ u < 4/6,
6(−1829
20+ 1227u
2− 1602u2 + 2052u3 + 324u5 − 1296u4
), 4/6 ≤ u < 5/6,
6
(324
5− 324u + 648u2 − 648u3 + 324u4 − 324u5
5
)
, 5/6 ≤ u < 1.
We can also calculate
∫ 2/6
1/6g(u) du = 19
240= 0.0792
and
∫ 1
11/18g(u) du = 5, 818
32, 805= 0.17735.
Although these integrations are not difficult, they are tedious to do by hand. !
5.8 CHEBYSHEV’S INEQUALITY AND CONVERGENCE IN PROBABILITYIn this section, we use Chebyshev’s inequality to show, in another sense, that thesample mean, X, is a good statistic to use to estimate a population mean µ; therelative frequency of success in n independent Bernoulli trials, Y/n, is a good statisticfor estimating p. We examine the effect of the sample size n on these estimates.
We begin by showing that Chebyshev’s inequality gives added significance tothe standard deviation in terms of bounding certain probabilities. The inequality isvalid for all distributions for which the standard deviation exists. The proof is givenfor the discrete case, but it holds for the continuous case, with integrals replacingsummations.
Theorem5.8-1
(Chebyshev’s Inequality) If the random variable X has a mean µ and variance σ 2,then, for every k ≥ 1,
P(|X − µ| ≥ kσ ) ≤ 1k2 .
Proof Let f (x) denote the pmf of X. Then
214 Chapter 5 Distributions of Functions of Random Variables
σ 2 = E[(X − µ)2] =∑
x∈S
(x − µ)2f (x)
=∑
x∈A
(x − µ)2f (x) +∑
x∈A′(x − µ)2f (x), (5.8-1)
where
A = {x : |x − µ| ≥ kσ }.
The second term in the right-hand member of Equation 5.8-1 is the sum of non-negative numbers and thus is greater than or equal to zero. Hence,
σ 2 ≥∑
x∈A
(x − µ)2f (x).
However, in A, |x − µ| ≥ kσ ; so
σ 2 ≥∑
x∈A
(kσ )2f (x) = k2σ 2∑
x∈A
f (x).
But the latter summation equals P(X ∈ A); thus,
σ 2 ≥ k2σ 2P(X ∈ A) = k2σ 2P(|X − µ| ≥ kσ ).
That is,
P(|X − µ| ≥ kσ ) ≤ 1k2 . !
Corollary5.8-1
If ε = kσ , then
P(|X − µ| ≥ ε) ≤ σ 2
ε2 .
"
In words, Chebyshev’s inequality states that the probability that X differs fromits mean by at least k standard deviations is less than or equal to 1/k2. It follows thatthe probability that X differs from its mean by less than k standard deviations is atleast 1 − 1/k2. That is,
P(|X − µ| < kσ ) ≥ 1 − 1k2 .
From the corollary, it also follows that
P(|X − µ| < ε) ≥ 1 − σ 2
ε2 .
Thus, Chebyshev’s inequality can be used as a bound for certain probabilities.However, in many instances, the bound is not very close to the true probability.
Example5.8-1
If it is known that X has a mean of 25 and a variance of 16, then, since σ = 4, a lowerbound for P(17 < X < 33) is given by
214 Chapter 5 Distributions of Functions of Random Variables
σ 2 = E[(X − µ)2] =∑
x∈S
(x − µ)2f (x)
=∑
x∈A
(x − µ)2f (x) +∑
x∈A′(x − µ)2f (x), (5.8-1)
where
A = {x : |x − µ| ≥ kσ }.The second term in the right-hand member of Equation 5.8-1 is the sum of non-negative numbers and thus is greater than or equal to zero. Hence,
σ 2 ≥∑
x∈A
(x − µ)2f (x).
However, in A, |x − µ| ≥ kσ ; so
σ 2 ≥∑
x∈A
(kσ )2f (x) = k2σ 2∑
x∈A
f (x).
But the latter summation equals P(X ∈ A); thus,
σ 2 ≥ k2σ 2P(X ∈ A) = k2σ 2P(|X − µ| ≥ kσ ).
That is,
P(|X − µ| ≥ kσ ) ≤ 1k2 . !
Corollary5.8-1
If ε = kσ , then
P(|X − µ| ≥ ε) ≤ σ 2
ε2 .
"
In words, Chebyshev’s inequality states that the probability that X differs fromits mean by at least k standard deviations is less than or equal to 1/k2. It follows thatthe probability that X differs from its mean by less than k standard deviations is atleast 1 − 1/k2. That is,
P(|X − µ| < kσ ) ≥ 1 − 1k2 .
From the corollary, it also follows that
P(|X − µ| < ε) ≥ 1 − σ 2
ε2 .
Thus, Chebyshev’s inequality can be used as a bound for certain probabilities.However, in many instances, the bound is not very close to the true probability.
Example5.8-1
If it is known that X has a mean of 25 and a variance of 16, then, since σ = 4, a lowerbound for P(17 < X < 33) is given by
214 Chapter 5 Distributions of Functions of Random Variables
σ 2 = E[(X − µ)2] =∑
x∈S
(x − µ)2f (x)
=∑
x∈A
(x − µ)2f (x) +∑
x∈A′(x − µ)2f (x), (5.8-1)
where
A = {x : |x − µ| ≥ kσ }.The second term in the right-hand member of Equation 5.8-1 is the sum of non-negative numbers and thus is greater than or equal to zero. Hence,
σ 2 ≥∑
x∈A
(x − µ)2f (x).
However, in A, |x − µ| ≥ kσ ; so
σ 2 ≥∑
x∈A
(kσ )2f (x) = k2σ 2∑
x∈A
f (x).
But the latter summation equals P(X ∈ A); thus,
σ 2 ≥ k2σ 2P(X ∈ A) = k2σ 2P(|X − µ| ≥ kσ ).
That is,
P(|X − µ| ≥ kσ ) ≤ 1k2 . !
Corollary5.8-1
If ε = kσ , then
P(|X − µ| ≥ ε) ≤ σ 2
ε2 .
"
In words, Chebyshev’s inequality states that the probability that X differs fromits mean by at least k standard deviations is less than or equal to 1/k2. It follows thatthe probability that X differs from its mean by less than k standard deviations is atleast 1 − 1/k2. That is,
P(|X − µ| < kσ ) ≥ 1 − 1k2 .
From the corollary, it also follows that
P(|X − µ| < ε) ≥ 1 − σ 2
ε2 .
Thus, Chebyshev’s inequality can be used as a bound for certain probabilities.However, in many instances, the bound is not very close to the true probability.
Example5.8-1
If it is known that X has a mean of 25 and a variance of 16, then, since σ = 4, a lowerbound for P(17 < X < 33) is given by
Section 6.2 Exploratory Data Analysis 241
Table 6.2-5 Order statistics of 50 exam scores
34 38 42 42 45 47 51 52 54 57
58 58 59 60 61 63 65 65 66 67
68 69 69 70 71 71 72 73 73 74
75 75 76 76 77 79 81 81 82 83
83 84 84 85 87 90 91 93 93 97
From either these order statistics or the corresponding ordered stem-and-leafdisplay, it is rather easy to find the sample percentiles. If 0 < p < 1, then the (100p)thsample percentile has approximately np sample observations less than it and alson(1−p) sample observations greater than it. One way of achieving this is to take the(100p)th sample percentile as the (n + 1)pth order statistic, provided that (n + 1)p isan integer. If (n + 1)p is not an integer but is equal to r plus some proper fraction—say, a/b—use a weighted average of the rth and the (r + 1)st order statistics. That is,define the (100p)th sample percentile as
Note that this formula is simply a linear interpolation between yr and yr+1.[If p < 1/(n + 1) or p > n/(n + 1), that sample percentile is not defined.]
As an illustration, consider the 50 ordered test scores. With p = 1/2, we findthe 50th percentile by averaging the 25th and 26th order statistics, since (n + 1)p =(51)(1/2) = 25.5. Thus, the 50th percentile is
π0.50 = (1/2)y25 + (1/2)y26 = (71 + 71)/2 = 71.
With p = 1/4, we have (n + 1)p = (51)(1/4) = 12.75, and the 25th sample percentile isthen
Note that approximately 50%, 25%, and 75% of the sample observations are lessthan 71, 58.75, and 81.25, respectively.
Special names are given to certain percentiles. The 50th percentile is the medianof the sample. The 25th, 50th, and 75th percentiles are, respectively, the first, second,and third quartiles of the sample. For notation, we let q1 = π0.25, q2 = m = π0.50,and q3 = π0.75. The 10th, 20th, . . . , and 90th percentiles are the deciles of the sample,so note that the 50th percentile is also the median, the second quartile, and the fifthdecile. With the set of 50 test scores, since (51)(2/10) = 10.2 and (51)(9/10) = 45.9, thesecond and ninth deciles are, respectively,
FIGURE 11.3: Least squares estimation of the regression line.
Function G is usually sought in a suitable form: linear, quadratic, logarithmic, etc. Thesimplest form is linear.
11.1.3 Linear regression
Linear regression model assumes that the conditional expectation
G(x) = E {Y | X = x} = β0 + β1 x
is a linear function of x. As any linear function, it has an intercept β0 and a slope β1.
The interceptβ0 = G(0)
equals the value of the regression function for x = 0. Sometimes it has no physical meaning.For example, nobody will try to predict the value of a computer with 0 random accessmemory (RAM), and nobody will consider the Federal reserve rate in year 0. In othercases, intercept is quite important. For example, according to the Ohm’s Law (V = RI)the voltage across an ideal conductor is proportional to the current. A non-zero intercept(V = V0 + R I) would show that the circuit is not ideal, and there is an external loss ofvoltage.
The slopeβ1 = G(x+ 1)−G(x)
is the predicted change in the response variable when predictor changes by 1. This is a veryimportant parameter that shows how fast we can change the expected response by varyingthe predictor. For example, customer satisfaction will increase by β1(∆x) when the qualityof produced computers increases by (∆x).
A zero slope means absence of a linear relationship between X and Y . In this case, Y isexpected to stay constant when X changes.
Regression 367
Regressionestimates
b0 = β0 = y − b1 x
b1 = β1 = Sxy/Sxx
where
Sxx =n∑
i=1
(xi − x)2
Sxy =n∑
i=1
(xi − x)(yi − y)
(11.6)
Example 11.3 (World population). In Example 11.1, xi is the year, and yi is the worldpopulation during that year. To estimate the regression line in Figure 11.1, we compute
We conclude that the world population grows at the average rate of 74.1 million every year.
We can use the obtained equation to predict the future growth of the world population.Regression predictions for years 2015 and 2020 are
G(2015) = b0 + 2015 b1 = 7152 million people
G(2020) = b0 + 2020 b1 = 7523 million people
♦
11.1.4 Regression and correlation
Recall from Section 3.3.5 that covariance
Cov(X,Y ) = E(X − E(X))(Y − E(Y ))
and correlation coefficient
ρ =Cov(X,Y )
( StdX)( StdY )
302 Chapter 7 Interval Estimation
Thus, since the probability of the first of these is 1 −α, the probability of the lastmust also be 1 − α, because the latter is true if and only if the former is true. That is,we have
P[
X − zα/2
(σ√n
)≤ µ ≤ X + zα/2
(σ√n
)]= 1 − α.
So the probability that the random interval[
X − zα/2
(σ√n
), X + zα/2
(σ√n
)]
includes the unknown mean µ is 1 − α.Once the sample is observed and the sample mean computed to equal x, the
interval [ x − zα/2(σ/√
n ), x + zα/2(σ/√
n )] becomes known. Since the probabilitythat the random interval covers µ before the sample is drawn is equal to 1 − α,we now call the computed interval, x ± zα/2(σ/
√n ) (for brevity), a 100(1 − α)%
confidence interval for the unknown mean µ. For example, x ± 1.96(σ/√
n ) is a 95%confidence interval for µ. The number 100(1 − α)%, or equivalently, 1 − α, is calledthe confidence coefficient.
We see that the confidence interval for µ is centered at the point estimate xand is completed by subtracting and adding the quantity zα/2(σ/
√n ). Note that as n
increases, zα/2(σ/√
n ) decreases, resulting in a shorter confidence interval with thesame confidence coefficient 1−α. A shorter confidence interval gives a more preciseestimate of µ, regardless of the confidence we have in the estimate of µ. Statisticianswho are not restricted by time, money, effort, or the availability of observations canobviously make the confidence interval as short as they like by increasing the samplesize n. For a fixed sample size n, the length of the confidence interval can also beshortened by decreasing the confidence coefficient 1 − α. But if this is done, weachieve a shorter confidence interval at the expense of losing some confidence.
Example7.1-1
Let X equal the length of life of a 60-watt light bulb marketed by a certain manufac-turer. Assume that the distribution of X is N(µ, 1296). If a random sample of n = 27bulbs is tested until they burn out, yielding a sample mean of x = 1478 hours, then a95% confidence interval for µ is[
x − z0.025
(σ√n
), x + z0.025
(σ√n
)]=
[1478 − 1.96
(36√27
), 1478 + 1.96
(36√27
)]
= [1478 − 13.58, 1478 + 13.58]
= [1464.42, 1491.58].
The next example will help to give a better intuitive feeling for the interpretationof a confidence interval.
Example7.1-2
Let x be the observed sample mean of five observations of a random sample fromthe normal distribution N(µ, 16). A 90% confidence interval for the unknown meanµ is
[
x − 1.645
√165
, x + 1.645
√165
]
.
302 Chapter 7 Interval Estimation
Thus, since the probability of the first of these is 1 −α, the probability of the lastmust also be 1 − α, because the latter is true if and only if the former is true. That is,we have
P[
X − zα/2
(σ√n
)≤ µ ≤ X + zα/2
(σ√n
)]= 1 − α.
So the probability that the random interval[
X − zα/2
(σ√n
), X + zα/2
(σ√n
)]
includes the unknown mean µ is 1 − α.Once the sample is observed and the sample mean computed to equal x, the
interval [ x − zα/2(σ/√
n ), x + zα/2(σ/√
n )] becomes known. Since the probabilitythat the random interval covers µ before the sample is drawn is equal to 1 − α,we now call the computed interval, x ± zα/2(σ/
√n ) (for brevity), a 100(1 − α)%
confidence interval for the unknown mean µ. For example, x ± 1.96(σ/√
n ) is a 95%confidence interval for µ. The number 100(1 − α)%, or equivalently, 1 − α, is calledthe confidence coefficient.
We see that the confidence interval for µ is centered at the point estimate xand is completed by subtracting and adding the quantity zα/2(σ/
√n ). Note that as n
increases, zα/2(σ/√
n ) decreases, resulting in a shorter confidence interval with thesame confidence coefficient 1−α. A shorter confidence interval gives a more preciseestimate of µ, regardless of the confidence we have in the estimate of µ. Statisticianswho are not restricted by time, money, effort, or the availability of observations canobviously make the confidence interval as short as they like by increasing the samplesize n. For a fixed sample size n, the length of the confidence interval can also beshortened by decreasing the confidence coefficient 1 − α. But if this is done, weachieve a shorter confidence interval at the expense of losing some confidence.
Example7.1-1
Let X equal the length of life of a 60-watt light bulb marketed by a certain manufac-turer. Assume that the distribution of X is N(µ, 1296). If a random sample of n = 27bulbs is tested until they burn out, yielding a sample mean of x = 1478 hours, then a95% confidence interval for µ is[
x − z0.025
(σ√n
), x + z0.025
(σ√n
)]=
[1478 − 1.96
(36√27
), 1478 + 1.96
(36√27
)]
= [1478 − 13.58, 1478 + 13.58]
= [1464.42, 1491.58].
The next example will help to give a better intuitive feeling for the interpretationof a confidence interval.
Example7.1-2
Let x be the observed sample mean of five observations of a random sample fromthe normal distribution N(µ, 16). A 90% confidence interval for the unknown meanµ is
[
x − 1.645
√165
, x + 1.645
√165
]
.
Section 7.1 Confidence Intervals for Means 305
1 − α = P
[
−tα/2(n−1) ≤ X − µ
S/√
n≤ tα/2(n−1)
]
= P[−tα/2(n−1)
(S√n
)≤ X − µ ≤ tα/2(n−1)
(S√n
)]
= P[−X − tα/2(n−1)
(S√n
)≤ −µ ≤ −X + tα/2(n−1)
(S√n
)]
= P[
X − tα/2(n−1)(
S√n
)≤ µ ≤ X + tα/2(n−1)
(S√n
)].
Thus, the observations of a random sample provide x and s2, and[
x − tα/2(n−1)(
s√n
), x + tα/2(n−1)
(s√n
)]
is a 100(1 − α)% confidence interval for µ.
Example7.1-5
Let X equal the amount of butterfat in pounds produced by a typical cow during a305-day milk production period between her first and second calves. Assume thatthe distribution of X is N(µ, σ 2). To estimate µ, a farmer measured the butterfatproduction for n = 20 cows and obtained the following data:
481 537 513 583 453 510 570 500 457 555
618 327 350 643 499 421 505 637 599 392
For these data, x = 507.50 and s = 89.75. Thus, a point estimate of µ is x = 507.50.Since t0.05(19) = 1.729, a 90% confidence interval for µ is
507.50 ± 1.729(
89.75√20
)or
507.50 ± 34.70, or equivalently, [472.80, 542.20].
Let T have a t distribution with n−1 degrees of freedom. Then tα/2(n−1) > zα/2.Consequently, we would expect the interval x ± zα/2(σ/
√n ) to be shorter than the
interval x± tα/2(n−1)(s/√
n ). After all, we have more information, namely, the valueof σ , in constructing the first interval. However, the length of the second intervalis very much dependent on the value of s. If the observed s is smaller than σ , ashorter confidence interval could result by the second procedure. But on the average,x ± zα/2(σ/
√n ) is the shorter of the two confidence intervals (Exercise 7.1-14).
Example7.1-6
In Example 7.1-2, 50 confidence intervals were simulated for the mean of a nor-mal distribution, assuming that the variance was known. For those same data, sincet0.05(4) = 2.132, x ± 2.132(s/
√5 ) was used to calculate a 90% confidence interval
for µ. For those particular 50 intervals, 46 contained the mean µ = 50. These 50intervals are depicted in Figure 7.1-1(b). Note the different lengths of the intervals.Some are longer and some are shorter than the corresponding z intervals. The aver-age length of the 50 t intervals is 7.137, which is quite close to the expected length ofsuch an interval: 7.169. (See Exercise 7.1-14.) The length of the intervals that use zand σ = 4 is 5.885.
310 Chapter 7 Interval Estimation
has a t distribution with n + m − 2 degrees of freedom. That is,
T =
X − Y − (µX − µY)√
σ 2/n + σ 2/m√√√√[
(n − 1)S2X
σ 2 + (m − 1)S2Y
σ 2
]/
(n + m − 2)
= X − Y − (µX − µY)√√√√
[(n − 1)S2
X + (m − 1)S2Y
n + m − 2
][1n
+ 1m
]
has a t distribution with r = n + m − 2 degrees of freedom. Thus, witht0 = tα/2(n+m−2), we have
P(−t0 ≤ T ≤ t0) = 1 − α.
Solving the inequalities for µX − µY yields
P
(
X − Y − t0SP
√1n
+ 1m
≤ µX − µY ≤ X − Y + t0SP
√1n
+ 1m
)
,
where the pooled estimator of the common standard deviation is
SP =√
(n − 1)S2X + (m − 1)S2
Y
n + m − 2.
If x, y, and sp are the observed values of X, Y, and SP, then
[
x − y − t0sp
√1n
+ 1m
, x − y + t0sp
√1n
+ 1m
]
is a 100(1 − α)% confidence interval for µX − µY .
Example7.2-2
Suppose that scores on a standardized test in mathematics taken by students fromlarge and small high schools are N(µX , σ 2) and N(µY , σ 2), respectively, where σ 2 isunknown. If a random sample of n = 9 students from large high schools yieldedx = 81.31, s2
x = 60.76, and a random sample of m = 15 students from small highschools yielded y = 78.61, s2
y = 48.24, then the endpoints for a 95% confidenceinterval for µX − µY are given by
81.31 − 78.61 ± 2.074
√8(60.76) + 14(48.24)
22
√19
+ 115
because t0.025(22) = 2.074. The 95% confidence interval is [−3.65, 9.05].
REMARKS The assumption of equal variances, namely, σ 2X = σ 2
Y , can be modifiedsomewhat so that we are still able to find a confidence interval for µX − µY . That is,if we know the ratio σ 2
X/σ 2Y of the variances, we can still make this type of statistical
310 Chapter 7 Interval Estimation
has a t distribution with n + m − 2 degrees of freedom. That is,
T =
X − Y − (µX − µY)√
σ 2/n + σ 2/m√√√√[
(n − 1)S2X
σ 2 + (m − 1)S2Y
σ 2
]/
(n + m − 2)
= X − Y − (µX − µY)√√√√
[(n − 1)S2
X + (m − 1)S2Y
n + m − 2
][1n
+ 1m
]
has a t distribution with r = n + m − 2 degrees of freedom. Thus, witht0 = tα/2(n+m−2), we have
P(−t0 ≤ T ≤ t0) = 1 − α.
Solving the inequalities for µX − µY yields
P
(
X − Y − t0SP
√1n
+ 1m
≤ µX − µY ≤ X − Y + t0SP
√1n
+ 1m
)
,
where the pooled estimator of the common standard deviation is
SP =√
(n − 1)S2X + (m − 1)S2
Y
n + m − 2.
If x, y, and sp are the observed values of X, Y, and SP, then
[
x − y − t0sp
√1n
+ 1m
, x − y + t0sp
√1n
+ 1m
]
is a 100(1 − α)% confidence interval for µX − µY .
Example7.2-2
Suppose that scores on a standardized test in mathematics taken by students fromlarge and small high schools are N(µX , σ 2) and N(µY , σ 2), respectively, where σ 2 isunknown. If a random sample of n = 9 students from large high schools yieldedx = 81.31, s2
x = 60.76, and a random sample of m = 15 students from small highschools yielded y = 78.61, s2
y = 48.24, then the endpoints for a 95% confidenceinterval for µX − µY are given by
81.31 − 78.61 ± 2.074
√8(60.76) + 14(48.24)
22
√19
+ 115
because t0.025(22) = 2.074. The 95% confidence interval is [−3.65, 9.05].
REMARKS The assumption of equal variances, namely, σ 2X = σ 2
Y , can be modifiedsomewhat so that we are still able to find a confidence interval for µX − µY . That is,if we know the ratio σ 2
X/σ 2Y of the variances, we can still make this type of statistical
Section 7.3 Confidence Intervals for Proportions 319
P
[
−zα/2 ≤ (Y/n) − p√
p(1 − p)/n≤ zα/2
]
≈ 1 − α. (7.3-1)
If we proceed as we did when we found a confidence interval for µ in Section 7.1,we would obtain
P
[Yn
− zα/2
√p(1 − p)
n≤ p ≤ Y
n+ zα/2
√p(1 − p)
n
]
≈ 1 − α.
Unfortunately, the unknown parameter p appears in the endpoints of this inequality.There are two ways out of this dilemma. First, we could make an additional approx-imation, namely, replacing p with Y/n in p (1 − p)/n in the endpoints. That is, if n islarge enough, it is still true that
P
[Yn
− zα/2
√(Y/n)(1 − Y/n)
n≤ p ≤ Y
n+ zα/2
√(Y/n)(1 − Y/n)
n
]
≈ 1 − α.
Thus, for large n, if the observed Y equals y, then the interval[
yn
− zα/2
√(y/n)(1 − y/n)
n,
yn
+ zα/2
√(y/n)(1 − y/n)
n
]
serves as an approximate 100(1 − α)% confidence interval for p. Frequently, thisinterval is written as
yn
± zα/2
√(y/n)(1 − y/n)
n(7.3-2)
for brevity. This formulation clearly notes, as does x ± zα/2(σ/√
n) in Section 7.1, thereliability of the estimate y/n, namely, that we are 100(1 − α)% confident that p iswithin zα/2
√(y/n)(1 − y/n)/n of p = y/n.
A second way to solve for p in the inequality in Equation 7.3-1 is to note that
|Y/n − p|√
p (1 − p)/n≤ zα/2
is equivalent to
H(p) =(
Yn
− p)2
−z2α/2 p(1 − p)
n≤ 0. (7.3-3)
But H(p) is a quadratic expression in p. Thus, we can find those values of p forwhich H(p) ≤ 0 by finding the two zeros of H(p). Letting p = Y/n and z0 = zα/2 inEquation 7.3-3, we have
H(p) =(
1 + z20
n
)
p2 −(
2 p + z20
n
)
p + p 2.
By the quadratic formula, the zeros of H(p) are, after simplifications,
p + z20/(2n) ± z0
√p (1 − p )/n + z2
0/(4n2)
1 + z20/n
, (7.3-4)
358 Chapter 8 Tests of Statistical Hypotheses
Table 8.1-1 Tests of hypotheses about one mean, variance known
H0 H1 Critical Region
µ = µ0 µ > µ0 z ≥ zα or x ≥ µ0 + zασ/√
n
µ = µ0 µ < µ0 z ≤ −zα or x ≤ µ0 − zασ/√
n
µ = µ0 µ %= µ0 |z| ≥ zα/2 or |x − µ0| ≥ zα/2σ/√
n
Z = X − µ0√σ 2/n
= X − µ0
σ/√
n, (8.1-1)
and the critical regions, at a significance level α, for the three respective alternativehypotheses would be (i) z ≥ zα , (ii) z ≤ −zα , and (iii) |z| ≥ zα/2. In terms of x, thesethree critical regions become (i) x ≥ µ0 + zα(σ/
√n ), (ii) x ≤ µ0 − zα(σ/
√n ), and
(iii) |x − µ0| ≥ zα/2(σ/√
n ).The three tests and critical regions are summarized in Table 8.1-1. The underly-
ing assumption is that the distribution is N(µ, σ 2) and σ 2 is known.It is usually the case that the variance σ 2 is not known. Accordingly, we now take
a more realistic position and assume that the variance is unknown. Suppose our nullhypothesis is H0: µ = µ0 and the two-sided alternative hypothesis is H1: µ %= µ0.Recall from Section 7.1, for a random sample X1, X2, . . . , Xn taken from a normaldistribution N(µ, σ 2), a confidence interval for µ is based on
T = X − µ√
S2/n= X − µ
S/√
n.
This suggests that T might be a good statistic to use for the test of H0: µ = µ0 with µ
replaced by µ0. In addition, it is the natural statistic to use if we replace σ 2/n by itsunbiased estimator S2/n in (X − µ0)/
√σ 2/n in Equation 8.1-1. If µ = µ0, we know
that T has a t distribution with n − 1 degrees of freedom. Thus, with µ = µ0,
P[ |T| ≥ tα/2(n−1)] = P
[|X − µ0|
S/√
n≥ tα/2(n−1)
]
= α.
Accordingly, if x and s are, respectively, the sample mean and sample standarddeviation, then the rule that rejects H0: µ = µ0 and accepts H1: µ %= µ0 if andonly if
|t| = |x − µ0|s/
√n
≥ tα/2(n−1)
provides a test of this hypothesis with significance level α. Note that this rule isequivalent to rejecting H0: µ = µ0 if µ0 is not in the open 100(1 − α)% confidenceinterval
(x − tα/2(n−1)
[s/
√n
], x + tα/2(n−1)
[s/
√n
]).
Table 8.1-2 summarizes tests of hypotheses for a single mean, along with thethree possible alternative hypotheses, when the underlying distribution is N(µ, σ 2),σ 2 is unknown, t = (x − µ0)/(s/
√n ), and n ≤ 30. If n > 30, we use Table 8.1-1 for
approximate tests, with σ replaced by s.
Section 8.1 Tests About One Mean 359
Table 8.1-2 Tests of hypotheses for one mean, variance unknown
H0 H1 Critical Region
µ = µ0 µ > µ0 t ≥ tα(n − 1) or x ≥ µ0 + tα(n − 1)s/√
n
µ = µ0 µ < µ0 t ≤ −tα(n − 1) or x ≤ µ0 − tα(n − 1)s/√
Let X (in millimeters) equal the growth in 15 days of a tumor induced in a mouse.Assume that the distribution of X is N(µ, σ 2). We shall test the null hypothesis H0:µ = µ0 = 4.0 mm against the two-sided alternative hypothesis H1: µ %= 4.0. If weuse n = 9 observations and a significance level of α = 0.10, the critical region is
|t| = |x − 4.0|s/
√9
≥ tα/2(8) = 1.860.
If we are given that n = 9, x = 4.3, and s = 1.2, we see that
t = 4.3 − 4.0
1.2/√
9= 0.3
0.4= 0.75.
Thus,
|t| = |0.75| < 1.860,
and we accept (do not reject) H0: µ = 4.0 at the α = 10% significance level. (SeeFigure 8.1-3.) The p-value is the two-sided probability of |T| ≥ 0.75, namely,
p-value = P(|T| ≥ 0.75) = 2P(T ≥ 0.75).
With our t tables with eight degrees of freedom, we cannot find this p-value exactly.It is about 0.50, because
P(|T| ≥ 0.706) = 2P(T ≥ 0.706) = 0.50.
However, Minitab gives a p-value of 0.4747. (See Figure 8.1-3.)
α/2 = 0.05α/2 = 0.05
p-value
t = 0.75
0.1
0.2
0.3
0.4
−3 −2 −1 3210
0.1
0.2
0.3
0.4
−3 −2 −1 3210
Figure 8.1-3 Test about mean of tumor growths
Section 8.2 Tests of the Equality of Two Means 367
Y
X
0.5 1.0 1.5 2.0 2.5
Figure 8.2-2 Box plots for pea stem growths
for the X sample and
0.8 1.15 1.6 2.2 2.6
for the Y sample. The two box plots are shown in Figure 8.2-2.
Assuming independent random samples of sizes n and m, let x, y, and s2p rep-
resent the observed unbiased estimates of the respective parameters µX , µY , andσ 2
X = σ 2Y of two normal distributions with a common variance. Then α-level tests of
certain hypotheses are given in Table 8.2-1 when σ 2X = σ 2
Y . If the common-varianceassumption is violated, but not too badly, the test is satisfactory, but the significancelevels are only approximate. The t statistic and sp are given in Equations 8.2-1 and8.2-2, respectively.
REMARK Again, to emphasize the relationship between confidence intervals andtests of hypotheses, we note that each of the tests in Table 8.2-1 has a correspondingconfidence interval. For example, the first one-sided test is equivalent to saying thatwe reject H0: µX − µY = 0 if zero is not in the one-sided confidence interval withlower bound
x − y − tα(n+m−2)sp√
1/n + 1/m.
Table 8.2-1 Tests of hypotheses for equality of two means
H0 H1 Critical Region
µX = µY µX > µY t ≥ tα(n+m−2) or
x − y ≥ tα(n+m−2)sp√
1/n + 1/m
µX = µY µX < µY t ≤ −tα(n+m−2) or
x − y ≤ −tα(n+m−2)sp√
1/n + 1/m
µX = µY µX $= µY |t| ≥ tα/2(n+m−2) or
|x − y| ≥ tα/2(n+m−2)sp√
1/n + 1/m
376 Chapter 8 Tests of Statistical Hypotheses
rolled to yield a total of n = 8000 observations. Let Y equal the number of timesthat 6 resulted in the 8000 trials. The test statistic is
Z = Y/n − 1/6√
(1/6)(5/6)/n= Y/8000 − 1/6
√(1/6)(5/6)/8000
.
If we use a significance level of α = 0.05, the critical region is
z ≥ z0.05 = 1.645.
The results of the experiment yielded y = 1389, so the calculated value of the teststatistic is
z = 1389/8000 − 1/6√
(1/6)(5/6)/8000= 1.67.
Since
z = 1.67 > 1.645,
the null hypothesis is rejected, and the experimental results indicate that these dicefavor a 6 more than a fair die would. (You could perform your own experiment tocheck out other dice.)
There are times when a two-sided alternative is appropriate; that is, here wetest H0: p = p0 against H1: p #= p0. For example, suppose that the pass rate in theusual beginning statistics course is p0. There has been an intervention (say, some newteaching method) and it is not known whether the pass rate will increase, decrease, orstay about the same. Thus, we test the null (no-change) hypothesis H0: p = p0 againstthe two-sided alternative H1: p #= p0. A test with the approximate significance levelα for doing this is to reject H0: p = p0 if
|Z| = |Y/n − p0|√p0(1 − p0)/n
≥ zα/2,
since, under H0, P(|Z| ≥ zα/2) ≈ α. These tests of approximate significance level α
are summarized in Table 8.3-1. The rejection region for H0 is often called the criticalregion of the test, and we use that terminology in the table.
The p-value associated with a test is the probability, under the null hypothesisH0, that the test statistic (a random variable) is equal to or exceeds the observedvalue (a constant) of the test statistic in the direction of the alternative hypothesis.
Table 8.3-1 Tests of hypotheses for one proportion
This table is taken from Table III of Fisher and Yates: Statistical Tables for Biological, Agricultrual, and Medical Research, published by Longman Group Ltd.,London (previously published by Oliver and Boyd, Edinburgh).