Top Banner
Continuous Probability Distributions Ka-fu WONG 23 August 2007 Abstract In previous chapter, we noted that a lot of continuous random variables can be approximated by discrete random variables. At the same time, a lot of discrete random variables may be approximated by continous random variables. Thus, we are studying continuous probability distributions not just for the sake understanding continuous random variables, but also for the sake of understanding its approximation to discrete random variables. In a lot of important cases, it turns out that the approximation are good enough and continuous probability distributions are easier to work with – if we know enough about it. Among continuous probability distributions, normal distribution is the most important. Normal distribution will be used over and over again in later chapters. The most difficult part about continuous probability distributions is the understanding of its connec- tion with the discrete ones. Once this is done, a lot of results about discrete probability distributions can be easily extended to the continuous case. A continuous random variable can assume an infinite uncountable number of values within a given range or interval(s). Recall that a discrete random variable can also take infinite number of values. A continiuous random variable differs in that the number of values it can take is uncountable, i.e., impossible to list all the values it can take. For instance, it is not possible to list all the real numbers within the interval [0,1]. The associated probability distributions to continous random variables is logically called continuous probability distributions. Example 1 (Continuous random variables): A continuous random variable is a variable that can assume any value in an interval. Some of the variables are continuous in nature. For examples: 1. The thickness of our Microeconomics textbook. 2. The time required to complete a homework assignment. 3. The temperature of a patient. 4. The distance travelled from my home to school. Other variables are continious because they are averages of discrete random variables. 1. Ginni Coefficient, a measure of inequality in an economy. 2. Unemployment rate. 1
40

Continuous Probability Distributions - Ka-fu Wongkafuwong.econ.hku.hk/teaching/econ1003/statbook2007/... · 2007. 8. 23. · Continuous Probability Distributions Ka-fu WONG 23 August

Jan 25, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Continuous Probability Distributions

    Ka-fu WONG

    23 August 2007

    Abstract

    In previous chapter, we noted that a lot of continuous random variables can be approximated bydiscrete random variables. At the same time, a lot of discrete random variables may be approximated bycontinous random variables. Thus, we are studying continuous probability distributions not just for thesake understanding continuous random variables, but also for the sake of understanding its approximationto discrete random variables. In a lot of important cases, it turns out that the approximation are goodenough and continuous probability distributions are easier to work with – if we know enough aboutit. Among continuous probability distributions, normal distribution is the most important. Normaldistribution will be used over and over again in later chapters.

    The most difficult part about continuous probability distributions is the understanding of its connec-tion with the discrete ones. Once this is done, a lot of results about discrete probability distributionscan be easily extended to the continuous case.

    A continuous random variable can assume an infinite uncountable number of values within a given range

    or interval(s). Recall that a discrete random variable can also take infinite number of values. A continiuous

    random variable differs in that the number of values it can take is uncountable, i.e., impossible to list all the

    values it can take. For instance, it is not possible to list all the real numbers within the interval [0,1]. The

    associated probability distributions to continous random variables is logically called continuous probability

    distributions.

    Example 1 (Continuous random variables): A continuous random variable is a variable that can

    assume any value in an interval. Some of the variables are continuous in nature. For examples:

    1. The thickness of our Microeconomics textbook.

    2. The time required to complete a homework assignment.

    3. The temperature of a patient.

    4. The distance travelled from my home to school.

    Other variables are continious because they are averages of discrete random variables.

    1. Ginni Coefficient, a measure of inequality in an economy.

    2. Unemployment rate.

    1

  • 3. Inflation rate.

    4. A stock market index, such as Hang Seng Index, Dow Jones Averages, Nasdaq, and S&P

    500.

    5. A student’s grade point average.

    6. Average age of students in this class.

    7. Average weekly working hours of employees.

    8. Average hourly salary of students working part-time.

    These variables can potentially take on any value, depending only on our ability of measuring

    and reporting them accurately.

    The above examples suggest that averages are better characterized as continuous rather than discrete

    variables even if the underlying variables used for the mean calculations were discrete.

    Conceptually, continuous probability distribution should be very similar to discrete probability distrib-

    ution. After all, the two types of probability distributions may be viewed as approximation of each other.

    The only main difference is that the probability that a continuous random variable takes a specific value is

    zero. However, the probability that a continuous random variable takes a value between an interval [a, b]

    can be positive. This difference has led to changes in the definitions and calculations.

    1 Features of a Continuous Probability Distribution

    Probability distribution may be classified according to the number of random variables it describes.

    Number of random variables Joint distribution

    1 Univariate probability distribution

    2 Bivariate probability distribution

    3 Trivariate probability distribution

    ... ...

    n Multivariate probability distribution

    These distributions have similar characteristics. We will discuss these characteristics for the univariate

    and the bivariate distribution. The extension to the multivariate will be straightforward.

    2

  • Theorem 1 (Charateristics of a Univariate Continuous Distribution): Suppose the random

    variable X is defined on the interval between a and b, i.e., X ∈ [a, b]. That is X can take any

    value between [a, b].

    1. The probability that X takes a value between an interval [c, d] is

    P (X ∈ [c, d]) =∫ d

    c

    f(x)dx

    where f(x) denote the probability density function of X. Note that the expression has an

    interpretation parallel to the discrete case. f(x)dx may be interpreted as the probability or

    the area under the density curve define in the neighborhood of x, say [x, x + dx].

    2. The probability density function f(x) is non-negative and may be larger than 1.

    3. P (X ∈ [c, d]) =∫ d

    cf(x)dx is between 0 and 1.00, i.e., [0, 1].

    4. The sum of the probabilities of the various outcomes is 1.00. That is,

    P (X ∈ [−∞,∞]) = P (X ∈ [a, b]) =∫ b

    a

    f(x)dx = 1

    5. Let the events defined on the two non-overlapping intervals, [c1, d1] and [c2, d2], be X ∈

    [c1, d1] and X ∈ [c2, d2]. These two events are mutually exclusive. That is,

    P (X ∈ [c1, d1] and X ∈ [c2, d2]) = 0.

    P ((X ∈ [c1, d1] or X ∈ [c2, d2]) = P ((X ∈ [c1, d1]) + P (X ∈ [c2, d2])

    The use of integration (∫

    ) in the definition could be terrifying, especially for students who had never

    studied it before. We are terrified only because we do not know there is a simple connection between the

    discrete probability distribution and the continuous probability distribution. Let us make the connection

    below and leave the introduction of integration in a mathematical appendix.

    3

  • 2 Making a connection between discrete and continuous distribu-

    tions

    To understand the difference between the two types of distributions, let’s start with a series of questions

    (from simple to complicated), of which we can easily derive good answers.

    2.1 Imagine throwing a dart at [0, 1]

    Consider a continuous random variable that is defined over a segment of line [0, 1]. We can imagine throwing

    a dart at the segment and each point in the segment has equal chance of being hit by the dart.

    1. What is the probability that a dart randomly thrown will end up in the segment [0, 1]? The answer is

    simple. Since the dart has to land on somewhere on [0, 1], the probability of the dart landing on the

    segment [0, 1] is 1.

    2. What is the probability that a dart randomly thrown will end up in the segment [0, 12 ]? The answer is

    slightly more difficult. Since the dart has equal chance to land on any point on [0, 1], the probability

    of having the dart landing on half of the line [0, 1], i.e., the segment [0, 12 ], has to be 1/2.

    3. What is the probability that a dart randomly thrown will end up in the segment [0, 14 ]? The answer is

    slightly more difficult. Since the dart has equal chance to land on any point on [0, 1], the probability

    of having the dart landing on a quarter of the line [0, 1], i.e., the segment [0, 14 ], has to be 1/4.

    4. What is the probability that a dart randomly thrown will end up in the segment [0, 18 ]? The answer is

    slightly more difficult. Since the dart has equal chance to land on any point on [0, 1], the probability

    of having the dart landing on an eighth of the line [0, 1], i.e., the segment [0, 18 ], has to be 1/8.

    5. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ]? The answer is

    slightly more difficult. The length of the segment is 1/8, the same as the last question. Since the dart

    has equal chance to land on any point on [0,1], the probability of having the dart landing on an eighth

    of the line [0,1], i.e., the segment [ 28 ,38 ], has to be 1/8.

    6. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ] AND [

    58 ,

    68 ]?

    The answer is not difficult. We have two non-overlapping segments of equal length. It is impossible for

    any throw to land on both of these two non-overlapping segments. Thus, the probability should be 0.

    4

  • 7. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ] OR [

    58 ,

    68 ]?

    The answer is slightly more difficult. We have two non-overlapping segments of equal length. The

    probability should equal the the sum of the probability of landing on the segment [ 28 ,38 ] and the

    probability of landing on the segment [ 58 ,68 ]. That is, it should be 1/8+1/8=2/8.

    8. What is the probability that a dart randomly thrown will end up in the segment [0, k], where k < 1? The

    answer is slightly more difficult. As we learn from the previous discussions, there is a 1/2 probability

    of having the dart landing on the segment [0, 12 ], 1/4 on the segment [0,14 ], 1/8 on the segment [0,

    18 ].

    It is not too difficult to induce that the probability of having the dart landing on the segment [0,k] is

    simply k.

    9. What is the probability that a dart randomly thrown will end up in the segment [k1, k2], where k1 < k2?

    The answer is slightly more difficult. It is not too difficult to induce that the probability of having the

    dart landing on the segment [k1, k2] is simply k2 − k1.

    10. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] AND [k3, k4],

    where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. It is

    impossible for any throw to land on both of these two non-overlapping segments. Thus, the probability

    should be 0.

    11. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] OR [k3, k4],

    where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. The

    probability should equal the the sum of the probability of landing on the segment [k1, k2] and the

    probability of landing on the segment [k3, k4]. That is, it should be (k2 − k1) + (k4 − k3).

    12. What is the probability that a dart randomly thrown will end up exactly at a single point 2/3? The

    answer is slightly more difficult. There are infinite uncountable points on the entire segment [0, 1]. The

    probability of the dart ending up at the point 2/3 is like the probability of the dart ending up in an

    interval with zero length. That is, it should be zero.

    13. What is the probability that a dart randomly thrown will end up exactly at one of the two points, 1/3

    or 2/3? The answer is slightly more difficult. Since the two points are non-overlapping, the probability

    should be the sum of the individual ones, i.e., 0 = 0 + 0.

    14. What is the probability that a dart randomly thrown will end up exactly at one of the 99 points, 1/100,

    2/100, ..., or 99/100? The answer is simple. Since the 99 points are non-overlapping, the probability

    5

  • should be the sum of the individual ones, i.e., 0.

    2.2 Imagine throwing a dart at [a, b]

    Let’s repeat the above questions and answers with a slight change. Consider a continuous random variable

    that is defined over a segment of line [a, b], where a < b. We can imagine throwing a dart at the segment

    and each point in the segment has equal chance of being hit by the dart.

    1. What is the probability that a dart randomly thrown will end up in the segment [a, b]? The answer is

    simple. Since the dart has to land on somewhere on [a, b], the probability of the dart landing on the

    segment [a,b] is 1.

    2. What is the probability that a dart randomly thrown will end up in the segment [a, a+ 12 (b− a)]? The

    answer is slightly more difficult. Since the dart has equal chance to land on any point on [a, b], the

    probability of having the dart landing on half of the line [a, b], i.e., the segment [a, a + 12 (b − a)], has

    to be 1/2.

    3. What is the probability that a dart randomly thrown will end up in the segment [a, a+ 14 (b− a)]? The

    answer is slightly more difficult. Since the dart has equal chance to land on any point on [a, b], the

    probability of having the dart landing on a quarter of the line [a, b], i.e., the segment [a, a + 14 (b− a)],

    has to be 1/4.

    4. What is the probability that a dart randomly thrown will end up in the segment [a, a+ 18 (b− a)]? The

    answer is slightly more difficult. Since the dart has equal chance to land on any point on [a, b], the

    probability of having the dart landing on an eighth of the line [a, b], i.e., the segment [a, a + 18 (b− a)],

    has to be 1/8.

    5. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]?

    The answer is slightly more difficult. The length of the segment is 1/8, the same as the last question.

    Since the dart has equal chance to land on any point on [a, b], the probability of having the dart landing

    on an eighth of the line [a, b], i.e., the segment [a + 28 (b− a), a +38 (b− a)], has to be 1/8.

    6. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]

    AND [a + 58 (b− a), a +68 (b− a)]? The answer is not difficult. We have two non-overlapping segments

    of equal length. It is impossible for any throw to land on both of these two non-overlapping segments.

    Thus, the probability should be 0.

    6

  • 7. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]

    OR [a + 58 (b − a), a +68 (b − a)]? The answer is slightly more difficult. We have two non-overlapping

    segments of equal length. The probability should equal the the sum of the probability of landing on the

    segment [a+ 28 (b−a), a+38 (b−a)] and the probability of landing on the segment [a+

    58 (b−a), a+

    68 (b−a)].

    That is, it should be 1/8+1/8=2/8.

    8. What is the probability that a dart randomly thrown will end up in the segment [a, k], where k < 1?

    The answer is slightly more difficult. As we learn from the previous discussions, it is not too difficult

    to induce that the probability of having the dart landing on the segment [a, k] is simply (k−a)/(b−a).

    9. What is the probability that a dart randomly thrown will end up in the segment [k1, k2], where k1 < k2?

    The answer is slightly more difficult. It is not too difficult to induce that the probability of having the

    dart landing on the segment [k1, k2] is simply (k2 − k1)/(b− a).

    10. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] AND [k3, k4],

    where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. It is

    impossible for any throw to land on both of these two non-overlapping segments. Thus, the probability

    should be 0.

    What is the probability that a dart randomly thrown will end up in the segment [k1, k2] OR [k3, k4], where

    k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. The

    probability should equal the the sum of the probability of landing on the segment [k1, k2] and the

    probability of landing on the segment [k3, k4]. That is, it should be [(k2 − k1) + (k4 − k3)]/(b− a).

    1. What is the probability that a dart randomly thrown will end up exactly at a single point k1? The

    answer is slightly more difficult. There are infinite uncountable points on the entire segment [a, b]. The

    probability of the dart ending up at the point k1 is like the probability of the dart ending up in an

    interval with zero length. That is, it should be zero.

    2. What is the probability that a dart randomly thrown will end up exactly at one of the two points, k1

    or k2? The answer is slightly more difficult. Since the two points are non-overlapping, the probability

    should be the sum of the individual ones, i.e., 0 = 0 + 0.

    3. What is the probability that a dart randomly thrown will end up exactly at one of the 99 points, k1,

    k2, ..., or k99? The answer is simple. Since the 99 points are non-overlapping, the probability should

    be the sum of the individual ones, i.e., 0.

    7

  • 2.3 Deriving the probability density function (pdf)

    Based on what we know about discrete probability distribution, it might appear that there is inconsistency

    between the notions P (x ∈ [k1, k2]) = (k2 − k1)/(b − a) and P (x = k) = 0. Can we write P (x ∈ [k1, k2])

    as a sum of probability of individual events? Yes, with an introduction of the density concept. That is,

    we would like to define a density c such that P (x ∈ [a, k]) = (k − a) × c. What would c be? We know

    that c must also satisfy P (x ∈ [a, b]) = (b − a) × c = 1, implying c = 1/(b − a). It is not too difficult

    to check that the density so defined will generate results that are consistent with the discussions above.

    In particular, P (x ∈ [a, k]) = (k − a) × c = (k − a)/(b − a), P (x ∈ [k1, k2]) = (k2 − k1)/(b − a), and

    P (x = k) = P (x ∈ [k, k]) = (k − k)/(b− a) = 0.

    What we have just discussed is the so called uniform distribution over the interval [a, b].

    Definition 1 (Uniform distribution): If a and b are numbers on the real line, the random variable

    (X) ∼ U(a, b), i.e., has a uniform distribution if the density function looks like

    f(x) =

    1

    (b−a) for a ≤ x ≤ b

    0 otherwise

    and the cumulative distribution function (cdf) looks like

    F (x) = Prob(X < x) =

    0 for x ≤ ax−a(b−a) for a ≤ x ≤ b

    1 for b ≤ x

    Recall that in discrete case, the expected value of a random variable X with probability mass function

    P(X) is

    E(X) =∑X

    XP (X)

    In continuous random variable case, P (X = k) = 0, how do we compute the expected value? We can use

    an approximation. Since P (X ∈ [k, k + dx]) > 0, it is possible to imagine the relevant segment is divided

    into many smaller intervals of length dx ([a, a + dx], [a + dx, a + 2dx], ..., [a + ( b−adx − 1)dx, a + (b−adx )dx]),

    and obtain an approximation with the formula above by replacing P (X = k) with P (X ∈ [k, k + dx]), and

    replacing X with one of the three possibilities:

    8

  • 1. the lower bound of [k, k + dx], i.e, k.

    E(X) =( b−adx )∑i=1

    (a + (i− 1)dx)P (X ∈ [a + (i− 1)dx, a + idx]) =( b−adx )∑i=1

    (a + (i− 1)dx)× c× dx)

    2. the upper bound of [k, k + dx], i.e, k + dx.

    E(X) =( b−adx )∑i=1

    (a + idx)P (X ∈ [a + (i− 1)dx, a + idx]) =( b−adx )∑i=1

    (a + idx)× c× dx

    3. the mid-point of [k, k + dx], i.e, k + dx/2.

    E(X) =( b−adx )∑i=1

    (a + (i− 12)dx)P (X ∈ [a + (i− 1)dx, a + idx]) =

    ( b−adx )∑i=1

    (a + (i− 12)dx)× c× dx

    The accuracy of such approximation depend on dx. Generally, the smaller dx, the more accurate is the

    approximation. To obtain a more accurate approximation, we can imagine dx shrinking towards zero. In

    this case the number of terms being added expands to infinity.

    E(X) = limdx→0

    ( b−adx )∑i=1

    (a + (i− 1)dx)× c× dx = limdx→0

    ( b−adx )∑i=1

    xi × c× dx =∫ b

    a

    xcdx

    a

    (x1)

    x

    c=1/(b-a)

    xi xi+dx

    y=f(x)

    bx2dx dx

    x3dx

    9

  • In the expression above, we are really using the integral sign to stand for the limit of the sum.

    limdx→0

    ( b−adx )∑i=1

    =∫ b

    a

    In the above discussion, we consider only the case when the dart will land on an interval [a, b] with

    equal chance, i.e., uniform distribution. The discussion can be easily extended to other situations with

    the interval [a, b] subdivided into many sub-intervals and the densities are held the same within each sub-

    intervals. Let the interval be of length dx as before, we have the intervals [a, a+dx], [a+dx, a+2dx], ..., [a+

    ( b−adx − 1)dx, a + (b−adx )dx], and the corresponding densities in each intervals can be labelled differently as

    f(x1), ..., f(xn), where n = (b − a)/dx. With this information, we can answer many questions similar to

    those discussed earlier. Again, at the limiting case with dx approaching zero, the probability will be defined

    as

    P (x ∈ [k1, k2]) = limdx→0

    ∑[k1,k2]

    fidx =∫ k2

    k1

    f(x)dx

    More generally we can define the cumulative distribbution function (cdf) and use cdf to compute the

    probability.

    Definition 2 (Cumulative distribbution function (cdf)): Let f(x) be the pdf of the a continuous

    random variable X. The cumulative distribbution function (cdf) is

    F (x) = P (X < x) =∫ x−∞

    f(x)dx

    P (x ∈ [k1, k2]) =∫ k2

    k1

    f(x)dx =∫ k2−∞

    f(x)dx−∫ k1−∞

    f(x)dx = F (k2)− F (k1)

    Given the cdf F (x), we can also derive the pdf f(x)

    f(x) =d

    dxF (x)

    where ddx means “differentiate with respect to x”.

    We can check such definitions are consistent at least with the uniform distribution. In fact, it holds for

    10

  • any continuous probability distributions.

    The above discussions suggest that the density function defines the probability distribution in the con-

    tinuous case.

    Example 2 (Part-time Work on Campus): A student has been offered part-time work in a

    laboratory. The professor says that the work will vary from week to week. The number of hours

    will be between 10 and 20 with a uniform probability density function:

    1. How tall is the rectangle?

    2. What is the probability of getting less than 15 hours in a week?

    3. Given that the student gets at least 15 hours in a week, what is the probability that more

    than 17.5 hours will be available?

    x

    c=0.1

    y=f(x)

    2010

    Because the probability is uniformly distributed, the pdf can be illustrated as an rectangle above

    and the height of the rectangle is the very uniform probabilty, i.e. 1/(20 − 10) = 0.1. If we

    employ the same letters we used previously, we have a = 10, b = 20, and c = 0.1. The pdf is

    f(x) =

    1

    20−10 = 0.1 for 10 ≤ x ≤ 20

    0 otherwise

    11

  • and the cdf is

    F (x) = Prob(X < x) =

    0 for x ≤ 10x−1020−10 for 10 ≤ x ≤ 20

    1 for 20 ≤ x

    Thus Prob(X < 15) = 15−1020−10 = 0.5, and Prob(X > 17.5|X > 15) =Prob(X>17.5&X>15)

    Prob(X>15) =Prob(X>17.5)Prob(X>15) =

    17.5−1015−10 = 0.5.

    Remember that the probability for a continuous random variable to be equal to an exact value

    is always zero. Therefore we are always relieved from considering about the boundary of any

    intervals, i.e. “” or “≥”.

    Example 3 (Customer Complaints): You are the manager of the complaint department for a

    large mail order company. Your data and experience indicate that the time it takes to handle a

    single call denote ranges from 0 to 15 minutes and is denoted as T and has a rectangular triangle

    probability density function, with a height of 2/15.

    1. Show that the area under the triangle is 1.

    2. Find the probability that a call will take longer than 10 minutes. That is, find P (T > 10).

    3. Given that the call takes at least 5 minutes, what is the probability that it will take longer

    than 10 minutes? That is, find P (T > 10|T > 5).

    4. Find P (T < 10).

    We know the area under the pdf curve is the probability and it must be 1 because from 0 to 15

    are all possible outcomes. The pdf of the function should be like the follows if we believe the

    company tries to handle all calls as soon as possible:

    12

  • t

    2/15

    y=f(t)

    15100 5

    P (T > 10) should be the area under the pdf curve where 15 > t > 10. Basic geometry helps us

    to get that area equals 1/9, i.e. P (T > 10) = 19 = 0.1111.

    P (T > 10|T > 5) = P (T>10&T>5)P (T>5) =P (T>10)P (T>5) =

    1/94/9 = 0.25.

    P (T < 10) = 1− P (T > 10) = 1− 1/9 = 8/9 = 0.8889.

    3 Normal distributions

    The most popular distribution we will use after this chapter is normal distribution. It pays to understand it

    very well.

    Definition 3 (Normal distribution): The random variable X ∼ N(µ, σ2), i.e., has a normal

    distribution if the random variable is defined on the whole real line [−∞,∞] and has the following

    density function

    f(x) =1

    σ√

    2πe−

    12 �

    2

    � =x− µ

    σ

    where µ and σ are the mean and standard deviation of the random variable (hence σ2 is the

    variance) , π = 3.14159..., and e = 2.71828... is the base of natural or Naperian logarithms.

    Normal distributions are characterized by its mean and variance. Often, a normal random vari-

    13

  • able, X, distributed as normal with mean µ and variance σ2 will be denoted as

    X ∼ N(µ, σ2)

    where “∼” reads “distributed as”.

    Normal distribution has several main characteristics:

    1. It is bell-shaped and single-peaked (unimodal) at the exact center of the distribution, µ.

    2. Symmetrical about its mean. The arithmetic mean, median, and mode of the distribution

    are equal and located at the peak. Thus half the area under the curve is above the mean

    and half is below it.

    3. The normal probability distribution is asymptotic. That is the curve gets closer and closer

    to the X-axis but never actually touches it.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    -5 -4 -3 -2 -1 0 1 2 3 4 5

    N(0,0.3)

    N(0,0.5)

    N(0,1)

    N(-1,0.5) N(3,0.5)

    N(0,2)

    Probabi l i ty

    x

    Given the density function as described above, unlike uniform distribution, it is not easy to integrate to

    obtain probability of a normal random variable lying within a segment. It is not to difficult to get Excel

    to do the calculation, even for different combinations of mean and variance. However, many years ago,

    computational power was limited. Statistician worked very hard to come up with a table people can easily

    read off from it the cumulative distribution function of normal random variable, i.e., P (X < x). Do we

    need to have a table for each combination of µ and σ2? It turns out that all normal random variables

    can be transformed to standard normal random variables easily. Thus, for those who know how to do the

    transformation, we only need one table, the standard normal table.

    14

  • Definition 4 (Standard Normal distribution): A standard normal random variable is a normal

    random variable with zero mean (µ = 0) and unit standard deviation (σ = 1). Its probability

    density is defined on a real line (from −∞ to ∞)

    f(x) =1√2π

    e−12 x

    2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    -2.5 -1.5 -0.5 0.5 1.5 2.5

    cdf

    pdf

    Probabi l i ty

    x

    The standard normal distribution is sometimes known as z-distribution.

    Theorem 2 (Transform to Standard Normal Distribution): A linear transformation of a normal

    random variable will remain normal. In particular, any normal random variable with mean µ

    and standard deviation σ can be transformed to a standard normal random variable

    Z =X − µ

    σ

    Thus, P (X ∈ [a, b]) = P (Z ∈ [(a − µ)/σ, (b − µ)/σ]). With this property, we do not need a separate

    probability tables for different µ and σ. Instead, we only need one table — the standard normal table.

    Although we can easily calculate probability of such normal random variable of different combinations of

    mean and variance lying in an interval with the help of a computer, we need to learn how to use the tables

    in exams.

    Example 4 (How to read the standard normal distribution table):Suppose we have the following

    part of a standard normal distribution table accompanied with a graph:

    15

  • z ... 0.03 0.04 0.05 0.06 0.07 ...

    ... ... ... ... ... ... ... ...

    0.7 ... 0.2673 0.2704 0.2734 0.2764 0.2794 ...

    0.8 ... 0.2967 0.2995 0.3023 0.3051 0.3078 ...

    0.9 ... 0.3238 0.3264 0.3289 0.3315 0.334 ...

    1 ... 0.3485 0.3508 0.3531 0.3554 0.3577 ...

    1.1 ... 0.3708 0.3729 0.3749 0.377 0.379 ...

    ... ... ... ... ... ... ... ...

    x

    0 z

    Probability

    Let’s first make clear what does this table mean. The most left column together with the top

    row combines to make the number we want, i.e. the z in the graph. The inner part of the table

    is the resulting probability, i.e. the yellow area in the graph. For example, suppose we want to

    know Prob(0 < X < 0.84), we just find the row of 0.8 and the column of 0.04, and then read the

    the number in the cell, i.e. 0.2995, or we have Prob(0 < X < 0.84) = 0.2995.

    Try the follows:

    1. Prob(0 < X < 1.163)

    This is the same as our previous example. Because the table gives only approximate values,

    we have to round 1.163 to its nearest percentile, 1.16 (i.e. we assume Prob(0 < X <

    1.163)=Prob(0 < X < 1.16) approximately). Then we just look the value up from the table

    at row 1.1 and column 0.06, which is simply 0.377.

    2. Prob(X > 0.77)

    First remember one of the very nice properties of normal distribution is its symmertry,

    i.e. Prob(X > µ) = Prob(X < µ) = 0.5. This point tells us to get Prob(X > 0.77)

    we need just one step more. We first find out that Prob(0 < X < 0.77) = 0.2794. Then

    Prob(X > 0.77) = 0.5− 0.2794 = 0.2206.

    3. Prob(0.77 < X < 1.16)

    We should have no trouble to get Prob(0 < X < 0.77) = 0.2794 and Prob(0 < X < 1.16) =

    0.377, as we have done already. Then Prob(0 < X < 1.16) is simply their difference:

    Prob(0 < X < 1.16) = 0.377− 0.2794 = 0.0976.

    4. Prob(−0.85 < X < 0)

    16

  • Symmertry helps again here, telling that Prob(−0.85 < X < 0) = Prob(0 < X < 0.85). We

    have no problem finding Prob(0 < X < 0.85), which is actually at row 0.8 and column 0.05,

    0.3023.

    5. Prob(X < −1.16)

    We apply symmertry again to get Prob(X < −1.16) = Prob(X > 1.16) = 0.5 − Prob(0 <

    X < 1.16) = 0.123.

    6. Prob(−0.85 < X < 0.77)

    Since for X it is mutually exclusive to fall into either (−0.85, 0) or (0, 0.77), we know

    Prob(−0.85 < X < 0.77) = Prob(−0.85 < X < 0) + Prob(0 < X < 0.77) = 0.3023 +

    0.2794 = 0.5817.

    7. Sometimes we want reversely use the table to find the particular z instead of the probability.

    Suppose Prob(−k < X < k) = 0.75, (k > 0). What is the k here?

    Considering the symmertry of normal distribution, 0.75 = Prob(−k < X < k) = 2 ×

    Prob(0 < X < k), i.e. Prob(0 < X < k) = 0.375. Therefore we look for 0.375 (or some

    other number that is closest to 0.375) in the table. We find on row 1.1 and column 0.05 the

    probability is 0.3749. Therefore a reasonably precise result would be k = 1.15.

    Recall that all normal distributions with different µand σ, i.e. different X N(µ, σ2) can be

    transferred to standard normal through a simple linear transformation:

    Z =X − µ

    σ

    Thus the standard normal distribution table helps us much more than the aforementioned. See

    the following more examples:

    1. Given X N(1, 0.25), what is Prob(0 < X < 1.5)?

    Transfer X to the standard normal Z, we have Prob(0 < X < 1.5) = Prob( 0−10.5 < Z <

    1.5−10.5 ) = Prob(−2 < Z < 1). We are capable to calculate this through the above exercises.

    Note that the given table in this example is not enough and a full table covering −2and 1

    must be found. The answer should be close to Prob(0 < X < 1.5) = 0.8185, as different

    tables report at different accuracy.

    2. Given X N(−2, 4), and Prob(−1 < X < k) = 0.10. What is k?=0.6915+0.1=0.7915

    17

  • Again we do the standard normal transfer first. The lower boundary -1 would be transferred

    to −1−(−2)2 = 0.5; the upper boundary k would be1−k2 after the linear transfer. Thus

    Prob(−1 < X < 1) = Prob(.5 < Z < 1− k2

    ) = Prob(0 < Z <1− k

    2)−Prob(0 < Z < .5) = 0.1

    or

    Prob(0 < Z <1− k

    2) = 0.1 + Prob(Z < .5)

    = 0.1 + 0.1915

    = 0.2915

    Then we look for the closest number to 0.2915 from a full standard normal table. At ordinary

    accuracy requiremnt, 1−k2 = 0.81 would be good enough, or k = 1− 2× 0.81 = −0.62.

    Example 5 (Normal distribution): Suppose you work in Quality Control for GE. Light bulb life

    has a normal distribution with µ = 2000 hours and σ = 200 hours.

    1. What is the probability that a bulb will last Between 2000 and 2400 hours?

    P (2000 < X < 2400) = P [(2000− µ)/σ < (X − µ)/σ < (2400− µ)/σ]

    = P [0 < (X − µ)/σ < (2400− µ)/σ]

    = P [0 < Z < 2]

    = 0.4772

    2. What is the probability that a bulb will last less than 1470 hours?

    P (X < 1470) = P [(X − µ)/σ < (1470− µ)/σ]

    = P [Z < −2.65]

    = P [Z > 2.65]

    = 0.5− P [0 < Z < 2.65]

    = 0.5− 4960

    = 0.004

    18

  • Example 6 (Normal distribution): The daily water usage per person in New Providence, New

    Jersey is normally distributed with a mean of 20 gallons and a standard deviation of 5 gallons.

    1. About 68 percent of those living in New Providence will use how many gallons of water?

    Note that we know that P (µ − σ < X < µ + σ) = 0.6826. Thus, about 68% of the daily

    water usage will lie between 15 (µ− σ) and 25 (µ + σ) gallons.

    2. What is the probability that a person from New Providence selected at random will use

    between 20 and 24 gallons per day?

    P (20 < X < 24) = P [(20− 20)/5 < (X − 20)/5 < (24− 20)/5] = P [0 < Z < 0.8]

    The area under a normal curve between a z-value of 0 and a z-value of 0.80 is 0.2881. We

    conclude that 28.81 percent of the residents use between 20 and 24 gallons of water per day.

    3. What percent of the population use between 18 and 26 gallons of water per day?

    P (18 < X < 26) = P [(18− 20)/5 < (X − 20)/5 < (26− 20)/5]

    = P (−0.4 < Z < 1.2)

    = P (−0.4 < Z < 0) + P (0 < Z < 1.2)

    = P (0 < Z < 0.4) + P (0 < Z < 1.2)

    = 0.1554 + 0.3849

    = 0.5403

    Example 7 (Normal distribution): Professor Wong has determined that the scores in his statis-

    tics course are approximately normally distributed with a mean of 72 and a standard deviation

    of 5. He announces to the class that the top 15 percent of the scores will earn an A. What is the

    lowest score a student can earn and still receive an A?

    To begin let k be the score that separates an A from a B. 15 percent of the students score more

    than k, then 35 percent must score between the mean of 72 and k.

    1. Write down the relation between k and the probability: P (X > k) = 0.15 and P (X < k) =

    1− P (X > k) = 0.85

    19

  • 2. Transform X into z:

    P [(X − 72)/5) < (k − 72)/5] = P [Z < (k − 72)/5]

    3. We look for s = (k − 72)/5 such that

    P [0 < Z < s] = 0.85− 0.5 = 0.35

    From the standar normal table, we find s = 1.04:

    P [0 < z < 1.04] = 0.35

    4. Compute k: (k − 72)/5 = 1.04 implies k = 77.2

    Thus, those with a score of 77.2 or more earn an A.

    Example 8 (Stock returns): Let X be daily stock returns (percentage change of daily stock

    prices). Suppose X is distributed as normal with mean 0 and variance 4, i.e., X ∼ N(0, 4).

    What is the probability that the daily stock returns will lie between -2 and +2?

    First make the linear transformation from (−2, 2) N(0, 4) to standard normal Z:

    P (−2 < X < 2) = P (−2− 0√4

    < Z <2− 0√

    4) = P (−1 < Z < 1)

    and because of the symmetry of normal distribution,

    P (−2 < X < 2) = 2P (0 < Z < 1)

    Then we can easily find the probability form the standard normal table:

    P (−2 < X < 2) = 2× 0.3413 = 0.6826

    The probability that the daily stock returns will lie between -2 and +2 is 0.6826.

    Example 9 (Personal income): Let X be monthly personal income (in dollars). Suppose

    log(X) is distributed as normal with mean 9 and variance 16, i.e., log(X) ∼ N(9, 16). What is

    20

  • the probability that the monthly personal income of a randomly drawn person will be less than

    5000? Is there a reason we assume log(X) instead of X to follow a normal distribution?

    For simlicity, let us denote Y = log(X) as a new random variable, and we know Y follows a

    normal distribution of N(9, 16). Because Y = log(X) is a monotonically increasing function

    (i.e. the value of Y does not decrease as X increases), for the interval that X < 5000, we have

    Y < log(5000) = 8.5172.

    Next we can transfer Y to standard normal, Z:

    P (Y < 8.5172) = P (Z <8.5172− 9√

    16= −0.12) = 0.5− P (0 < Z < 0.12)

    Looking up the standard normal table, the probability is 0.5− 0.0478 = 0.4522.

    3.1 Checking for normality

    Normal distribution is one of the most important distributions. But, how do we know that data is distributed

    as normal, at least approximately? There are at least two ways to check.

    First, we can check the moments. We know that normal distributed random variable has zero skewness

    (due to symmetry of the distribution) and zero excess kurtosis1. If the sample skewness and excess kurtosis

    are close to zero, we will be convinced that the data is likely from a normally distributed population.

    Simulation 1 (Checking the skewness and kurtosis for normality):

    1. Generate 50 observations from a standard normal distribution N(0, 1), another 50 observa-

    tions from uniform distribution U(0, 1). Compute their sample skewness and excess kurto-

    sis2.

    skewness =√

    n∑n

    i=1(xi − x̄)3

    [∑n

    i=1(xi − x̄)2]3/2

    kurtosis =n

    ∑ni=1(xi − x̄)4

    [∑n

    i=1(xi − x̄)2]2 − 3

    2. Repeat last step 1000 times. Report the skewness and excess kurtosis calculation, and the

    average skewness over these 1000 repetitions.1For instance, refer to section 16.7 Tests for Skewness and Excess Kurtosis, p.567 of Estimation and Inference in Econometrics

    by Davidson and MacKinnon.2Note that a slight adjustment is need to obtain an unbiased estimator of these statistics.

    21

  • Below is a group of possible resulted we have generated:

    Distribution Observation # Average Skewness Average Excess KurtosisU(0,1) 50 0.0114 -1.1496U(-2,2) 50 -0.0077 -1.1375N(0,1) 50 -0.0008 0.0031N(-2,5) 50 0.0077 -0.0290

    The table reports that normal distribution has much closer-to-zero skewness and excess kurtosis

    than uniform distribution, and this holds regardless of the means and variances.

    [Refer to sim1.xls]

    Alternatively, we can use normal probability plot. Suppose we have n observations in the sample.

    1. Sort them in ascending order.

    2. Compute the empirical z value (i.e., (x− x̄)/σ).

    3. Generate a column 0.5, 1.5, ..., [0.5 + (n− 1)]. Call this column U .

    4. Generate another column p(z) = U/n.

    5. Generate another column theoretical z = NORMSINV (p(z)).

    6. Plot empirical z against the theoretical z.

    If the data has normal distribution, the plot should be a straight line. We illustrate the steps to do so in the

    following example.

    Example 10 (normal probability plot):

    1. Generate 1000 observations from N(0, 1).

    2. Sort them in ascending order.

    3. Compute the empirical z value (i.e., (x− x̄)/σ).

    4. Generate a column 0.5, 1.5, ..., [0.5 + (n− 1)]. Call this column U .

    5. Generate another column p(z) = U/n.

    6. Generate another column theoretical z = NORMSINV (p(z)).

    7. Plot empirical z against the theoretical z.

    22

  • 8. Repeat with 1000 observations drawn from U(0, 1).

    The two plots are shown below.

    Normal probablity plot (underlying population = normal)

    -3.5

    -3

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5

    Theoretical z value

    z va

    lue

    fro

    m d

    ata

    Normal probablity plot (underlying population = uniform)

    -3.5

    -3

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5

    Theoretical z value

    z va

    lue

    fro

    m d

    ata

    As obvious from the two plots, the plot lies more or less on the straight line when underlying

    population is normal but not when the underlying population is uniformly distributed.

    23

  • 4 Bivariate distributions

    Except for the use of integrations, the bivariate distributions of continuous random variables are not very

    different from that of discrete random variables.

    Theorem 3 (Characteristics of a Bivariate Continuous Distribution): If X and Y are continuous

    random variables that are defined on intervals [a, b] and [c, d] respectively. The joint probability

    density is denoted as f(x, y).

    1. The probability density function f(x, y) is non-negative and may be larger than 1.

    2. The probability that X takes a value between an interval [m,n] and Y between [k, l] is

    P (X ∈ [m,n], Y ∈ [k, l]) =∫ x=n

    x=m

    ∫ y=ly=k

    f(x, y)dxdy

    3. [P (X ∈ [m,n], Y ∈ [k, l]) =∫ x=n

    x=m

    ∫ y=ly=k

    f(x, y)dxdy is between 0 and 1.00.

    4. The sum of the probabilities of the various outcomes is 1.00. That is,

    P (X ∈ [−∞,∞], Y ∈ [−∞,∞]) = P (X ∈ [a, b], Y ∈ [c, d]) =∫ x=b

    x=a

    ∫ y=dy=c

    f(x, y)dxdy = 1

    5. Let the events defined on the two non-overlapping regions, ([m1, n1], [k1, l1]) and ([m2, n2], [k2, l2])

    be X ∈ [m1, n1], and Y ∈ [k1, l1] and X ∈ [m2, n2] and Y ∈ [k2, l2] These two events are

    mutually exclusive. That is,

    P ((X ∈ [m1, n1] and Y ∈ [k1, l1]) and (X ∈ [m2, n2] and Y ∈ [k2, l2])) = 0.

    P ((X ∈ [m1, n1] and Y ∈ [k1, l1]) or (X ∈ [m2, n2] and Y ∈ [k2, l2]))

    = P (X ∈ [m1, n1] and Y ∈ [k1, l1]) + P (X ∈ [m2, n2] and Y ∈ [k2, l2])

    6. The marginal density function of X is

    f(x) =∫ ∞

    y=−∞f(x, y)dy =

    ∫y

    f(x, y)dy.

    Note that the marginal density function of X is used when we do not care about the values

    24

  • Y takes. Similarly the marginal density function of Y is

    f(y) =∫ ∞

    x=−∞f(x, y)dx =

    ∫x

    f(x, y)dx.

    7. The conditional density function of X given Y :

    f(x|y) =

    f(x, y)/f(y) if f(y) > 00 if f(y) = 0Definition 5 (Independent bivariate uniform distribution): If a, b, c and d are numbers on the

    real line, , the random variable (X1, X2) ∼ U(a, b, c, d), i.e., has a independent bivariate uniform

    distribution if

    f(x1, x2) =

    1

    (b−a)(d−c) for a ≤ x1 ≤ b and c ≤ x2 ≤ d

    0 otherwise

    Definition 6 (Bivariate normal distribution): The two random variables (X, Y ) ∼ N(µx, µy, σ2x, σ2y, ρ),

    i.e., has a bivariate normal distribution with a correlation coefficient of ρ if the two random vari-

    ables are defined jointly on the whole real line [−∞,∞] and has the following density function

    f(x, y) =1

    2πσxσy√

    1− ρ2e−

    12 [�

    2x+�

    2y−2ρ�x�y)/(1−ρ

    2)]

    �x =x− µx

    σx

    �y =y − µy

    σy

    5 Expectations

    The expectation plays a central role in statistics and Economics. Expectation (often known as mean) reports

    the central location of the data. Sometimes, it is also known as the long-run average value of the random

    variable, i.e., the average of the outcomes of many experiments.

    25

  • Definition 7 (Expectation (mean)): Let X be a continuous random variable defined over [a, b],

    with a probability density function of f(x). The expectation of X is

    E(X) =∫ x=∞

    x=−∞xf(x)dx =

    ∫x

    xf(x)dx

    The expectation, E(X), is often denoted by a Greek letter µ (pronounced as mu).

    Thus, expectation of a random variable is a weight average of all the possible values of the random

    variable, weighted by its probability density function.

    Definition 8 (Conditional Expectation): For bivariate probability distribution, the conditional

    expectation or conditional mean E(X|Y = y) is computed by the formula:

    E(X|Y = y) =∫

    x

    xf(x|y)dx

    The unconditional expectation or mean of X is related to the conditional mean.

    E(X) =∫

    y

    E(X|Y = y)f(y)dy

    = E[E(X|Y )]

    Theorem 4 (Expectation of a linear transformed random variable): If a and b are constants and

    X is a random variable, then

    1. E(a) = a

    2. E(bX) = bE(X)

    3. E(a + bX) = a + bE(X)

    Proof: In our proof, we will only show the most general case E(a + bX) = a + bE(X).

    E(a + bX) =∫

    x

    (a + bx)f(a + bx)dx

    =∫

    x

    (a + bx)f(x)dx

    =∫

    x

    af(x)dx +∫

    x

    bxf(x)dx

    = a∫

    x

    f(x)dx + b∫

    x

    xf(x)dx

    26

  • = a + bE(X)

    Definition 9 (Variance): Let X be a continuous random variable defined over [a, b], with a

    probability density function of f(x). The variance of X is

    V (X) =∫

    x

    (x− E(X))2 f(x)dx

    The variance, V (X), is often denoted by a Greek letter σ2 (pronounced as sigma square).

    Note that variance of a random variable is the expectation of squared deviation of the random variable

    from its mean. That is, if we define a tranformed variable as Z = ((X − E(X))2, V (X) = E(Z). Thus,

    we will expect the variance of a transformed variable will be similar to the ones of the expectation of a

    transformed variable.

    Example 11 (Variance of a random variable): Suppose X and Y are jointly distributed random

    variables with probability density function f(x, y). The variance of X is

    V (X) =∫

    y

    ∫x

    (x− E(X))2f(x, y)dxdy

    Definition 10 (Conditional Variance): For bivariate probability distribution, the conditional

    expectation or conditional mean V (X|Y ) is computed by the formula:

    V (X|Y = y) =∫

    x

    (x− E(X|Y = y))2f(x|y)dx

    Theorem 5 (Variance of a linear transformed random variable): If a and b are constants and

    X is a random variable, then

    1. V (a) = 0

    2. V (bX) = b2V (X)

    3. V (a + bX) = b2V (X)

    Proof: In our proof, we will only show the most general case V (a + bX) = b2V (X).

    V (a + bX) = E[(a + bX)− (a + bE(X))]2

    27

  • = E[(bX − bE(X)]2

    = E[b(X − E(X))]2

    = E[b2(X − bE(X)2]

    = b2E[(X − bE(X)2]

    = b2V (X)

    Definition 11 (Covariance): Covariance between two random variables X and Y is

    C(X, Y ) = E[(X − E(X))(Y − E(Y ))]

    =∫

    x

    ∫y

    (x− E(X)(y − E(Y ))PXY (x, y)

    Note that the covariance can be written as

    C(X, Y ) = E[(X − E(X))(Y − E(Y ))]

    = E[XY − E(X)Y −XE(Y ) + E(X)E(Y )]

    = E[XY ]− E[E(X)Y ]− E[XE(Y )] + E[E(X)E(Y )]

    = E[XY ]− E(X)E(Y )− E(X)E(Y ) + E(X)E(Y )

    = E[XY ]− E(X)E(Y )

    Theorem 6 (Covariance of a linear transformed random variable): If a and b are constants and

    X is a random variable, then

    1. C(a, b) = 0

    2. C(a, bX) = 0

    3. C(a + bX, Y ) = bC(X, Y )

    Proof: In our proof, we will only show the most general case C(a + bX, Y ) = bC(X, Y ).

    C(a + bX, Y ) = E{[(a + bX)− (a + bE(X))][Y − E(Y )]}

    = E{[(bX − bE(X)][Y − E(Y )]}

    = E{[b(X − E(X))][Y − E(Y )]}

    28

  • = bE{[X − E(X)][Y − E(Y )]}

    = bC(X, Y )

    Theorem 7 (Variance of a sum of random variables): If a and b are constants, X and Y are

    random variables, then

    1. V (X + Y ) = V (X) + V (Y ) + 2C(X, Y )

    2. V (aX + bY ) = a2V (X) + b2V (Y ) + 2abC(X, Y )

    Proof: In our proof, we will only show the most general case V (aX +bY ) = a2V (X)+b2V (Y )+

    2abC(X, Y ).

    V (aX + bY ) = E[(aX + bY )− (aE(X) + bE(Y ))]2

    = E[aX − aE(X) + bY − bE(Y )]2

    = E[a(X − E(X)) + b(Y − E(Y ))]2

    = E[a2(X − E(X))2 + b2(Y − E(Y ))2 + 2ab(X − E(X))(Y − E(Y ))]

    = a2E[(X − E(X))2] + b2E[(Y − E(Y ))2] + 2abE[(X − E(X))(Y − E(Y ))]

    = a2V (X) + b2V (Y ) + 2abC(X, Y )

    Definition 12 (Independence): Consider two random variables X and Y with joint probability

    density f(x, y), marginal probability f(x), f(y), conditional probability f(x|y) and f(y|x).

    1. They are said to be independent of each other if and only if

    f(x, y) = f(x)× f(y) for all x and y.

    X and Y are independent if each cell probability, f(x, y), is the product of the corresponding

    marginal probability density function.

    2. X is said to be independent of Y if and only if

    f(x|y) = f(x) for all x and y.

    29

  • 3. Y is said to be independent of X if and only if

    f(y|x) = f(y) for all x and y.

    Theorem 8 (Consequence of Independence): If X and Y are independent random variables, we

    will have

    E(XY ) = E(X)E(Y )

    E(XY ) = E(X)E(Y ) does not always imply that the random variables X and Y are independent.

    6 The Normal Approximation to the Binomial

    The normal distribution (a continuous distribution) yields a good approximation of the binomial distribution

    (a discrete distribution) for large values of n.

    Recall for the binomial experiment:

    1. There are only two mutually exclusive outcomes (success or failure) on each trial.

    2. A binomial distribution results from counting the number of successes.

    3. Each trial is independent.

    4. The probability is fixed from trial to trial, and the number of trials n is also fixed.

    The normal probability distribution is generally a good approximation to the binomial probability distribu-

    tion when nπ and n(1−π) are both greater than 5 – because of the Central Limit Theorem (to be discussed

    in next Chapter). However, because the normal distribution can take all real numbers (is continuous) but

    the binomial distribution can only take integer values (is discrete), we will need to correct for the continuity.

    A normal approximation to the binomial should identify the binomial event “8” with the normal interval

    “(7.5, 8.5)” (and similarly for other integer values).

    30

  • Binomial Event Normal Interval

    0 (-0.5,0.5)

    1 (0.5,1.5)

    2 (1.5,2.5)

    3 (2.5,3.5)

    ... ...

    x (x− 0.5, x + 0.5)

    ... ...

    0

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    0.1

    12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

    B(p,n)=B(.3,100)

    0

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    0.1

    12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

    N(np,np(1-p))=N(30,21)

    Example 12 (Continuity correction in normal approximation of binomial): If n = 20 and

    31

  • π = .25, what is the probability that X is greater than or equal to 8?

    The normal approximation without the continuity correction factor yields

    Z =(8− 20× .25)

    (20× .25× .75)0.5= 1.55

    , P (X ≥ 8) = P (Z ≥ 1.55) is approximately .0606 (from the standard normal table).

    The continuity correction factor requires us to use 7.5 in order to include 8 since the inequality

    is weak and we want the region to the right.

    Z =(7.5− 20× .25)

    (20× .25× .75)0.5= 1.29

    P (X ≥ 7.5) = P (Z ≥ 1.29) is .0985. The exact solution from binomial distribution function is

    .1019. Thus, the normal approximation with continuity correction yiled a good approximation to

    binomial.

    Example 13 (Continuity correction in normal approximation of binomial): A recent study by

    a marketing research firm showed that 15% of American households owned a video camera. For

    a sample of 200 homes, how many of the homes would you expect to have video cameras?

    1. Compute the mean: µ = nπ = 200× .15 = 30.

    2. Compute the variance: σ2 = nπ(1− π) = 200× .15× (1− .15) = 25.5. Standard deviation

    is σ =√

    25.5 = 5.0498.

    3. “Less than 40” means “less or equal to 39”. We use the correction factor, so X is 39.5. Hence,

    P (X < 39.5) = P [(X−30)/5.0498 < (39.5−30)/5.0498] = P [Z < 1.88] = P [Z < 0]+P [0 <

    Z < 1.88] = .5 + .4699 = .9699.

    Thus, the likelihood that less than 40 of the 200 homes have a video camera is about 97%.

    7 The exponential distribution

    Exponential distribution is often used to model the length of time between the occurrences of two events or

    between two ocurrences of the same event (the time between arrivals).

    Example 14 (Exponential distribution):

    32

  • 1. Time taken for your instructor to respond to your email.

    2. Time between the birth of two babies.

    3. Time taken to find a new job since layoff – the so-called unemployment spell.

    4. Time to complete a wage bargaining between a company and a labor union.

    5. Time to complete the accession to WTO.

    6. Time between two major floods in Sichuan, China.

    7. Time taken for the police to solve a crime case.

    8. Time taken for obtain a job promotion.

    Definition 13 (Exponential distribution):

    The exponential random variable T (T > 0) has a probability density function

    f(t) = λe−λt for t > 0

    where λ is the mean number of occurrences per unit time; t is the length of time until the next

    occurrence; e = 2.71828.

    The cumulative distribution function (the probability that an arrival time is less than some

    specified time t) is

    F (t) = Prob(T < t) = 1− e−λt

    The mean and variance of an exponential random variable are

    E(T ) =1λ

    V ar(T ) =1λ2

    Note that exponential random variable requires only one parameter, its mean λ (lambda). What are the

    other distributions that are completely characterized by one parameter?

    Example 15 (Exponential distribution): Customers arrive at the service counter at the rate of

    15 per hour. What is the probability that the arrival time between consecutive customers is less

    than three minutes?

    33

  • Let T be the arrival time between consecutive customers. The mean number of arrivals per hour

    is 15, so λ = 15. Three minutes is .05 hours. Hence, we have

    P (T < .05) = 1− e−λt = 1− e−(15)(.05) = 0.5276

    So there is a 52.76% probability that the arrival time between successive customers is less than

    three minutes.

    Example 16 (Exponential distribution): On average, it takes about seven years to accede to

    the GATT/WTO. What is the probability that a country takes more than 15 years to accede to

    GATT/WTO?

    Let T be the time taken. Since E(T ) = 1λ = 7, we have λ =17 . Hence, we have

    P (T < 15) = 1− e−λt = 1− e−( 17 )(15) = 0.883

    P (T ≥ 15) = 1− P (T < 15) = e−( 17 )(15) = 0.117

    Thus, there is a 11.7% probability that it takes more than 15 years to accede to GATT/WTO.3

    [Still need to write a lot more numerical examples for different sections!!]

    [Plot some figures associated with the numerical examples!!]

    3It took China 15 years to accede to GATT/WTO.

    34

  • References

    [1] Davidson, Russell, and James G. MacKinnon (1993): Estimation and Inference in Econometrics. Oxford

    University Press.

    35

  • Mathematical appendix (A brief introduction to integration)

    Integration is a useful tool to estimate the area under a curve above zero. Suppose we have a curve defined

    by the function of f(x). Suppose we want to compute the area of f(x) above 0 between a and b. Let’s

    consider several cases.

    1. Let’s start from the simpliest case when f(x)is a constant, i.e., f(x) = k. In this case, the area is

    actually a rectangle. It is easy to compute the area as (b− a)× k.

    2. Suppose f(x) = k1 for x ∈ [a, c1] and f(x) = k2 for x ∈ [c1, b]. In this case, the area is actually a sum

    of two rectangles. It is still easy to compute the area as (b− c1)× k2 + (c1 − a)× k1.

    3. Suppose f(x) = k1 for x ∈ [a, c1], f(x) = k2 for x ∈ [c1, c2]and f(x) = k3 for x ∈ [c2, b]. In

    this case, the area is actually a sum of three rectangles. It is still easy to compute the area as

    (b− c1)× k1 + (c2 − c1)× k2 + (c2 − a)× k3.

    The three examples illustrates that how one can compute the area if the area is a combination of rectangles.

    For a general f(x), we can approximate the area by a sum of rectangles. Let’s define

    g(x) =

    k1 for x ∈ [a, a + dx]

    k2 for x ∈ [a + dx, a + 2dx]

    k3 for x ∈ [a + 2dx, a + 3dx]

    .... ....

    or

    g(x) = km for x ∈ [a + (m− 1)dx, a + mdx]

    If g(x) approximate f(x) well, the approximated area is a sum of rectangles

    k1 × dx + k2 × dx + k3 × dx + ...

    What are the values of k1, k2, ...? One possibility is to take k1 = f(a). That is,

    g(x) = f(a + (m− 1)dx) for x ∈ [a + (m− 1)dx, a + mdx]

    36

  • When the width of each rectangle is small, it does not matter if we take f(a + (m− 1)dx) or f(a + mdx) or

    any point in [a + (m− 1)dx, a + mdx].

    Thus, the area may be approximated by

    f(a)× dx + f(a + dx)× dx + f(a + 2dx)× dx + ...

    The approximation is better if dx can be made to be close to zero. The area will be a sum of infinite

    very small rectangles. We write the area as ∫ ba

    f(x)dx

    Mathematicians have developed neat ways to approximate this area. All we need is to borrow those

    formula from them.

    Theorem 9 (Some integration formula):

    1. Let f(x) = k, a constant (such as the density of a uniform distribution). The area under

    the curve f(x) in the interval [a, b] is

    ∫ ba

    f(x)dx = k∫ b

    a

    dx = k(x|b − x|a) = k(b− a)

    where g(x)|a = g(a).

    2. Let f(x) = kx, a liner curve (such as the product of the random variable and the density of

    a uniform distribution). The area under the curve f(x) in the interval [a, b] is

    ∫ ba

    f(x)dx = k∫ b

    a

    xdx = k(x2

    2|b −

    x2

    2|a) = k(

    b2

    2− a

    2

    2)

    where g(x)|a = g(a).

    Problem sets

    We have tried to include some of the most important examples in the text. To get a good understand of

    the concepts, it is most useful to re-do the examples and simulations in the text. Work on the following

    problems only if you need extra practice or if your instructor assigns them as an assignment. Of course, the

    more you work on these problems, the more you learn.

    37

  • Challenge 1 (mean and standard deviation of a uniform distribution): Show that the mean and

    standard deviation of a uniform random variable X are

    E(X) =c + d

    2

    and

    V (X) =(c− d)2

    12

    Challenge 2 (mean and standard deviation of a uniform distribution): Show that the exponen-

    tial distribution is memoryless, i.e., P (T > s + t|T > s) = P (T > t) for all s, t ≥ 0.

    Challenge 3 (mean and standard deviation of an univarate uniform distribution): Let X be

    uniformly distributed on the interval [a, b]. Find E(X) and V ar(X).

    Challenge 4 (mean and standard deviation of a bivariate uniform distribution): Let (X, Y ) be

    jointly uniformly distributed on the rectangle [a, b, c, d]. Find E(X) and V ar(X).

    Solutions to problem set

    1. It is very straight-forward to guess that the mean of a uniform distribution is simply c+d2 because at

    any point between the interval (c, d) the point probability is the same, i.e. f(x) ≡ 1d−c , x ∈ (c, d).

    However we can still prove it mathematically:

    E(X) =∫ d

    c

    xf(x)dx

    =∫ d

    c

    x

    d− cdx

    =1

    d− c

    ∫ dc

    xdx

    =1

    d− c× x

    2

    2|dc

    =1

    d− c× (d− c)(d + c)

    2

    =c + d

    2

    38

  • and the variance:

    V (X) =∫ d

    c

    (x− µ)2f(x)dx

    =1

    d− c

    ∫ dc

    (x2 − 2µx + µ2)dx

    =1

    d− c× (x

    3

    3− µx2 + µ2x)|dc

    =d2 + cd + c2

    3− µ(c + d) + µ2

    =d2 + cd + c2

    3− (c + d)

    2

    2+

    (c + d)2

    4

    =(4d2 + 4cd + 4c2)− (6c2 + 12cd + 6d2) + (3c2 + 6cd + 3d2)

    12

    =(c− d)2

    12

    2. “Memoryless” is a property of exponential distribution. For example, suppose at a specific crossroad,

    every λ hours there is a traffic accident. (a) What is the probability that another accident would happen

    in next t hours? (b) Given that s accidents had been witnessed last week, what is the probability that

    another accident would happen in next t hours?

    (a) P (T > t) = 1− (1− e−λt) = e−λt;

    (b) P (T > s + t|T > s) = [1− (1− e−λ(s+t))]/[1− (1− e−λs)] = e−λ(s+t)/e−λs = e−λt.

    We find that P (T > t) ≡ (T > s + t|T > s), which means that it matters not how many accidents

    had happened before - in another word, memoryless - to determine the future probability of another

    accident. (Similar property could be found in “Geomertric distribution”).

    3. Refer to 1.

    4. We may give a reasonable guess that a single point probability of the bivariate uniform distribution

    is 1(b−a)(d−c) , i.e fxy(x, y) ≡1

    (b−a)(d−c) , x ∈ (a, b), y ∈ (c, d). Then we can run mathematically to the

    expectation:

    E(X) = E[E(X|Y )]

    = E[∫ b

    a

    xfx(x|y)dx|Y ]

    = E(a + b

    2|Y )

    39

  • =∫ d

    c

    a + b2

    dy

    =(a + b)(d− c)

    2

    and the variance is

    V (X) =

    40

    Features of a Continuous Probability DistributionMaking a connection between discrete and continuous distributionsImagine throwing a dart at [0,1]Imagine throwing a dart at [a,b]Deriving the probability density function (pdf)

    Normal distributionsChecking for normality

    Bivariate distributionsExpectationsThe Normal Approximation to the BinomialThe exponential distribution