Top Banner
CME 308 Spring 2014 Notes George Papanicolaou March 31, 2015 Contents 1 Sums of independent identically distributed random variables 2 1.1 The weak law of large numbers ......................... 2 1.2 The strong law of large numbers ........................ 3 1.3 Weak convergence ................................ 4 1.4 The central limit theorem (CLT) ........................ 5 1.5 Characteristic functions and Fourier transforms ................ 7 1.6 On convergence of random variables ...................... 8 1.7 Confidence intervals for the empirical mean .................. 9 1.8 Large deviations ................................. 10 2 Maximum likelihood estimation (MLE) 12 2.1 Large sample properties of MLE ........................ 13 2.2 Cramer-Rao lower bound and asymptotic efficiency of the MLE ....... 14 2.3 Asymptotic normality of posterior densities .................. 16 3 Basic Monte Carlo methods 17 3.1 Properties of basic Monte Carlo ......................... 18 3.2 Importance sampling ............................... 19 3.3 Acceptance-rejection ............................... 20 3.4 Glivenko-Cantelli theorem and the Kolmogorov-Smirnov test ........ 21 3.5 Density kernel estimation ............................ 23 3.6 Bootstrap ..................................... 24 4 Markov Chains 26 4.1 Exit times ..................................... 27 4.2 Transience and recurrence ............................ 29 4.3 Invariant probabilities .............................. 31 4.4 The ergodic theorem ............................... 32 4.5 The central limit theorem for Markov chains .................. 35 1
53
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • CME 308 Spring 2014 Notes

    George Papanicolaou

    March 31, 2015

    Contents

    1 Sums of independent identically distributed random variables 21.1 The weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 The strong law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Weak convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 The central limit theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Characteristic functions and Fourier transforms . . . . . . . . . . . . . . . . 71.6 On convergence of random variables . . . . . . . . . . . . . . . . . . . . . . 81.7 Confidence intervals for the empirical mean . . . . . . . . . . . . . . . . . . 91.8 Large deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2 Maximum likelihood estimation (MLE) 122.1 Large sample properties of MLE . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Cramer-Rao lower bound and asymptotic efficiency of the MLE . . . . . . . 142.3 Asymptotic normality of posterior densities . . . . . . . . . . . . . . . . . . 16

    3 Basic Monte Carlo methods 173.1 Properties of basic Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Acceptance-rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Glivenko-Cantelli theorem and the Kolmogorov-Smirnov test . . . . . . . . 213.5 Density kernel estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4 Markov Chains 264.1 Exit times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Transience and recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Invariant probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 The ergodic theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 The central limit theorem for Markov chains . . . . . . . . . . . . . . . . . . 35

    1

  • 4.6 Expected number of visits to a state and the invariant probabilities . . . . . 364.7 Return times and the ergodic theorem . . . . . . . . . . . . . . . . . . . . . 384.8 MLE for Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.9 Bayesian filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5 Random walks and connections with differential equations 445.1 Transience and recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Connections with classical potential theory . . . . . . . . . . . . . . . . . . 485.3 Stochastic control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    1 Sums of independent identically distributed random vari-ables

    We will be dealing with sequences of independent identically distributed random variablesX1, X2, ..., Xn where P{Xj x} = F (x) is their common distribution function. Wewill also use the notation FX(x) when we want to identify the random variable whosedistribution is F (x). Independence means that the joint distribution of X1, X2, ..., Xn isequal to the product of the marginals

    P{X1 x1, X2 x2, ..., Xn xn} = F (x1, x2, ..., xn) =nj=1

    F (xj)

    This implies that the expectation of the product of any bounded functions of the ran-dom variables equals the product of the expectations: E{g1(X1)g2(X2) gn(Xn)} =E{g1(X1)}E{g2(X2)} E{gn(Xn)}.

    We will be interested in the behavior of the sample or empirical mean

    Xn =X1 +X2 + ...+Xn

    n

    which is the simplest and most widely studied function of the random variables. We expectthat it should be closely related to the theoretical mean = E{Xj}. We denote by 2 thevariance of Xj

    2 = var(Xj) = E{(Xj )2} =

    (x )2dF (x)

    1.1 The weak law of large numbers

    The simplest large sample relation is the weak law of large numbers (WLLN) which saysthat Xn in probability as n. This is a consequence of the Chebyshev inequality(CI)

    P{|Xn | > } E{(Xn )2}

    2=

    2

    n2 0

    2

  • as n for all > 0. We have used here the fact that Xn converges to also in meansquare, E{(Xn )2} 0.

    We note that neither independence of the Xj nor their finite variance are needed forthe validity of WLLN. It is enough that the Xj be sufficiently uncorrelated, with finitevariance, or that they be independent with finite E{|Xj |} < but may have infinitevariance.

    The Chebyshev inequality for any random variable X has the form

    P{|X | > } E{|X |p}

    p, > 0 , 0 < p } = 0 , for all > 0

    We would like to know when it is true that

    P{ limnXn = } = 1

    which is the strong law of large numbers (SLLN).This is more involved because we need to calculate the probability of an event that

    depends on infinitely many random variables. It is necessary therefore that the infinitesequence of independent identically distributed (iid) random variables X1, X2, ..., Xn, ...be defined on the same probability space , that is Xj = Xj(), are real valuedfunctions on . We may think of as a set of elementary events on which a probabilitylaw P is defined, more precisely it is defined on subsets of . We will not need a measuretheoretic foundation for probability here except in special cases in which we will deal withthe issues that arise without a general theory.

    Let An = { | |Xn| } for some > 0. If we can show that, given > 0, theset of all such that |Xn | for all n larger than some N() has probability one,then we have proved the SLLN. But the event in question can be written as

    A = N=1 n=N AnWith Ac, the complement of A, we now have

    P{Ac} = P{N=1 n=N Acn} = limN

    P{n=NAcn} limN

    n=N

    P{Acn}

    We have used here some general properties of probability laws that we will not discuss indetail but which are rather intuitive. One is that the probability of the intersection of a

    3

  • sequence of decreasing sets is equal to the limit of the probability of these sets, and theother is that the probability of the union of sets is less than or equal to the sum of theprobabilities. It is equal when the events are disjoint, that is, their pairwise intersectionsare empty. Suppose now that we can show that for any > 0 fixed

    P{Acn} = P{|Xn | > } constant

    n2

    Then we have that P{Ac} = 0, since

    limN

    n=N

    P{Acn} = 0,

    and hence P{A} = 1, which proves the SLLN.If the iid random variables {Xj} have finite fourth order moments, E{|Xj |4} < or

    E{(Xj )4}

  • which shows that only distribution functions are involved. The continuity of g is essentialas it allows for a coarse graining of the features of the random variables involved, andleads as a consequence to results that can have universal behavior. The central limittheorem is the best and perhaps simplest example of all this, as we will see in the nextsection.

    A basic theorem in weak convergence is the equivalence of three different forms of it.The following three statements are equivalent:

    1. limnE{g(Xn)} = E{g(X)} , for all bounded and continous g(x)2. limFn(x) = F (x) , at all continuity points x of F (x)

    3. limnE{eiXn} = E{eiX} , pointwise for all RThe last form of weak convergence involves the characteristic functions of the random

    variables, which are the Fourier transforms of the distribution functions. Implicit in thistheorem is the statement that knowledge of the expectations E{g(X)} for all bounded andcontinuous g determines F uniquely, and knowledge of the characteristic function E{eiX}also determines F uniquely. The latter is not so surprising as it amounts to the uniquenessof the inverse Fourier transform, which provides also a formula for recovering F from itsFourier transform.

    As an example of how the above equivalence is shown consider how the second state-ment follows from the first. The indicator function I(,x](y) is discontinuous but can bebounded above and below by continuous functions, g(y) I(,x](y) g(y), where g(y)equals one for y < x, is linear between x and x + going from 1 to zero, and is zero fory > x + . The lower function is defined similarly, being equal to 1 for y < x , linearbetween x and x and zero for y > x. From the first statement we have that

    E{g(X)} = limnE{g(Xn)} lim infn Fn(x) lim supn Fn(x) limnE{g

    (Xn)} = E{g(X)}

    From the definition of g and g it follows that

    F (x ) lim infn Fn(x) lim supn Fn(x) F (x+ )

    If x is a continuity point of F then letting 0 gives the second statement.

    1.4 The central limit theorem (CLT)

    The CLT states that if X1, X2, ..., Xn are iid random variables with mean and variance2 then Zn =

    n(Xn ) converges weakly to a Gaussian random variable with mean

    zero and variance 2.

    5

  • The proof uses characteristic functions. We also assume that we have finite thirdmoments, E{|X|3} < }, which is not necessary except to simplify the proof. We havethat

    E{eiZn} = E{e innj=1(Xj)} = E{

    nj=1

    ein

    (Xj)}

    =nj=1

    E{e in (Xj)} = (E{e in (X)})n

    The independence is used in going from the first to the second line above.We now use the Taylor expansion with remainder for the exponential

    |eix 1 ix 12

    (ix)2| 16|x|3

    and note that derivatives of the characteristic function at zero exist and equal to momentsif these are finite. This gives

    (E{e in (X)})n = (1 22

    2n+O(

    1

    n3/2))n e

    22

    2

    as n . But e22

    2 is the characteristic function of a Gaussian random variable withmean zero and variance 2, which is the well-known identnity

    e22

    2 =

    eixe

    x2

    222pi2

    dx (1)

    A proof of the CLT as above but with only second moments for the law of the {Xj}can be given by noting that if () = E{ei(X)} then (0) = 1, (0) = 0, (0) = 2,and we have by Taylors theorem with remainder

    n(n

    ) =

    (1 + (n)

    2

    2n

    )n e

    22

    2

    where 0 < n 0, P{|XnX| > } 0 as n.

    We say that {Xn} converges in mean square to X, Xn MSQ X, if E{(XnX)2} 0 as n. Clearly convergence in mean square implies convergence in probability. Thisfollows from the Chebyshev inequality. However, the converse is not true simply because

    8

  • random variables can converge in probability and they may not even have finite secondmoments (variances), so mean square convergence does not make sense for them.

    Convergence with probability one, as explained above, means that all random variablesare defined on the same probability space and that P{limnXn = X} = 1. This is acomposite of three statements: First the limit exists, then the limit is equal to X and thirdthis occurs with probability one. Convergence with probability one implies convergence inprobability (you need the bounded convergence theorem to show this) but the converse isfalse. This is intuitively clear but somewhat technical to justify as are most statementsthat hold with probability one. The connection with mean square convergence is this:Convergence with probability one and boundedness of the variances of the Xn uniformlyin n does not imply mean square convergence but it does imply that the limit randomvariable has finite variance. And mean square convergence does not imply convergencewith probability one.

    Regarding weak convergence, it is true that convergence in probability implies it. Theconverse is not true except when the weak limit of the random variables (distributions)Xn is deterministic, that is, when X takes only one value and its distribution has ajump of hight one at that one point. In that case weak convergence implies convergencein probability. Slutskys theorem is a generalization of this statement that we use often in

    estimation theory and elsewhere. Its statement is that if XnL X and Yn P c, where c

    is a (deterministic) constant, then Xn + YnL X + c, XnYn L Xc and if c > 0 then

    Xn/YnL X/c.

    As an example of how these statements are shown, consider how convergence in proba-bility implies weak convergence. For weak convergence it is sufficient that the test functiong be bounded and uniformly continuous (bounded and continuous is enough) so that givenand > 0 there is a > 0 so that |g(x) g(y)| < if |x y| < . We then have that

    |E{g(Xn)} E{g(X)}| |E{[g(Xn) g(X)]I(|XnX|)}|+ E{|g(Xn) g(X)|I(|XnX|>)} (1 P{|Xn X| > }) + 2 max

    x|g(x)|P{|Xn X| > }

    This implies thatlim supn

    |E{g(Xn)} E{g(X)}|

    and since is arbitrary we have the statement of weak convergence.

    1.7 Confidence intervals for the empirical mean

    An important application of the central limit theorem along with Slutskys theorem is ingetting confidence intervals in parameter estimation. Suppose that we use the empiricalmean Xn to estimate the true mean , assumed unknown. First we introduce the sample

    9

  • variance

    s2n =1

    n 1nj=1

    (Xj Xn)2

    It is easily seen that E{s2n} = 2, the theoretical variance, and assuming finite thirdmoments we not only have that Xn

    P but also s2n P 2, which implies that sn P .Now the central limit theorem says that

    n (Xn )

    L N (0, 1). By Slutskys theorem wealso have that

    nsn

    (Xn) L N (0, 1). Given a confidence level , for example = .05, wefind such that P{|Z| > } = , where Z is a Gaussian random variable with mean zeroand variance one. It then follows that for n large enough we have that

    nsn|Xn | >

    with probability and hence

    Xn snn Xn + sn

    n(2)

    with probability 1 . This is a confidence interval for the unknown mean in terms ofa sample of size n, which we assume is large enough so that the central limit theorem canbe used.

    1.8 Large deviations

    The weak law of large numbers states that XnP and the central limit theorem that

    n(Xn ) L N (0, 2). Let > and note that we must have

    P{Xn > } 0.

    The question posed in large deviations is to estimate the rate at which this probabilitytends to zero. Assume that the underlying random variable X has a density f(x) and amoment generating function. We denote the logarithm of the moment generating functionby

    L() = logE{eX}, Rwhich we assume to be finite and differentiable in . We have that L(0) = 0, L(0) = andL(0) = 2. We also assume that L(), which is always convex, is in fact strictly convex.We define the conjugate convex function of L by

    H() = sup

    ( L()), R

    and note that L can be recovered from H by applying to the Legendre transform again

    L() = sup

    ( H()), R

    10

  • We will show that

    limn

    1

    nlogP{Xn > } = H(),

    which means that in the logarithmic sense we have

    P{Xn > } enH()

    For the proof we get first the upper bound

    lim supn

    1

    nlogP{Xn > } H(), (3)

    and then the lower bound

    lim infn

    1

    nlogP{Xn > } H(), (4)

    from which the result follows. For the upper bound we note that

    P{Xn > } = P{enXn > en} en(E{eX})n = en(L()), 0,which after taking logs, dividing by n, taking the upper limit and then optimizing the rightside over we get the upper bound (3). If > then H() = sup>0( L()) so theMarkov inequality above is valid.

    To get the lower bound we first introduce a transformation of law for X that plays anessential role here and in some other contexts such as in importance sampling. For any R let

    f(x) =exf(x)

    exf(x)dx

    and note that we have the identity

    f(x1)f(x2) f(xn) = e(x1+x2++xn)+nL()f(x1)f(x2) f(xn)This implies that for any real we have

    P{Xn > } = E{I(Xn>)} = E{en(XnL())I(Xn>)}where E denotes expectation relative to the law with density f. Let be any smallpositive number and note that {Xn > } = {Xn ( + ) > } {|Xn ( + )| < }.If we chose = ( + ) so that L() = + , which means that E{X} = + , theweak law of large numbers gives

    P{|Xn ( + )| < } 1, as nWe now note that

    E{en(XnL())I(Xn>)} en((+2)L())P{|Xn ( + )| < }

    11

  • Taking logarithms, dividing by n and taking the lower limit as n gives

    lim infn

    1

    nlogP{Xn > } [( + 2)( + ) L(( + ))]

    Since is arbitrarily small we have

    lim infn

    1

    nlogP{Xn > } [() L(())] = H()

    which proves the lower bound and hence the large deviations theorem.

    2 Maximum likelihood estimation (MLE)

    Let X1, X2, ..., Xn be independent samples of a random variable whose distribution F (x|)depends on a real parameter with values in a bounded interval. How can we estimate thetrue value of this parameter from the observed sequence. In the maximum likelihoodmethod we form the likelihood function

    Ln() = Ln(;X1, X2, ..., Xn) =

    nj=1

    f(Xj |), (5)

    where f(x|) = ddxF (x|) is the density of the random variable, and obtain an estimate of by maximizing it

    n = argmaxLn()

    Clearly n = n(X1, X2, ..., Xn) and the reason that this is an acceptable choice for anestimator is that the observed sample by the mere fact that it has occurred must havecome from the distribution with the most likely parameter. Of course, the reason thatMLE estimators are calculated and used is because they have very good large sampleproperties as we will see, which captures this intuition.

    Assuming that the density f(x|) > 0 is smooth in and that integrals we will becomputing all exist, we introduce the scaled log likelihood function

    ln() =1

    n

    nj=1

    log f(Xj |) (6)

    and note that ln(n) = 0, which is the usual first order condition for an extremum. Hereprimes denote derivatives with respect to and we have

    ln() =1

    n

    nj=1

    f (Xj |)f(Xj |)

    12

  • By the WLLN we have that

    limnP{|ln() l()| > } 0 , for all > 0

    wherel() = E{log f(X|)}.

    Having assumed differentiability in we also have

    limnP{|l

    n() l()| > } 0 , for all > 0

    where

    l() = E{f(X|)f(X|) }

    Clearly we have that

    l() =f (x|)f(x|) f(x|

    )dx =f (x|)dx = d

    d

    f(x|)dx = 0

    We also have that

    l() =

    (f (x|)2f(x|) dx < 0

    which means that is, in general, a maximum of l(). We define the Fisher informationby

    I() = l() =

    (f (x|)2f(x|) dx (7)

    and note that it is positive.

    2.1 Large sample properties of MLE

    An estimator is called asymptotically consistent if

    limnP

    {|n | > } = 0 , for all > 0 (8)

    The fact that the MLE estimator is asymptotically consistent follows from the smoothnessin of f(x|) and the existence the integrals that arise in the calculations above. This isnot an obvious statement and will not be proved in detail. It becomes more plausible whenwe invoke the uniform WLLN for ln() and l

    n()

    limnP

    { max||a

    |ln() l()| > } = 0 , for all > 0 (9)

    and similarly for ln, where a > 0 is a fixed constant. This says that near the randomcurve ln() and its derivative l

    n() are uniformly close to the deterministic curve l() and

    13

  • its derivative, in probability. This then implies that the maximum of ln, n, is close to themaximum of l, which is , in probability.

    An estimator is unbiased if E{n} = and asymptotically unbiased if limnE{n} =. MLE estimators are in general biased but because of consistency they are, with addi-tional assumptions on integrability, asymptotically unbiased.

    An important application of the central limit theorem is in determining the fluctuationsin the MLE. Let Zn be the scaled error

    n = +

    1nZn , or Zn =

    n(n )

    We will show that Zn converges weakly to a Gaussian random variable with mean zero andvariance equal to the reciprocal of the Fisher information I. It is not immediately clearhow the CLT comes into the picture since n is not, in general, a sum of random variables.The way to represent it approximately as a ratio of sums of random variables is by usingthe delta method or expansion. We use a Taylor expansion with remainder (and the meanvalue theorem) to get

    0 = ln(n) = ln() +

    1nln(n)Zn

    so that

    Zn =

    nln()ln(n)

    where n is between and n and tends to in probability, by consistency. We now note

    the following. First, the CLT can be applied directly tonln() so as to show that it

    converges weakly to a Gaussian random variable with mean zero and variance equal to theFisher information I(). Second, the uniform WLLN can be applied ln(n) to show thatit converges in probability to the Fisher information1. By Slutskys theorem it follows thatZn converges in distribution to a Gaussian random variable with mean zero and varianceI1(). Therefore we have shown that

    n(n) converges in law to a Gaussian random

    variable with mean zero and variance I1(). From this result we can construct confidenceintervals for the unknown parameter just as we did for the empirical mean in (2).

    Slutskys theorem says that if Un = VnWn + Yn and (i) Vn converges weakly to V , (ii)Wn converges in probability to a constant w and (iii) Yn converges in probability to zero,then Un converges weakly to V w.

    2.2 Cramer-Rao lower bound and asymptotic efficiency of the MLE

    We now want to show that the MLE is asymptotically efficient because the variance of thelimit fluctuation Z, which is the reciprocal of the Fisher information, is as small as it can

    1T. S. Ferguson, A Course in Large Sample Theory, CRC Press, 1996, page 122.

    14

  • be. This is done by using the Cramer-Rao lower bound for the variance of estimators thatwe derive next.

    For any random variables U, V we have by the Schwartz inequality

    var(U) (cov(U, V ))2

    var(V )

    Let X be a vector-valued random variable with (multi-dimensional) density f(x|) depend-ing on a parameter . Now apply the inequality with U = w(X), a function of X, andV = (log f(X|)). Since we have that E{V } = 0, we see that

    cov(U, V ) = E{w(X)(log f(X|))} = ddE{w(X)}

    Therefore we have the inequality

    var(w(X)) ( ddE{w(X)})2

    E{( dd log f(X|))2}This is the Cramer-Rao inequality.

    In the case that the random vector X = (X1, X2, . . . , Xn) is a sample of size n fromthe one-dimensional density f(x|) we see easily that we have

    var(w(X)) ( ddE{w(X)})2

    nE{( dd log f(X|))2}This inequality is valid for any and any estimator w(X).

    We now apply this to the maximum likelihood estimator n of , that is, we let

    w(X) = n(X). We then have

    nvar(n(X)) ( ddE{n(X)}|=)2

    I()

    where

    I() = E{( dd

    log f(X|))2}|=is the Fisher information. The numerator on the right involves the bias of the MLE. Withenough assumptions we can show that for n large we have that

    d

    dE{n(X)}|= 1

    so that we have the asymptotic inequality for large n:

    var(nn(X)) 1

    I(). (10)

    15

  • We know that n in probability andn(n ) tends in distribution to a

    Gaussian random variable with mean zero and variance one over the Fischer information.With mild integrability assumptions this implies that the MLE is asymptotically unbiased.It is biased in general. With further assumptions the derivative of the bias (E{n} )with respect to also tends to zero as n. This shows that the MLE has the smallestasymptotic variance among all possible consistent and asymptotically unbiased (in thestronger sense involving the derivative) estimators.

    As an example of how the Cramer-Rao inequality can be used, consider a sampleX1, X2, . . . , Xn from the exponential density f(x|) = ex, x > 0, with parameter > 0.We find easily from the log likelihood function that the MLE of is n = 1/Xn where Xnis the sample mean. The MLE is biased because E{n} = nn1, but asymptoticallyunbiased and the derivative tends to one so that (10) can be used asymptotically for largen. The Fisher information is here I = ()2. To see that E{n} = nn1, we note that

    E{ 1Xn} =

    0

    0

    n

    x1 + x2 + + xn ()ne

    (x1+x2+xn)dx1 dxn

    =

    0

    0

    n

    x1 + x2 + + xn e(x1+x2+xn)dx1 dxn

    = an

    The constant an is evaluated by differentiating the first line with respect to and getting

    a differential equation for E{ 1Xn }, which has solution an with an = nn1 .

    2.3 Asymptotic normality of posterior densities

    Instead of considering as a parameter upon which the density f(x|) depends, we maythink of it as a random variable with a prior density g(). Once the sample X1, X2, . . . , Xnfrom f(x|) has been observed, the posterior density of given the sample is defined by

    pin() =Ln()g()Ln()g()d

    (11)

    where the likelihood function Ln() is defined by (5). We will show that if we look at theposterior density in a neighborhood on the MLE n that is of order 1/

    n it tends to a

    limit2 that is a Gaussian density with mean zero and variance equal to the reciprocal of

    the Fisher information I(). The actual theorem states that if we let = n + n and

    change variables in the posterior pin() so that it becomes pin() then

    pin() e I()2

    22pi(1/I())

    (12)

    2T. S. Ferguson, A Course in Large Sample Theory, CRC Press, 1996, p. 141.

    16

  • in probability and for each . Some hypotheses about the prior g() are needed so that thelimit can be taken inside the integral in the denominator in (11). We know by (8) that theMLE is consistent, n in probability, so the change of variables is essentially centeredabout the true parameter value . However, the convergence of the posterior density isnot valid if we do not center around n. Note also that the limit posterior is independentof the prior density g(), assuming that the latter is positive for all as well as such thatthe limit can be taken inside the integral in (11).

    For the proof we note that by (6), Ln() = enln() and we have the expansion

    ln(n +n

    ) = ln(n) +1nln(n) +

    1

    2nln(n)

    2

    where n is between n and n +n

    and so tends to in probability as n . Butln(n) = 0, which is essential for centering, we have

    ln(n +n

    ) = ln(n) +1

    2nln(n)

    2

    Taking the normalization into consideration, the denominator in (11), the uniform law oflarge numbers (9) for second derivatives of the log likelihood function, and consistency forthe MLE, we have that ln(n) I() in probability. We see that this last relation doesimply (12), in probability for each . It can also be shown that this implies that

    |pin() e I()2

    22pi(1/I())

    | d 0

    in probability.

    3 Basic Monte Carlo methods

    The main problem in Monte Carlo simulation is to calculate using sampling methods expec-tations of complicated (multi-dimensional) functions of random variables (random vectors)that have also complicated distributions.

    The most direct but often difficult to apply way of generating random variables with agiven distribution F (x) (density f(x) = F (x)) is by noting that if U is a uniform randomvariable over [0, 1] then F1(U) has distribution F (x). This is because P (F1(U) x) =P (U F (x)) = F (x). Of course generating i.i.d. uniform random variables U1, U2, . . . , Unnumerically is not easy and requites deep number theoretic methods, especially if n is largeand rigorous statistical tests for independence are to be satisfied. But if we assume that thiscan be done then generation of a sample X1, X2, . . . , Xn with density f is easy exceptwhen the inverse distribution function F1 is hard to obtain, and then the acceptance-rejection algorithm can be used.

    17

  • Given a function g(x) and assuming that we can generate an i.i.d. sample from f(x)then we approximate I = E(g(X)) by

    In =1

    n

    nj=1

    g(Xj)

    This is basic Monte Carlo.

    3.1 Properties of basic Monte Carlo

    Clearly E(In) = I and

    var(In) = E

    1n

    nj=1

    g(Xj) I2 = E

    1n

    nj=1

    (g(Xj) I)2

    = E

    1n2

    nj=1

    (g(Xj) I)2 = 1

    nvar(g(X))

    The fact that the variance of the Monte Carlo approximation In decays as one over the sam-ple size is characteristic of this method and the main limiting factor for its applicability. Itis therefore important to look for ways of reducing the multiplicative factor 2 = var(g(X))and this is what importance sampling tries to do.

    We can also use the CLT to get confidence intervals for I using the approximations In.The CLT does apply since we assume that var(g(X)) ) = ,then for n large we have approximately

    P (

    n

    |In I| ) 1

    or

    In n I In +

    n

    with probability 1 . Of course is not known and must be replaced by the samplevariance

    s2n =1

    n 1nj=1

    (g(Xj) In)2

    to get a realizable confidence interval. The justification for replacing by sn in the CLT,

    that is, in still havingnsn

    (In I) N(0, 1) in law, requires the use of Slutskys theorem,

    18

  • since sn > 0 in probability andn(In I) N(0, 2) in distribution and therefore

    nsn

    (In I) N(0, 1) in distribution. The confidence intervals are now realizable

    In snn I In + sn

    n

    with probability 1. The question of how large the number of realizations n must be forthis to be reasonably accurate based on the asymptotic theory depends both on the inte-grand g and on the density f . Since we are not doing estimation here it is usually possibleto increase the number of realizations and thus improve the accuracy of the confidenceinterval.

    One can also use the continuous mapping theorem that can be stated in general asfollows. Suppose that the pair of random variables (Xn, Yn) converges in distribution to(X,Y ). Let h(x, y) be a function fromR2 R and such that the set {(x, y) | h(x, y) is not continuous}has probability zero with respect to the limit law. Then h(Xn, Yn) h(X,Y ) in distribu-tion. This more general theorem can be applied here with Xn =

    n(In I), Yn = sn and

    h(x, y) = x/y.

    3.2 Importance sampling

    Importance sampling is used when the function g(x) to be integrated and the density f(x)overlap very little so that the product g(x)f(x) is very small and therefore I = E(g(X)) isvery small. Most of the samples drawn from f will not overlap significantly with regionswhere g is significant. This means that the relative error (standard deviation over mean)in direct Monte Carlo simulations will be very large.

    The main idea in importance sampling is to introduce a reference density f(x), let

    M(x) =f(x)

    f(x)

    and then note that, assuming all integrals are well defined,

    E(gM) =

    g(x)

    f(x)

    f(x)f(x)dx =

    g(x)f(x)dx = E(g) = I

    We can now generate an unbiased Monte Carlo approximation by using a sample from f ,X1, X2, . . . , Xn, so that

    In =1

    n

    nj=1

    g(Xj)M(Xj)

    The variance of In is now

    E((gM)2) (E(gM))2 =

    (g(x))2(f(x))2

    f(x)dx I2

    19

  • The question is how to choose f so as to reduce the variance of In. Assuming that g ispositive we see right away that if we take

    f(x) =g(x)f(x)

    E(g(X))

    then the variance of In is zero! The reference density puts all the weight just where itshould, that is, where the product gf is significant. But this is hardly an improved MonteCarlo since in order to implement it we need to know E(g(X)), which is the very quantitythat we are trying to approximate. A number of interesting algorithms can be developed,however, that use approximations of E(g) to get f , which will then lead to improvedapproximations.

    One possibility is to choose f(x) as follows. Let {xj} be a partition of the real line anddefine

    f(x) =

    j f(x

    j )g(x

    j )1{xj1 1 such that

    f(x)

    g(x) c

    for all x, and we may consider the smallest such constant. The acceptance-rejection algo-rithm consists of the following steps:1. Generate Z from g2. Generate an independent uniform [0, 1] random variable U

    3. Is U f(Z)cg(Z)? If yes then return Z (accept) and if not then repeat (go to step 1).

    The output random variable, say X, will be in a set A if

    1{XA} = 1{Z1A}1{U1 f(Z1)cg(Z1)}+

    k=2

    k1j=1

    1{Uj> f(Zj)cg(Zj)}

    1{ZkA}1{Uk f(Zk)cg(Zk)}

    20

  • But

    E(1{ZA}1{U f(Z)cg(Z)

    }) =1{zA}E(1{U f(z)

    cg(z)})g(z)dz =

    1

    c

    Af(z)dz

    and using the independence of the random variables in the acceptance-rejection algorithmwe see that

    P (X A) = 1c

    Af(z)dz +

    1

    c

    Af(z)dz

    k=2

    (1 1c

    )k1 =Af(z)dz

    This shows that indeed the acceptance rejection algorithm does produce random variableswith the correct density.

    To understand better how the algorithm behaves, let N be the number of cycles neededto produce the desired random variable. Clearly

    P (N = k) =

    [P (U >

    f(z)

    cg(z))g(z)dz

    ]k1 [1

    P (U >

    f(z)

    cg(z))g(z)dz

    ], k = 1, 2, . . .

    = (1 1c

    )k11

    c

    and therefore E(N) = c, which explains the role that the constant c plays.Now acceptance-rejection can be combined with Monte Carlo somewhat in the way

    importance sampling is done as follows. Suppose that it is hard to sample from f(x) andthat there is another density g(x) and a constant c > 1 such that f(x)/g(x) c for allx. To compute I = E(h(X)) for some function h(x) we generate X1, X2, . . . , Xn by theacceptance rejection algorithm and then compute

    In =1

    n

    nj=1

    h(Xj)

    This is a slightly less efficient approximation than when we generate the sample directlyfrom f(x) because if we count the steps needed in the acceptance rejection cycle then atotal nc (with c > 1) steps are needed on average to compute In. Counting also the uniformrandom variables in the acceptance-rejection algorithm, we see that we need on average2nc samples to generate In. The accuracy of In remains the same but the computationalcost has increased.

    3.4 Glivenko-Cantelli theorem and the Kolmogorov-Smirnov test

    There is little practical or theoretical distinction between Monte Carlo and estimationmethods so we discuss non-parametric estimation in this part of the notes.

    21

  • Let X1, X2, . . . , Xn be an i.i.d. sample from a distribution F (x) and define the empiricaldistribution Fn(x) by

    Fn(x) =1

    n

    nj=1

    1{Xjx}

    This is a random, since it depends on the sample, piece-wise constant distribution functionthat should serve as an estimate for the true F (x), assumed unknown here. Since foreach x R the random variables 1{Xjx} are bounded by one, have mean F (x) and areindependent, the weak law of large numbers tells us that Fn(x) F (x) in probability. Infact this is true also with probability one and the CLT holds:

    n (Fn(x) F (x)) N(0, F (x)(1 F (x)) in distribution for each x R

    This is not so useful though because the variance of the limit Gaussian depends on thevery distribution we want to estimate. Normalizing by Fn so that

    n

    Fn(x)(1 Fn(x))(Fn(x) F (x)) N(0, 1) in distribution for each x R

    is not as useful either because of the rather slow convergence that depends on x.There are two basic results in non-parametric estimation that we now state and discuss,

    without proofs, which can be found in advanced statistics and probability books. Thefirst is the Glivenko-Cantelli limit theorem and the second the Kolmogorov-Smirnov limittheorem. The first states that

    P{ limn supxR

    |Fn(x) F (x)| = 0} = 1

    The second states that if F (x) is continuous then Dn = supxR |Fn(x)F (x)| is a randomvariable whose law is independent of F and

    nDn converges in distribution to a universal

    law

    limnP{

    nDn x} = 1 2

    k=1

    (1)k1e2k2x2

    With the KS theorem we can formulate various statistical tests regarding the estimation ofthe unknown distribution F (x). The limit distribution of

    nDn is, for continuous F (x),

    the distribution of the maximum of the absolute value of the Brownian bridge, whichis a Gaussian stochastic process B(t) defined on [0, 1], with mean zero and covarianceE{B(t)B(s)} = t(1 s if 0 < t s < 1, symmetric in t and s. A Gaussian processindexed by t [0, 1] is a collection of Gaussian random variables such that for any set ofindices 0 t0 < t1 < < tN 1 the vector (B(t0), B(t1), . . . , B(tN )) is Gaussian withmean zero and covariance matrix given as above. This limit theorem is an example of aninvariance principle where the law of a function of the process before the limit, in thiscase the maximum absolute value, converges to the law of the same function of the limitprocess.

    22

  • 3.5 Density kernel estimation

    A simpler non-parametric test when an unknown density is involved will be discussed insome detail because it shows explicitly the dependence of the rate of convergence on thesmoothness of the density. In other words, one sees explicitly the slowing of the rate ofconvergence in a non-parametric estimation.

    Suppose that we have a sample X1, X2, . . . , Xn from a density f(x) which is not knownand we want to estimate it. Since F (x) = f(x) an estimator for f can be obtained bydifferentiating the empirical distribution Fn. But this is a random step function so itsderivative is a sum of delta functions. So we need some smoothing which we do with asmoothing kernel, a positive, infinitely differentiable function (x) such that

    (x) 0 ,R(x)dx = 1 ,

    Rx(x)dx = 0 ,

    Rx2(x)dx = 1

    The estimate of the density f is now

    fn,h(x) =1

    n

    nj=1

    1

    h

    (xXjh

    )Clearly

    E{fn,h(x)} =Rh(x y)f(y)dy f(x) + h

    2

    2f (x) = f(x) + bias

    for h small, assuming differentiability of the density. Here h(x) = (x/h)/h and thesmall h expansion is just a Taylor expansion plus use of the normalization properties ofthe smoothing kernel .

    The variance of the estimator is similarly given by

    var(fn,h(x)) =1

    nvar(h(xX)) = 1

    n

    [R2h(x y)f(y)dy (

    Rh(x y)f(y)dy)2

    ]Assuming form here on the (x) in the Gaussian mean zero, variance one density we havethat

    var(fn,h(x)) 1n

    [1

    h

    1

    2pif(x) f2(x)

    ]to principal order as h 0. From this we conclude that the mean square error is

    E{(fn,h(x)) f(x))2} = var(fn,h(x)) + (bias)2 1h

    1

    2npif(x) +

    h4

    4(f (x))2

    to principal order in h 0 for each term. Minimizing this error over h gives

    h =1

    n1/5

    (1

    2pi

    f(x)

    (f (x))2

    )1/523

  • It follows that with the optimal in MSQ sense smoothing we have

    fn,h(x) f(x) +O(n2/5) for each x

    in MSQ sense. Instead of an error O(n1/2), which is the usual one for parameter estima-tion, we have slower error decay because of the necessary smoothing. And, of course, thedensity f(x) to be estimated must have at least two derivatives.

    3.6 Bootstrap

    Let X1, X2, . . . , Xn be a sample drawn from a distribution function F (x) and let =(X1, X2, . . . , Xn) be a statistic of interest, such an estimator of an unknown parameter.We often want to calculate the standard deviation of this statistic, SD = (F, n, ) = (F ),or some other measure of uncertainty, but the distribution function F is not known or itis not known with enough accuracy. The bootstrap3 is a very effective way to do this.

    We introduce the empirical distribution function of the sample, Fn, which assigns mass1/n at at the points xi, i = 1, 2, . . . , n, the values of the observed sample. We want tocalculate SD = (Fn, n, ) = (Fn) which, depending on the statistic and the distributionF will be close to the theoretical SD for n large. The issue in bootstrap is primarily howto calculate SD for fixed but large n, since this is often a combinatorially complex problemfor large n.

    We do this by Monte Carlo using Fn as the basic distribution from which to sample.A bootstrap sample is denoted by X1 , X2 , . . . , Xn, which is an i.i.d sequence drawn fromFn. This is the same as drawing from (x1, x2, . . . , xn) with replacement n times. Let = (X1 , X2 , . . . , Xn) be the bootstrap value of the statistic. The bootstrap standarddeviation is simply a Monte Carlo approximation of SD = (F ). We do this by drawingrepeated independent samples of size n from Fn and letting

    SDB =

    (1

    B 1Bb=1

    (b .)2)1/2

    where B is the number of bootstrap samples and it is assumed large enough to be a goodMonte Carlo approximation of (F ). Here . is the empirical mean of the bootstrap valuesof the statistic over the B Monte Carlo samples (re-samples). As B tends to infinity wehave that SDB SD, but since for finite n this is only an approximation of SD, which iswhat we want, we should ideally choose B so that the Monte Carlo error is comparable tothe finite n error.

    As an example, consider the sample mean X as the statistic of interest. Then thestandard deviation is (F ) = (2n )

    1/2 where 2 = EF (X EF (X))2 is the theoretical3See the lecture notes by B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans, CBMS-

    NSF Series in Applied Mathematics no. 38, 1982

    24

  • variance of F . The bootstrap standard deviation SD = ( 2n )1/2 where 2 =

    1n

    ni=1(xix)2,

    which is the sample variance. Thus EF ( V AR) = EF (2n ) =

    n1n

    2n =

    n1n V ar(X). The

    point of this example is that the bootstrap is consistent with what is expected in the sameway that basic Monte Carlo is excepted to work, but now we only deal with the originalsample of size n by resampling it.

    25

  • 4 Markov Chains

    A time-homogeneous Markov chain {X0, X1, . . . , Xn, . . .} taking values in a finite set S ofsize N is characterized by its transition probabilities

    P (x, y) = P{Xn+1 = y|Xn = x} 0 , x, y S ,y

    P (x, y) = 1,

    which are independent of n because of time homogeneity. The Markov property meansthat conditional probabilities depend only on the latest information

    P{Xn = y|X0, X1, . . . , Xn1} = P{Xn = y|Xn1}

    and therefore the joint probability of {Xn}n0, in path space, is expressed as a product oftransition probabilities

    P{Xn = xn, Xn1 = xn1, . . . , X0 = x0} = pi0(x0)P (x0, x1) P (xn1, xn)

    where pi0(x) = P{X0 = x}. Path probabilities fully determine the S-valued process{Xn}n0, and for Markov chains the initial probabilities pi0 and the transition probabilitiesP determine everything.

    The n-step transition probability

    P{Xn = y|X0 = x} = Px{Xn = y}

    is obtained recursively by using the Markov property and time homogeneity. We have,introducing also some notation and using the law of iterated conditional expectation, that

    Pn(x, y) = Px{Xn = y} = Ex{P{Xn = y|X1, X0 = x}}

    = Ex{P{Xn = y|X1}} Markov property= Ex{PX1{Xn1 = y}} time homogeneity

    =zS

    P (x, z)Pz{Xn1 = y} =zS

    P (x, z)Pn1(z, y)

    so that Pn = (Pn(x, y))x,yS is identified as the n-th power of the N N transition matrix

    P = (P (x, y))x,yS .If pi0(x) = P{X0 = x}, x S, is the initial probability of the chain then

    pin(x) =zS

    pin1(z)P (z, x) =zS

    pi0(z)Pn(z, x) , n = 1, 2, ..., x S.

    Thus probability vectors pin = (pin(x))xS can be considered to be N -row vectors that getupdated recursively by left multiplication with the transition matrix.

    26

  • Expectations of functions of the state

    Ex{f(Xn)} = un(x) =zS

    P (x, z)un1(z) =zS

    Pn(x, z)f(z) , n = 1, 2, 3, ..., x S

    can similarly be considered as N -column vectors un = (un(x))xS that are updated recur-sively by right multiplication with P , starting with the initial column vector f = (f(x))when X0 = x.

    4.1 Exit times

    Let C S and let T = Tx be the first time to enter the complement of C, Cc, startingfrom x C, which is also the first exit time from C:

    T = min{n 1 | Xn / C}

    This is a random variable that takes integer values and is such that for all n 0 the events{T > n} depend only on the Markov chain up to time n, {X1, X2, . . . , Xn}. It is calleda stopping time. Clearly the exit time of the Markov chain from a subset C of the statespace S is a stopping time.

    We want to find a linear system of equations satisfied by vn(x) = Px{T > n} =1 Px{T n}, which is the probability distribution of T , starting from x and withn = 0, 1, .... We use a first transition or a renewal analysis that relies on the Markovproperty and time homogeneity as follows. For x C we have

    vn(x) = Px{T > n} = Ex{P{T > n|X1, X0 = x}} = Ex{P{T > n|X1}}

    =yC

    P (x, y)Py{T > n 1} =yC

    P (x, y)vn1(y) , n = 1, 2, 3, ... , v0(x) = 1{xC}

    To see why P{T > n|X1} = PX1{T > n 1}, and to explain the notation, we first writeT = T (X0, X1, X2, . . .) to indicate that the exit time depends on the path of the Markovchain. By the definition of the exit time from C, T > 0 when X0 = x C. Withn = 1, 2, . . . , after one time unit has passed, I{T (X0,X1,X2,...)>n} = I{1+T (X1,X2,...)>n} whenX1 C. In words, after one time step the Markov chain restarts from the state X1 Cto which it went, and the exit time increases by one unit. Time homogeneity leads then tothe result above.

    If we let PC = (P (x, y), x, y C), which is a sub-stochastic transition matrix since ingeneral its row sums are less than one, then for the vector vn = (vn(x))xC we have thelinear recursion

    vn = PCvn1 , v0 = 1

    Note again that the column vectors vn are restricted to elements x in C and 1 is the columnvector of all ones. Alternatively we can write the recursion as an initial-boundary value

    27

  • problem:

    vn(x) =yS

    P (x, y)vn1(y) , n = 1, 2, . . . , x S

    withv0(x) = 1, x C , vn(x) = 0, x / C, n = 0, 1, . . .

    Let us define the norm of row vectors to be the l1 norm

    ||q|| =yS|q(y)|

    and the norm of column vectors to be the maximum norm

    ||f || = maxx|f(x)|

    Then the induced matrix norm is

    ||Q|| = maxx

    y

    |Q(x, y)|

    and we see that the norm of transition matrices P is one but the norm of PC is in generalless than one. This is the case if transitions from states in C to states in the complementoccur with positive probability. Therefore, since ||Q2|| ||Q||2, we conclude that, ingeneral, ||vn|| 0 as n , assuming that ||PC || < 1. This means that the exit timeform C, T , is finite with probability one as should be expected in this case.

    Consider as another example the calculation of the mean exit time from C, Ex{T}. Forsome fixed 0 < s < 1 define the moment generating function of the exit time

    u(x; s) = Ex{sT }

    Using the first transition or renewal argument, as above, we find that u satisfies the linearsystem

    u(x; s) = syCc

    P (x, y) + syC

    P (x, y)u(y; s), x C

    Assuming a finite expectation for the exit we have that

    u(x) =d

    dsu(x; s)|s=1 = Ex{T}, x C

    By differentiating the equation for u(x; s) with respect to s and setting s = 1 we get, aftersome rearrangement

    u(x) = 1 +yC

    P (x, y)u(y), x C

    28

  • and in vector formPC u u = 1

    Since the norm of PC is less than one, as it is in general when the Markov chain can reachstates in Cc from states in C, we have that

    u = (I PC)11

    Another interesting application of these ideas is the calculation of u(x) = Ex{f(XT )},where f(x) is a function defined for x Cc. For example, when f(x) = 1{x = z},u(x) = u(x; z) is the probability of exiting C by going to z Cc. We leave this as anexercise.

    4.2 Transience and recurrence

    Consider a finite state Markov chain with transition probabilities P (x, y), x, y S and letFn(x, y) be the probability that y is reached for the first time at n, starting from x

    Fn(x, y) = P{Xn = y|X0 = x,X1 6= y, . . . , Xn1 6= y}, n = 1, 2, ...

    Note that Fn(x, y) = Px{Ty = n} where Ty is the first time that the state y is reached. Wewant to show that

    Pn(x, y) =n

    m=1

    Fm(x, y)Pnm(y, y), n = 1, 2, ...

    In terms of generating matrix functions, P (x, y; s) =

    n=0 snPn(x, y) and F (x, y; s) =

    n=1 snFn(x, y), we have that

    P (x, y; s) = x,y + F (x, y; s)P (y, y; s)

    for 0 s 1 and where x,y is zero when x 6= y and one otherwise.A state y is said to be persistent or recurrent if F (y, y; 1) =

    n=1 Fn(y, y) = 1, which

    means that Ty

  • Note that the event {Ty = n} = {X0 = x,X1 6= y, . . . , Xn1 6= y,Xn = y} and so itdepends on the path only up to time n. We thus have

    P{Xn = y|X0 = x} =n

    m=1

    P{Xn = y, Ty = m|X0 = x}

    =

    nm=1

    P{Xn = y|Ty = m,X0 = x}P{Ty = m|X0 = x}

    =n

    m=1

    P{Xn = y|Xm = y}P{Ty = m|X0 = x}

    =n

    m=1

    Fm(x, y)Pnm(y, y)

    Then, write,

    P (x, y; s) =n=0

    snPn(x, y)

    = x,y +

    n=1

    snPn(x, y)

    = x,y +n=1

    sn

    (n

    m=1

    Fm(x, y)Pnm(y, y)

    )

    by the previous result. Interchanging the summations we get the result

    P (x, y; s) = x,y + F (x, y; s)P (y, y; s)

    We note that everything is non-negative and so monotone increasing in s 1. From

    P (y, y; s) =1

    1 F (y, y; s)P (y, y; s)we see that as F increases to one, we have that P increases to infinity. Similarly, from

    F (y, y; s) =P (y, y; 1) 1P (y, y; 1)

    we see that as P increases to infinity, F increases to one.

    30

  • 4.3 Invariant probabilities

    We now want to address the long time behavior of the Markov chain Xn and in particularto study its ergodic properties. We assume that the state space S is finite (although thisis not necessary for the method used; compactness is) and the transition probabilities areuniformly positive:

    P (x, y) N

    , x, y Sfor some > 0 and with N = #(S). With simple modifications the arguments belowextend to the case where this condition holds for P to some fixed power. We are simplyrequiring that the Markov chain be irreducible and aperiodic. Irreducible means that everystate can be reached from every other state in a finite number of time steps (at most N)with positive probability. Aperiodic means that the greatest common divisor of returntimes (with positive probability) to states is one, for all states.

    We will prove the following basic theorem. For any probability vectors pi1 and pi2 wehave that

    ||pi1P pi2P || (1 2

    )||pi1 pi2||which means that P is a strict contraction when acting on differences of probability vectors.

    Using this theorem we will then prove by the contraction mapping iteration processthat P has a unique invariant probability vector pi

    pi = piP (in components: pi(x) =yS

    pi(y)P (y, x) , x S)

    and that this invariant probability is approached exponentially fast for any starting prob-ability pi0

    ||pi0Pn pi|| 2n , 0 < < 1 ( = 1 2

    )

    To prove the contraction property, let q = pi1 pi2, which is a vector with positiveand negative entries such that

    yS q(y) = 0. Let q = q

    + + q where q+ has all non-negative elements and q has all negative ones. Clearly,

    yS q

    +(y) = yS q(y) and||q|| = ||q+||+ ||q|| = 2||q+||. We now have the following

    ||qP || =y

    |qP (y)|

    =y

    |q+P (y) + qP (y)|

    =yS+

    [q+P (y) + qP (y)]yS

    [q+P (y) + qP (y)]

    31

  • and continuing

    = q+yS+

    P + qyS+

    P q+yS

    P qyS

    P

    = q+(yS+

    P yS

    P ) q(yS

    P yS+

    P )

    ||q+||(1 N

    N) + ||q+||(1 N

    +

    N)

    = ||q+||(2 ) = 2||q+||(1 2

    )

    = ||q||(1 2

    )

    which gives what we wanted. Here S denotes the set of states where the entries in thesum are positive and negative, respectively, and N is the size of these sets.

    To show that pi = piP has a unique invariant probability vector solution, we define thesequence pin = pin1P , with pi0 any probability vector. We have that

    ||pin pin1|| = ||(pin1 pin2)P || < ||pin1 pin2||where = 1 2 < 1. Iterating backwards we have that

    ||pin pin1|| < n1||pi1 pi0||From this we conclude that the probability vectors pin are a Cauchy sequence

    supm||pin+m pin|| 0

    as n and therefore pin has a limit pi, which is a probability vector. Passing to thelimit in pin = pin1P we see that pi is an invariant probability vector and since it is theunique limit of a Cauchy sequence it must be the unique invariant probability vector. Wealso have exponential convergence

    ||pi0Pn pi|| = ||pi0Pn piPn|| < n||pi0 pi||

    4.4 The ergodic theorem

    For a Markov chain in a finite state space the ergodic theorem is valid under the hypothesesand results of the previous section, and gives an important interpretation of the invariantprobabilities pi(x). We have that

    1

    n

    nj=1

    1{Xj = z} pi(z)

    32

  • in mean square as n, for any z S and any initial probability vector P{X0 = x} =pi0(x). In words, the invariant probability vector is the limit in mean square of the relativetime spent in state z as n.

    More generally, for any function f(x) on the state space we have

    1

    n

    nj=1

    f(Xj)yS

    pi(y)f(y) = pif

    in mean square as n for any initial probability vector P{X0 = x} = pi0(x).If we take expectations on the left we have

    Ex{ 1n

    nj=1

    f(Xj)} = 1n

    nj=1

    Ex{f(Xj)} = 1n

    nj=1

    P jf(x)

    If we let pi0 = x, the probability vector concentrated at x, we have

    Ex{ 1n

    nj=1

    f(Xj)} = 1n

    nj=1

    pi0Pjf pif

    by the results of the previous section. We want to show here that not only the meansconverge but the random time averages converge in mean square to the average withrespect to the invariant probability.

    The main step in proving the ergodic theorem is the introduction of a new function(x) that converts approximately the time average to an expression that is easy to handlebecause it is a martingale. This function is the solution of the Poisson equation (by analogywith PDEs)

    (P I) = f + pifMore explicitly, this system of equations for = ((x))xS has the form

    (x) = f(x) pif +yS

    P (x, y)(y), x S

    Note that the right hand side of the system (P I) = f + pif has mean zero, or innerproduct zero, with respect to pi: pi(f pif) = 0. This then is a necessary condition for thesolvability of the Poisson equation since the pi average of the left side is always zero, forany :

    pi(P I) = 0 , orxS

    pi(x)

    yS

    P (x, y)(y) (x) = 0

    Of course, the Poisson equation does not have a unique solution since 1 is an invariantright vector: (P I)1 = 0. Thus + c1 is a solution for any constant c.

    33

  • Without loss of generality we may assume that f is such that pif = 0 as we may replacef by f pif . We now show that the Poisson equation has a solution. If there is a solution,we can write

    = (I P )1f =n=0

    Pnf

    The sum is, however, convergent because

    ||Pnf || = ||(Pn pi)f || < n2||f ||To see this, we let pix be the probability row vector that is equal to one at x and zeroelsewhere. We then have that Pnf(x) = pixP

    nf and hence ||Pnf pif || = maxx |pixPnf pif | maxx ||pixPn pi||||f || n2||f || by the results of the previous section.

    Once we have a fixed solution (x) we form the collapsing (telescoping) sum

    (Xn+1) (X1) =nj=1

    ((Xj+1) (Xj))

    =nj=1

    [(Xj+1) E{(Xj+1)|Xj}+ (E{(Xj+1)|Xj} (Xj)]

    =nj=1

    [(Xj+1) P(Xj)] +nj=1

    [P(Xj) (Xj ]

    Using the Poisson equation that satisfies, rearranging and dividing by n we have

    1

    n

    nj=1

    f(Xj) =1

    n((Xn+1) (X1)) + 1

    nMn+1

    where

    Mn+1 =

    nj=1

    [(Xj+1) P(Xj)]

    This representation of the time average whose limit we want is important because, up tothe first term on the right that goes to zero as n , the second term is a martingaleMn+1 divided by n. The martingale property is

    E{Mn+1|X0, X1, ..., Xn} = Mnwhich is easily verified since the conditional expectation of the last term in the sum forMn+1 is zero. We also have that Ex{Mn+1} = 0.

    Now the proof can be completed by noting that the variance of Mn+1 has the form

    Ex{M2n+1} = Ex{nj=1

    [(Xj+1) P(Xj)]2}

    34

  • because cross terms have mean zero, as they are for sums of zero mean independent (oronly uncorrelated) random variables. This is the essential point for introducing and using to get Mn: we can calculate variances as if we were dealing with sums of independent,mean zero random variables. We now conclude that

    1

    n2Ex{M2n+1}

    constant

    n

    which implies the result we want:

    Ex{[ 1n

    nj=1

    f(Xj)]2} 0

    in the case pif = 0, which we have assumed as noted above.

    4.5 The central limit theorem for Markov chains

    Let2(x) = Ex{((X1) P(x))2}

    Under the same conditions for which we have proved the ergodic theorem in the previoussection we also have a central limit theorem as follows. The scaled difference

    n

    1n

    nj=1

    f(Xj) pif N(0, pi2) , n

    in distribution, where in particular

    limn

    1

    n

    nj=1

    2(Xj+1) = pi2

    in mean square, by the ergodic theorem. In fact this CLT simply follows from the moregeneral martingale central limit theorem in view of the representation of the time averageof f that we obtained in the previous section in terms of an asymptotically small term anda scaled martingale.

    We will not go through the detailed calculations but note here that it is enough to showthat for any R we have that

    limnEx{e

    i nMn+1} = e

    2

    2pi2

    which proves the CLT through the limit of characteristic functions. The ergodic theorem isused here as already noted. We also note that whereas the solution of the Poisson equation played a role only in streamlining the proof of the ergodic theorem, in the CLT it playsa basic role as it enters into the form of the limit variance. There are other, equivalentforms for the limit variance pi2, in the form of sums which correspond to expansions of .

    35

  • 4.6 Expected number of visits to a state and the invariant probabilities

    Let x, z and y be three states in S, not necessarily all different and consider the expectednumber of visits to z before reaching y, starting from x

    u(x; z, y) = Ex{n=1

    1{Xn = z}1{Ty n}} = Ex{Tyn=1

    1{Xn = z}}

    where Ty is the first time to reach y

    Ty = inf{n 1 | Xn = y}

    This is a stopping time and I{Tyn} depends, or is a function of, only {X0, X1, . . . , Xn1}.When we start at y then with the current definition Ty is the first time to return to y.Clearly

    z

    u(x; z, y) = Ex{Ty}

    which is the expected time to reach y. When y = x then

    z u(x; z, x) is the expected timeto return to x, after starting from it. For any bounded function f on S we have that

    uf (x, y) =zS

    u(x; z, y)f(z) = Ex{Tyn=1

    f(Xn)}

    To get a recursion relation for u, as a function of z and not the starting point x, wewrite

    u(x; z, y) = Ex{n=1

    1{Xn = z}w

    1{Xn1 = w}{1{Ty n}}

    =w

    n=1

    Ex{E{1{Xn = z}1{Xn1 = w}{1{Ty n}|X0, X1, ..., Xn1}}

    using iterated conditional expectation. Using the properties of the stopping time and theMarkov property we have

    w

    n=1

    Ex{E{1{Xn = z}1{Xn1 = w}{1{Ty n}|X0, X1, ..., Xn1}} (13)

    =w

    P (w, z)

    n=1

    Ex{1{Xn1 = w}1{Ty n}}

    36

  • We interchange expectation and summation to get

    w

    P (w, z)Ex{n=1

    1{Xn1 = w}1{Ty n}}

    =w

    P (w, z)Ex{Tyn=1

    1{Xn1 = w}}

    (14)

    By time homogeneity we have

    w

    P (w, z)Ex{Tyn=1

    1{Xn1 = w}}

    =w

    P (w, z)Ex{Ty1n=0

    1{Xn = w}}

    =w

    P (w, z)

    Ex{ Tyn=1

    1{Xn = w}}+ xw yw

    =w

    u(x;w, y)P (w, z) + P (x, z) P (y, z) (15)

    where xw equals one when x = w and zero otherwise. In vector-matrix form we have thefollowing system for u as a row vector in z

    u(I P ) = pixP piyP

    where pix is the probability vector with component equal to one at z = x and zero otherwise.This system is always solvable since (pixP piyP )1 = 0, that is, the right hand side isorthogonal to the right null-vector of I P , which is 1. When x = y, the right hand sideis zero and we see that, in this case, u is proportional to the invariant probability vector pi

    pi(z) =u(x; z, x)z u(x; z, x)

    =u(x; z, x)

    Ex{Tx}In words, the expected number of visits to z between successive visits to x divided by theexpected number of time steps between successive visits to x is the invariant probabilitypi(z). Furthermore, when x = z then there is only one visit to x after leaving it, betweensuccessive visits, and so

    pi(x) =1

    Ex{Tx}

    37

  • 4.7 Return times and the ergodic theorem

    Consider again an ergodic Markov chain on a finite set S, with transition probabilitiesP (x, y) > 0. Fix a state x and let T1, T2, T3, . . . be the successive return times to x,starting from it. We show first that they are independent, identically distributed randomvariables. We use the ergodic theorem for the Markov chain to show that the mean of thereturn time to x is equal to the reciprocal of the invariant probability pi(x).

    We show independence for the first two return times. Let

    Fn(x) = P{T1 = n|X0 = x} = P{Xn = x,Xn1 6= x, . . .X1 6= x|X0 = x}

    We then have

    P{T2 = m,T1 = n|X0 = x}= P{Xn+m = x,Xn+m1 6= x, . . . ,Xn+1 6= x,Xn = x,Xn1 6= x, . . .X1 6= x|X0 = x}= P{Xn+m = x,Xn+m1 6= x, . . . ,Xn+1 6= x|Xn = x,Xn1 6= x, . . .X1 6= x,X0 = x}

    P{Xn = x,Xn1 6= x, . . .X1 6= x|X0 = x}= P{Xn+m = x,Xn+m1 6= x, . . . ,Xn+1 6= x|Xn = x}Fn(x) Markov property= P{Xm = x,Xm1 6= x, . . . ,X1 6= x|X0 = x}Fn(x) Time homogeneity= Fm(x)Fn(x)

    We can also write this as

    P{T2 = m,T1 = n|X0 = x}= P{Xn+m = x,Xn+m1 6= x, . . . ,Xn+1 6= x|Xn = x}Fn(x) Markov property= P{T1 + T2 = n+m|T1 = n,Xn = x}P{T1 = n|X0 = x}= P{T2 = m|XT1 = x}P{T1 = n|X0 = x}

    from where we see that P{T2 = m|XT1 = x} = Fm(x). Both the Markov property andthe time homogeneity have been used. Similarly, T1, T2, T3, . . . are independent randomvariables. Their distribution is identical because of the time homogeneity of the Markovchain.

    We argue that we can replace N byN

    k=1 Tk in the ergodic theorem. That is,

    limN

    1Nk=1 Tk

    Nk=1 Tkn=1

    f(Xn) = pif

    in mean square. Setting f(y) = 1{y=x} we obtain

    limN

    11N

    Nk=1 Tk

    = pi(x).

    38

  • But by law of large numbers the left hand side is just 1/ET1. To complete the proof weneed to show that we can replace the index N in the ergodic theorem by MN =N

    k=1 Tk , in probability as N . But MNN 0 in mean square (hence inprobability) where = E{T1} > 0 by the standard law of large numbers. In more detail,we need to show that 1MN

    MNn=1

    f(Xn) 1[N ]

    [N ]n=1

    f(Xn)

    0in probability as N . Here [a] denotes the integer part of a. But we have the estimate 1MN

    MNn=1

    f(Xn) 1[N ]

    [N ]n=1

    f(Xn)

    2 [N ]MN 1

    ||f ||and since MN[N ] 1 in probability, we also have that [N ]MN 1 in probability, whichcompletes the proof.

    4.8 MLE for Markov chains

    Let {Xn, n 0} be a Markov chain in a finite state space S with transition probabilityP (x, y; ), where is a real valued parameter. The goal is to estimate using observations{Xi, 0 i n}. We assume that the Markov chain is ergodic, uniformly in in some fixedinterval of interest, and that we have a positive lower bound for P (x, y; ). In this sectionwe will carry out the following.

    1. Let Yn = (Xn, Xn1)T , n 1, which is also a Markov chain. We will show that thetransition probability for Yn is given by Q(x1, x2; y1, y2) = P{Yn = (y1, y2)T |Yn1 =(x1, x2)

    T } = 1{x1=y2}P (y2, y1).2. Let pi be the invariant probability vector for Xn:

    xS pi(x)P (x, y) = pi(y). We will

    show that an invariant probability vector for Yn has components pi(x2)P (x2, x1). Bycalculating Q2, the two-step transition probabilities, we conclude that this invariantvector is unique and is approached exponentially fast. We in fact reduce this problemto the standard case analyzed in earlier sections.

    3. Suppose the initial state x0 is given. Moreover suppose is the true value of the

    underlying parameter. We use the ergodic theorem, to show that as n thenormalized log likelihood function

    ln() =1

    nlog[P(X0 = x0, X1 = x1, . . . , Xn = xn; )]

    39

  • x,yS

    pi(x; )P (x, y; ) logP (x, y; ) := `().

    in probability.

    4. Assume P (x, y; ) is twice-differentiable with respect to . We show that l() = 0and l() < 0.

    5. We also use the delta method (as in the i.i.d. case) and the central limit theorem forMarkov chains to get a CLT for the MLE n (where l

    n(n) = 0).

    We now go on to analyze the statements made above.

    1. By the definition of conditional probability

    Q(x1, x2; y1, y2) = P{Yn = (y1, y2)T |Yn1 = (x1, x2)T } = P{Yn = (y1, y2)T , Yn1 = (x1, x2)T }

    P{Yn1 = (x1, x2)T }

    =pin2(x2)P (x2, x1)1{x1=y2}P (y2, y1)

    pin2(x2)P (x2, x1)= 1{x1=y2}P (y2, y1)

    2. We verify that

    pi(y2)p(y2, y1) =x1,x2

    pi(x2)P (x2, x1)Q(x1, x2; y1, y2) =x1,x2

    pi(x2)P (x2, x1)1{x1=y2}P (y2, y1)

    which is a true relation because

    x pi(x)P (x, y) = pi(y). We also have that

    Q2(x1, x2; y1, y2) =z1,z2

    Q(x1, x2; z1, z2)Q(z1, z2; y1, y2) = P (x1, y2)P (y2, y1) > 0

    the positivity being for all (finitely many) states and so the hypothesis for the ergodictheorem holds for {Yn}.

    3. Clearly we have by the ergodic theorem, which we have shown that it applies,

    ln() =1

    n

    nj=1

    logP (Xj1, Xj ; ) +1

    nlog pi0(X0)

    {n}x,yS

    pi(x; )P (x, y; ) logP (x, y; ) = l()

    where we indicate dependence on the parameter(s) explicitly.

    40

  • 4. Differentiability of l() with respect to follows from the uniformity of the ergodiclimit with respect to . That is, the convergence is in mean square (or in probability),uniformly with respect to the parameter. We can then do differentiation. We have

    l() =x,yS

    pi(x; )P (x, y; )P (x, y; )P (x, y; )

    where prime denotes derivative with respect to , assumed a scalar parameter in[1, 2] and with

    (1, 2). At = we havel() =

    x,yS

    pi(x; )P (x, y; )

    But for any we have

    x,y pi(x; )P (x, y; ) = 1 and so differentiating we get

    0 =x,y

    pi(x; )P (x, y; ) +x,y

    pi(x; )P (x, y; )

    =x

    pi(x; ) +x,y

    pi(x; )P (x, y; ) =x,y

    pi(x; )P (x, y; )

    which implies that l() = 0. In the same way we can calculate

    l() =x,yS

    pi(x; )P (x, y; )(P (x, y; )P (x, y; )

    (P(x, y; ))2

    P 2(x, y; )

    )which leads to

    l() = x,yS

    pi(x; )(P (x, y; ))2

    P (x, y; )= I < 0

    with I the Fisher information.

    5. The maximum likelihood estimator of n of is defined as the maximizer of ln().From the ergodic theorem we know that ln() is close in mean square to l(), whichis concave near the true value . Therefore (and this is not so easy to prove in detaileven in the iid case) for close enough to and n large enough, ln() will be concavewith high probability. This means that n must satisfy l

    n(n) = 0 and we have that

    P{|n | > } 0 as n for any > 0. Define the fluctuation error Zn by

    n = +

    1nZn

    To get a CLT for the error, as in the iid case, we use the delta method

    0 = ln(n) = ln( +

    1nZn) = l

    n() + ln(

    )1nZn + . . .

    41

  • where the dots signify terms that go to zero faster than 1/n in mean square or

    in probability (not easy to prove in detail). Ignoring the higher order terms we get(approximately)

    Zn =

    nln()ln()

    By differentiation we have that

    ln() =

    1

    n

    nj=1

    P (Xj1, Xj ; )P (Xj1, Xj ; )

    ln() =

    1

    n

    nj=1

    (P (Xj1, Xj ; )P (Xj1, Xj ; )

    (P(Xj1, Xj ; ))2

    P 2(Xj1, Xj ; )

    )where we have ignored the pi0 term that goes to zero in the limit. The ergodic theoremapplies to ln(, which tends to l() = I in mean square (or in probability). Thenext step is to apply the central limit theorem to

    nln(), if possible, where

    nln(

    ) =1n

    nj=1

    P (Xj1, Xj ; )P (Xj1, Xj ; )

    As discussed earlier in these notes, we need to first transform the quantity of inter-est using the solution of a suitable Poisson equation. We then have this quantityexpressed as a martingale plus a term that goes to zero in the limit. The key obser-vation here, however, is that ln() is a martingale already, with zero mean. Infact, we can verify the defining property of a martingale

    E{ln()|X0, X1 Xn1} = ln1()which clearly follows from the fact that

    E

    {P (Xn1, Xn; )P (Xn1, Xn; )

    |Xn1}

    =y

    P (Xn1, y; ) = 0

    Now for a zero mean martingale we have the key property, as in discussed in earliersections,

    var(nln(

    )) = E

    1

    n

    nj=1

    P (Xj1, Xj ; )P (Xj1, Xj ; )

    2 = E 1n

    nj=1

    (P (Xj1, Xj ; )P (Xj1, Xj ; )

    )2We can now apply the ergodic theorem (we only need convergence of the mean here) tothe last sum and we see that var(

    nln()) I, with I the Fisher information. The

    42

  • CLT now applied to the martingalenln() tells us that it converges in distribution

    (in law, weakly) to an N(0, I) random variable. Application of Slutskys theoremleads to the CLT for the error of the MLE

    n(n ) N(0, 1

    I) in distribution

    just as in the iid case. The Cramer-Rao lower bound applies here (as for any asymp-totically consistent estimator) and we conclude that the MLE is asymptotically ef-ficient, as we should expect. Thus all of the theory of MLE carries over to Markovchains. What is a lot more difficult now is the actual calculation of the MLE, n,which can be done iteratively by the expectation minimization (EM) algorithm or byother optimization methods depending on the problem at hand.

    4.9 Bayesian filtering

    We want to determine recursive filtering equations in the Markov chain context. LetX = {Xn : n 0} be a Markov chain in a finite state-space S that is not observed directly,and let Z = (Zn : n 0) be an observed, noise corrupted version of X defined via the pathprobability densities given the Markov chain:

    P {Z0 dz0, Z1 dz1, . . . , Zn dzn|X} =ni=0

    f(zi;Xi)dzi,

    where (f(;x) : x S) is a family of given density functions. The noisy observations areconditionally independent given the Markov chain. Let = ((x) : x S) be a priordistribution of initial state X0 (i.e (x) = P(X0 = x)) and denote by

    n(x) = P(Xn = x|Z0, , Zn)

    the posterior of the state at time n given the observations. We want to compute n+1(x)recursively from n(x) assuming that the transition probabilities P (x, y), x, y S, of theMarkov chain are known.

    Let Z(n) = {Z0, Z1, . . . , Zn} be the observed noisy Markov chain up to time n. Fromthe properties of conditional probabilities we have that

    n(x) = P(Xn = x|Z(n)) =P{Xn = x, Z(n)}

    P{Z(n)}=P{Zn|Xn = x, Z(n1)}P{Xn = x, Z(n1)}

    P{Z(n)}

    where P{Z(n)} is the joint (unconditional) density of the noisy Markov chain evaluated atthe observation Z(n). But by the law of total probability and the Markov property we have

    P{Xn = x, Z(n1)} =zS

    P{Xn = x|Xn1 = z, Z(n1)}P{Xn1 = z, Z(n1)}

    43

  • =zS

    P{Xn = x|Xn1 = z}P{Xn1 = z, Z(n1)}

    and

    P{Xn1 = z, Z(n1)} = P{Xn1 = z|Z(n1)}P{Z(n1)} = n1(z)P{Z(n1)}Note also that

    P{Zn|Xn = x, Z(n1)} = P{Zn|Xn = x} = f(Zn;x)by the independence of the noisy observations given the Markov chain.

    We can now see how the recursion goes. We first update the state with the (assumedknown) Markov chain transition probabilities

    n1 n|n1(x) =zS

    n1(z)P (z, x) , 0(x) given.

    Then we update the observation

    n|n1 n(x) =f(Zn, x)n|n1(x)xS f(Zn;x)n|n1(x)

    In applications there are two main limitations to this algorithm. First, the transitionprobabilities of the Markov chain are not known perfectly. They usually have unknownparameters in them, which must be estimated. Second, the updating-of-the-state stepwill be computationally intensive when the dimension of the Markov chain is large soMonte Carlo methods (particle methods) must be used. Combining filtering and maximumlikelihood parameter estimation can be done with variants of the expectation-maximization(EM) algorithm.

    5 Random walks and connections with differential equations

    We begin with the simple random walk on equally spaced points in an interval. Let {Xn}denote the symmetric random walk on {x0 = 0, x1 = x, . . . , xN = Nx = a} so that

    P{Xn = xk|Xn1 = xj} = 1/2 when k = j 1 and 0 otherwisewhere xj is an interior point, that is, it is not 0 or a. The process can be absorbed at 0 ora, in which case

    P{Xn = x0|Xn1 = x0} = 1 , P{Xn = xN |Xn1 = xN} = 1The matrix of transition probabilities P (x, y) has the form

    P =

    1 0 0 0 0

    1/2 0 1/2 0 0

    0 0 1/2 0 1/20 0 0 0 1

    44

  • The process satisfies the stochastic difference equation Xn = Xn1 +Zn for n = 1, 2, ...,with X0 = x given, for example, and with {Zn} independent identically distributed randomvariables taking values x with probability 1/2. The process stops when it reaches theboundary points.

    We can derive difference equations for probabilities of interest such as

    unj = P{T > n|X0 = xj}where T is the first time to reach x0 or xN , the exit time from the interval. By the renewalmethod we find that

    unj =1

    2(un1j+1 + u

    n1j1 ) , j = 1, 2, . . . , N 1 , n = 1, 2, 3, . . .

    with boundary conditions un0 = unN = 0 , n = 0, 1, 2, . . . , and initial condition u

    0j = 1 , j =

    1, 2, . . . , N 1. In vector-matrix form this finite difference equation has the formun = Pun1 , n = 1, 2, 3, ...

    where un = (unj ) and where u0 = f with f equal to one at all interior points and zero at

    the two boundary points. The eigenvalues and eigenvectors of this symmetric tridiagonalmatrix can be computed explicitly using, for example, the discrete Fourier transform.

    The connection with partial differential equations comes from passing to the continuumlimit, which we will do here formally, without detailed proofs that resemble the usualconvergence proofs for finite difference methods or can be entirely probabilistic since Xn isjust a sum of iid random variables and the diffusion approximation is simply a restatementof the CLT.

    We write the recursion relation for un as

    1

    t(un un1) = 2 1

    (x)2(P I)un1

    with 2 = (x)2/t remaining fixed at t and x tend to zero. In the continuum limitwe assume that unj u(nt, jx) with u(t, x) a smooth function that will satisfy a partialdifferential equation. It is easily seen that the j-th entry of

    [1

    (x)2(P I)un1]j 1

    2uxx(nt, jx)

    and therefore the continuum limit of the difference equation in the interior becomes

    ut(t, x) =2

    2uxx(t, x) , t > 0 , x (0, a)

    with boundary conditions u(t, 0) = u(t, a) = 0 and initial conditions u(0, x) = 1. Thisequation can be solved by Fourier sine series and the result is

    u(t, x) = Px{T > t} =k=0

    2

    2

    (2k + 1)piae(

    2k+1)pia

    )2 2t2 sin(

    2k + 1)pix

    a)

    45

  • Here T is the exit time of Brownian motion, the path limit in law of the random walk,from the interval (0, a). The most interesting feature of the explicit solution is the rate ofdecay at t

    1

    tlogPx{T > t} pi

    22

    2a2

    When we consider this problem in the semi-infinite interval (0,) we simply have tosolve the pde

    ut(t, x) =2

    2uxx(t, x) , t > 0 , x > 0

    with u(t, 0) = 0 and u(0, x) = 1. It is interesting to compute here for > 0

    Ex{eT } =

    0etut(t, x)dt

    which is the Laplace transform of the density of the exit time T from the origin. TheLaplace transform of u

    u(, x) =

    0

    etu(t, x)dt

    satisfies the ode

    u 1 = 2

    2uxx , u(, 0) = 0

    for which we get the unique bounded solution

    u =1

    (1 e

    2x)

    And since Ex{eT } = 1 u we see that

    Ex{eT } = e

    2x

    This Laplace transform can be inverted and the explicit form of the density of T can beobtained. But, contrary to what we have in a finite interval where the large t behaviorof the distribution is exponentially small, in the semi-infinite interval the mean exit timeis infinite, Ex{T} = . This can be shown by differentiating the Laplace transform of Twith respect to and then letting tend to zero.

    5.1 Transience and recurrence

    An irreducible and aperiodic Markov chain on an infinite space, a random walk for example,is recurrent if the expected number of visits to a state is infinite and transient otherwise.We will show that the one-dimensional random walk on the infinite lattice is recurrentwhile in three dimensions it is transient. It is also recurrent in two dimensions. We willuse Fourier series for the analysis.

    46

  • The one-dimensional random walk is Xn = Xn1 + Zn where {Zn} are iid randomvariables with values x with probability 1/2, Therefore, for k (pix , pix) we have

    Ex{eikXn} = Ex{E{eikZneikXn1 |Xn1}}= eikx

    (E{eikZ}

    )n= eikx

    (1

    2(eikx + eikx)

    )n= eikx(cos kx)n

    The values of Xn are mx where m = 0,1,2, ... so that if x = 0 then

    E0{eikXn} =m

    eikmxpm(n)

    where pm(n) is the probability that Xn = mx, starting from 0. By the orthogonality ofthe complex exponentials we see that

    pm(n) =x

    2pi

    pi/xpi/x

    E0{eikXn}eikmxdk

    which implies that

    P0(Xn = 0) = p0(n) =x

    2pi

    pi/xpi/x

    (cos kx)ndk

    We will use this Fourier representation of p0(n) to show that the one-dimensional symmetricrandom walk is recurrent.

    We will show that the expected number of returns to zero is infinite. The expectednumber of returns by time n is

    E0{nj=0

    1{Xj = 0}} =nj=0

    p0(j) =x

    2pi

    pi/xpi/x

    1 (cos kx)n+11 cos kx dk

    The integrand may be singular only at k = 0. Expanding near k = 0 we see that it behaveslike n+1 and therefore in any fixed, small neighborhood of k = 0 we see that the integral isunbounded as n. We conclude, omitting details, that the expected number of returnsto the origin, E0{

    j=0 1{Xj = 0}} =, and therefore the random walk is recurrent.

    In three dimensions the calculation leading to the Fourier representation of the proba-bility of return to the origin in n steps is essentially the same as in one dimension exceptfor notation. Now m = (m1,m2,m3) are points on the integer lattice in three dimensions,we denote the mesh size by x in all three directions, and we denote by ej , j = 1, 2, 3the three unit coordinate vectors in R3. The Fourier variable is k = (k1, k2, k3) with each

    47

  • coordinate taking values in (pix ,pix). The random walk is Xn = Xn1 + Zn where the

    increments are iid random variables taking values ejx, j = 1, 2, 3, with probability 1/6.Repeating the above steps and using inner product notation for vectors we have that

    E0{eikXn} = [13

    (cos k1x+ cos k2x+ cos k3x)]n

    Therefore

    E0{j=0

    1{Xj = 0}} = (x2pi

    )3 pi/xpi/x

    pi/xpi/x

    pi/xpi/x

    1

    1 13(cos k1x+ cos k2x+ cos k3x)dk

    As in the one-dimensional case only k = 0 is a singular point and expanding near k = 0we see that the denominator behaves like |k|2. But the jacobian in polar coordinates inthree dimensions is also proportional to |k|2 so the singularity cancels and we have, in facta convergent integral. This proves that the symmetric random walk in three dimensions istransient.

    5.2 Connections with classical potential theory

    The connections with classical potential theory come from the fact that the transitionprobability matrix for the random walk on the scaled lattice (scaled by x) converges tothe Laplace operator. Let as assume that we are in three dimensions and let f(x) be asmooth and bounded function on (R)3. We can then define the transition operator of therandom walk by

    Pf(x) =1

    6

    3j=1

    f(x ejx) = Average of nearest neighbors

    so that

    1

    (x)2(Pf(x) f(x)) 1

    6f(x) =

    1

    6(fx1x1(x) + fx2x2(x) + fx3x3(x))

    as x 0, where is the Laplace operator. For the expectation

    unm = E{f(Xn)|X0 = mx},

    where m = (m1,m2,m3), we have that

    un+1m = (Pun)(mx).

    and we can rewrite this as

    un+1m unm = (P I)un(mx)

    48

  • Dividing by t we have

    1

    t(un+1m unm) =

    (x)2

    t 1

    (x)2(P I)un(mx)

    In the continuum limit t 0, x 0 with

    2 =1

    3

    (x)2

    t= 2 = constant

    we have that unm u(nt,mx) with

    ut =2

    2u , t > 0

    with u(0, x) = f(x). The factor 1/3 is attached to the definition of 2 because it refers tocoordinate-wise mean square displacement rather than overall mean square displacement.

    We introduce the Laplace transform

    u(x, ) =

    0

    etu(t, x)dt , > 0

    and note that the diffusion equation transforms to

    u(x, ) f(x) = 2

    2u(x, )

    In terms of the Brownian motion process Xt, the continuous time analog of the randomwalk which is not considered in detail here, we have the probabilistic representation

    u(t, x) = Ex{f(Xt)}

    and for the Laplace transform

    u(x, ) =

    0

    etEx{f(Xt)}dt = Ex{

    0etf(Xt)dt

    When f(x) = 1A(x) = 1{x A} then

    u(x, ) = Ex{

    0et1A(Xt)dt

    is the discounted, with rate , expected time spent by Brownian motion in the set A,starting from x. When = 0, u(x, 0) is the expected time spent in A, which is thecontinuous analog of the quantity that characterizes transience and recurrence in randomwalks.

    49

  • The continuum limit is interesting because it connects directly with potential theory,that is the theory of solutions of the Laplace equation. In three dimensions the Greensfunction for the equation

    xG(x, y) G(x, y) = y(x)is explicitly given by

    G(x, y) =e|xy|

    4pi|x y|Therefore we have the integral representation

    u(x, ) = Ex{

    0etf(Xt)dt =

    A

    e

    2|xy|

    4pi|x y| dy

    The expected time spent in A is thus given by the Newtonian potential

    Ex{

    0f(Xt)dt =

    A

    1

    4pi|x y|dy

    For the random walk on the three dimensional lattice we do not have an explicit expressionsuch as this one, which, in particular, shows that the time spent in any bounded set isfinite.

    The recurrence of the one-dimensional Brownian motion can be seen easily by notingthe Greens function in one dimension has the form

    G(x, y) =e|xy|

    2

    and therefore in one dimension we have

    u(x, ) = Ex{

    0etf(Xt)dt =

    A

    e

    2|xy|

    2dy

    This becomes infinite as 0, showing that the expected time spent in any set of positivevolume is infinite. This is the analog of recurrence for Brownian motion in one dimension.

    5.3 Stochastic control

    We will formulate a stochastic control problem for a finite difference equation of the form

    Xn+1 = Xn + b(Un, Xn)t+ (Un, Xn)

    tZn+1 , n = 0, 1, 2, ...

    where X0 = x is given and {Zn} are iid N(0, 1) random variables. This is a Markov chainwith values in R and with coefficient functions b(u, x) and (u, x) that depend on both xand the control variable u R. The transition probabilities are given by

    P{Xn+1 A|Xn = x, un} =Ae (yxb(u,x)t)2

    22(u,x)t dy

    50

  • For any bounded function f(x) we define the transition operator

    Puf(x) =

    e (yxb(u,x)t)2

    22(u,x)t f(y)dy

    The reason the scaling with the time increment t is chosen this way is so that conditionalincrement of the Markov chain has the following mean and variance

    E{Xn+1 Xn|Xn = x, un} = b(u, x)t

    Var{Xn+1 Xn|Xn = x, un} = 2(u, x)twhich implies that for any smooth and bounded function f we have that

    limt0

    1

    t(Puf(x) f(x)) =

    2(u, x)

    2

    2f(x)

    x2+ b(u, x)

    f(x)

    x= Luf(x)

    Thus, the continuum limit of the Markov chain exists and is a diffusion process.The control of the process Xn is Un, assumed to be a real valued random variable that

    depends only on X0, X1, . . . , Xn and such that

    Ex{N1j=0

    L(Uj , Xj)t+ g(XN )}

    is minimized over all control sequences U0, U1, ..., UN1, for some terminal time index Nand for a given running cost function L(u, x) and terminal cost g(x). We introduce thevalue function

    V n(x) = infAn

    E{N1j=n

    L(Uj , Xj)t+ g(XN )|Xn = x} , n = N 1, N 2, ..., 0

    where An is the set of all admissible control sequences. Note that V N (x) = g(x). Clearlywe want to find V 0(x) and the associated optimal control sequence U0 , U1 , ..., UN1.

    The main result in discrete time stochastic control is the derivation of the backward intime recursion for the determination of the value function V 0 and the associated optimal

    51

  • control. Using iterated conditional expectation we write

    V n(x) = infAn

    E{N1j=n

    L(Uj , Xj)t+ g(XN )|Xn = x}

    = infAn

    E{E{L(Un, Xn)t+N1j=n+1

    L(Uj , Xj)t+ g(XN )|Xn+1, Xn = x}|Xn = x}

    = infuE{L(u, x)t+ inf

    An+1E{

    N1j=n+1

    L(Uj , Xj)t+ g(XN )|Xn+1}|Xn = x, Un = u}

    = infu

    [L(u, x)t+ E{V n+1(Xn+1)|Xn = x, Un = u}]= inf

    u[L(u, x)t+ E{V n+1(Xn+1)|Xn = x, Un = u}]

    = infu

    [L(u, x)t+ PuVn+1(x)]

    Therefore, the determination of the value function and the associated optimal control isreduced to soving the backward optimality recursion

    V n(x) = infu

    [L(u, x)t+ PuVn+1(x)] , n = N 1, N 2, ..., 0

    with V N (x) = g(x).Assuming that a unique minimum un(x) exists at each step in the backward optimality

    recursion, which in general requires convexity and other assumptions on L, b, and g, wecan then obtain the optimally controlled Markov chain recursively by

    Xn+1 = Xn + b(u

    n(X

    n), X

    n)t+ (u

    n(X

    n), X

    n)

    tZn+1 , n = 0, 1, 2, ...

    with X0 = x. The optimal controls are given by Un = un(Xn) and they are Markovian,that is, they depend only on the current (optimal) state. The optimal value functionsatisfies

    V n(x) = L(un(x), x)t+ Pun(x)Vn+1(x)] , n = N 1, N 2, ..., 0

    with V N (x) = g(x). The optimal control problem is thus reduced to solving first thebackward optimality recursion and then the forward recursion that determines the optimalstate and the associated optimal controls.

    Let us consider a simple, linear, quadratic cost control problem in which b(u, x) =bu, (u, x) = , L(u, x) = lu2, g(x) = gx2, where b, , l, g are now constants and we taket = 1. In this case we may assume that the value function has the form

    V n(x) = anx2 + bn

    52

  • and derive recursions for the sequences of constants {an} and {bn}, with aN = g andbN = 0. Because of the Gaussian transition probability density for the Markov chain, thebackward optimality recursion has the form

    anx2 + bn = inf

    u[lu2 + an+1(

    2 + (x+ bu)2) + bn+1]

    The minimizing u is given by

    un(x) =ban+1l + b2an+1

    x

    and then the recursions for an and bn have the form

    an =lan+1

    l + b2an+1, bn =

    2an+1 + bn+1

    with aN = g and bN = 0. Once this sequence of constants is determined the optimallycontrolled Markov chain has the form

    Xn+1 = Xn + bu

    n(X

    n) + Zn+1 , n = 0, 1, 2, ...

    with X0 = x. Note that the optimal control is a linear function of the state in this case.The continuum limit of the backward optimality recursion is the Hamilton-Jacobi-

    Bellman (HJB) equation obtained as follows. We re-write this recursion as

    V n+1(x) V n(x) + infu

    [L(u, x)t+ PuVn+1(x) V n(x)] , n = N 1, N 2, ..., 0

    Dividing by t and passing to the limit we get

    Vt + infu

    [L(u, x) + Luf(x)] , t < T

    with V (T, x) = g(x). Re