-
CME 308 Spring 2014 Notes
George Papanicolaou
March 31, 2015
Contents
1 Sums of independent identically distributed random variables
21.1 The weak law of large numbers . . . . . . . . . . . . . . . .
. . . . . . . . . 21.2 The strong law of large numbers . . . . . .
. . . . . . . . . . . . . . . . . . 31.3 Weak convergence . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 The
central limit theorem (CLT) . . . . . . . . . . . . . . . . . . . .
. . . . 51.5 Characteristic functions and Fourier transforms . . .
. . . . . . . . . . . . . 71.6 On convergence of random variables .
. . . . . . . . . . . . . . . . . . . . . 81.7 Confidence intervals
for the empirical mean . . . . . . . . . . . . . . . . . . 91.8
Large deviations . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 10
2 Maximum likelihood estimation (MLE) 122.1 Large sample
properties of MLE . . . . . . . . . . . . . . . . . . . . . . . .
132.2 Cramer-Rao lower bound and asymptotic efficiency of the MLE .
. . . . . . 142.3 Asymptotic normality of posterior densities . . .
. . . . . . . . . . . . . . . 16
3 Basic Monte Carlo methods 173.1 Properties of basic Monte
Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 183.2
Importance sampling . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 193.3 Acceptance-rejection . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 203.4 Glivenko-Cantelli theorem
and the Kolmogorov-Smirnov test . . . . . . . . 213.5 Density
kernel estimation . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 233.6 Bootstrap . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 24
4 Markov Chains 264.1 Exit times . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 274.2 Transience and
recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . .
294.3 Invariant probabilities . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 314.4 The ergodic theorem . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 324.5 The central limit
theorem for Markov chains . . . . . . . . . . . . . . . . . .
35
1
-
4.6 Expected number of visits to a state and the invariant
probabilities . . . . . 364.7 Return times and the ergodic theorem
. . . . . . . . . . . . . . . . . . . . . 384.8 MLE for Markov
chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
394.9 Bayesian filtering . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 43
5 Random walks and connections with differential equations 445.1
Transience and recurrence . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 465.2 Connections with classical potential theory . .
. . . . . . . . . . . . . . . . 485.3 Stochastic control . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1 Sums of independent identically distributed random
vari-ables
We will be dealing with sequences of independent identically
distributed random variablesX1, X2, ..., Xn where P{Xj x} = F (x)
is their common distribution function. Wewill also use the notation
FX(x) when we want to identify the random variable
whosedistribution is F (x). Independence means that the joint
distribution of X1, X2, ..., Xn isequal to the product of the
marginals
P{X1 x1, X2 x2, ..., Xn xn} = F (x1, x2, ..., xn) =nj=1
F (xj)
This implies that the expectation of the product of any bounded
functions of the ran-dom variables equals the product of the
expectations: E{g1(X1)g2(X2) gn(Xn)} =E{g1(X1)}E{g2(X2)}
E{gn(Xn)}.
We will be interested in the behavior of the sample or empirical
mean
Xn =X1 +X2 + ...+Xn
n
which is the simplest and most widely studied function of the
random variables. We expectthat it should be closely related to the
theoretical mean = E{Xj}. We denote by 2 thevariance of Xj
2 = var(Xj) = E{(Xj )2} =
(x )2dF (x)
1.1 The weak law of large numbers
The simplest large sample relation is the weak law of large
numbers (WLLN) which saysthat Xn in probability as n. This is a
consequence of the Chebyshev inequality(CI)
P{|Xn | > } E{(Xn )2}
2=
2
n2 0
2
-
as n for all > 0. We have used here the fact that Xn
converges to also in meansquare, E{(Xn )2} 0.
We note that neither independence of the Xj nor their finite
variance are needed forthe validity of WLLN. It is enough that the
Xj be sufficiently uncorrelated, with finitevariance, or that they
be independent with finite E{|Xj |} < but may have
infinitevariance.
The Chebyshev inequality for any random variable X has the
form
P{|X | > } E{|X |p}
p, > 0 , 0 < p } = 0 , for all > 0
We would like to know when it is true that
P{ limnXn = } = 1
which is the strong law of large numbers (SLLN).This is more
involved because we need to calculate the probability of an event
that
depends on infinitely many random variables. It is necessary
therefore that the infinitesequence of independent identically
distributed (iid) random variables X1, X2, ..., Xn, ...be defined
on the same probability space , that is Xj = Xj(), are real
valuedfunctions on . We may think of as a set of elementary events
on which a probabilitylaw P is defined, more precisely it is
defined on subsets of . We will not need a measuretheoretic
foundation for probability here except in special cases in which we
will deal withthe issues that arise without a general theory.
Let An = { | |Xn| } for some > 0. If we can show that, given
> 0, theset of all such that |Xn | for all n larger than some
N() has probability one,then we have proved the SLLN. But the event
in question can be written as
A = N=1 n=N AnWith Ac, the complement of A, we now have
P{Ac} = P{N=1 n=N Acn} = limN
P{n=NAcn} limN
n=N
P{Acn}
We have used here some general properties of probability laws
that we will not discuss indetail but which are rather intuitive.
One is that the probability of the intersection of a
3
-
sequence of decreasing sets is equal to the limit of the
probability of these sets, and theother is that the probability of
the union of sets is less than or equal to the sum of
theprobabilities. It is equal when the events are disjoint, that
is, their pairwise intersectionsare empty. Suppose now that we can
show that for any > 0 fixed
P{Acn} = P{|Xn | > } constant
n2
Then we have that P{Ac} = 0, since
limN
n=N
P{Acn} = 0,
and hence P{A} = 1, which proves the SLLN.If the iid random
variables {Xj} have finite fourth order moments, E{|Xj |4} <
or
E{(Xj )4}
-
which shows that only distribution functions are involved. The
continuity of g is essentialas it allows for a coarse graining of
the features of the random variables involved, andleads as a
consequence to results that can have universal behavior. The
central limittheorem is the best and perhaps simplest example of
all this, as we will see in the nextsection.
A basic theorem in weak convergence is the equivalence of three
different forms of it.The following three statements are
equivalent:
1. limnE{g(Xn)} = E{g(X)} , for all bounded and continous g(x)2.
limFn(x) = F (x) , at all continuity points x of F (x)
3. limnE{eiXn} = E{eiX} , pointwise for all RThe last form of
weak convergence involves the characteristic functions of the
random
variables, which are the Fourier transforms of the distribution
functions. Implicit in thistheorem is the statement that knowledge
of the expectations E{g(X)} for all bounded andcontinuous g
determines F uniquely, and knowledge of the characteristic function
E{eiX}also determines F uniquely. The latter is not so surprising
as it amounts to the uniquenessof the inverse Fourier transform,
which provides also a formula for recovering F from itsFourier
transform.
As an example of how the above equivalence is shown consider how
the second state-ment follows from the first. The indicator
function I(,x](y) is discontinuous but can bebounded above and
below by continuous functions, g(y) I(,x](y) g(y), where g(y)equals
one for y < x, is linear between x and x + going from 1 to zero,
and is zero fory > x + . The lower function is defined
similarly, being equal to 1 for y < x , linearbetween x and x
and zero for y > x. From the first statement we have that
E{g(X)} = limnE{g(Xn)} lim infn Fn(x) lim supn Fn(x) limnE{g
(Xn)} = E{g(X)}
From the definition of g and g it follows that
F (x ) lim infn Fn(x) lim supn Fn(x) F (x+ )
If x is a continuity point of F then letting 0 gives the second
statement.
1.4 The central limit theorem (CLT)
The CLT states that if X1, X2, ..., Xn are iid random variables
with mean and variance2 then Zn =
n(Xn ) converges weakly to a Gaussian random variable with
mean
zero and variance 2.
5
-
The proof uses characteristic functions. We also assume that we
have finite thirdmoments, E{|X|3} < }, which is not necessary
except to simplify the proof. We havethat
E{eiZn} = E{e innj=1(Xj)} = E{
nj=1
ein
(Xj)}
=nj=1
E{e in (Xj)} = (E{e in (X)})n
The independence is used in going from the first to the second
line above.We now use the Taylor expansion with remainder for the
exponential
|eix 1 ix 12
(ix)2| 16|x|3
and note that derivatives of the characteristic function at zero
exist and equal to momentsif these are finite. This gives
(E{e in (X)})n = (1 22
2n+O(
1
n3/2))n e
22
2
as n . But e22
2 is the characteristic function of a Gaussian random variable
withmean zero and variance 2, which is the well-known identnity
e22
2 =
eixe
x2
222pi2
dx (1)
A proof of the CLT as above but with only second moments for the
law of the {Xj}can be given by noting that if () = E{ei(X)} then
(0) = 1, (0) = 0, (0) = 2,and we have by Taylors theorem with
remainder
n(n
) =
(1 + (n)
2
2n
)n e
22
2
where 0 < n 0, P{|XnX| > } 0 as n.
We say that {Xn} converges in mean square to X, Xn MSQ X, if
E{(XnX)2} 0 as n. Clearly convergence in mean square implies
convergence in probability. Thisfollows from the Chebyshev
inequality. However, the converse is not true simply because
8
-
random variables can converge in probability and they may not
even have finite secondmoments (variances), so mean square
convergence does not make sense for them.
Convergence with probability one, as explained above, means that
all random variablesare defined on the same probability space and
that P{limnXn = X} = 1. This is acomposite of three statements:
First the limit exists, then the limit is equal to X and thirdthis
occurs with probability one. Convergence with probability one
implies convergence inprobability (you need the bounded convergence
theorem to show this) but the converse isfalse. This is intuitively
clear but somewhat technical to justify as are most statementsthat
hold with probability one. The connection with mean square
convergence is this:Convergence with probability one and
boundedness of the variances of the Xn uniformlyin n does not imply
mean square convergence but it does imply that the limit
randomvariable has finite variance. And mean square convergence
does not imply convergencewith probability one.
Regarding weak convergence, it is true that convergence in
probability implies it. Theconverse is not true except when the
weak limit of the random variables (distributions)Xn is
deterministic, that is, when X takes only one value and its
distribution has ajump of hight one at that one point. In that case
weak convergence implies convergencein probability. Slutskys
theorem is a generalization of this statement that we use often
in
estimation theory and elsewhere. Its statement is that if XnL X
and Yn P c, where c
is a (deterministic) constant, then Xn + YnL X + c, XnYn L Xc
and if c > 0 then
Xn/YnL X/c.
As an example of how these statements are shown, consider how
convergence in proba-bility implies weak convergence. For weak
convergence it is sufficient that the test functiong be bounded and
uniformly continuous (bounded and continuous is enough) so that
givenand > 0 there is a > 0 so that |g(x) g(y)| < if |x y|
< . We then have that
|E{g(Xn)} E{g(X)}| |E{[g(Xn) g(X)]I(|XnX|)}|+ E{|g(Xn)
g(X)|I(|XnX|>)} (1 P{|Xn X| > }) + 2 max
x|g(x)|P{|Xn X| > }
This implies thatlim supn
|E{g(Xn)} E{g(X)}|
and since is arbitrary we have the statement of weak
convergence.
1.7 Confidence intervals for the empirical mean
An important application of the central limit theorem along with
Slutskys theorem is ingetting confidence intervals in parameter
estimation. Suppose that we use the empiricalmean Xn to estimate
the true mean , assumed unknown. First we introduce the sample
9
-
variance
s2n =1
n 1nj=1
(Xj Xn)2
It is easily seen that E{s2n} = 2, the theoretical variance, and
assuming finite thirdmoments we not only have that Xn
P but also s2n P 2, which implies that sn P .Now the central
limit theorem says that
n (Xn )
L N (0, 1). By Slutskys theorem wealso have that
nsn
(Xn) L N (0, 1). Given a confidence level , for example = .05,
wefind such that P{|Z| > } = , where Z is a Gaussian random
variable with mean zeroand variance one. It then follows that for n
large enough we have that
nsn|Xn | >
with probability and hence
Xn snn Xn + sn
n(2)
with probability 1 . This is a confidence interval for the
unknown mean in terms ofa sample of size n, which we assume is
large enough so that the central limit theorem canbe used.
1.8 Large deviations
The weak law of large numbers states that XnP and the central
limit theorem that
n(Xn ) L N (0, 2). Let > and note that we must have
P{Xn > } 0.
The question posed in large deviations is to estimate the rate
at which this probabilitytends to zero. Assume that the underlying
random variable X has a density f(x) and amoment generating
function. We denote the logarithm of the moment generating
functionby
L() = logE{eX}, Rwhich we assume to be finite and differentiable
in . We have that L(0) = 0, L(0) = andL(0) = 2. We also assume that
L(), which is always convex, is in fact strictly convex.We define
the conjugate convex function of L by
H() = sup
( L()), R
and note that L can be recovered from H by applying to the
Legendre transform again
L() = sup
( H()), R
10
-
We will show that
limn
1
nlogP{Xn > } = H(),
which means that in the logarithmic sense we have
P{Xn > } enH()
For the proof we get first the upper bound
lim supn
1
nlogP{Xn > } H(), (3)
and then the lower bound
lim infn
1
nlogP{Xn > } H(), (4)
from which the result follows. For the upper bound we note
that
P{Xn > } = P{enXn > en} en(E{eX})n = en(L()), 0,which
after taking logs, dividing by n, taking the upper limit and then
optimizing the rightside over we get the upper bound (3). If >
then H() = sup>0( L()) so theMarkov inequality above is
valid.
To get the lower bound we first introduce a transformation of
law for X that plays anessential role here and in some other
contexts such as in importance sampling. For any R let
f(x) =exf(x)
exf(x)dx
and note that we have the identity
f(x1)f(x2) f(xn) = e(x1+x2++xn)+nL()f(x1)f(x2) f(xn)This implies
that for any real we have
P{Xn > } = E{I(Xn>)} = E{en(XnL())I(Xn>)}where E
denotes expectation relative to the law with density f. Let be any
smallpositive number and note that {Xn > } = {Xn ( + ) > }
{|Xn ( + )| < }.If we chose = ( + ) so that L() = + , which
means that E{X} = + , theweak law of large numbers gives
P{|Xn ( + )| < } 1, as nWe now note that
E{en(XnL())I(Xn>)} en((+2)L())P{|Xn ( + )| < }
11
-
Taking logarithms, dividing by n and taking the lower limit as n
gives
lim infn
1
nlogP{Xn > } [( + 2)( + ) L(( + ))]
Since is arbitrarily small we have
lim infn
1
nlogP{Xn > } [() L(())] = H()
which proves the lower bound and hence the large deviations
theorem.
2 Maximum likelihood estimation (MLE)
Let X1, X2, ..., Xn be independent samples of a random variable
whose distribution F (x|)depends on a real parameter with values in
a bounded interval. How can we estimate thetrue value of this
parameter from the observed sequence. In the maximum
likelihoodmethod we form the likelihood function
Ln() = Ln(;X1, X2, ..., Xn) =
nj=1
f(Xj |), (5)
where f(x|) = ddxF (x|) is the density of the random variable,
and obtain an estimate of by maximizing it
n = argmaxLn()
Clearly n = n(X1, X2, ..., Xn) and the reason that this is an
acceptable choice for anestimator is that the observed sample by
the mere fact that it has occurred must havecome from the
distribution with the most likely parameter. Of course, the reason
thatMLE estimators are calculated and used is because they have
very good large sampleproperties as we will see, which captures
this intuition.
Assuming that the density f(x|) > 0 is smooth in and that
integrals we will becomputing all exist, we introduce the scaled
log likelihood function
ln() =1
n
nj=1
log f(Xj |) (6)
and note that ln(n) = 0, which is the usual first order
condition for an extremum. Hereprimes denote derivatives with
respect to and we have
ln() =1
n
nj=1
f (Xj |)f(Xj |)
12
-
By the WLLN we have that
limnP{|ln() l()| > } 0 , for all > 0
wherel() = E{log f(X|)}.
Having assumed differentiability in we also have
limnP{|l
n() l()| > } 0 , for all > 0
where
l() = E{f(X|)f(X|) }
Clearly we have that
l() =f (x|)f(x|) f(x|
)dx =f (x|)dx = d
d
f(x|)dx = 0
We also have that
l() =
(f (x|)2f(x|) dx < 0
which means that is, in general, a maximum of l(). We define the
Fisher informationby
I() = l() =
(f (x|)2f(x|) dx (7)
and note that it is positive.
2.1 Large sample properties of MLE
An estimator is called asymptotically consistent if
limnP
{|n | > } = 0 , for all > 0 (8)
The fact that the MLE estimator is asymptotically consistent
follows from the smoothnessin of f(x|) and the existence the
integrals that arise in the calculations above. This isnot an
obvious statement and will not be proved in detail. It becomes more
plausible whenwe invoke the uniform WLLN for ln() and l
n()
limnP
{ max||a
|ln() l()| > } = 0 , for all > 0 (9)
and similarly for ln, where a > 0 is a fixed constant. This
says that near the randomcurve ln() and its derivative l
n() are uniformly close to the deterministic curve l() and
13
-
its derivative, in probability. This then implies that the
maximum of ln, n, is close to themaximum of l, which is , in
probability.
An estimator is unbiased if E{n} = and asymptotically unbiased
if limnE{n} =. MLE estimators are in general biased but because of
consistency they are, with addi-tional assumptions on
integrability, asymptotically unbiased.
An important application of the central limit theorem is in
determining the fluctuationsin the MLE. Let Zn be the scaled
error
n = +
1nZn , or Zn =
n(n )
We will show that Zn converges weakly to a Gaussian random
variable with mean zero andvariance equal to the reciprocal of the
Fisher information I. It is not immediately clearhow the CLT comes
into the picture since n is not, in general, a sum of random
variables.The way to represent it approximately as a ratio of sums
of random variables is by usingthe delta method or expansion. We
use a Taylor expansion with remainder (and the meanvalue theorem)
to get
0 = ln(n) = ln() +
1nln(n)Zn
so that
Zn =
nln()ln(n)
where n is between and n and tends to in probability, by
consistency. We now note
the following. First, the CLT can be applied directly tonln() so
as to show that it
converges weakly to a Gaussian random variable with mean zero
and variance equal to theFisher information I(). Second, the
uniform WLLN can be applied ln(n) to show thatit converges in
probability to the Fisher information1. By Slutskys theorem it
follows thatZn converges in distribution to a Gaussian random
variable with mean zero and varianceI1(). Therefore we have shown
that
n(n) converges in law to a Gaussian random
variable with mean zero and variance I1(). From this result we
can construct confidenceintervals for the unknown parameter just as
we did for the empirical mean in (2).
Slutskys theorem says that if Un = VnWn + Yn and (i) Vn
converges weakly to V , (ii)Wn converges in probability to a
constant w and (iii) Yn converges in probability to zero,then Un
converges weakly to V w.
2.2 Cramer-Rao lower bound and asymptotic efficiency of the
MLE
We now want to show that the MLE is asymptotically efficient
because the variance of thelimit fluctuation Z, which is the
reciprocal of the Fisher information, is as small as it can
1T. S. Ferguson, A Course in Large Sample Theory, CRC Press,
1996, page 122.
14
-
be. This is done by using the Cramer-Rao lower bound for the
variance of estimators thatwe derive next.
For any random variables U, V we have by the Schwartz
inequality
var(U) (cov(U, V ))2
var(V )
Let X be a vector-valued random variable with
(multi-dimensional) density f(x|) depend-ing on a parameter . Now
apply the inequality with U = w(X), a function of X, andV = (log
f(X|)). Since we have that E{V } = 0, we see that
cov(U, V ) = E{w(X)(log f(X|))} = ddE{w(X)}
Therefore we have the inequality
var(w(X)) ( ddE{w(X)})2
E{( dd log f(X|))2}This is the Cramer-Rao inequality.
In the case that the random vector X = (X1, X2, . . . , Xn) is a
sample of size n fromthe one-dimensional density f(x|) we see
easily that we have
var(w(X)) ( ddE{w(X)})2
nE{( dd log f(X|))2}This inequality is valid for any and any
estimator w(X).
We now apply this to the maximum likelihood estimator n of ,
that is, we let
w(X) = n(X). We then have
nvar(n(X)) ( ddE{n(X)}|=)2
I()
where
I() = E{( dd
log f(X|))2}|=is the Fisher information. The numerator on the
right involves the bias of the MLE. Withenough assumptions we can
show that for n large we have that
d
dE{n(X)}|= 1
so that we have the asymptotic inequality for large n:
var(nn(X)) 1
I(). (10)
15
-
We know that n in probability andn(n ) tends in distribution to
a
Gaussian random variable with mean zero and variance one over
the Fischer information.With mild integrability assumptions this
implies that the MLE is asymptotically unbiased.It is biased in
general. With further assumptions the derivative of the bias (E{n}
)with respect to also tends to zero as n. This shows that the MLE
has the smallestasymptotic variance among all possible consistent
and asymptotically unbiased (in thestronger sense involving the
derivative) estimators.
As an example of how the Cramer-Rao inequality can be used,
consider a sampleX1, X2, . . . , Xn from the exponential density
f(x|) = ex, x > 0, with parameter > 0.We find easily from the
log likelihood function that the MLE of is n = 1/Xn where Xnis the
sample mean. The MLE is biased because E{n} = nn1, but
asymptoticallyunbiased and the derivative tends to one so that (10)
can be used asymptotically for largen. The Fisher information is
here I = ()2. To see that E{n} = nn1, we note that
E{ 1Xn} =
0
0
n
x1 + x2 + + xn ()ne
(x1+x2+xn)dx1 dxn
=
0
0
n
x1 + x2 + + xn e(x1+x2+xn)dx1 dxn
= an
The constant an is evaluated by differentiating the first line
with respect to and getting
a differential equation for E{ 1Xn }, which has solution an with
an = nn1 .
2.3 Asymptotic normality of posterior densities
Instead of considering as a parameter upon which the density
f(x|) depends, we maythink of it as a random variable with a prior
density g(). Once the sample X1, X2, . . . , Xnfrom f(x|) has been
observed, the posterior density of given the sample is defined
by
pin() =Ln()g()Ln()g()d
(11)
where the likelihood function Ln() is defined by (5). We will
show that if we look at theposterior density in a neighborhood on
the MLE n that is of order 1/
n it tends to a
limit2 that is a Gaussian density with mean zero and variance
equal to the reciprocal of
the Fisher information I(). The actual theorem states that if we
let = n + n and
change variables in the posterior pin() so that it becomes pin()
then
pin() e I()2
22pi(1/I())
(12)
2T. S. Ferguson, A Course in Large Sample Theory, CRC Press,
1996, p. 141.
16
-
in probability and for each . Some hypotheses about the prior
g() are needed so that thelimit can be taken inside the integral in
the denominator in (11). We know by (8) that theMLE is consistent,
n in probability, so the change of variables is essentially
centeredabout the true parameter value . However, the convergence
of the posterior density isnot valid if we do not center around n.
Note also that the limit posterior is independentof the prior
density g(), assuming that the latter is positive for all as well
as such thatthe limit can be taken inside the integral in (11).
For the proof we note that by (6), Ln() = enln() and we have the
expansion
ln(n +n
) = ln(n) +1nln(n) +
1
2nln(n)
2
where n is between n and n +n
and so tends to in probability as n . Butln(n) = 0, which is
essential for centering, we have
ln(n +n
) = ln(n) +1
2nln(n)
2
Taking the normalization into consideration, the denominator in
(11), the uniform law oflarge numbers (9) for second derivatives of
the log likelihood function, and consistency forthe MLE, we have
that ln(n) I() in probability. We see that this last relation
doesimply (12), in probability for each . It can also be shown that
this implies that
|pin() e I()2
22pi(1/I())
| d 0
in probability.
3 Basic Monte Carlo methods
The main problem in Monte Carlo simulation is to calculate using
sampling methods expec-tations of complicated (multi-dimensional)
functions of random variables (random vectors)that have also
complicated distributions.
The most direct but often difficult to apply way of generating
random variables with agiven distribution F (x) (density f(x) = F
(x)) is by noting that if U is a uniform randomvariable over [0, 1]
then F1(U) has distribution F (x). This is because P (F1(U) x) =P
(U F (x)) = F (x). Of course generating i.i.d. uniform random
variables U1, U2, . . . , Unnumerically is not easy and requites
deep number theoretic methods, especially if n is largeand rigorous
statistical tests for independence are to be satisfied. But if we
assume that thiscan be done then generation of a sample X1, X2, . .
. , Xn with density f is easy exceptwhen the inverse distribution
function F1 is hard to obtain, and then the acceptance-rejection
algorithm can be used.
17
-
Given a function g(x) and assuming that we can generate an
i.i.d. sample from f(x)then we approximate I = E(g(X)) by
In =1
n
nj=1
g(Xj)
This is basic Monte Carlo.
3.1 Properties of basic Monte Carlo
Clearly E(In) = I and
var(In) = E
1n
nj=1
g(Xj) I2 = E
1n
nj=1
(g(Xj) I)2
= E
1n2
nj=1
(g(Xj) I)2 = 1
nvar(g(X))
The fact that the variance of the Monte Carlo approximation In
decays as one over the sam-ple size is characteristic of this
method and the main limiting factor for its applicability. Itis
therefore important to look for ways of reducing the multiplicative
factor 2 = var(g(X))and this is what importance sampling tries to
do.
We can also use the CLT to get confidence intervals for I using
the approximations In.The CLT does apply since we assume that
var(g(X)) ) = ,then for n large we have approximately
P (
n
|In I| ) 1
or
In n I In +
n
with probability 1 . Of course is not known and must be replaced
by the samplevariance
s2n =1
n 1nj=1
(g(Xj) In)2
to get a realizable confidence interval. The justification for
replacing by sn in the CLT,
that is, in still havingnsn
(In I) N(0, 1) in law, requires the use of Slutskys theorem,
18
-
since sn > 0 in probability andn(In I) N(0, 2) in
distribution and therefore
nsn
(In I) N(0, 1) in distribution. The confidence intervals are now
realizable
In snn I In + sn
n
with probability 1. The question of how large the number of
realizations n must be forthis to be reasonably accurate based on
the asymptotic theory depends both on the inte-grand g and on the
density f . Since we are not doing estimation here it is usually
possibleto increase the number of realizations and thus improve the
accuracy of the confidenceinterval.
One can also use the continuous mapping theorem that can be
stated in general asfollows. Suppose that the pair of random
variables (Xn, Yn) converges in distribution to(X,Y ). Let h(x, y)
be a function fromR2 R and such that the set {(x, y) | h(x, y) is
not continuous}has probability zero with respect to the limit law.
Then h(Xn, Yn) h(X,Y ) in distribu-tion. This more general theorem
can be applied here with Xn =
n(In I), Yn = sn and
h(x, y) = x/y.
3.2 Importance sampling
Importance sampling is used when the function g(x) to be
integrated and the density f(x)overlap very little so that the
product g(x)f(x) is very small and therefore I = E(g(X)) isvery
small. Most of the samples drawn from f will not overlap
significantly with regionswhere g is significant. This means that
the relative error (standard deviation over mean)in direct Monte
Carlo simulations will be very large.
The main idea in importance sampling is to introduce a reference
density f(x), let
M(x) =f(x)
f(x)
and then note that, assuming all integrals are well defined,
E(gM) =
g(x)
f(x)
f(x)f(x)dx =
g(x)f(x)dx = E(g) = I
We can now generate an unbiased Monte Carlo approximation by
using a sample from f ,X1, X2, . . . , Xn, so that
In =1
n
nj=1
g(Xj)M(Xj)
The variance of In is now
E((gM)2) (E(gM))2 =
(g(x))2(f(x))2
f(x)dx I2
19
-
The question is how to choose f so as to reduce the variance of
In. Assuming that g ispositive we see right away that if we
take
f(x) =g(x)f(x)
E(g(X))
then the variance of In is zero! The reference density puts all
the weight just where itshould, that is, where the product gf is
significant. But this is hardly an improved MonteCarlo since in
order to implement it we need to know E(g(X)), which is the very
quantitythat we are trying to approximate. A number of interesting
algorithms can be developed,however, that use approximations of
E(g) to get f , which will then lead to improvedapproximations.
One possibility is to choose f(x) as follows. Let {xj} be a
partition of the real line anddefine
f(x) =
j f(x
j )g(x
j )1{xj1 1 such that
f(x)
g(x) c
for all x, and we may consider the smallest such constant. The
acceptance-rejection algo-rithm consists of the following steps:1.
Generate Z from g2. Generate an independent uniform [0, 1] random
variable U
3. Is U f(Z)cg(Z)? If yes then return Z (accept) and if not then
repeat (go to step 1).
The output random variable, say X, will be in a set A if
1{XA} = 1{Z1A}1{U1 f(Z1)cg(Z1)}+
k=2
k1j=1
1{Uj> f(Zj)cg(Zj)}
1{ZkA}1{Uk f(Zk)cg(Zk)}
20
-
But
E(1{ZA}1{U f(Z)cg(Z)
}) =1{zA}E(1{U f(z)
cg(z)})g(z)dz =
1
c
Af(z)dz
and using the independence of the random variables in the
acceptance-rejection algorithmwe see that
P (X A) = 1c
Af(z)dz +
1
c
Af(z)dz
k=2
(1 1c
)k1 =Af(z)dz
This shows that indeed the acceptance rejection algorithm does
produce random variableswith the correct density.
To understand better how the algorithm behaves, let N be the
number of cycles neededto produce the desired random variable.
Clearly
P (N = k) =
[P (U >
f(z)
cg(z))g(z)dz
]k1 [1
P (U >
f(z)
cg(z))g(z)dz
], k = 1, 2, . . .
= (1 1c
)k11
c
and therefore E(N) = c, which explains the role that the
constant c plays.Now acceptance-rejection can be combined with
Monte Carlo somewhat in the way
importance sampling is done as follows. Suppose that it is hard
to sample from f(x) andthat there is another density g(x) and a
constant c > 1 such that f(x)/g(x) c for allx. To compute I =
E(h(X)) for some function h(x) we generate X1, X2, . . . , Xn by
theacceptance rejection algorithm and then compute
In =1
n
nj=1
h(Xj)
This is a slightly less efficient approximation than when we
generate the sample directlyfrom f(x) because if we count the steps
needed in the acceptance rejection cycle then atotal nc (with c
> 1) steps are needed on average to compute In. Counting also
the uniformrandom variables in the acceptance-rejection algorithm,
we see that we need on average2nc samples to generate In. The
accuracy of In remains the same but the computationalcost has
increased.
3.4 Glivenko-Cantelli theorem and the Kolmogorov-Smirnov
test
There is little practical or theoretical distinction between
Monte Carlo and estimationmethods so we discuss non-parametric
estimation in this part of the notes.
21
-
Let X1, X2, . . . , Xn be an i.i.d. sample from a distribution F
(x) and define the empiricaldistribution Fn(x) by
Fn(x) =1
n
nj=1
1{Xjx}
This is a random, since it depends on the sample, piece-wise
constant distribution functionthat should serve as an estimate for
the true F (x), assumed unknown here. Since foreach x R the random
variables 1{Xjx} are bounded by one, have mean F (x) and
areindependent, the weak law of large numbers tells us that Fn(x) F
(x) in probability. Infact this is true also with probability one
and the CLT holds:
n (Fn(x) F (x)) N(0, F (x)(1 F (x)) in distribution for each x
R
This is not so useful though because the variance of the limit
Gaussian depends on thevery distribution we want to estimate.
Normalizing by Fn so that
n
Fn(x)(1 Fn(x))(Fn(x) F (x)) N(0, 1) in distribution for each x
R
is not as useful either because of the rather slow convergence
that depends on x.There are two basic results in non-parametric
estimation that we now state and discuss,
without proofs, which can be found in advanced statistics and
probability books. Thefirst is the Glivenko-Cantelli limit theorem
and the second the Kolmogorov-Smirnov limittheorem. The first
states that
P{ limn supxR
|Fn(x) F (x)| = 0} = 1
The second states that if F (x) is continuous then Dn = supxR
|Fn(x)F (x)| is a randomvariable whose law is independent of F
and
nDn converges in distribution to a universal
law
limnP{
nDn x} = 1 2
k=1
(1)k1e2k2x2
With the KS theorem we can formulate various statistical tests
regarding the estimation ofthe unknown distribution F (x). The
limit distribution of
nDn is, for continuous F (x),
the distribution of the maximum of the absolute value of the
Brownian bridge, whichis a Gaussian stochastic process B(t) defined
on [0, 1], with mean zero and covarianceE{B(t)B(s)} = t(1 s if 0
< t s < 1, symmetric in t and s. A Gaussian processindexed by
t [0, 1] is a collection of Gaussian random variables such that for
any set ofindices 0 t0 < t1 < < tN 1 the vector (B(t0),
B(t1), . . . , B(tN )) is Gaussian withmean zero and covariance
matrix given as above. This limit theorem is an example of
aninvariance principle where the law of a function of the process
before the limit, in thiscase the maximum absolute value, converges
to the law of the same function of the limitprocess.
22
-
3.5 Density kernel estimation
A simpler non-parametric test when an unknown density is
involved will be discussed insome detail because it shows
explicitly the dependence of the rate of convergence on
thesmoothness of the density. In other words, one sees explicitly
the slowing of the rate ofconvergence in a non-parametric
estimation.
Suppose that we have a sample X1, X2, . . . , Xn from a density
f(x) which is not knownand we want to estimate it. Since F (x) =
f(x) an estimator for f can be obtained bydifferentiating the
empirical distribution Fn. But this is a random step function so
itsderivative is a sum of delta functions. So we need some
smoothing which we do with asmoothing kernel, a positive,
infinitely differentiable function (x) such that
(x) 0 ,R(x)dx = 1 ,
Rx(x)dx = 0 ,
Rx2(x)dx = 1
The estimate of the density f is now
fn,h(x) =1
n
nj=1
1
h
(xXjh
)Clearly
E{fn,h(x)} =Rh(x y)f(y)dy f(x) + h
2
2f (x) = f(x) + bias
for h small, assuming differentiability of the density. Here
h(x) = (x/h)/h and thesmall h expansion is just a Taylor expansion
plus use of the normalization properties ofthe smoothing kernel
.
The variance of the estimator is similarly given by
var(fn,h(x)) =1
nvar(h(xX)) = 1
n
[R2h(x y)f(y)dy (
Rh(x y)f(y)dy)2
]Assuming form here on the (x) in the Gaussian mean zero,
variance one density we havethat
var(fn,h(x)) 1n
[1
h
1
2pif(x) f2(x)
]to principal order as h 0. From this we conclude that the mean
square error is
E{(fn,h(x)) f(x))2} = var(fn,h(x)) + (bias)2 1h
1
2npif(x) +
h4
4(f (x))2
to principal order in h 0 for each term. Minimizing this error
over h gives
h =1
n1/5
(1
2pi
f(x)
(f (x))2
)1/523
-
It follows that with the optimal in MSQ sense smoothing we
have
fn,h(x) f(x) +O(n2/5) for each x
in MSQ sense. Instead of an error O(n1/2), which is the usual
one for parameter estima-tion, we have slower error decay because
of the necessary smoothing. And, of course, thedensity f(x) to be
estimated must have at least two derivatives.
3.6 Bootstrap
Let X1, X2, . . . , Xn be a sample drawn from a distribution
function F (x) and let =(X1, X2, . . . , Xn) be a statistic of
interest, such an estimator of an unknown parameter.We often want
to calculate the standard deviation of this statistic, SD = (F, n,
) = (F ),or some other measure of uncertainty, but the distribution
function F is not known or itis not known with enough accuracy. The
bootstrap3 is a very effective way to do this.
We introduce the empirical distribution function of the sample,
Fn, which assigns mass1/n at at the points xi, i = 1, 2, . . . , n,
the values of the observed sample. We want tocalculate SD = (Fn, n,
) = (Fn) which, depending on the statistic and the distributionF
will be close to the theoretical SD for n large. The issue in
bootstrap is primarily howto calculate SD for fixed but large n,
since this is often a combinatorially complex problemfor large
n.
We do this by Monte Carlo using Fn as the basic distribution
from which to sample.A bootstrap sample is denoted by X1 , X2 , . .
. , Xn, which is an i.i.d sequence drawn fromFn. This is the same
as drawing from (x1, x2, . . . , xn) with replacement n times. Let
= (X1 , X2 , . . . , Xn) be the bootstrap value of the statistic.
The bootstrap standarddeviation is simply a Monte Carlo
approximation of SD = (F ). We do this by drawingrepeated
independent samples of size n from Fn and letting
SDB =
(1
B 1Bb=1
(b .)2)1/2
where B is the number of bootstrap samples and it is assumed
large enough to be a goodMonte Carlo approximation of (F ). Here .
is the empirical mean of the bootstrap valuesof the statistic over
the B Monte Carlo samples (re-samples). As B tends to infinity
wehave that SDB SD, but since for finite n this is only an
approximation of SD, which iswhat we want, we should ideally choose
B so that the Monte Carlo error is comparable tothe finite n
error.
As an example, consider the sample mean X as the statistic of
interest. Then thestandard deviation is (F ) = (2n )
1/2 where 2 = EF (X EF (X))2 is the theoretical3See the lecture
notes by B. Efron, The Jackknife, the Bootstrap and Other
Resampling Plans, CBMS-
NSF Series in Applied Mathematics no. 38, 1982
24
-
variance of F . The bootstrap standard deviation SD = ( 2n )1/2
where 2 =
1n
ni=1(xix)2,
which is the sample variance. Thus EF ( V AR) = EF (2n ) =
n1n
2n =
n1n V ar(X). The
point of this example is that the bootstrap is consistent with
what is expected in the sameway that basic Monte Carlo is excepted
to work, but now we only deal with the originalsample of size n by
resampling it.
25
-
4 Markov Chains
A time-homogeneous Markov chain {X0, X1, . . . , Xn, . . .}
taking values in a finite set S ofsize N is characterized by its
transition probabilities
P (x, y) = P{Xn+1 = y|Xn = x} 0 , x, y S ,y
P (x, y) = 1,
which are independent of n because of time homogeneity. The
Markov property meansthat conditional probabilities depend only on
the latest information
P{Xn = y|X0, X1, . . . , Xn1} = P{Xn = y|Xn1}
and therefore the joint probability of {Xn}n0, in path space, is
expressed as a product oftransition probabilities
P{Xn = xn, Xn1 = xn1, . . . , X0 = x0} = pi0(x0)P (x0, x1) P
(xn1, xn)
where pi0(x) = P{X0 = x}. Path probabilities fully determine the
S-valued process{Xn}n0, and for Markov chains the initial
probabilities pi0 and the transition probabilitiesP determine
everything.
The n-step transition probability
P{Xn = y|X0 = x} = Px{Xn = y}
is obtained recursively by using the Markov property and time
homogeneity. We have,introducing also some notation and using the
law of iterated conditional expectation, that
Pn(x, y) = Px{Xn = y} = Ex{P{Xn = y|X1, X0 = x}}
= Ex{P{Xn = y|X1}} Markov property= Ex{PX1{Xn1 = y}} time
homogeneity
=zS
P (x, z)Pz{Xn1 = y} =zS
P (x, z)Pn1(z, y)
so that Pn = (Pn(x, y))x,yS is identified as the n-th power of
the N N transition matrix
P = (P (x, y))x,yS .If pi0(x) = P{X0 = x}, x S, is the initial
probability of the chain then
pin(x) =zS
pin1(z)P (z, x) =zS
pi0(z)Pn(z, x) , n = 1, 2, ..., x S.
Thus probability vectors pin = (pin(x))xS can be considered to
be N -row vectors that getupdated recursively by left
multiplication with the transition matrix.
26
-
Expectations of functions of the state
Ex{f(Xn)} = un(x) =zS
P (x, z)un1(z) =zS
Pn(x, z)f(z) , n = 1, 2, 3, ..., x S
can similarly be considered as N -column vectors un = (un(x))xS
that are updated recur-sively by right multiplication with P ,
starting with the initial column vector f = (f(x))when X0 = x.
4.1 Exit times
Let C S and let T = Tx be the first time to enter the complement
of C, Cc, startingfrom x C, which is also the first exit time from
C:
T = min{n 1 | Xn / C}
This is a random variable that takes integer values and is such
that for all n 0 the events{T > n} depend only on the Markov
chain up to time n, {X1, X2, . . . , Xn}. It is calleda stopping
time. Clearly the exit time of the Markov chain from a subset C of
the statespace S is a stopping time.
We want to find a linear system of equations satisfied by vn(x)
= Px{T > n} =1 Px{T n}, which is the probability distribution of
T , starting from x and withn = 0, 1, .... We use a first
transition or a renewal analysis that relies on the Markovproperty
and time homogeneity as follows. For x C we have
vn(x) = Px{T > n} = Ex{P{T > n|X1, X0 = x}} = Ex{P{T >
n|X1}}
=yC
P (x, y)Py{T > n 1} =yC
P (x, y)vn1(y) , n = 1, 2, 3, ... , v0(x) = 1{xC}
To see why P{T > n|X1} = PX1{T > n 1}, and to explain the
notation, we first writeT = T (X0, X1, X2, . . .) to indicate that
the exit time depends on the path of the Markovchain. By the
definition of the exit time from C, T > 0 when X0 = x C. Withn =
1, 2, . . . , after one time unit has passed, I{T
(X0,X1,X2,...)>n} = I{1+T (X1,X2,...)>n} whenX1 C. In words,
after one time step the Markov chain restarts from the state X1 Cto
which it went, and the exit time increases by one unit. Time
homogeneity leads then tothe result above.
If we let PC = (P (x, y), x, y C), which is a sub-stochastic
transition matrix since ingeneral its row sums are less than one,
then for the vector vn = (vn(x))xC we have thelinear recursion
vn = PCvn1 , v0 = 1
Note again that the column vectors vn are restricted to elements
x in C and 1 is the columnvector of all ones. Alternatively we can
write the recursion as an initial-boundary value
27
-
problem:
vn(x) =yS
P (x, y)vn1(y) , n = 1, 2, . . . , x S
withv0(x) = 1, x C , vn(x) = 0, x / C, n = 0, 1, . . .
Let us define the norm of row vectors to be the l1 norm
||q|| =yS|q(y)|
and the norm of column vectors to be the maximum norm
||f || = maxx|f(x)|
Then the induced matrix norm is
||Q|| = maxx
y
|Q(x, y)|
and we see that the norm of transition matrices P is one but the
norm of PC is in generalless than one. This is the case if
transitions from states in C to states in the complementoccur with
positive probability. Therefore, since ||Q2|| ||Q||2, we conclude
that, ingeneral, ||vn|| 0 as n , assuming that ||PC || < 1. This
means that the exit timeform C, T , is finite with probability one
as should be expected in this case.
Consider as another example the calculation of the mean exit
time from C, Ex{T}. Forsome fixed 0 < s < 1 define the moment
generating function of the exit time
u(x; s) = Ex{sT }
Using the first transition or renewal argument, as above, we
find that u satisfies the linearsystem
u(x; s) = syCc
P (x, y) + syC
P (x, y)u(y; s), x C
Assuming a finite expectation for the exit we have that
u(x) =d
dsu(x; s)|s=1 = Ex{T}, x C
By differentiating the equation for u(x; s) with respect to s
and setting s = 1 we get, aftersome rearrangement
u(x) = 1 +yC
P (x, y)u(y), x C
28
-
and in vector formPC u u = 1
Since the norm of PC is less than one, as it is in general when
the Markov chain can reachstates in Cc from states in C, we have
that
u = (I PC)11
Another interesting application of these ideas is the
calculation of u(x) = Ex{f(XT )},where f(x) is a function defined
for x Cc. For example, when f(x) = 1{x = z},u(x) = u(x; z) is the
probability of exiting C by going to z Cc. We leave this as
anexercise.
4.2 Transience and recurrence
Consider a finite state Markov chain with transition
probabilities P (x, y), x, y S and letFn(x, y) be the probability
that y is reached for the first time at n, starting from x
Fn(x, y) = P{Xn = y|X0 = x,X1 6= y, . . . , Xn1 6= y}, n = 1, 2,
...
Note that Fn(x, y) = Px{Ty = n} where Ty is the first time that
the state y is reached. Wewant to show that
Pn(x, y) =n
m=1
Fm(x, y)Pnm(y, y), n = 1, 2, ...
In terms of generating matrix functions, P (x, y; s) =
n=0 snPn(x, y) and F (x, y; s) =
n=1 snFn(x, y), we have that
P (x, y; s) = x,y + F (x, y; s)P (y, y; s)
for 0 s 1 and where x,y is zero when x 6= y and one otherwise.A
state y is said to be persistent or recurrent if F (y, y; 1) =
n=1 Fn(y, y) = 1, which
means that Ty
-
Note that the event {Ty = n} = {X0 = x,X1 6= y, . . . , Xn1 6=
y,Xn = y} and so itdepends on the path only up to time n. We thus
have
P{Xn = y|X0 = x} =n
m=1
P{Xn = y, Ty = m|X0 = x}
=
nm=1
P{Xn = y|Ty = m,X0 = x}P{Ty = m|X0 = x}
=n
m=1
P{Xn = y|Xm = y}P{Ty = m|X0 = x}
=n
m=1
Fm(x, y)Pnm(y, y)
Then, write,
P (x, y; s) =n=0
snPn(x, y)
= x,y +
n=1
snPn(x, y)
= x,y +n=1
sn
(n
m=1
Fm(x, y)Pnm(y, y)
)
by the previous result. Interchanging the summations we get the
result
P (x, y; s) = x,y + F (x, y; s)P (y, y; s)
We note that everything is non-negative and so monotone
increasing in s 1. From
P (y, y; s) =1
1 F (y, y; s)P (y, y; s)we see that as F increases to one, we
have that P increases to infinity. Similarly, from
F (y, y; s) =P (y, y; 1) 1P (y, y; 1)
we see that as P increases to infinity, F increases to one.
30
-
4.3 Invariant probabilities
We now want to address the long time behavior of the Markov
chain Xn and in particularto study its ergodic properties. We
assume that the state space S is finite (although thisis not
necessary for the method used; compactness is) and the transition
probabilities areuniformly positive:
P (x, y) N
, x, y Sfor some > 0 and with N = #(S). With simple
modifications the arguments belowextend to the case where this
condition holds for P to some fixed power. We are simplyrequiring
that the Markov chain be irreducible and aperiodic. Irreducible
means that everystate can be reached from every other state in a
finite number of time steps (at most N)with positive probability.
Aperiodic means that the greatest common divisor of returntimes
(with positive probability) to states is one, for all states.
We will prove the following basic theorem. For any probability
vectors pi1 and pi2 wehave that
||pi1P pi2P || (1 2
)||pi1 pi2||which means that P is a strict contraction when
acting on differences of probability vectors.
Using this theorem we will then prove by the contraction mapping
iteration processthat P has a unique invariant probability vector
pi
pi = piP (in components: pi(x) =yS
pi(y)P (y, x) , x S)
and that this invariant probability is approached exponentially
fast for any starting prob-ability pi0
||pi0Pn pi|| 2n , 0 < < 1 ( = 1 2
)
To prove the contraction property, let q = pi1 pi2, which is a
vector with positiveand negative entries such that
yS q(y) = 0. Let q = q
+ + q where q+ has all non-negative elements and q has all
negative ones. Clearly,
yS q
+(y) = yS q(y) and||q|| = ||q+||+ ||q|| = 2||q+||. We now have
the following
||qP || =y
|qP (y)|
=y
|q+P (y) + qP (y)|
=yS+
[q+P (y) + qP (y)]yS
[q+P (y) + qP (y)]
31
-
and continuing
= q+yS+
P + qyS+
P q+yS
P qyS
P
= q+(yS+
P yS
P ) q(yS
P yS+
P )
||q+||(1 N
N) + ||q+||(1 N
+
N)
= ||q+||(2 ) = 2||q+||(1 2
)
= ||q||(1 2
)
which gives what we wanted. Here S denotes the set of states
where the entries in thesum are positive and negative,
respectively, and N is the size of these sets.
To show that pi = piP has a unique invariant probability vector
solution, we define thesequence pin = pin1P , with pi0 any
probability vector. We have that
||pin pin1|| = ||(pin1 pin2)P || < ||pin1 pin2||where = 1 2
< 1. Iterating backwards we have that
||pin pin1|| < n1||pi1 pi0||From this we conclude that the
probability vectors pin are a Cauchy sequence
supm||pin+m pin|| 0
as n and therefore pin has a limit pi, which is a probability
vector. Passing to thelimit in pin = pin1P we see that pi is an
invariant probability vector and since it is theunique limit of a
Cauchy sequence it must be the unique invariant probability vector.
Wealso have exponential convergence
||pi0Pn pi|| = ||pi0Pn piPn|| < n||pi0 pi||
4.4 The ergodic theorem
For a Markov chain in a finite state space the ergodic theorem
is valid under the hypothesesand results of the previous section,
and gives an important interpretation of the invariantprobabilities
pi(x). We have that
1
n
nj=1
1{Xj = z} pi(z)
32
-
in mean square as n, for any z S and any initial probability
vector P{X0 = x} =pi0(x). In words, the invariant probability
vector is the limit in mean square of the relativetime spent in
state z as n.
More generally, for any function f(x) on the state space we
have
1
n
nj=1
f(Xj)yS
pi(y)f(y) = pif
in mean square as n for any initial probability vector P{X0 = x}
= pi0(x).If we take expectations on the left we have
Ex{ 1n
nj=1
f(Xj)} = 1n
nj=1
Ex{f(Xj)} = 1n
nj=1
P jf(x)
If we let pi0 = x, the probability vector concentrated at x, we
have
Ex{ 1n
nj=1
f(Xj)} = 1n
nj=1
pi0Pjf pif
by the results of the previous section. We want to show here
that not only the meansconverge but the random time averages
converge in mean square to the average withrespect to the invariant
probability.
The main step in proving the ergodic theorem is the introduction
of a new function(x) that converts approximately the time average
to an expression that is easy to handlebecause it is a martingale.
This function is the solution of the Poisson equation (by
analogywith PDEs)
(P I) = f + pifMore explicitly, this system of equations for =
((x))xS has the form
(x) = f(x) pif +yS
P (x, y)(y), x S
Note that the right hand side of the system (P I) = f + pif has
mean zero, or innerproduct zero, with respect to pi: pi(f pif) = 0.
This then is a necessary condition for thesolvability of the
Poisson equation since the pi average of the left side is always
zero, forany :
pi(P I) = 0 , orxS
pi(x)
yS
P (x, y)(y) (x) = 0
Of course, the Poisson equation does not have a unique solution
since 1 is an invariantright vector: (P I)1 = 0. Thus + c1 is a
solution for any constant c.
33
-
Without loss of generality we may assume that f is such that pif
= 0 as we may replacef by f pif . We now show that the Poisson
equation has a solution. If there is a solution,we can write
= (I P )1f =n=0
Pnf
The sum is, however, convergent because
||Pnf || = ||(Pn pi)f || < n2||f ||To see this, we let pix be
the probability row vector that is equal to one at x and
zeroelsewhere. We then have that Pnf(x) = pixP
nf and hence ||Pnf pif || = maxx |pixPnf pif | maxx ||pixPn
pi||||f || n2||f || by the results of the previous section.
Once we have a fixed solution (x) we form the collapsing
(telescoping) sum
(Xn+1) (X1) =nj=1
((Xj+1) (Xj))
=nj=1
[(Xj+1) E{(Xj+1)|Xj}+ (E{(Xj+1)|Xj} (Xj)]
=nj=1
[(Xj+1) P(Xj)] +nj=1
[P(Xj) (Xj ]
Using the Poisson equation that satisfies, rearranging and
dividing by n we have
1
n
nj=1
f(Xj) =1
n((Xn+1) (X1)) + 1
nMn+1
where
Mn+1 =
nj=1
[(Xj+1) P(Xj)]
This representation of the time average whose limit we want is
important because, up tothe first term on the right that goes to
zero as n , the second term is a martingaleMn+1 divided by n. The
martingale property is
E{Mn+1|X0, X1, ..., Xn} = Mnwhich is easily verified since the
conditional expectation of the last term in the sum forMn+1 is
zero. We also have that Ex{Mn+1} = 0.
Now the proof can be completed by noting that the variance of
Mn+1 has the form
Ex{M2n+1} = Ex{nj=1
[(Xj+1) P(Xj)]2}
34
-
because cross terms have mean zero, as they are for sums of zero
mean independent (oronly uncorrelated) random variables. This is
the essential point for introducing and using to get Mn: we can
calculate variances as if we were dealing with sums of
independent,mean zero random variables. We now conclude that
1
n2Ex{M2n+1}
constant
n
which implies the result we want:
Ex{[ 1n
nj=1
f(Xj)]2} 0
in the case pif = 0, which we have assumed as noted above.
4.5 The central limit theorem for Markov chains
Let2(x) = Ex{((X1) P(x))2}
Under the same conditions for which we have proved the ergodic
theorem in the previoussection we also have a central limit theorem
as follows. The scaled difference
n
1n
nj=1
f(Xj) pif N(0, pi2) , n
in distribution, where in particular
limn
1
n
nj=1
2(Xj+1) = pi2
in mean square, by the ergodic theorem. In fact this CLT simply
follows from the moregeneral martingale central limit theorem in
view of the representation of the time averageof f that we obtained
in the previous section in terms of an asymptotically small term
anda scaled martingale.
We will not go through the detailed calculations but note here
that it is enough to showthat for any R we have that
limnEx{e
i nMn+1} = e
2
2pi2
which proves the CLT through the limit of characteristic
functions. The ergodic theorem isused here as already noted. We
also note that whereas the solution of the Poisson equation played
a role only in streamlining the proof of the ergodic theorem, in
the CLT it playsa basic role as it enters into the form of the
limit variance. There are other, equivalentforms for the limit
variance pi2, in the form of sums which correspond to expansions of
.
35
-
4.6 Expected number of visits to a state and the invariant
probabilities
Let x, z and y be three states in S, not necessarily all
different and consider the expectednumber of visits to z before
reaching y, starting from x
u(x; z, y) = Ex{n=1
1{Xn = z}1{Ty n}} = Ex{Tyn=1
1{Xn = z}}
where Ty is the first time to reach y
Ty = inf{n 1 | Xn = y}
This is a stopping time and I{Tyn} depends, or is a function of,
only {X0, X1, . . . , Xn1}.When we start at y then with the current
definition Ty is the first time to return to y.Clearly
z
u(x; z, y) = Ex{Ty}
which is the expected time to reach y. When y = x then
z u(x; z, x) is the expected timeto return to x, after starting
from it. For any bounded function f on S we have that
uf (x, y) =zS
u(x; z, y)f(z) = Ex{Tyn=1
f(Xn)}
To get a recursion relation for u, as a function of z and not
the starting point x, wewrite
u(x; z, y) = Ex{n=1
1{Xn = z}w
1{Xn1 = w}{1{Ty n}}
=w
n=1
Ex{E{1{Xn = z}1{Xn1 = w}{1{Ty n}|X0, X1, ..., Xn1}}
using iterated conditional expectation. Using the properties of
the stopping time and theMarkov property we have
w
n=1
Ex{E{1{Xn = z}1{Xn1 = w}{1{Ty n}|X0, X1, ..., Xn1}} (13)
=w
P (w, z)
n=1
Ex{1{Xn1 = w}1{Ty n}}
36
-
We interchange expectation and summation to get
w
P (w, z)Ex{n=1
1{Xn1 = w}1{Ty n}}
=w
P (w, z)Ex{Tyn=1
1{Xn1 = w}}
(14)
By time homogeneity we have
w
P (w, z)Ex{Tyn=1
1{Xn1 = w}}
=w
P (w, z)Ex{Ty1n=0
1{Xn = w}}
=w
P (w, z)
Ex{ Tyn=1
1{Xn = w}}+ xw yw
=w
u(x;w, y)P (w, z) + P (x, z) P (y, z) (15)
where xw equals one when x = w and zero otherwise. In
vector-matrix form we have thefollowing system for u as a row
vector in z
u(I P ) = pixP piyP
where pix is the probability vector with component equal to one
at z = x and zero otherwise.This system is always solvable since
(pixP piyP )1 = 0, that is, the right hand side isorthogonal to the
right null-vector of I P , which is 1. When x = y, the right hand
sideis zero and we see that, in this case, u is proportional to the
invariant probability vector pi
pi(z) =u(x; z, x)z u(x; z, x)
=u(x; z, x)
Ex{Tx}In words, the expected number of visits to z between
successive visits to x divided by theexpected number of time steps
between successive visits to x is the invariant probabilitypi(z).
Furthermore, when x = z then there is only one visit to x after
leaving it, betweensuccessive visits, and so
pi(x) =1
Ex{Tx}
37
-
4.7 Return times and the ergodic theorem
Consider again an ergodic Markov chain on a finite set S, with
transition probabilitiesP (x, y) > 0. Fix a state x and let T1,
T2, T3, . . . be the successive return times to x,starting from it.
We show first that they are independent, identically distributed
randomvariables. We use the ergodic theorem for the Markov chain to
show that the mean of thereturn time to x is equal to the
reciprocal of the invariant probability pi(x).
We show independence for the first two return times. Let
Fn(x) = P{T1 = n|X0 = x} = P{Xn = x,Xn1 6= x, . . .X1 6= x|X0 =
x}
We then have
P{T2 = m,T1 = n|X0 = x}= P{Xn+m = x,Xn+m1 6= x, . . . ,Xn+1 6=
x,Xn = x,Xn1 6= x, . . .X1 6= x|X0 = x}= P{Xn+m = x,Xn+m1 6= x, . .
. ,Xn+1 6= x|Xn = x,Xn1 6= x, . . .X1 6= x,X0 = x}
P{Xn = x,Xn1 6= x, . . .X1 6= x|X0 = x}= P{Xn+m = x,Xn+m1 6= x,
. . . ,Xn+1 6= x|Xn = x}Fn(x) Markov property= P{Xm = x,Xm1 6= x, .
. . ,X1 6= x|X0 = x}Fn(x) Time homogeneity= Fm(x)Fn(x)
We can also write this as
P{T2 = m,T1 = n|X0 = x}= P{Xn+m = x,Xn+m1 6= x, . . . ,Xn+1 6=
x|Xn = x}Fn(x) Markov property= P{T1 + T2 = n+m|T1 = n,Xn = x}P{T1
= n|X0 = x}= P{T2 = m|XT1 = x}P{T1 = n|X0 = x}
from where we see that P{T2 = m|XT1 = x} = Fm(x). Both the
Markov property andthe time homogeneity have been used. Similarly,
T1, T2, T3, . . . are independent randomvariables. Their
distribution is identical because of the time homogeneity of the
Markovchain.
We argue that we can replace N byN
k=1 Tk in the ergodic theorem. That is,
limN
1Nk=1 Tk
Nk=1 Tkn=1
f(Xn) = pif
in mean square. Setting f(y) = 1{y=x} we obtain
limN
11N
Nk=1 Tk
= pi(x).
38
-
But by law of large numbers the left hand side is just 1/ET1. To
complete the proof weneed to show that we can replace the index N
in the ergodic theorem by MN =N
k=1 Tk , in probability as N . But MNN 0 in mean square (hence
inprobability) where = E{T1} > 0 by the standard law of large
numbers. In more detail,we need to show that 1MN
MNn=1
f(Xn) 1[N ]
[N ]n=1
f(Xn)
0in probability as N . Here [a] denotes the integer part of a.
But we have the estimate 1MN
MNn=1
f(Xn) 1[N ]
[N ]n=1
f(Xn)
2 [N ]MN 1
||f ||and since MN[N ] 1 in probability, we also have that [N
]MN 1 in probability, whichcompletes the proof.
4.8 MLE for Markov chains
Let {Xn, n 0} be a Markov chain in a finite state space S with
transition probabilityP (x, y; ), where is a real valued parameter.
The goal is to estimate using observations{Xi, 0 i n}. We assume
that the Markov chain is ergodic, uniformly in in some
fixedinterval of interest, and that we have a positive lower bound
for P (x, y; ). In this sectionwe will carry out the following.
1. Let Yn = (Xn, Xn1)T , n 1, which is also a Markov chain. We
will show that thetransition probability for Yn is given by Q(x1,
x2; y1, y2) = P{Yn = (y1, y2)T |Yn1 =(x1, x2)
T } = 1{x1=y2}P (y2, y1).2. Let pi be the invariant probability
vector for Xn:
xS pi(x)P (x, y) = pi(y). We will
show that an invariant probability vector for Yn has components
pi(x2)P (x2, x1). Bycalculating Q2, the two-step transition
probabilities, we conclude that this invariantvector is unique and
is approached exponentially fast. We in fact reduce this problemto
the standard case analyzed in earlier sections.
3. Suppose the initial state x0 is given. Moreover suppose is
the true value of the
underlying parameter. We use the ergodic theorem, to show that
as n thenormalized log likelihood function
ln() =1
nlog[P(X0 = x0, X1 = x1, . . . , Xn = xn; )]
39
-
x,yS
pi(x; )P (x, y; ) logP (x, y; ) := `().
in probability.
4. Assume P (x, y; ) is twice-differentiable with respect to .
We show that l() = 0and l() < 0.
5. We also use the delta method (as in the i.i.d. case) and the
central limit theorem forMarkov chains to get a CLT for the MLE n
(where l
n(n) = 0).
We now go on to analyze the statements made above.
1. By the definition of conditional probability
Q(x1, x2; y1, y2) = P{Yn = (y1, y2)T |Yn1 = (x1, x2)T } = P{Yn =
(y1, y2)T , Yn1 = (x1, x2)T }
P{Yn1 = (x1, x2)T }
=pin2(x2)P (x2, x1)1{x1=y2}P (y2, y1)
pin2(x2)P (x2, x1)= 1{x1=y2}P (y2, y1)
2. We verify that
pi(y2)p(y2, y1) =x1,x2
pi(x2)P (x2, x1)Q(x1, x2; y1, y2) =x1,x2
pi(x2)P (x2, x1)1{x1=y2}P (y2, y1)
which is a true relation because
x pi(x)P (x, y) = pi(y). We also have that
Q2(x1, x2; y1, y2) =z1,z2
Q(x1, x2; z1, z2)Q(z1, z2; y1, y2) = P (x1, y2)P (y2, y1) >
0
the positivity being for all (finitely many) states and so the
hypothesis for the ergodictheorem holds for {Yn}.
3. Clearly we have by the ergodic theorem, which we have shown
that it applies,
ln() =1
n
nj=1
logP (Xj1, Xj ; ) +1
nlog pi0(X0)
{n}x,yS
pi(x; )P (x, y; ) logP (x, y; ) = l()
where we indicate dependence on the parameter(s) explicitly.
40
-
4. Differentiability of l() with respect to follows from the
uniformity of the ergodiclimit with respect to . That is, the
convergence is in mean square (or in probability),uniformly with
respect to the parameter. We can then do differentiation. We
have
l() =x,yS
pi(x; )P (x, y; )P (x, y; )P (x, y; )
where prime denotes derivative with respect to , assumed a
scalar parameter in[1, 2] and with
(1, 2). At = we havel() =
x,yS
pi(x; )P (x, y; )
But for any we have
x,y pi(x; )P (x, y; ) = 1 and so differentiating we get
0 =x,y
pi(x; )P (x, y; ) +x,y
pi(x; )P (x, y; )
=x
pi(x; ) +x,y
pi(x; )P (x, y; ) =x,y
pi(x; )P (x, y; )
which implies that l() = 0. In the same way we can calculate
l() =x,yS
pi(x; )P (x, y; )(P (x, y; )P (x, y; )
(P(x, y; ))2
P 2(x, y; )
)which leads to
l() = x,yS
pi(x; )(P (x, y; ))2
P (x, y; )= I < 0
with I the Fisher information.
5. The maximum likelihood estimator of n of is defined as the
maximizer of ln().From the ergodic theorem we know that ln() is
close in mean square to l(), whichis concave near the true value .
Therefore (and this is not so easy to prove in detaileven in the
iid case) for close enough to and n large enough, ln() will be
concavewith high probability. This means that n must satisfy l
n(n) = 0 and we have that
P{|n | > } 0 as n for any > 0. Define the fluctuation
error Zn by
n = +
1nZn
To get a CLT for the error, as in the iid case, we use the delta
method
0 = ln(n) = ln( +
1nZn) = l
n() + ln(
)1nZn + . . .
41
-
where the dots signify terms that go to zero faster than 1/n in
mean square or
in probability (not easy to prove in detail). Ignoring the
higher order terms we get(approximately)
Zn =
nln()ln()
By differentiation we have that
ln() =
1
n
nj=1
P (Xj1, Xj ; )P (Xj1, Xj ; )
ln() =
1
n
nj=1
(P (Xj1, Xj ; )P (Xj1, Xj ; )
(P(Xj1, Xj ; ))2
P 2(Xj1, Xj ; )
)where we have ignored the pi0 term that goes to zero in the
limit. The ergodic theoremapplies to ln(, which tends to l() = I in
mean square (or in probability). Thenext step is to apply the
central limit theorem to
nln(), if possible, where
nln(
) =1n
nj=1
P (Xj1, Xj ; )P (Xj1, Xj ; )
As discussed earlier in these notes, we need to first transform
the quantity of inter-est using the solution of a suitable Poisson
equation. We then have this quantityexpressed as a martingale plus
a term that goes to zero in the limit. The key obser-vation here,
however, is that ln() is a martingale already, with zero mean.
Infact, we can verify the defining property of a martingale
E{ln()|X0, X1 Xn1} = ln1()which clearly follows from the fact
that
E
{P (Xn1, Xn; )P (Xn1, Xn; )
|Xn1}
=y
P (Xn1, y; ) = 0
Now for a zero mean martingale we have the key property, as in
discussed in earliersections,
var(nln(
)) = E
1
n
nj=1
P (Xj1, Xj ; )P (Xj1, Xj ; )
2 = E 1n
nj=1
(P (Xj1, Xj ; )P (Xj1, Xj ; )
)2We can now apply the ergodic theorem (we only need convergence
of the mean here) tothe last sum and we see that var(
nln()) I, with I the Fisher information. The
42
-
CLT now applied to the martingalenln() tells us that it
converges in distribution
(in law, weakly) to an N(0, I) random variable. Application of
Slutskys theoremleads to the CLT for the error of the MLE
n(n ) N(0, 1
I) in distribution
just as in the iid case. The Cramer-Rao lower bound applies here
(as for any asymp-totically consistent estimator) and we conclude
that the MLE is asymptotically ef-ficient, as we should expect.
Thus all of the theory of MLE carries over to Markovchains. What is
a lot more difficult now is the actual calculation of the MLE,
n,which can be done iteratively by the expectation minimization
(EM) algorithm or byother optimization methods depending on the
problem at hand.
4.9 Bayesian filtering
We want to determine recursive filtering equations in the Markov
chain context. LetX = {Xn : n 0} be a Markov chain in a finite
state-space S that is not observed directly,and let Z = (Zn : n 0)
be an observed, noise corrupted version of X defined via the
pathprobability densities given the Markov chain:
P {Z0 dz0, Z1 dz1, . . . , Zn dzn|X} =ni=0
f(zi;Xi)dzi,
where (f(;x) : x S) is a family of given density functions. The
noisy observations areconditionally independent given the Markov
chain. Let = ((x) : x S) be a priordistribution of initial state X0
(i.e (x) = P(X0 = x)) and denote by
n(x) = P(Xn = x|Z0, , Zn)
the posterior of the state at time n given the observations. We
want to compute n+1(x)recursively from n(x) assuming that the
transition probabilities P (x, y), x, y S, of theMarkov chain are
known.
Let Z(n) = {Z0, Z1, . . . , Zn} be the observed noisy Markov
chain up to time n. Fromthe properties of conditional probabilities
we have that
n(x) = P(Xn = x|Z(n)) =P{Xn = x, Z(n)}
P{Z(n)}=P{Zn|Xn = x, Z(n1)}P{Xn = x, Z(n1)}
P{Z(n)}
where P{Z(n)} is the joint (unconditional) density of the noisy
Markov chain evaluated atthe observation Z(n). But by the law of
total probability and the Markov property we have
P{Xn = x, Z(n1)} =zS
P{Xn = x|Xn1 = z, Z(n1)}P{Xn1 = z, Z(n1)}
43
-
=zS
P{Xn = x|Xn1 = z}P{Xn1 = z, Z(n1)}
and
P{Xn1 = z, Z(n1)} = P{Xn1 = z|Z(n1)}P{Z(n1)} = n1(z)P{Z(n1)}Note
also that
P{Zn|Xn = x, Z(n1)} = P{Zn|Xn = x} = f(Zn;x)by the independence
of the noisy observations given the Markov chain.
We can now see how the recursion goes. We first update the state
with the (assumedknown) Markov chain transition probabilities
n1 n|n1(x) =zS
n1(z)P (z, x) , 0(x) given.
Then we update the observation
n|n1 n(x) =f(Zn, x)n|n1(x)xS f(Zn;x)n|n1(x)
In applications there are two main limitations to this
algorithm. First, the transitionprobabilities of the Markov chain
are not known perfectly. They usually have unknownparameters in
them, which must be estimated. Second, the updating-of-the-state
stepwill be computationally intensive when the dimension of the
Markov chain is large soMonte Carlo methods (particle methods) must
be used. Combining filtering and maximumlikelihood parameter
estimation can be done with variants of the
expectation-maximization(EM) algorithm.
5 Random walks and connections with differential equations
We begin with the simple random walk on equally spaced points in
an interval. Let {Xn}denote the symmetric random walk on {x0 = 0,
x1 = x, . . . , xN = Nx = a} so that
P{Xn = xk|Xn1 = xj} = 1/2 when k = j 1 and 0 otherwisewhere xj
is an interior point, that is, it is not 0 or a. The process can be
absorbed at 0 ora, in which case
P{Xn = x0|Xn1 = x0} = 1 , P{Xn = xN |Xn1 = xN} = 1The matrix of
transition probabilities P (x, y) has the form
P =
1 0 0 0 0
1/2 0 1/2 0 0
0 0 1/2 0 1/20 0 0 0 1
44
-
The process satisfies the stochastic difference equation Xn =
Xn1 +Zn for n = 1, 2, ...,with X0 = x given, for example, and with
{Zn} independent identically distributed randomvariables taking
values x with probability 1/2. The process stops when it reaches
theboundary points.
We can derive difference equations for probabilities of interest
such as
unj = P{T > n|X0 = xj}where T is the first time to reach x0
or xN , the exit time from the interval. By the renewalmethod we
find that
unj =1
2(un1j+1 + u
n1j1 ) , j = 1, 2, . . . , N 1 , n = 1, 2, 3, . . .
with boundary conditions un0 = unN = 0 , n = 0, 1, 2, . . . ,
and initial condition u
0j = 1 , j =
1, 2, . . . , N 1. In vector-matrix form this finite difference
equation has the formun = Pun1 , n = 1, 2, 3, ...
where un = (unj ) and where u0 = f with f equal to one at all
interior points and zero at
the two boundary points. The eigenvalues and eigenvectors of
this symmetric tridiagonalmatrix can be computed explicitly using,
for example, the discrete Fourier transform.
The connection with partial differential equations comes from
passing to the continuumlimit, which we will do here formally,
without detailed proofs that resemble the usualconvergence proofs
for finite difference methods or can be entirely probabilistic
since Xn isjust a sum of iid random variables and the diffusion
approximation is simply a restatementof the CLT.
We write the recursion relation for un as
1
t(un un1) = 2 1
(x)2(P I)un1
with 2 = (x)2/t remaining fixed at t and x tend to zero. In the
continuum limitwe assume that unj u(nt, jx) with u(t, x) a smooth
function that will satisfy a partialdifferential equation. It is
easily seen that the j-th entry of
[1
(x)2(P I)un1]j 1
2uxx(nt, jx)
and therefore the continuum limit of the difference equation in
the interior becomes
ut(t, x) =2
2uxx(t, x) , t > 0 , x (0, a)
with boundary conditions u(t, 0) = u(t, a) = 0 and initial
conditions u(0, x) = 1. Thisequation can be solved by Fourier sine
series and the result is
u(t, x) = Px{T > t} =k=0
2
2
(2k + 1)piae(
2k+1)pia
)2 2t2 sin(
2k + 1)pix
a)
45
-
Here T is the exit time of Brownian motion, the path limit in
law of the random walk,from the interval (0, a). The most
interesting feature of the explicit solution is the rate ofdecay at
t
1
tlogPx{T > t} pi
22
2a2
When we consider this problem in the semi-infinite interval (0,)
we simply have tosolve the pde
ut(t, x) =2
2uxx(t, x) , t > 0 , x > 0
with u(t, 0) = 0 and u(0, x) = 1. It is interesting to compute
here for > 0
Ex{eT } =
0etut(t, x)dt
which is the Laplace transform of the density of the exit time T
from the origin. TheLaplace transform of u
u(, x) =
0
etu(t, x)dt
satisfies the ode
u 1 = 2
2uxx , u(, 0) = 0
for which we get the unique bounded solution
u =1
(1 e
2x)
And since Ex{eT } = 1 u we see that
Ex{eT } = e
2x
This Laplace transform can be inverted and the explicit form of
the density of T can beobtained. But, contrary to what we have in a
finite interval where the large t behaviorof the distribution is
exponentially small, in the semi-infinite interval the mean exit
timeis infinite, Ex{T} = . This can be shown by differentiating the
Laplace transform of Twith respect to and then letting tend to
zero.
5.1 Transience and recurrence
An irreducible and aperiodic Markov chain on an infinite space,
a random walk for example,is recurrent if the expected number of
visits to a state is infinite and transient otherwise.We will show
that the one-dimensional random walk on the infinite lattice is
recurrentwhile in three dimensions it is transient. It is also
recurrent in two dimensions. We willuse Fourier series for the
analysis.
46
-
The one-dimensional random walk is Xn = Xn1 + Zn where {Zn} are
iid randomvariables with values x with probability 1/2, Therefore,
for k (pix , pix) we have
Ex{eikXn} = Ex{E{eikZneikXn1 |Xn1}}= eikx
(E{eikZ}
)n= eikx
(1
2(eikx + eikx)
)n= eikx(cos kx)n
The values of Xn are mx where m = 0,1,2, ... so that if x = 0
then
E0{eikXn} =m
eikmxpm(n)
where pm(n) is the probability that Xn = mx, starting from 0. By
the orthogonality ofthe complex exponentials we see that
pm(n) =x
2pi
pi/xpi/x
E0{eikXn}eikmxdk
which implies that
P0(Xn = 0) = p0(n) =x
2pi
pi/xpi/x
(cos kx)ndk
We will use this Fourier representation of p0(n) to show that
the one-dimensional symmetricrandom walk is recurrent.
We will show that the expected number of returns to zero is
infinite. The expectednumber of returns by time n is
E0{nj=0
1{Xj = 0}} =nj=0
p0(j) =x
2pi
pi/xpi/x
1 (cos kx)n+11 cos kx dk
The integrand may be singular only at k = 0. Expanding near k =
0 we see that it behaveslike n+1 and therefore in any fixed, small
neighborhood of k = 0 we see that the integral isunbounded as n. We
conclude, omitting details, that the expected number of returnsto
the origin, E0{
j=0 1{Xj = 0}} =, and therefore the random walk is
recurrent.
In three dimensions the calculation leading to the Fourier
representation of the proba-bility of return to the origin in n
steps is essentially the same as in one dimension exceptfor
notation. Now m = (m1,m2,m3) are points on the integer lattice in
three dimensions,we denote the mesh size by x in all three
directions, and we denote by ej , j = 1, 2, 3the three unit
coordinate vectors in R3. The Fourier variable is k = (k1, k2, k3)
with each
47
-
coordinate taking values in (pix ,pix). The random walk is Xn =
Xn1 + Zn where the
increments are iid random variables taking values ejx, j = 1, 2,
3, with probability 1/6.Repeating the above steps and using inner
product notation for vectors we have that
E0{eikXn} = [13
(cos k1x+ cos k2x+ cos k3x)]n
Therefore
E0{j=0
1{Xj = 0}} = (x2pi
)3 pi/xpi/x
pi/xpi/x
pi/xpi/x
1
1 13(cos k1x+ cos k2x+ cos k3x)dk
As in the one-dimensional case only k = 0 is a singular point
and expanding near k = 0we see that the denominator behaves like
|k|2. But the jacobian in polar coordinates inthree dimensions is
also proportional to |k|2 so the singularity cancels and we have,
in facta convergent integral. This proves that the symmetric random
walk in three dimensions istransient.
5.2 Connections with classical potential theory
The connections with classical potential theory come from the
fact that the transitionprobability matrix for the random walk on
the scaled lattice (scaled by x) converges tothe Laplace operator.
Let as assume that we are in three dimensions and let f(x) be
asmooth and bounded function on (R)3. We can then define the
transition operator of therandom walk by
Pf(x) =1
6
3j=1
f(x ejx) = Average of nearest neighbors
so that
1
(x)2(Pf(x) f(x)) 1
6f(x) =
1
6(fx1x1(x) + fx2x2(x) + fx3x3(x))
as x 0, where is the Laplace operator. For the expectation
unm = E{f(Xn)|X0 = mx},
where m = (m1,m2,m3), we have that
un+1m = (Pun)(mx).
and we can rewrite this as
un+1m unm = (P I)un(mx)
48
-
Dividing by t we have
1
t(un+1m unm) =
(x)2
t 1
(x)2(P I)un(mx)
In the continuum limit t 0, x 0 with
2 =1
3
(x)2
t= 2 = constant
we have that unm u(nt,mx) with
ut =2
2u , t > 0
with u(0, x) = f(x). The factor 1/3 is attached to the
definition of 2 because it refers tocoordinate-wise mean square
displacement rather than overall mean square displacement.
We introduce the Laplace transform
u(x, ) =
0
etu(t, x)dt , > 0
and note that the diffusion equation transforms to
u(x, ) f(x) = 2
2u(x, )
In terms of the Brownian motion process Xt, the continuous time
analog of the randomwalk which is not considered in detail here, we
have the probabilistic representation
u(t, x) = Ex{f(Xt)}
and for the Laplace transform
u(x, ) =
0
etEx{f(Xt)}dt = Ex{
0etf(Xt)dt
When f(x) = 1A(x) = 1{x A} then
u(x, ) = Ex{
0et1A(Xt)dt
is the discounted, with rate , expected time spent by Brownian
motion in the set A,starting from x. When = 0, u(x, 0) is the
expected time spent in A, which is thecontinuous analog of the
quantity that characterizes transience and recurrence in
randomwalks.
49
-
The continuum limit is interesting because it connects directly
with potential theory,that is the theory of solutions of the
Laplace equation. In three dimensions the Greensfunction for the
equation
xG(x, y) G(x, y) = y(x)is explicitly given by
G(x, y) =e|xy|
4pi|x y|Therefore we have the integral representation
u(x, ) = Ex{
0etf(Xt)dt =
A
e
2|xy|
4pi|x y| dy
The expected time spent in A is thus given by the Newtonian
potential
Ex{
0f(Xt)dt =
A
1
4pi|x y|dy
For the random walk on the three dimensional lattice we do not
have an explicit expressionsuch as this one, which, in particular,
shows that the time spent in any bounded set isfinite.
The recurrence of the one-dimensional Brownian motion can be
seen easily by notingthe Greens function in one dimension has the
form
G(x, y) =e|xy|
2
and therefore in one dimension we have
u(x, ) = Ex{
0etf(Xt)dt =
A
e
2|xy|
2dy
This becomes infinite as 0, showing that the expected time spent
in any set of positivevolume is infinite. This is the analog of
recurrence for Brownian motion in one dimension.
5.3 Stochastic control
We will formulate a stochastic control problem for a finite
difference equation of the form
Xn+1 = Xn + b(Un, Xn)t+ (Un, Xn)
tZn+1 , n = 0, 1, 2, ...
where X0 = x is given and {Zn} are iid N(0, 1) random variables.
This is a Markov chainwith values in R and with coefficient
functions b(u, x) and (u, x) that depend on both xand the control
variable u R. The transition probabilities are given by
P{Xn+1 A|Xn = x, un} =Ae (yxb(u,x)t)2
22(u,x)t dy
50
-
For any bounded function f(x) we define the transition
operator
Puf(x) =
e (yxb(u,x)t)2
22(u,x)t f(y)dy
The reason the scaling with the time increment t is chosen this
way is so that conditionalincrement of the Markov chain has the
following mean and variance
E{Xn+1 Xn|Xn = x, un} = b(u, x)t
Var{Xn+1 Xn|Xn = x, un} = 2(u, x)twhich implies that for any
smooth and bounded function f we have that
limt0
1
t(Puf(x) f(x)) =
2(u, x)
2
2f(x)
x2+ b(u, x)
f(x)
x= Luf(x)
Thus, the continuum limit of the Markov chain exists and is a
diffusion process.The control of the process Xn is Un, assumed to
be a real valued random variable that
depends only on X0, X1, . . . , Xn and such that
Ex{N1j=0
L(Uj , Xj)t+ g(XN )}
is minimized over all control sequences U0, U1, ..., UN1, for
some terminal time index Nand for a given running cost function
L(u, x) and terminal cost g(x). We introduce thevalue function
V n(x) = infAn
E{N1j=n
L(Uj , Xj)t+ g(XN )|Xn = x} , n = N 1, N 2, ..., 0
where An is the set of all admissible control sequences. Note
that V N (x) = g(x). Clearlywe want to find V 0(x) and the
associated optimal control sequence U0 , U1 , ..., UN1.
The main result in discrete time stochastic control is the
derivation of the backward intime recursion for the determination
of the value function V 0 and the associated optimal
51
-
control. Using iterated conditional expectation we write
V n(x) = infAn
E{N1j=n
L(Uj , Xj)t+ g(XN )|Xn = x}
= infAn
E{E{L(Un, Xn)t+N1j=n+1
L(Uj , Xj)t+ g(XN )|Xn+1, Xn = x}|Xn = x}
= infuE{L(u, x)t+ inf
An+1E{
N1j=n+1
L(Uj , Xj)t+ g(XN )|Xn+1}|Xn = x, Un = u}
= infu
[L(u, x)t+ E{V n+1(Xn+1)|Xn = x, Un = u}]= inf
u[L(u, x)t+ E{V n+1(Xn+1)|Xn = x, Un = u}]
= infu
[L(u, x)t+ PuVn+1(x)]
Therefore, the determination of the value function and the
associated optimal control isreduced to soving the backward
optimality recursion
V n(x) = infu
[L(u, x)t+ PuVn+1(x)] , n = N 1, N 2, ..., 0
with V N (x) = g(x).Assuming that a unique minimum un(x) exists
at each step in the backward optimality
recursion, which in general requires convexity and other
assumptions on L, b, and g, wecan then obtain the optimally
controlled Markov chain recursively by
Xn+1 = Xn + b(u
n(X
n), X
n)t+ (u
n(X
n), X
n)
tZn+1 , n = 0, 1, 2, ...
with X0 = x. The optimal controls are given by Un = un(Xn) and
they are Markovian,that is, they depend only on the current
(optimal) state. The optimal value functionsatisfies
V n(x) = L(un(x), x)t+ Pun(x)Vn+1(x)] , n = N 1, N 2, ..., 0
with V N (x) = g(x). The optimal control problem is thus reduced
to solving first thebackward optimality recursion and then the
forward recursion that determines the optimalstate and the
associated optimal controls.
Let us consider a simple, linear, quadratic cost control problem
in which b(u, x) =bu, (u, x) = , L(u, x) = lu2, g(x) = gx2, where
b, , l, g are now constants and we taket = 1. In this case we may
assume that the value function has the form
V n(x) = anx2 + bn
52
-
and derive recursions for the sequences of constants {an} and
{bn}, with aN = g andbN = 0. Because of the Gaussian transition
probability density for the Markov chain, thebackward optimality
recursion has the form
anx2 + bn = inf
u[lu2 + an+1(
2 + (x+ bu)2) + bn+1]
The minimizing u is given by
un(x) =ban+1l + b2an+1
x
and then the recursions for an and bn have the form
an =lan+1
l + b2an+1, bn =
2an+1 + bn+1
with aN = g and bN = 0. Once this sequence of constants is
determined the optimallycontrolled Markov chain has the form
Xn+1 = Xn + bu
n(X
n) + Zn+1 , n = 0, 1, 2, ...
with X0 = x. Note that the optimal control is a linear function
of the state in this case.The continuum limit of the backward
optimality recursion is the Hamilton-Jacobi-
Bellman (HJB) equation obtained as follows. We re-write this
recursion as
V n+1(x) V n(x) + infu
[L(u, x)t+ PuVn+1(x) V n(x)] , n = N 1, N 2, ..., 0
Dividing by t and passing to the limit we get
Vt + infu
[L(u, x) + Luf(x)] , t < T
with V (T, x) = g(x). Re