Convergence Proofs of Least Squares Policy Iteration Algorithm for High-Dimensional Infinite Horizon Markov Decision Process Problems Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 December 17, 2008
49
Embed
Convergence Proofs of Least Squares Policy Iteration …castlelab.princeton.edu/html/Papers/Ma Powell-Policy iteration... · Algorithm for High-Dimensional Inflnite Horizon Markov
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Convergence Proofs of Least Squares Policy IterationAlgorithm for High-Dimensional Infinite Horizon
Markov Decision Process Problems
Jun Ma and Warren B. PowellDepartment of Operations Research and Financial Engineering
Princeton University, Princeton, NJ 08544
December 17, 2008
Abstract
Most of the current theory for dynamic programming algorithms focuses on finite state,
finite action Markov decision problems, with a paucity of theory for the convergence of
approximation algorithms with continuous states. In this paper we propose a policy iteration
algorithm for infinite-horizon Markov decision problems where the state and action spaces are
continuous and the expectation cannot be computed exactly. We show that an appropriately
designed least squares (LS) or recursive least squares (RLS) method is provably convergent
under certain problem structure assumptions on value functions. In addition, we show
that the LS/RLS approximate policy iteration algorithm converges in the mean, meaning
that the mean error between the approximate policy value function and the optimal value
function shrinks to zero as successive approximations become more accurate. Furthermore,
the convergence results are extended to the more general case of unknown basis functions
with orthogonal polynomials.
1 Introduction
The core concept of dynamic programming for solving Markov decision process (MDP) is
Bellman’s equation, which is often written in the standard form (Puterman (1994))
Vt(St) = maxxt
{C(St, xt) + γ∑
s′∈SP (s′|St, xt)Vt+1(s
′)}. (1)
It is more convenient for our purpose but mathematically equivalent to write Bellman’s
equation (1) with the expectation form
Vt(s) = maxxt
{C(St, xt) + γE[Vt+1(St+1)|St = s]}, (2)
where St+1 = SM(St, xt,Wt+1) and the expectation is taken over the random information
variable Wt+1. If the state variable St and decision variable xt are discrete scalars and the
transition matrix P is known, the value function Vt(St) can be computed by enumerating all
the states backward through time, which is a method often referred to as backward dynamic
programming. Moreover, there is a mature and elegant convergence theory supporting algo-
rithms that handle problems with finite state and action spaces and computable expectation
(Puterman (1994)). However, discrete representation of the problem often suffers from the
well-known “curses of dimensionality”: 1) If the state St is a vector, the state space grows
exponentially with the number of dimensions. 2) A multi-dimensional information variable
Wt makes the computation of the expectation intractable. 3) When the decision variable
xt is a high-dimensional vector, we have to rely on mathematical programming to solve
the more complicated optimization problem in the Bellman’s equation. In addition, there
are a large number of real world applications with continuous state and action spaces, to
which direct application of algorithms developed for discrete problems is not appropriate.
Hence, continuous value function approximations are suggested to handle high-dimensional
and continuous applications.
This paper describes a provably convergent and implementable approximate policy-
iteration algorithm that uses linear function approximation to handle infinite-horizon Markov
decision process problems where state, action and information variables are all continuous
1
vectors (possibly of high dimensionality) and the expectation cannot be computed exactly.
Policy iteration is not a novel idea. However, when the state and action spaces are not finite,
policy iteration runs into such difficulties as (1) the feasibility of obtaining accurate policy
value functions in a computationally implementable way and (2) the existence of a sequence
of policies generated by the algorithm (Bertsekas & Shreve (1978)). Least squares or recur-
sive least squares (LS/RLS) updating methods are proposed to conquer the first difficulty.
To overcome the second difficulty, we impose the significant assumption that the true value
functions for all policies in the policy space are continuous and linearly parameterized with
finitely many known basis functions. Consequently, it leads to a sound convergence theory
of the algorithm applied to problems with continuous state and action spaces.
The main contributions of this paper include: a) a thorough and up-to-date literature
review on continuous function approximations, b) extension of the least squares temporal
difference learning algorithm (LSTD(0)) by Bradtke & Barto (1996) to the continuous case, c)
almost sure convergence of the exact policy iteration algorithm using the LS/RLS updating,
d) convergence in the mean of the corresponding approximate policy iteration algorithm,
and e) extension of the convergence results to unknown basis functions.
The rest of the paper is organized as follows. In section 2, we review the literature on
continuous function approximations applied to Markov decision process problems and their
asymptotic properties. In section 3, we summarize important mathematical foundations
to establish convergence and illustrate the details of a least squares/recursive least squares
approximate policy iteration (LS/RLSAPI) algorithm. Section 4 presents the convergence
results for a fixed policy. In section 5, by applying the results in section 4, we show almost sure
convergence of the exact policy iteration algorithm, which requires exact policy evaluation,
to the optimal policy. Section 6 proves mean convergence of approximate policy iteration,
in which case policies are evaluated approximately. In section 7, we extend the convergence
result in section 6 to unknown basis functions using orthogonal polynomials. The last section
concludes and discusses future research directions.
2
2 Literature Review
In this section, we review the literature on continuous function approximations and related
convergence theories. We break this section into three parts. The first part focuses on con-
tinuous approximation algorithms for discrete MDP problems. The second part deals with
function approximation algorithms applied to problems with a continuous state space. The
last part describes some divergence examples of value iteration type algorithms using para-
metric function approximators. The table in figure 1 is a brief summary of the algorithms
categorized by their characteristics such as whether the state and action spaces are discrete or
continuous (D/C), whether the contribution/reward function is quadratic or general (Q/G),
whether the expectation can be computed exactly (Y/N), whether the problem is determinis-
tic or stochastic (Gaussian noise) (D/S(G)), the type of algorithms including value iteration
(VI), fixed policy (FP), exact or approximate policy iteration (EPI/API), and whether there
is a convergence guarantee for the algorithm (Y/N) or performance bound (B). Details of
the algorithms and their convergence properties are discussed in the following subsections.
2.1 Continuous approximations of discrete problems
Since the inception of the field of dynamic programming, researchers have devoted consid-
erable effort to explore the use of compact representations in value function approximation
in order to overcome the curses of dimensionality in large-scale stochastic dynamic pro-
(1978), Schweitzer & Seidmann (1985)). However, most of these approaches are heuristic,
and there is no formal proof guaranteeing convergence of the algorithms. More recently,
provably convergent algorithms have been proposed for different approximation techniques
such as feature-based function approximation, temporal difference learning, fitted temporal
difference learning with contraction or expansion mappings, and residual gradient. These
are reviewed below.
3
State Action Reward Exp. Noise Policy Conv.Bradtke et al. (1994) C C Q N D API Y
Gordon (1995) NA NA G Y S VI YBaird (1995) D D G N D VI Y
Tsitsiklis & Van Roy (1996) D D G Y S VI YBradtke & Barto (1996) D D G N S FP Y
Landelius & Knutsson (1996) C C Q N D VI/API YTsitsiklis & Van Roy (1997) D D G Y S FP Y
Meyn (1997) C D G Y S EPI YPapavassiliou & Russell (1999) D D G N S FP Y
Gordon (2001) D D G N S VI YOrmoneit & Sen (2002) C D G N S VI Y
Lagoudakis & Parr (2003) D D G N S API NPerkins & Precup (2003) D D G N S API Y
Engel et al. (2005) C C G N S API NMunos & Szepesvari (2005) C D G N S VI B
Melo et al. (2007) C D G N S FP YSzita (2007) C C Q N S(G) VI Y
Antos et al. (2007a, 2008) C D G N S API BAntos et al. (2007b) C C G N S API B
Deisenroth et al. (2008) D D G Y S VI N
Figure 1: Table of some continuous function approximation algorithms and related conver-gence results
2.1.1 Feature-based function approximation
The first step to set up a rigorous framework combining dynamic programming and compact
representations is taken in Tsitsiklis & Van Roy (1996). Two types of feature-based value
iteration algorithms are proposed. One is a variant of the value iteration algorithm that
uses a look-up table in feature space rather than in state space (similar to aggregation of
state variables). The other value iteration algorithm employs feature extraction and linear
approximations with a fixed set of basis functions including radial basis functions, neural
networks, wavelet networks, and polynomials. Under rather strict technical assumptions
on the feature mapping, Tsitsiklis & Van Roy (1996) proves the convergence of the value
iteration algorithm (not necessarily to the optimal value function unless it is spanned by
the basis functions) and provides a bound on the quality of the resulting approximations
compared with the optimal value function.
4
2.1.2 Temporal difference learning
Tsitsiklis & Van Roy (1997) proves convergence of online temporal difference learning TD(λ)
algorithm using a linear-in-the-parameters model and continuous basis functions. The con-
vergence proof assumes a fixed policy, in which case the problem is a Markov chain. All the
results are established on the assumption of a discrete state Markov chain. It is claimed that
the proofs can be easily carried over to the continuous case, but no details for this extension
are provided. Papavassiliou & Russell (1999) describes the Bridge algorithm for temporal
difference learning applied to a fixed policy. It shows that the algorithm converges to an
approximate global optimum for any “agnostically learnable” hypothesis class other than the
class of linear combination of fixed basis functions and provides approximation error bounds.
Bradtke & Barto (1996) and Boyan (1999) proves almost sure convergence of Least-
Squares TD (LSTD) algorithm when it is used with linear function approximation for a
constant policy. Motivated by the LSTD algorithm, Lagoudakis & Parr (2003) proposes the
least square policy iteration (LSPI) algorithm which combines value-function approximation
with linear architectures and approximate policy iteration. Mahadevan & Maggioni (2007)
extends the LSPI algorithm within the representation policy iteration (RPI) framework. By
representing a finite sample of state transitions induced by the MDP as an undirected graph,
RPI constructs an orthonormal set of basis functions with the graph Laplacian operator.
With some technical assumptions on the policy improvement operator, Perkins & Precup
(2003) proves convergence of the approximate policy iteration algorithm to a unique solution
from any initial policy when SARSA updates (see Sutton & Barto (1998)) are used with linear
state-action value function approximation for policy evaluation.
Gordon (1995) provides a convergence proof for fitted temporal difference learning (value
iteration) algorithm with function approximations that are contraction or expansion map-
pings, such as k-nearest-neighbor, linear interpolation, some types of splines, and local
weighted average. Interestingly, linear regression and neural networks do not fall into this
class due to the property that small local changes can lead to large global shifts of the
approximation, so they can in fact diverge. Gordon (2001) proves a weaker convergence re-
sult for linear approximation algorithms, SARSA(0) (a Q-learning type algorithm presented
5
in Rummery & Niranjan (1994)) and V(0) (a value-iteration type algorithm introduced by
Tesauro (1994)), converge to a bounded region almost surely.
2.1.3 Residual gradient algorithm
To overcome the instability of Q-learning or value iteration when implemented directly with a
general function approximation, residual gradient algorithms, which perform gradient descent
on the mean squared Bellman residual rather than the value function or Q-function, are
proposed in Baird (1995). Convergence of the algorithms to a local minimum of the Bellman
residual is guaranteed only for the deterministic case.
2.2 Approximations of continuous problems
There are comparatively fewer papers that treat function approximation algorithms directly
applied to continuous problems. Convergence results are mostly found for the problem class
of linear quadratic regulation (LQR), batch reinforcement learning, and the non-parametric
approach with kernel regression. Another non-parametric method using a Gaussian process
model has been successfully applied to complex nonlinear control problems but lacks conver-
gence guarantees. Other convergence results are established for exact policy iteration and
constant policy Q-learing that assume a finite or countable action space and ergodicity of
the underlying process.
2.2.1 Linear quadratic regulation
In the 1990’s, convergence proofs of DP/ADP type algorithms were established for a special
problem class with continuous states called Linear Quadratic Regulation in control theory.
One nice feature of LQR type problems is that one can find analytical solutions if the
exact reward/contribution and transition functions are known. When the exact form of the
functions is unknown and only sample observations are made, dynamic programming-type
algorithms are proposed to find the optimal control. Bradtke et al. (1994) proposes two
algorithms based on Q-learning and applied to deterministic infinite-horizon LQR problems.
6
Approximate policy iteration is provably convergent to the optimal control policy (Bradtke
(1993)), but an example shows that value iteration only converges locally. In Landelius
& Knutsson (1996), convergence proofs are presented for several adaptive-critic algorithms
applied to deterministic LQR problems including Heuristic Dynamic Programming (HDP,
another name for Approximate Dynamic Programming), Dual Heuristic Programming (DHP,
which works with derivatives of value function), Action Dependent HDP (ADHDP, another
name for Q-learning) and Action Dependent DHP (ADDHP). The key difference between
the RL algorithm proposed in Bradtke et al. (1994) and HDP is the parameter updating
formula. Bradtke et al. (1994) uses recursive least squares, while Landelius & Knutsson
(1996) uses a stochastic gradient algorithm that may introduce a scaling problem. It seems
reasonable to extend convergence proofs from the deterministic case to the stochastic case,
but we are only aware of a convergence proof for the stochastic gradient algorithm by Szita
(2007) under the assumption of Gaussian noises.
2.2.2 Batch reinforcement learning
More recently, a series of papers (Munos & Szepesvari (2005), Antos et al. (2007a,b, 2008))
derive finite-sample Probably Approximately Correctness (PAC) bounds for batch reinforce-
ment learning problems. These performance bounds ensure that the algorithms produce
near-optimal policy with high probability. They depend on the mixing rate of the sample
trajectory, the smoothness properties of the underlying MDP, the approximation power and
capacity of the function approximation method used, the iteration counter of the policy
improvement step, and the sample size for policy evaluation. More specifically, Munos &
Szepesvari (2005) considers sampling-based fitted value iteration algorithm for MDP prob-
lems with large or possibly infinite state space but finite action space with a known generative
model of the environment. In a model-free setting, Antos et al. (2007b, 2008) propose off-
policy fitted policy iteration algorithms that are based on a single trajectory of some fixed
behavior policy to handle problems in continuous state space and finite action space. An-
tos et al. (2007a) extends previous consistency results to problems with continuous action
spaces.
7
2.2.3 Kernel-based reinforcement learning
A kernel-based approach to reinforcement learning that adopts a non-parametric perspective
is presented in Ormoneit & Sen (2002) to overcome the stability problems of temporal-
difference learning in problems with a continuous state space but finite action space. The
approximate value iteration algorithm is provably convergent to a unique solution of an
approximate Bellman’s equation regardless of its initial values for a finite training data set
of historical transitions. As the size of the transition data set goes to infinity, the approximate
value function and approximate policy converge in probability to the optimal value function
and optimal policy respectively.
2.2.4 Gaussian process models
Another non-parametric approach to optimal control problems is the Gaussian process model,
in which value functions are modeled with Gaussian processes. Both the Gaussian process
temporal difference (GPTD) algorithm (policy iteration) in Engel et al. (2005) and the
Gaussian process dynamic programming (GPDP) algorithm (value iteration) in Deisenroth
et al. (2008) have successful applications in complex nonlinear control problems without
convergence guarantees.
2.2.5 Other convergence results
Meyn (1997) proves convergence of policy iteration algorithms for average cost optimal con-
trol problems with unbounded cost and general state space. The algorithm assumes countable
action space and requires exact computation of the expectation. With further assumptions
of a c-regular (a strong stability condition where c represents the cost function) initial policy
and irreducibility of the state space, the algorithm generates a sequence of c-regular policies
that converge to the optimal average cost policy. One extension of the temporal difference
algorithm to continuous domain can be found in Melo et al. (2007), which proves the conver-
gence of a Q-learning algorithm with linear function approximation under a fixed learning
policy for MDP problems with continuous state space but finite action space.
8
2.3 Examples of divergence
Value iteration-type algorithms with linear function approximation frequently fail to con-
verge. Bertsekas (1995) describes a counter-example of TD(0) algorithm with stochastic
gradient updating for the parameters. It is shown that the algorithm converges but can
generate a poor approximation of the optimal value function in terms of Euclidean distance.
Tsitsiklis & Van Roy (1996) presents a similar counter-example as in Bertsekas (1995) but
with a least squares updating rule for the parameter estimates, in which case divergence
happens even when the optimal value function can be perfectly represented by the linear
approximator. Boyan & Moore (1995) illustrates that divergent behavior occurs for value
iteration algorithms with a variety of function approximation techniques such as polynomial
regression, back-propagation and local weighted regression when the algorithm is applied to
simple nonlinear problems.
3 Preliminaries and the algorithm
We consider a class of infinite horizon Markov decision processes with continuous state
and action spaces. The following subsections discuss several important preliminary con-
cepts including Markov decision processes, contraction operators, continuous correspondence,
continuous-state Markov chain and post-decision state variable. These basics are necessary
in the convergence proofs in sections 4, 5 and 6. The last subsection illustrates the details
of an approximate policy iteration algorithm with recursive least squares updating.
3.1 Markov decision processes
We start with a brief review of Markov decision processes. A Markov decision process is a
sequential optimization problem where the goal is to find a policy that maximizes (for our
application) the expected infinite-horizon discounted rewards. Let St be the state of the
system at time t, xt be a vector-valued continuous decision (control) vector, Xπ(St) be a
decision function corresponding to a policy π ∈ Π where Π is the stationary deterministic
policy space, C(St, xt) be a contribution/reward function, and γ be a discount factor between
9
0 and 1. We shall, by convenient abuse of notation, denote π to be the policy and use it
interchangeably with the corresponding decision function Xπ. The system evolves according
to the following state transition function
St+1 = SM(St, xt, Wt+1), (3)
where Wt+1 represents the exogenous information that arrives during the time interval from
t to t + 1. The problem is to find the policy that solves
supπ∈Π
E
{ ∞∑t=0
γtC(St, Xπt (St))
}. (4)
Since solving the objective function (4) directly is computationally intractable, Bellman’s
equation is introduced so that the optimal control can be computed recursively
V (s) = maxx∈X
{C(s, x) + γE
[V (SM(s, x, W ))|s]} , (5)
where V (s) is the value function representing the value of being in state s by following the
optimal policy onward. It is worth noting that the contribution function in (5) can also be
stochastic. Then, Bellman’s equation becomes
V (s) = maxx∈X
{E
[C(s, x,W ) + γV (SM(s, x, W ))|s
]}. (6)
To solve problems with continuous states, we list the following assumptions and defi-
nitions for future reference in later sections. Assume that the state space S is a Borel-
measurable, convex and compact subset of Rm, the action space X is a compact subset of
Rn, the outcome space W is a compact subset of Rl and Q : S × X ×W → R is a contin-
uous probability transition function. Let Cb(S) denote the space of all bounded continuous
functions from S to R. It is well-known that Cb(S) is a complete metric space.
3.2 Contraction operators
In this section, we describe the contraction operators associated with Markov decision
processes. Their contraction property is crucial in the convergence proof of our algorithm.
10
Definition 3.1 (Max operator) Let M be the max operator such that for all s ∈ S and
V ∈ Cb(S),
MV (s) = supx∈X
{C(s, x) + γ
∫
WQ(s, x, dw)V (SM(s, x, w))},
where Q has the Feller Property (that is M maps Cb(S) into itself) and SM : S×X×W → Sis the continuous state transition function.
Definition 3.2 (Operator for a fixed policy π) Let Mπ be the operator for a fixed policy
π such that for all s ∈ S
MπV (s) = C(s,Xπ(s)) + γ
∫
WQ(s,Xπ(s), dw)V (SM(s,Xπ(s), w))
where Q and SM have the same property as in definition 3.1.
There are a few elementary properties of the operators M and Mπ (Bertsekas & Shreve
(1978)) that will play an important role in the subsequent sections.
Proposition 3.1 (Monotonicity of M and Mπ) For any V1, V2 ∈ Cb(S), if V1(s) ≤ V2(s)
for all s ∈ S, then for all k ∈ N and s ∈ S
MkV1(s) ≤ MkV2(s),
MkπV1(s) ≤ Mk
πV2(s).
Proposition 3.2 (Fixed point of M and Mπ) For any V ∈ Cb(S), limk→∞ MkV = V ∗
where V ∗ is the optimal value function, and V ∗ is the only solution to the equation V = MV .
Similarly, for any V ∈ Cb(S), limk→∞ MkπV = V π where V π is the value function by following
policy π, and V π is the only solution to the equation V = MπV .
3.3 Continuous correspondence
The correspondence Γ formally defined below describes the feasible set of actions that ensures
the max operator M takes the function space Cb(S) into itself. A correspondence is said to
be compact-valued if the set Γ(s) is compact for every s ∈ S.
11
Definition 3.3 (Correspondence) A correspondence Γ : S → X is a relation that assigns
a feasible action set Γ(s) ⊂ X for each s ∈ S.
Definition 3.4 (Lower hemi-continuity of correspondence) A correspondence Γ : S →X is lower hemi-continuous (l.h.c.) at s if Γ(s) is nonempty and if, for every x ∈ Γ(s) and
every sequence sn → s, there exists N ≥ 1 and a sequence {xn}∞n=N such that xn → x and
xn ∈ Γ(sn) for all n ≥ N .
Definition 3.5 (Upper hemi-continuity of correspondence) A compact-valued corre-
spondence Γ : S → X is upper hemi-continuous (u.h.c.) at s if Γ(s) is nonempty and if, for
every sequence sn → s and every sequence {xn} such that xn ∈ Γ(sn) for all n, there exists
a convergent subsequence of {xn} with limit point x ∈ Γ(s).
Definition 3.6 (Continuity of correspondence) A correspondence Γ : S → X is con-
tinuous at s ∈ S if it is both u.h.c. and l.h.c. at s.
3.4 Markov chains with continuous state space
To work with Markov chains with a general state space, we present the following definitions of
irreducibility, invariant measure, recurrence and positivity that all have familiar counterparts
in discrete chains. These properties are related to the stability of a Markov chain, which
is of great importance in proving the convergence of value function estimates. In addition,
the continuity property of the transition kernel is helpful in defining behavior of chains with
desirable topological structure of the state space (Meyn & Tweedie (1993)). Hence, we
introduce the concepts of Feller chains, petite sets and T -chains, which will be used later to
classify positive Harris chains.
Definition 3.7 (ψ-Irreducibility for general space chains) We call a Markov chain Φ
on state space S ϕ-irreducible if there exists a measure ϕ on B(S) such that whenever ϕ(A) >
0 for A ∈ B(S), we have
Ps{Φ ever enters A} > 0, ∀s ∈ S
12
where Ps denotes the conditional probability on the event that the chain starts in state s. Let
ψ be the maximal irreducibility measure among such measures.
Definition 3.8 (Harris recurrence) The set A ∈ B(S) is called Harris recurrent if
Ps{Φ ∈ A infinitely often} = 1, ∀s ∈ S.
A chain Φ is called Harris (recurrent) if it is ψ-irreducible and every set in
B+(S) = {A ∈ B(S) : ψ(A) > 0}
is Harris recurrent.
Definition 3.9 (Invariant measure) Let P (·, ·) be the transition kernel of a chain Φ on
the state space S. A σ-finite measure µ on B(S) with the property
µ(A) =
∫
Sµ(ds)P (s, A),∀A ∈ B(S)
will be called invariant.
Definition 3.10 (Positive chains) Suppose a chain Φ is ψ-irreducible and admits an in-
variant probability measure µ. Then Φ is called a positive chain.
Definition 3.11 (Weak Feller chains) If a chain Φ has a transition kernel P such that
P (·, O) is a lower semi-continuous function for any open set O ∈ B(S), then Φ is called a
(weak) Feller chain.
It is worth noting that weak Feller property is often defined by assuming that the tran-
sition kernel P maps the set of all continuous functions C(S) into itself.
Definition 3.12 (Petite set) A set C ∈ B(S) is called petite if for some measure ν on
B(S) and δ > 0,
K(s, A) ≥ δν(A), s ∈ C, A ∈ B(S)
13
where K is the resolvent kernel defined by
K(s, A) =∞∑
n=0
(1
2)n+1P n(s, A).
Definition 3.13 (T -chains) If every petite compact set of B(S) is petite, then Φ is called
a T -chain. (For another more detailed definition, see Meyn & Tweedie (1993) chapter 6.)
3.5 Post-decision state variable
Computing the expectation within the max operator M is often intractable when the un-
derlying distribution of the evolution of the stochastic system is unknown or the decision
x is a vector. However, we can circumvent the difficulty by introducing the notion of the
post-decision state variable (Van Roy et al. (1997), Judd (1998), Powell (2007)). To illustrate
the idea, suppose we can break the original transition function (3) into the two steps
Sxt = SM,x(St, xt), (7)
St+1 = SM,W (Sxt ,Wt+1). (8)
Then, let V x(Sxt ) be the value of being in state Sx
t immediately after we make a decision.
There is a simple relationship between the pre-decision value function V (St) and post-decision
value function V x(Sxt ) that is summarized as
V (St) = maxxt∈X
{C(St, xt) + V x(Sxt )} , (9)
V x(Sxt ) = E{V (St+1)|Sx
t }. (10)
By substituting (9) into (10), we have Bellman’s equation of post-decision value function
V x(Sxt ) = E{ max
xt+1∈X{C(St+1, xt+1) + γV x(Sx
t+1)}|Sxt }. (11)
In our algorithm, we work with the post-decision value functions resulting from following
a constant policy. By our assumptions of linear pre-decision policy value functions and the
14
Step 0: Initialization:
Step 0a Set the initial values of the value function parameters θ0.
Step 0b Set the initial policy
π0(s) = arg maxx∈Γ(s)
{C(s, x) + γφ(sx)′θ0}.
Step 0c Set the iteration counter n = 1.
Step 0d Set the initial State S00 .
Step 1: Do for n = 1, . . . , N ,
Step 2: Do for m = 1, . . . ,M :
Step 3: Initialize vm = 0.Step 4: Choose one step sample realization ω.Step 5: Do the following:
Step 5a Set xnm = πn−1(Sn
m).Step 5b Compute Sn,x
m = SM,x(Snm, xn
m), Snm+1 = SM (Sn,x
m ,Wm+1(ω)) and Sn,xm+1 =
SM,x(Snm+1, x
nm+1)
Step 5c Compute the corresponding basis function value in the general form φ(sn,xm ) −
γφ(sn,xm+1).
Step 6: Do the following:Step 6a Compute vm = C(Sn,x
m , Sn,xm+1)
Step 6b Update parameters θn,m with LS/RLS method
Step 7: Update the parameter and the policy:
θn = θn,M ,
πn(s) = arg maxx∈Γ(s)
{C(s, x) + γφ(sx)′θn}.
Step 8: Return the policy πN and parameters θN .
Figure 2: Infinite-horizon approximate policy iteration algorithm with recursive least squaresmethod
assumption of a continuous state transition function, relation (10) implies that the post-
decision value functions are continuous and of the linear form V x(sx|θ) = φ(sx)T θ where
φ(sx) is the vector of basis functions and θ is the vector of parameters. It is further assumed
that the spanning set of the basis functions are known, so it is enough to just estimate the
linear parameters.
15
3.6 Algorithm details
The recursive least squares approximate policy iteration algorithm (RLSAPI) is summa-
rized in Figure 2. It is worth making a remark on the arg max function at the end of the
algorithm. This step is usually a multi-variate global optimization problem that updates
the policy (exactly or approximately) from the post-decision value function of the previous
inner iteration. The updated policy function feeds back a decision given any fixed input of
the state variable. We assume that there is a tie-breaking rule that determines a unique
solution to the arg max function for all f ∈ Cb(S) i.e. nonlinear proximal point algorithm
( Luque (1987)). As a result, the returned policies are well-defined single-valued functions.
It is worth noting that determining the unique solution to the arg max function may not
be an easy job in practice. However, the computational difficulty is significantly reduced if
we have special problem structures such as strict concavity and differentiability of the value
functions.
4 Convergence results for a fixed policy
The following theorems and lemmas are extensions of the proof for the convergence of the
LSTD algorithm by Bradtke & Barto (1996) applied to a Markov chain with continuous
state space. We would apply the parameter convergence results to prove the convergence of
the inner loop in our policy iteration algorithm. Bradtke & Barto (1996) argues that LSTD
algorithm is superior to the TD algorithm in terms of convergence properties for the following
three reasons: (1) Choosing step-size parameters is not necessary in LSTD (It is well-known
that a poor choice of step-sizes can significantly slow down convergence). (2) LSTD uses
samples more efficiently, so it produces faster convergence rate. (3) TD is not robust to the
initial value of the parameter estimates and choice of basis functions, but LSTD is.
For a fixed policy π, the transition steps (7) and (8) become
Sπt = SM,π(St, π(St)),
St+1 = SM,W (Sπt ,Wt+1).
16
As a result, the Markov decision problem can be reduced to a Markov chain for post-decision
states if the exogenous information Wt+1 only depends on the post-decision state Sπt in the
previous time period. The Bellman’s equation (11) for post-decision state becomes
V π(s) =
∫
Sπ
P (s, ds′)(Cπ(s, s′) + γV π(s′)), (12)
where P (·, ·) is the transition probability function of the chain, Cπ(·, ·) is the stochastic con-
tribution/reward function with Cπ(Sπt , Sπ
t+1) = C(St+1, π(St+1)) and Sπ is the post decision
state space by following policy π. It is worth noting that Sπ is compact since S and X are
compact and transition function is continuous. In addition, s in (12) is the post-decision
state variable and we drop the superscript x for simplicity. Suppose the true value function
for following the fixed policy π is V π(s) = φ(s)T θ∗ where φ(s) = [· · · , φf (s), · · · ] is the vector
of basis functions of dimension F = |F| (number of basis functions) and f ∈ F (F denotes
the set of features). Bellman’s equation (12) gives us
φ(s)T θ∗ =
∫
Sπ
P (s, ds′)[Cπ(s, s′) + γφ(s′)T θ∗].
We can rewrite the recursion as
Cπ(s, s′) = (φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T θ∗ + Cπ(s, s′)−∫
Sπ
P (s, ds′)Cπ(s, s′).
Remark: It is worth noting that φ(s) and φ(s′) are vectors. The integral is taken compo-
nentwise for φ(s′), so it feeds back a vector. Similarly, if we take an integral of a matrix, it
is taken componentwise.
Here, we have a general linear model with both observation errors and input noise, which
is known as the errors-in-variable model (Young (1984)). Cπ(s, s′) is the observation, and
Cπ(s, s′) − ∫Sπ P (s, ds′)Cπ(s, s′) is the observation error. φ(s) − γ
∫Sπ P (s, ds′)φ(s′) can be
viewed as the input variable. Since the transition probability function may be unknown
or not computable, at iteration m, instead of having the exact input variable, we can only
observe an unbiased sample estimate φm − γφm+1 where φm is the shorthand notation for
φ(sm). Therefore, the regular formula of linear regression for estimating θ∗ does not apply
since the errors in the input variables introduce bias. To circumvent this difficulty, we need
17
to introduce an instrumental variable ρ, which is correlated with the true input variable
but uncorrelated with the input error term and the observation error term. It is further
required that the correlation matrix between the instrumental variable and input variable is
nonsingular and finite. Then, the m-th estimate of θ∗ becomes
θm =
[1
m
m∑i=1
ρi(φi − γφi+1)′]−1 [
1
m
m∑i=1
ρiCi
],
where Ci = Cπ(si, si+1) is the i-th observation of the contribution.
Convergence of the parameter estimates requires stability of the chain. This can be
interpreted as saying a chain finally settles down to a stable regime independent of its initial
starting point (Meyn & Tweedie (1993)). Positive Harris chains defined in section 3 meet
the stability requirement precisely, and the invariant measure µ describes the stationary
distribution of the chain. Young (1984) proves convergence of the parameter estimates
calculated with samples drawn from a stationary distribution in the following lemma.
Lemma 4.1 (Young, 1984) Let y be the input variable, ξ be the input noise, ε be the
observation noise, and ρ be the instrumental variable in the errors-in-variable model. If the
correlation matrix Cor(ρ, y) is non-singular and finite and the correlation matrix Cor(ρ, ξ) =
0 and Cor(ρ, ε) = 0, then θm → θ∗ in probability.
Proof:
See Young (1984).
The task is to find a candidate for ρ in our algorithm. Following Bradtke & Barto
(1996), we also choose φ(s) as the instrumental variable for the continuous case. Under the
assumption that the Markov chain follows a stationary process with an invariant measure µ,
the following lemma shows that φ(s) satisfies the requirement that it is uncorrelated with
both input and observation error terms.
Lemma 4.2 Assume the Markov chain Φ has an invariant probability measure µ and let
ρ = φ(s), ε = Cπ(s, s′) − ∫Sπ P (s, ds′)Cπ(s, s′) and ξ = γ
∫Sπ P (s, ds′)φ(s′) − γφ(s′). We
have (1) E(ε) = 0, (2) Cor(ρ, ε) = 0, and (3) Cor(ρ, ξ) = 0.
18
Proof:
Let Cs =∫Sπ P (s, ds′)Cπ(s, s′). (1) is immediate by definition. By (1),
Cor(ρ, ε) = E(φ(s)ε)
=
∫
Sπ
µ(ds)
∫
Sπ
P (s, ds′)φ(s)(Cπ(s, s′)− Cs)
=
∫
Sπ
µ(ds)φ(s)
∫
Sπ
P (s, ds′)(Cπ(s, s′)− Cs)
=
∫
Sπ
µ(ds)φ(s)(Cs − Cs)
= 0.
Hence, (2) holds.
Cor(ρ, ξ) = E(φ(s)ξT )
=
∫
Sπ
µ(ds)
∫
Sπ
P (s, ds′)φ(s)(γ
∫
Sπ
P (s, ds′)φ(s′)− γφ(s′))T
=
∫
Sπ
µ(ds)φ(s)
∫
Sπ
P (s, ds′)(γ∫
Sπ
P (s, ds′)φ(s′)− γφ(s′))T
=
∫
Sπ
µ(ds)φ(s)(γ
∫
Sπ
P (s, ds′)φ(s′)− γ
∫
Sπ
P (s, ds′)φ(s′))T
=
∫
Sπ
µ(ds)φ(s)0T
= 0.
Therefore, (3) holds.
For problems with finite state space, Bradtke & Barto (1996) assumes that the number of
linearly independent basis functions is the same as the dimension of the state variable so that
the correlation matrix between the input and instrumental variables is invertible. However,
this assumption defeats the purpose of using continuous function approximation to overcome
the curse of dimensionality since the complexity of computing parameter estimates is the
same as estimating a look-up table value function directly. By imposing stronger assumptions
on the basis functions and discount factor γ, the following lemma shows that the invertibility
of the correlation matrix is guaranteed in the continuous case.
19
Lemma 4.3 (Non-singularity of the correlation matrix) Suppose we have orthonor-
mal basis functions with respect to the invariant measure µ of the Markov chain and γ < 1F
where F is the number of basis functions. Then the correlation matrix
∫
Sπ
µ(ds)φ(s)
(φ(s)− γ
∫
Sπ
P (s, ds′)φ(s′))T
is invertible.
Proof:
For shorthand notation, we write∫Sπ µ(ds)φi(s) = µφi and
∫Sπ P (s, ds′)φi(s
′) = Pφi(s). It
is worth noting that µφi is a constant and Pφi(s) is a function of s. Then, we can write the
V πn = V πn+1 . So {V πn} is a non-decreasing sequence.
25
Now we show that MnV π0 ≤ V πn for all n ∈ N with induction.
Base step: n = 1. Since V π0 ≤ V π1 , by monotonicity of Mπ1 ,
MV π0 = Mπ1Vπ0 ≤ Mπ1V
π1 = V π1 .
Induction steps: Suppose MnV π0 ≤ V πn . Again by monotonicity of M and Mπn ,
Mn+1V π0 ≤ MV πn = Mπn+1Vπn ≤ Mπn+1V
πn+1 = V πn+1 .
By the contraction mapping theorem (Royden (1968)), MnV π0 ↗ V ∗ uniformly as n →∞. We also notice that V ∗ = supπ V π. Therefore, V πn ↗ V ∗ uniformly.
Lemma 5.1 Let S,X , Γ be as defined in section 3 and assume Γ is nonempty, compact
and convex-valued, and continuous for each s ∈ S. Let fn : S × X → R be a sequence of
continuous functions for each n. Assume f has the same properties and fn → f uniformly.
Define πn and π by
πn(s) = arg maxx∈Γ(s)
fn(s, x)
and
π(s) = arg maxx∈Γ(s)
f(s, x).
Then, πn → π pointwise. If S is compact, πn → π uniformly.
Proof:
See lemma 3.7 and theorem 3.8 on page 64-65 of Stokey et al. (1989).
Lemma 5.2 Let S,X ,W and Q be defined as in lemma 5.1. If g : S×X×W → R is bounded
and continuous, then Tg(s, x) =∫W Q(s, x, dw)g(s, x, w) is also bounded and continuous.
Proof:
See proof of lemma 9.5 on page 261 of Stokey et al. (1989).
26
Theorem 5.1 (Convergence of exact policy iteration algorithm) Assume that for all
π ∈ Π the infinite-horizon problem can be reduced to a positive Harris chain and the corre-
sponding policy value function is continuous and linear in the parameters. Further assume
that the contribution function is bounded and continuous. Let S,X ,W and Q be defined as
in lemma 5.1. For M, N defined in figure 2, as M →∞ and then N →∞, the exact policy
iteration algorithm converges. That is πn → π∗ uniformly.
Proof:
By assumption, in the inner loop the system evolves according to a stationary Markov process
given a constant policy. By theorem 4.1 or corollary 4.1, as M → ∞, θn,M → θ∗n almost
surely. In other words, as M →∞, by following the fixed policy πn we can obtain the exact
post-decision value function in the linearly parameterized form of φ(sx)′θ∗n. Then define