Convergence Proofs of Least Squares Policy Iteration …castlelab.princeton.edu/html/Papers/Ma Powell-Policy iteration... · Algorithm for High-Dimensional Inﬂnite Horizon Markov

Convergence Proofs of Least Squares Policy IterationAlgorithm for High-Dimensional Infinite Horizon

Markov Decision Process Problems

Jun Ma and Warren B. PowellDepartment of Operations Research and Financial Engineering

Princeton University, Princeton, NJ 08544

December 17, 2008

Abstract

Most of the current theory for dynamic programming algorithms focuses on finite state,

finite action Markov decision problems, with a paucity of theory for the convergence of

approximation algorithms with continuous states. In this paper we propose a policy iteration

algorithm for infinite-horizon Markov decision problems where the state and action spaces are

continuous and the expectation cannot be computed exactly. We show that an appropriately

designed least squares (LS) or recursive least squares (RLS) method is provably convergent

under certain problem structure assumptions on value functions. In addition, we show

that the LS/RLS approximate policy iteration algorithm converges in the mean, meaning

that the mean error between the approximate policy value function and the optimal value

function shrinks to zero as successive approximations become more accurate. Furthermore,

the convergence results are extended to the more general case of unknown basis functions

with orthogonal polynomials.

1 Introduction

The core concept of dynamic programming for solving Markov decision process (MDP) is

Bellman’s equation, which is often written in the standard form (Puterman (1994))

Vt(St) = maxxt

{C(St, xt) + γ∑

s′∈SP (s′|St, xt)Vt+1(s

′)}. (1)

It is more convenient for our purpose but mathematically equivalent to write Bellman’s

equation (1) with the expectation form

Vt(s) = maxxt

{C(St, xt) + γE[Vt+1(St+1)|St = s]}, (2)

where St+1 = SM(St, xt,Wt+1) and the expectation is taken over the random information

variable Wt+1. If the state variable St and decision variable xt are discrete scalars and the

transition matrix P is known, the value function Vt(St) can be computed by enumerating all

the states backward through time, which is a method often referred to as backward dynamic

programming. Moreover, there is a mature and elegant convergence theory supporting algo-

rithms that handle problems with finite state and action spaces and computable expectation

(Puterman (1994)). However, discrete representation of the problem often suffers from the

well-known “curses of dimensionality”: 1) If the state St is a vector, the state space grows

exponentially with the number of dimensions. 2) A multi-dimensional information variable

Wt makes the computation of the expectation intractable. 3) When the decision variable

xt is a high-dimensional vector, we have to rely on mathematical programming to solve

the more complicated optimization problem in the Bellman’s equation. In addition, there

are a large number of real world applications with continuous state and action spaces, to

which direct application of algorithms developed for discrete problems is not appropriate.

Hence, continuous value function approximations are suggested to handle high-dimensional

and continuous applications.

This paper describes a provably convergent and implementable approximate policy-

iteration algorithm that uses linear function approximation to handle infinite-horizon Markov

decision process problems where state, action and information variables are all continuous

1

vectors (possibly of high dimensionality) and the expectation cannot be computed exactly.

Policy iteration is not a novel idea. However, when the state and action spaces are not finite,

policy iteration runs into such difficulties as (1) the feasibility of obtaining accurate policy

value functions in a computationally implementable way and (2) the existence of a sequence

of policies generated by the algorithm (Bertsekas & Shreve (1978)). Least squares or recur-

sive least squares (LS/RLS) updating methods are proposed to conquer the first difficulty.

To overcome the second difficulty, we impose the significant assumption that the true value

functions for all policies in the policy space are continuous and linearly parameterized with

finitely many known basis functions. Consequently, it leads to a sound convergence theory

of the algorithm applied to problems with continuous state and action spaces.

The main contributions of this paper include: a) a thorough and up-to-date literature

review on continuous function approximations, b) extension of the least squares temporal

difference learning algorithm (LSTD(0)) by Bradtke & Barto (1996) to the continuous case, c)

almost sure convergence of the exact policy iteration algorithm using the LS/RLS updating,

d) convergence in the mean of the corresponding approximate policy iteration algorithm,

and e) extension of the convergence results to unknown basis functions.

The rest of the paper is organized as follows. In section 2, we review the literature on

continuous function approximations applied to Markov decision process problems and their

asymptotic properties. In section 3, we summarize important mathematical foundations

to establish convergence and illustrate the details of a least squares/recursive least squares

approximate policy iteration (LS/RLSAPI) algorithm. Section 4 presents the convergence

results for a fixed policy. In section 5, by applying the results in section 4, we show almost sure

convergence of the exact policy iteration algorithm, which requires exact policy evaluation,

to the optimal policy. Section 6 proves mean convergence of approximate policy iteration,

in which case policies are evaluated approximately. In section 7, we extend the convergence

result in section 6 to unknown basis functions using orthogonal polynomials. The last section

concludes and discusses future research directions.

2

2 Literature Review

In this section, we review the literature on continuous function approximations and related

convergence theories. We break this section into three parts. The first part focuses on con-

tinuous approximation algorithms for discrete MDP problems. The second part deals with

function approximation algorithms applied to problems with a continuous state space. The

last part describes some divergence examples of value iteration type algorithms using para-

metric function approximators. The table in figure 1 is a brief summary of the algorithms

categorized by their characteristics such as whether the state and action spaces are discrete or

continuous (D/C), whether the contribution/reward function is quadratic or general (Q/G),

whether the expectation can be computed exactly (Y/N), whether the problem is determinis-

tic or stochastic (Gaussian noise) (D/S(G)), the type of algorithms including value iteration

(VI), fixed policy (FP), exact or approximate policy iteration (EPI/API), and whether there

is a convergence guarantee for the algorithm (Y/N) or performance bound (B). Details of

the algorithms and their convergence properties are discussed in the following subsections.

2.1 Continuous approximations of discrete problems

Since the inception of the field of dynamic programming, researchers have devoted consid-

erable effort to explore the use of compact representations in value function approximation

in order to overcome the curses of dimensionality in large-scale stochastic dynamic pro-

gramming problems (Bellman & Dreyfus (1959), Bellman et al. (1963), Reetz (1977), Whitt

(1978), Schweitzer & Seidmann (1985)). However, most of these approaches are heuristic,

and there is no formal proof guaranteeing convergence of the algorithms. More recently,

provably convergent algorithms have been proposed for different approximation techniques

such as feature-based function approximation, temporal difference learning, fitted temporal

difference learning with contraction or expansion mappings, and residual gradient. These

are reviewed below.

3

State Action Reward Exp. Noise Policy Conv.Bradtke et al. (1994) C C Q N D API Y

Gordon (1995) NA NA G Y S VI YBaird (1995) D D G N D VI Y

Tsitsiklis & Van Roy (1996) D D G Y S VI YBradtke & Barto (1996) D D G N S FP Y

Landelius & Knutsson (1996) C C Q N D VI/API YTsitsiklis & Van Roy (1997) D D G Y S FP Y

Meyn (1997) C D G Y S EPI YPapavassiliou & Russell (1999) D D G N S FP Y

Gordon (2001) D D G N S VI YOrmoneit & Sen (2002) C D G N S VI Y

Lagoudakis & Parr (2003) D D G N S API NPerkins & Precup (2003) D D G N S API Y

Engel et al. (2005) C C G N S API NMunos & Szepesvari (2005) C D G N S VI B

Melo et al. (2007) C D G N S FP YSzita (2007) C C Q N S(G) VI Y

Antos et al. (2007a, 2008) C D G N S API BAntos et al. (2007b) C C G N S API B

Deisenroth et al. (2008) D D G Y S VI N

Figure 1: Table of some continuous function approximation algorithms and related conver-gence results

2.1.1 Feature-based function approximation

The first step to set up a rigorous framework combining dynamic programming and compact

representations is taken in Tsitsiklis & Van Roy (1996). Two types of feature-based value

iteration algorithms are proposed. One is a variant of the value iteration algorithm that

uses a look-up table in feature space rather than in state space (similar to aggregation of

state variables). The other value iteration algorithm employs feature extraction and linear

approximations with a fixed set of basis functions including radial basis functions, neural

networks, wavelet networks, and polynomials. Under rather strict technical assumptions

on the feature mapping, Tsitsiklis & Van Roy (1996) proves the convergence of the value

iteration algorithm (not necessarily to the optimal value function unless it is spanned by

the basis functions) and provides a bound on the quality of the resulting approximations

compared with the optimal value function.

4

2.1.2 Temporal difference learning

Tsitsiklis & Van Roy (1997) proves convergence of online temporal difference learning TD(λ)

algorithm using a linear-in-the-parameters model and continuous basis functions. The con-

vergence proof assumes a fixed policy, in which case the problem is a Markov chain. All the

results are established on the assumption of a discrete state Markov chain. It is claimed that

the proofs can be easily carried over to the continuous case, but no details for this extension

are provided. Papavassiliou & Russell (1999) describes the Bridge algorithm for temporal

difference learning applied to a fixed policy. It shows that the algorithm converges to an

approximate global optimum for any “agnostically learnable” hypothesis class other than the

class of linear combination of fixed basis functions and provides approximation error bounds.

Bradtke & Barto (1996) and Boyan (1999) proves almost sure convergence of Least-

Squares TD (LSTD) algorithm when it is used with linear function approximation for a

constant policy. Motivated by the LSTD algorithm, Lagoudakis & Parr (2003) proposes the

least square policy iteration (LSPI) algorithm which combines value-function approximation

with linear architectures and approximate policy iteration. Mahadevan & Maggioni (2007)

extends the LSPI algorithm within the representation policy iteration (RPI) framework. By

representing a finite sample of state transitions induced by the MDP as an undirected graph,

RPI constructs an orthonormal set of basis functions with the graph Laplacian operator.

With some technical assumptions on the policy improvement operator, Perkins & Precup

(2003) proves convergence of the approximate policy iteration algorithm to a unique solution

from any initial policy when SARSA updates (see Sutton & Barto (1998)) are used with linear

state-action value function approximation for policy evaluation.

Gordon (1995) provides a convergence proof for fitted temporal difference learning (value

iteration) algorithm with function approximations that are contraction or expansion map-

pings, such as k-nearest-neighbor, linear interpolation, some types of splines, and local

weighted average. Interestingly, linear regression and neural networks do not fall into this

class due to the property that small local changes can lead to large global shifts of the

approximation, so they can in fact diverge. Gordon (2001) proves a weaker convergence re-

sult for linear approximation algorithms, SARSA(0) (a Q-learning type algorithm presented

5

in Rummery & Niranjan (1994)) and V(0) (a value-iteration type algorithm introduced by

Tesauro (1994)), converge to a bounded region almost surely.

2.1.3 Residual gradient algorithm

To overcome the instability of Q-learning or value iteration when implemented directly with a

general function approximation, residual gradient algorithms, which perform gradient descent

on the mean squared Bellman residual rather than the value function or Q-function, are

proposed in Baird (1995). Convergence of the algorithms to a local minimum of the Bellman

residual is guaranteed only for the deterministic case.

2.2 Approximations of continuous problems

There are comparatively fewer papers that treat function approximation algorithms directly

applied to continuous problems. Convergence results are mostly found for the problem class

of linear quadratic regulation (LQR), batch reinforcement learning, and the non-parametric

approach with kernel regression. Another non-parametric method using a Gaussian process

model has been successfully applied to complex nonlinear control problems but lacks conver-

gence guarantees. Other convergence results are established for exact policy iteration and

constant policy Q-learing that assume a finite or countable action space and ergodicity of

the underlying process.

2.2.1 Linear quadratic regulation

In the 1990’s, convergence proofs of DP/ADP type algorithms were established for a special

problem class with continuous states called Linear Quadratic Regulation in control theory.

One nice feature of LQR type problems is that one can find analytical solutions if the

exact reward/contribution and transition functions are known. When the exact form of the

functions is unknown and only sample observations are made, dynamic programming-type

algorithms are proposed to find the optimal control. Bradtke et al. (1994) proposes two

algorithms based on Q-learning and applied to deterministic infinite-horizon LQR problems.

6

Approximate policy iteration is provably convergent to the optimal control policy (Bradtke

(1993)), but an example shows that value iteration only converges locally. In Landelius

& Knutsson (1996), convergence proofs are presented for several adaptive-critic algorithms

applied to deterministic LQR problems including Heuristic Dynamic Programming (HDP,

another name for Approximate Dynamic Programming), Dual Heuristic Programming (DHP,

which works with derivatives of value function), Action Dependent HDP (ADHDP, another

name for Q-learning) and Action Dependent DHP (ADDHP). The key difference between

the RL algorithm proposed in Bradtke et al. (1994) and HDP is the parameter updating

formula. Bradtke et al. (1994) uses recursive least squares, while Landelius & Knutsson

(1996) uses a stochastic gradient algorithm that may introduce a scaling problem. It seems

reasonable to extend convergence proofs from the deterministic case to the stochastic case,

but we are only aware of a convergence proof for the stochastic gradient algorithm by Szita

(2007) under the assumption of Gaussian noises.

2.2.2 Batch reinforcement learning

More recently, a series of papers (Munos & Szepesvari (2005), Antos et al. (2007a,b, 2008))

derive finite-sample Probably Approximately Correctness (PAC) bounds for batch reinforce-

ment learning problems. These performance bounds ensure that the algorithms produce

near-optimal policy with high probability. They depend on the mixing rate of the sample

trajectory, the smoothness properties of the underlying MDP, the approximation power and

capacity of the function approximation method used, the iteration counter of the policy

improvement step, and the sample size for policy evaluation. More specifically, Munos &

Szepesvari (2005) considers sampling-based fitted value iteration algorithm for MDP prob-

lems with large or possibly infinite state space but finite action space with a known generative

model of the environment. In a model-free setting, Antos et al. (2007b, 2008) propose off-

policy fitted policy iteration algorithms that are based on a single trajectory of some fixed

behavior policy to handle problems in continuous state space and finite action space. An-

tos et al. (2007a) extends previous consistency results to problems with continuous action

spaces.

7

2.2.3 Kernel-based reinforcement learning

A kernel-based approach to reinforcement learning that adopts a non-parametric perspective

is presented in Ormoneit & Sen (2002) to overcome the stability problems of temporal-

difference learning in problems with a continuous state space but finite action space. The

approximate value iteration algorithm is provably convergent to a unique solution of an

approximate Bellman’s equation regardless of its initial values for a finite training data set

of historical transitions. As the size of the transition data set goes to infinity, the approximate

value function and approximate policy converge in probability to the optimal value function

and optimal policy respectively.

2.2.4 Gaussian process models

Another non-parametric approach to optimal control problems is the Gaussian process model,

in which value functions are modeled with Gaussian processes. Both the Gaussian process

temporal difference (GPTD) algorithm (policy iteration) in Engel et al. (2005) and the

Gaussian process dynamic programming (GPDP) algorithm (value iteration) in Deisenroth

et al. (2008) have successful applications in complex nonlinear control problems without

convergence guarantees.

2.2.5 Other convergence results

Meyn (1997) proves convergence of policy iteration algorithms for average cost optimal con-

trol problems with unbounded cost and general state space. The algorithm assumes countable

action space and requires exact computation of the expectation. With further assumptions

of a c-regular (a strong stability condition where c represents the cost function) initial policy

and irreducibility of the state space, the algorithm generates a sequence of c-regular policies

that converge to the optimal average cost policy. One extension of the temporal difference

algorithm to continuous domain can be found in Melo et al. (2007), which proves the conver-

gence of a Q-learning algorithm with linear function approximation under a fixed learning

policy for MDP problems with continuous state space but finite action space.

8

2.3 Examples of divergence

Value iteration-type algorithms with linear function approximation frequently fail to con-

verge. Bertsekas (1995) describes a counter-example of TD(0) algorithm with stochastic

gradient updating for the parameters. It is shown that the algorithm converges but can

generate a poor approximation of the optimal value function in terms of Euclidean distance.

Tsitsiklis & Van Roy (1996) presents a similar counter-example as in Bertsekas (1995) but

with a least squares updating rule for the parameter estimates, in which case divergence

happens even when the optimal value function can be perfectly represented by the linear

approximator. Boyan & Moore (1995) illustrates that divergent behavior occurs for value

iteration algorithms with a variety of function approximation techniques such as polynomial

regression, back-propagation and local weighted regression when the algorithm is applied to

simple nonlinear problems.

3 Preliminaries and the algorithm

We consider a class of infinite horizon Markov decision processes with continuous state

and action spaces. The following subsections discuss several important preliminary con-

cepts including Markov decision processes, contraction operators, continuous correspondence,

continuous-state Markov chain and post-decision state variable. These basics are necessary

in the convergence proofs in sections 4, 5 and 6. The last subsection illustrates the details

of an approximate policy iteration algorithm with recursive least squares updating.

3.1 Markov decision processes

We start with a brief review of Markov decision processes. A Markov decision process is a

sequential optimization problem where the goal is to find a policy that maximizes (for our

application) the expected infinite-horizon discounted rewards. Let St be the state of the

system at time t, xt be a vector-valued continuous decision (control) vector, Xπ(St) be a

decision function corresponding to a policy π ∈ Π where Π is the stationary deterministic

policy space, C(St, xt) be a contribution/reward function, and γ be a discount factor between

9

0 and 1. We shall, by convenient abuse of notation, denote π to be the policy and use it

interchangeably with the corresponding decision function Xπ. The system evolves according

to the following state transition function

St+1 = SM(St, xt, Wt+1), (3)

where Wt+1 represents the exogenous information that arrives during the time interval from

t to t + 1. The problem is to find the policy that solves

supπ∈Π

E

{ ∞∑t=0

γtC(St, Xπt (St))

}. (4)

Since solving the objective function (4) directly is computationally intractable, Bellman’s

equation is introduced so that the optimal control can be computed recursively

V (s) = maxx∈X

{C(s, x) + γE

[V (SM(s, x, W ))|s]} , (5)

where V (s) is the value function representing the value of being in state s by following the

optimal policy onward. It is worth noting that the contribution function in (5) can also be

stochastic. Then, Bellman’s equation becomes

V (s) = maxx∈X

{E

[C(s, x,W ) + γV (SM(s, x, W ))|s

]}. (6)

To solve problems with continuous states, we list the following assumptions and defi-

nitions for future reference in later sections. Assume that the state space S is a Borel-

measurable, convex and compact subset of Rm, the action space X is a compact subset of

Rn, the outcome space W is a compact subset of Rl and Q : S × X ×W → R is a contin-

uous probability transition function. Let Cb(S) denote the space of all bounded continuous

functions from S to R. It is well-known that Cb(S) is a complete metric space.

3.2 Contraction operators

In this section, we describe the contraction operators associated with Markov decision

processes. Their contraction property is crucial in the convergence proof of our algorithm.

10

Definition 3.1 (Max operator) Let M be the max operator such that for all s ∈ S and

V ∈ Cb(S),

MV (s) = supx∈X

{C(s, x) + γ

∫

WQ(s, x, dw)V (SM(s, x, w))},

where Q has the Feller Property (that is M maps Cb(S) into itself) and SM : S×X×W → Sis the continuous state transition function.

Definition 3.2 (Operator for a fixed policy π) Let Mπ be the operator for a fixed policy

π such that for all s ∈ S

MπV (s) = C(s,Xπ(s)) + γ

∫

WQ(s,Xπ(s), dw)V (SM(s,Xπ(s), w))

where Q and SM have the same property as in definition 3.1.

There are a few elementary properties of the operators M and Mπ (Bertsekas & Shreve

(1978)) that will play an important role in the subsequent sections.

Proposition 3.1 (Monotonicity of M and Mπ) For any V1, V2 ∈ Cb(S), if V1(s) ≤ V2(s)

for all s ∈ S, then for all k ∈ N and s ∈ S

MkV1(s) ≤ MkV2(s),

MkπV1(s) ≤ Mk

πV2(s).

Proposition 3.2 (Fixed point of M and Mπ) For any V ∈ Cb(S), limk→∞ MkV = V ∗

where V ∗ is the optimal value function, and V ∗ is the only solution to the equation V = MV .

Similarly, for any V ∈ Cb(S), limk→∞ MkπV = V π where V π is the value function by following

policy π, and V π is the only solution to the equation V = MπV .

3.3 Continuous correspondence

The correspondence Γ formally defined below describes the feasible set of actions that ensures

the max operator M takes the function space Cb(S) into itself. A correspondence is said to

be compact-valued if the set Γ(s) is compact for every s ∈ S.

11

Definition 3.3 (Correspondence) A correspondence Γ : S → X is a relation that assigns

a feasible action set Γ(s) ⊂ X for each s ∈ S.

Definition 3.4 (Lower hemi-continuity of correspondence) A correspondence Γ : S →X is lower hemi-continuous (l.h.c.) at s if Γ(s) is nonempty and if, for every x ∈ Γ(s) and

every sequence sn → s, there exists N ≥ 1 and a sequence {xn}∞n=N such that xn → x and

xn ∈ Γ(sn) for all n ≥ N .

Definition 3.5 (Upper hemi-continuity of correspondence) A compact-valued corre-

spondence Γ : S → X is upper hemi-continuous (u.h.c.) at s if Γ(s) is nonempty and if, for

every sequence sn → s and every sequence {xn} such that xn ∈ Γ(sn) for all n, there exists

a convergent subsequence of {xn} with limit point x ∈ Γ(s).

Definition 3.6 (Continuity of correspondence) A correspondence Γ : S → X is con-

tinuous at s ∈ S if it is both u.h.c. and l.h.c. at s.

3.4 Markov chains with continuous state space

To work with Markov chains with a general state space, we present the following definitions of

irreducibility, invariant measure, recurrence and positivity that all have familiar counterparts

in discrete chains. These properties are related to the stability of a Markov chain, which

is of great importance in proving the convergence of value function estimates. In addition,

the continuity property of the transition kernel is helpful in defining behavior of chains with

desirable topological structure of the state space (Meyn & Tweedie (1993)). Hence, we

introduce the concepts of Feller chains, petite sets and T -chains, which will be used later to

classify positive Harris chains.

Definition 3.7 (ψ-Irreducibility for general space chains) We call a Markov chain Φ

on state space S ϕ-irreducible if there exists a measure ϕ on B(S) such that whenever ϕ(A) >

0 for A ∈ B(S), we have

Ps{Φ ever enters A} > 0, ∀s ∈ S

12

where Ps denotes the conditional probability on the event that the chain starts in state s. Let

ψ be the maximal irreducibility measure among such measures.

Definition 3.8 (Harris recurrence) The set A ∈ B(S) is called Harris recurrent if

Ps{Φ ∈ A infinitely often} = 1, ∀s ∈ S.

A chain Φ is called Harris (recurrent) if it is ψ-irreducible and every set in

B+(S) = {A ∈ B(S) : ψ(A) > 0}

is Harris recurrent.

Definition 3.9 (Invariant measure) Let P (·, ·) be the transition kernel of a chain Φ on

the state space S. A σ-finite measure µ on B(S) with the property

µ(A) =

∫

Sµ(ds)P (s, A),∀A ∈ B(S)

will be called invariant.

Definition 3.10 (Positive chains) Suppose a chain Φ is ψ-irreducible and admits an in-

variant probability measure µ. Then Φ is called a positive chain.

Definition 3.11 (Weak Feller chains) If a chain Φ has a transition kernel P such that

P (·, O) is a lower semi-continuous function for any open set O ∈ B(S), then Φ is called a

(weak) Feller chain.

It is worth noting that weak Feller property is often defined by assuming that the tran-

sition kernel P maps the set of all continuous functions C(S) into itself.

Definition 3.12 (Petite set) A set C ∈ B(S) is called petite if for some measure ν on

B(S) and δ > 0,

K(s, A) ≥ δν(A), s ∈ C, A ∈ B(S)

13

where K is the resolvent kernel defined by

K(s, A) =∞∑

n=0

(1

2)n+1P n(s, A).

Definition 3.13 (T -chains) If every petite compact set of B(S) is petite, then Φ is called

a T -chain. (For another more detailed definition, see Meyn & Tweedie (1993) chapter 6.)

3.5 Post-decision state variable

Computing the expectation within the max operator M is often intractable when the un-

derlying distribution of the evolution of the stochastic system is unknown or the decision

x is a vector. However, we can circumvent the difficulty by introducing the notion of the

post-decision state variable (Van Roy et al. (1997), Judd (1998), Powell (2007)). To illustrate

the idea, suppose we can break the original transition function (3) into the two steps

Sxt = SM,x(St, xt), (7)

St+1 = SM,W (Sxt ,Wt+1). (8)

Then, let V x(Sxt ) be the value of being in state Sx

t immediately after we make a decision.

There is a simple relationship between the pre-decision value function V (St) and post-decision

value function V x(Sxt ) that is summarized as

V (St) = maxxt∈X

{C(St, xt) + V x(Sxt )} , (9)

V x(Sxt ) = E{V (St+1)|Sx

t }. (10)

By substituting (9) into (10), we have Bellman’s equation of post-decision value function

V x(Sxt ) = E{ max

xt+1∈X{C(St+1, xt+1) + γV x(Sx

t+1)}|Sxt }. (11)

In our algorithm, we work with the post-decision value functions resulting from following

a constant policy. By our assumptions of linear pre-decision policy value functions and the

14

Step 0: Initialization:

Step 0a Set the initial values of the value function parameters θ0.

Step 0b Set the initial policy

π0(s) = arg maxx∈Γ(s)

{C(s, x) + γφ(sx)′θ0}.

Step 0c Set the iteration counter n = 1.

Step 0d Set the initial State S00 .

Step 1: Do for n = 1, . . . , N ,

Step 2: Do for m = 1, . . . ,M :

Step 3: Initialize vm = 0.Step 4: Choose one step sample realization ω.Step 5: Do the following:

Step 5a Set xnm = πn−1(Sn

m).Step 5b Compute Sn,x

m = SM,x(Snm, xn

m), Snm+1 = SM (Sn,x

m ,Wm+1(ω)) and Sn,xm+1 =

SM,x(Snm+1, x

nm+1)

Step 5c Compute the corresponding basis function value in the general form φ(sn,xm ) −

γφ(sn,xm+1).

Step 6: Do the following:Step 6a Compute vm = C(Sn,x

m , Sn,xm+1)

Step 6b Update parameters θn,m with LS/RLS method

Step 7: Update the parameter and the policy:

θn = θn,M ,

πn(s) = arg maxx∈Γ(s)

{C(s, x) + γφ(sx)′θn}.

Step 8: Return the policy πN and parameters θN .

Figure 2: Infinite-horizon approximate policy iteration algorithm with recursive least squaresmethod

assumption of a continuous state transition function, relation (10) implies that the post-

decision value functions are continuous and of the linear form V x(sx|θ) = φ(sx)T θ where

φ(sx) is the vector of basis functions and θ is the vector of parameters. It is further assumed

that the spanning set of the basis functions are known, so it is enough to just estimate the

linear parameters.

15

3.6 Algorithm details

The recursive least squares approximate policy iteration algorithm (RLSAPI) is summa-

rized in Figure 2. It is worth making a remark on the arg max function at the end of the

algorithm. This step is usually a multi-variate global optimization problem that updates

the policy (exactly or approximately) from the post-decision value function of the previous

inner iteration. The updated policy function feeds back a decision given any fixed input of

the state variable. We assume that there is a tie-breaking rule that determines a unique

solution to the arg max function for all f ∈ Cb(S) i.e. nonlinear proximal point algorithm

( Luque (1987)). As a result, the returned policies are well-defined single-valued functions.

It is worth noting that determining the unique solution to the arg max function may not

be an easy job in practice. However, the computational difficulty is significantly reduced if

we have special problem structures such as strict concavity and differentiability of the value

functions.

4 Convergence results for a fixed policy

The following theorems and lemmas are extensions of the proof for the convergence of the

LSTD algorithm by Bradtke & Barto (1996) applied to a Markov chain with continuous

state space. We would apply the parameter convergence results to prove the convergence of

the inner loop in our policy iteration algorithm. Bradtke & Barto (1996) argues that LSTD

algorithm is superior to the TD algorithm in terms of convergence properties for the following

three reasons: (1) Choosing step-size parameters is not necessary in LSTD (It is well-known

that a poor choice of step-sizes can significantly slow down convergence). (2) LSTD uses

samples more efficiently, so it produces faster convergence rate. (3) TD is not robust to the

initial value of the parameter estimates and choice of basis functions, but LSTD is.

For a fixed policy π, the transition steps (7) and (8) become

Sπt = SM,π(St, π(St)),

St+1 = SM,W (Sπt ,Wt+1).

16

As a result, the Markov decision problem can be reduced to a Markov chain for post-decision

states if the exogenous information Wt+1 only depends on the post-decision state Sπt in the

previous time period. The Bellman’s equation (11) for post-decision state becomes

V π(s) =

∫

Sπ

P (s, ds′)(Cπ(s, s′) + γV π(s′)), (12)

where P (·, ·) is the transition probability function of the chain, Cπ(·, ·) is the stochastic con-

tribution/reward function with Cπ(Sπt , Sπ

t+1) = C(St+1, π(St+1)) and Sπ is the post decision

state space by following policy π. It is worth noting that Sπ is compact since S and X are

compact and transition function is continuous. In addition, s in (12) is the post-decision

state variable and we drop the superscript x for simplicity. Suppose the true value function

for following the fixed policy π is V π(s) = φ(s)T θ∗ where φ(s) = [· · · , φf (s), · · · ] is the vector

of basis functions of dimension F = |F| (number of basis functions) and f ∈ F (F denotes

the set of features). Bellman’s equation (12) gives us

φ(s)T θ∗ =

∫

Sπ

P (s, ds′)[Cπ(s, s′) + γφ(s′)T θ∗].

We can rewrite the recursion as

Cπ(s, s′) = (φ(s)− γ

∫

Sπ

P (s, ds′)φ(s′))T θ∗ + Cπ(s, s′)−∫

Sπ

P (s, ds′)Cπ(s, s′).

Remark: It is worth noting that φ(s) and φ(s′) are vectors. The integral is taken compo-

nentwise for φ(s′), so it feeds back a vector. Similarly, if we take an integral of a matrix, it

is taken componentwise.

Here, we have a general linear model with both observation errors and input noise, which

is known as the errors-in-variable model (Young (1984)). Cπ(s, s′) is the observation, and

Cπ(s, s′) − ∫Sπ P (s, ds′)Cπ(s, s′) is the observation error. φ(s) − γ

∫Sπ P (s, ds′)φ(s′) can be

viewed as the input variable. Since the transition probability function may be unknown

or not computable, at iteration m, instead of having the exact input variable, we can only

observe an unbiased sample estimate φm − γφm+1 where φm is the shorthand notation for

φ(sm). Therefore, the regular formula of linear regression for estimating θ∗ does not apply

since the errors in the input variables introduce bias. To circumvent this difficulty, we need

17

to introduce an instrumental variable ρ, which is correlated with the true input variable

but uncorrelated with the input error term and the observation error term. It is further

required that the correlation matrix between the instrumental variable and input variable is

nonsingular and finite. Then, the m-th estimate of θ∗ becomes

θm =

[1

m

m∑i=1

ρi(φi − γφi+1)′]−1 [

1

m

m∑i=1

ρiCi

],

where Ci = Cπ(si, si+1) is the i-th observation of the contribution.

Convergence of the parameter estimates requires stability of the chain. This can be

interpreted as saying a chain finally settles down to a stable regime independent of its initial

starting point (Meyn & Tweedie (1993)). Positive Harris chains defined in section 3 meet

the stability requirement precisely, and the invariant measure µ describes the stationary

distribution of the chain. Young (1984) proves convergence of the parameter estimates

calculated with samples drawn from a stationary distribution in the following lemma.

Lemma 4.1 (Young, 1984) Let y be the input variable, ξ be the input noise, ε be the

observation noise, and ρ be the instrumental variable in the errors-in-variable model. If the

correlation matrix Cor(ρ, y) is non-singular and finite and the correlation matrix Cor(ρ, ξ) =

0 and Cor(ρ, ε) = 0, then θm → θ∗ in probability.

Proof:

See Young (1984).

The task is to find a candidate for ρ in our algorithm. Following Bradtke & Barto

(1996), we also choose φ(s) as the instrumental variable for the continuous case. Under the

assumption that the Markov chain follows a stationary process with an invariant measure µ,

the following lemma shows that φ(s) satisfies the requirement that it is uncorrelated with

both input and observation error terms.

Lemma 4.2 Assume the Markov chain Φ has an invariant probability measure µ and let

ρ = φ(s), ε = Cπ(s, s′) − ∫Sπ P (s, ds′)Cπ(s, s′) and ξ = γ

∫Sπ P (s, ds′)φ(s′) − γφ(s′). We

have (1) E(ε) = 0, (2) Cor(ρ, ε) = 0, and (3) Cor(ρ, ξ) = 0.

18

Proof:

Let Cs =∫Sπ P (s, ds′)Cπ(s, s′). (1) is immediate by definition. By (1),

Cor(ρ, ε) = E(φ(s)ε)

=

∫

Sπ

µ(ds)

∫

Sπ

P (s, ds′)φ(s)(Cπ(s, s′)− Cs)

=

∫

Sπ

µ(ds)φ(s)

∫

Sπ

P (s, ds′)(Cπ(s, s′)− Cs)

=

∫

Sπ

µ(ds)φ(s)(Cs − Cs)

= 0.

Hence, (2) holds.

Cor(ρ, ξ) = E(φ(s)ξT )

=

∫

Sπ

µ(ds)

∫

Sπ

P (s, ds′)φ(s)(γ

∫

Sπ

P (s, ds′)φ(s′)− γφ(s′))T

=

∫

Sπ

µ(ds)φ(s)

∫

Sπ

P (s, ds′)(γ∫

Sπ

P (s, ds′)φ(s′)− γφ(s′))T

=

∫

Sπ

µ(ds)φ(s)(γ

∫

Sπ

P (s, ds′)φ(s′)− γ

∫

Sπ

P (s, ds′)φ(s′))T

=

∫

Sπ

µ(ds)φ(s)0T

= 0.

Therefore, (3) holds.

For problems with finite state space, Bradtke & Barto (1996) assumes that the number of

linearly independent basis functions is the same as the dimension of the state variable so that

the correlation matrix between the input and instrumental variables is invertible. However,

this assumption defeats the purpose of using continuous function approximation to overcome

the curse of dimensionality since the complexity of computing parameter estimates is the

same as estimating a look-up table value function directly. By imposing stronger assumptions

on the basis functions and discount factor γ, the following lemma shows that the invertibility

of the correlation matrix is guaranteed in the continuous case.

19

Lemma 4.3 (Non-singularity of the correlation matrix) Suppose we have orthonor-

mal basis functions with respect to the invariant measure µ of the Markov chain and γ < 1F

where F is the number of basis functions. Then the correlation matrix

∫

Sπ

µ(ds)φ(s)

(φ(s)− γ

∫

Sπ


is invertible.

Proof:

For shorthand notation, we write∫Sπ µ(ds)φi(s) = µφi and

∫Sπ P (s, ds′)φi(s

′) = Pφi(s). It

is worth noting that µφi is a constant and Pφi(s) is a function of s. Then, we can write the

correlation matrix explicitly as the F × F matrix

C =

µφ21 − γµφ1Pφ1 µφ1φ2 − γµφ1Pφ2 . . . µφ1φF − γµφ1PφF

µφ2φ1 − γµφ2Pφ1 µφ22 − γµφ2Pφ2 . . . µφ2φF − γµφ2PφF

. . .

. . .

. . .µφF φ1 − γµφF Pφ1 . . . . µφ2

F − γµφF PφF

Since φ(s) is a vector of orthonormal basis functions with respect to the invariant measure

µ, we have

C =

1− γµφ1Pφ1 −γµφ1Pφ2 . . . −γµφ1PφF

−γµφ2Pφ1 1− γµφ2Pφ2 . . . −γµφ2PφF

. . .

. . .

. . .−γµφF Pφ1 . . . . 1− γµφF PφF

.

We know that

µ(Pφi)2 ≤ µPφ2

i = µφ2i = 1.

Then, for all 1 ≤ i, j ≤ F we have

µφiPφj ≤ µφ2i + µ(Pφj)

2

2≤ 1.

Similarly, we have µφiPφj ≥ −1 for all 1 ≤ i, j ≤ F . As a result, C = I − γA where A

is a matrix with entries Ai,j ∈ [−1, 1]. C is invertible iff |C| 6= 0, so it suffices to show

|A− 1γI| 6= 0. In other words, we need to show that 1

γis not a real eigenvalue of A. Suppose

20

λ is an eigenvalue of A and its corresponding eigenvector is v. Let v = max1≤i≤F{|vi|}. Since

Ai,j ∈ [−1, 1], −F v ≤ λv ≤ F v. Hence, all the real eigenvalues of A are bounded between

−F and F . This implies that C is invertible if 1γ

> F . Equivalently, C is non-singular if

γ < 1F.

In general, the discount factor γ may not be smaller than 1F, which can be quite small as

the number of basis functions increases. However, without loss of generality, we can assume

γ < 1F. Since γ ∈ (0, 1), there exists k ∈ N such that γk−1 < 1

Fand the following recursion

must be satisfied if we keep substituting Bellman’s equation (12) back into itself k−2 times:

V π(s1) =

∫

Sπ×...×Sπ

k−1∏i=1

P (si, dsi+1)

{k−1∑i=1

γi−1Cπ(si, si+1) + γk−1V π(sk)

}.

Let P (s1, dsk) =∏k−1

i=1 P (si, dsi+1), C(s1, sk) =∑k−1

i=1 γi−1Cπ(si, si+1) and γ = γk−1. It is

easy to see that µ is still an invariant measure for P . As a result, we have

V π(s1) =

∫

Sπ

P (s1, dsk)(C(s1, sk) + γV π(sk)).

It is of the same form as in (12), so we can make a minor modification to the algorithm to

achieve the invertibility of the correlation matrix by collapsing k − 1 transitions (s1 → sk)

into 1 transition.

Finally, we have the parameter estimates

θm =

[1

m

m∑i=1

φi(φi − γφi+1)′]−1 [

1

m

m∑i=1

φiCi

].

It is worth noting that the summand 1m

∑mi=1 φi(φi − γφi+1)

T could be singular for small

m. The common methods that deal with this numerical problem include pseudo-inverse,

regularized matrix and additional samples (Choi & Van Roy (2006)).

When we only have known linearly independent basis functions, the correlation matrix of

input and instrumental variables is not guaranteed to be invertible. However, it is reasonable

to assume the non-singularity of the correlation matrix. Since singular matrices can be

viewed as the roots of the polynomial function given by the determinant, the set of singular

n × n matrices is a null subset (with Lebesgue measure zero) of Rn×n over the field of real

21

numbers. In other words, if you randomly pick a square matrix over the real numbers, the

probability that it is singular is zero. The following lemma states that the situation where

the correlation matrix is non-singular is extremely rare.

Lemma 4.4 Suppose we have linearly independent basis functions φ = (φ1, . . . , φF ) that are

not zero almost everywhere with respect to the invariant probability measure µ. Then the

correlation matrix

CF =

∫

Sπ

µ(ds)φ(s)

(φ(s)− γ

∫

Sπ


is invertible for all but at most finitely many γ ∈ (0, 1).

Proof:

The determinant of the correlation matrix CF is a F -th order polynomial function of γ,

so the function is either 0 for all γ or has at most F real roots. When γ = 0, we have

CF = µφφT . It is easy to see that CF is positive semi-definite. With induction, we show

that CF is actually positive definite.

1. Base step: If F = 2, then C2 =

(µφ2

1 µφ1φ2

µφ1φ2 µφ22

). By Cauchy-Schwartz’s inequality,

detC2 = µφ21µφ2

2 − (µφ1φ2)2 ≥ 0.

The equality holds only if φ1 and φ2 are linearly dependent. Hence, detC2 > 0.

2. Induction steps: Suppose CF is positive definite. We need to show that CF+1 is positive

definite. Let a be any non-zero vector in RF+1. Then,

aT CF+1a = µ(F∑

i=1

aiφi + aF+1φF+1)2.

If aF+1 = 0 and a−(F+1) (the vector in RF obtained by excluding aF+1) is nonzero, by

induction hypothesis we have

aT CF+1a = aT−(F+1)CF a−(F+1) > 0.

If aF+1 6= 0 and a−(F+1) = 0,

aT CF+1a = a2F+1µφ2

F+1 > 0,

22

since φF+1 6= 0 almost everywhere. Otherwise, aT CF+1a = 0 only if φF+1 = 1aF+1

∑Fi=1 aiφi,

but this violates the assumption of linear independence of the basis functions. Hence, we

conclude that CF is positive definite.

As a result, the determinant of the correlation matrix CF is not uniformly 0 for all γ and

it has at most F real roots in (0, 1). In other words, CF fails to be invertible for at most F

γ’s.

In practice, one may still run into non-invertible or close to non-invertible correlation

matrices, and we can use some remedial approaches such as perturbing γ or collapsing

transitions.

Lemma 4.5 (Law of large numbers for positive Harris chains) If Φ is a positive Har-

ris chain (see definition 3.8 and 3.10) with invariant probability measure µ, then for each

f ∈ L1(S,B(S), µ),

limn→∞

1

n

n∑i=1

f(si) =

∫

Sµ(ds)f(s)

almost surely.

Proof:

See the proof of Theorem 17.1.7 on page 416 of Meyn & Tweedie (1993).

With the above lemma, we are ready to prove that the parameter estimates converge to

the true value almost surely.

Theorem 4.1 Suppose the Markov chain in the aforementioned errors-in-variable model

follows a positive Harris chain with continuous state space having transition kernel P (x, dy)

and invariant probability measure µ. Further assume the orthonormal basis functions are

bounded and continuous. Then, θm → θ∗ almost surely.

Proof:

First, we note that the following must be satisfied for θ∗ and all s:

Cs =

∫

Sπ

P (s, ds′)Cπ(s, s′) =

(φ(s)− γ

∫

Sπ


θ∗

23

Then, we recall that the m-th estimate of θ∗ is defined to be

θm =

[1

m

m∑i=1

φi(φi − γφi+1)T

]−1 [1

m

m∑i=1

Ciφi

].

Since the basis functions are bounded, by lemma 4.3 and 4.5, we have with probability

1,

limm→∞

θm

=

[lim

m→∞1

m

m∑i=1

φi(φi − γφi+1)T

]−1 [lim

m→∞1

m

m∑i=1

Ciφi

]

=

[∫

Sπ

µ(ds)φ(s)

(φ(s)− γ

∫

Sπ


]−1 [∫

Sπ

µ(ds)φ(s)

∫

Sπ

P (s, ds′)Cπ(s, s′)]

=

[∫

Sπ

µ(ds)φ(s)

(φ(s)− γ

∫

Sπ


]−1 [∫

Sπ

µ(ds)φ(s)Cs

]

=

[∫

Sπ

µ(ds)φ(s)

(φ(s)− γ

∫

Sπ


]−1

·[∫

Sπ

µ(ds)φ(s)

(φ(s)− γ

∫

Sπ


]θ∗

= θ∗

In practice, estimating θ using least squares, which requires matrix inversion at each

iteration, is computationally expensive. Instead, we can use recursive least squares to obtain

the well-known updating formulas (Bradtke & Barto (1996)):

εm = Cm − (φm − γφm+1)T θm−1, (13)

Bm = Bm−1 − Bm−1φm(φm − γφm+1)T Bm−1

1 + (φm − γφm+1)T Bm−1φm

, (14)

θm = θm−1 +εmBm−1φm

1 + (φm − γφm+1)T Bm−1φm

. (15)

It is worth noting that we must specify θ0 and B0. θ0 can be any finite vector, while B0

is usually chosen to be βI for some small positive constant β. The following corollary shows

24

that the RLS estimate of θ∗ also converges to the true parameters. The proof only requires

simple calculation and is virtually the same as the proof of theorem 4.1.

Corollary 4.1 Suppose we have the same assumptions as in theorem 4.1. Further assume

that 1 + (φm − γφm+1)T Bm−1φm 6= 0 for all m, then θm → θ∗ almost surely using recursion

formulas (13), (14) and (15).

5 Convergence of exact policy iteration

Before getting to the convergence theorem, we present the following preliminary convergence

results. Proposition 5.1 proves that value functions of monotonic policies converge to the

optimal value function, while lemma 5.1 is used to prove the convergence of a sequence

of monotonic policies to the optimal policy given the convergence of corresponding value

functions. Lemma 5.2 shows that integration preserves certain properties of the integrand

such as boundedness and continuity required in the proof of the theorem 5.1.

We first make some notational clarification. Following convention, if f, g, h are functions

with the same domain D, we write f = g ≤ h to mean f(x) = g(x) ≤ h(x) for all x ∈ D.

Proposition 5.1 Let (πn)∞n=0 be a sequence of policies generated recursively as follows: given

an initial policy π0, for n ≥ 0,

πn+1 = arg maxπ

MV πn .

Then V πn → V ∗ uniformly.

Proof:

We note that for all n ∈ N, V πn = MπnV πn ≤ MV πn . For any k ∈ N, by monotonicity of

Mπn ,

V πn = MπnV πn ≤ MV πn = Mπn+1Vπn ≤ Mπn+1MV πn = M2

πn+1V πn ≤ · · · ≤ Mk

πn+1V πn .

Then, V πn ≤ limk→∞ Mkπn+1

V πn = V πn+1 . So {V πn} is a non-decreasing sequence.

25

Now we show that MnV π0 ≤ V πn for all n ∈ N with induction.

Base step: n = 1. Since V π0 ≤ V π1 , by monotonicity of Mπ1 ,

MV π0 = Mπ1Vπ0 ≤ Mπ1V

π1 = V π1 .

Induction steps: Suppose MnV π0 ≤ V πn . Again by monotonicity of M and Mπn ,

Mn+1V π0 ≤ MV πn = Mπn+1Vπn ≤ Mπn+1V

πn+1 = V πn+1 .

By the contraction mapping theorem (Royden (1968)), MnV π0 ↗ V ∗ uniformly as n →∞. We also notice that V ∗ = supπ V π. Therefore, V πn ↗ V ∗ uniformly.

Lemma 5.1 Let S,X , Γ be as defined in section 3 and assume Γ is nonempty, compact

and convex-valued, and continuous for each s ∈ S. Let fn : S × X → R be a sequence of

continuous functions for each n. Assume f has the same properties and fn → f uniformly.

Define πn and π by

πn(s) = arg maxx∈Γ(s)

fn(s, x)

and

π(s) = arg maxx∈Γ(s)

f(s, x).

Then, πn → π pointwise. If S is compact, πn → π uniformly.

Proof:

See lemma 3.7 and theorem 3.8 on page 64-65 of Stokey et al. (1989).

Lemma 5.2 Let S,X ,W and Q be defined as in lemma 5.1. If g : S×X×W → R is bounded

and continuous, then Tg(s, x) =∫W Q(s, x, dw)g(s, x, w) is also bounded and continuous.

Proof:

See proof of lemma 9.5 on page 261 of Stokey et al. (1989).

26

Theorem 5.1 (Convergence of exact policy iteration algorithm) Assume that for all

π ∈ Π the infinite-horizon problem can be reduced to a positive Harris chain and the corre-

sponding policy value function is continuous and linear in the parameters. Further assume

that the contribution function is bounded and continuous. Let S,X ,W and Q be defined as

in lemma 5.1. For M, N defined in figure 2, as M →∞ and then N →∞, the exact policy

iteration algorithm converges. That is πn → π∗ uniformly.

Proof:

By assumption, in the inner loop the system evolves according to a stationary Markov process

given a constant policy. By theorem 4.1 or corollary 4.1, as M → ∞, θn,M → θ∗n almost

surely. In other words, as M →∞, by following the fixed policy πn we can obtain the exact

post-decision value function in the linearly parameterized form of φ(sx)′θ∗n. Then define

fn(s, x) = C(s, x) + γ

∫

WQ(s, x, dw)V πn(SM(s, x, w)) = C(s, x) + γφ(sx)′θ∗n,

and

f(s, x) = C(s, x) + γ

∫

WQ(s, x, dw)V ∗(SM(s, x, w)).

Let Vn = MnV π0 and define

fn(s, x) = C(s, x) + γ

∫

WQ(s, x, dw)Vn(SM(s, x, w)).

By theorem 5.1, V πn ↗ V ∗ uniformly, so fn ↗ f pointwise. fn ↗ f uniformly by theorem

9.9 on page 266 of Stokey et al. (1989). By theorem 5.1, we have Vn ≤ V πn , so fn ≤ fn.

In turn, we conclude that fn ↗ f uniformly. Since the contribution function and value

functions for fixed policies are assumed to be bounded and continuous, by lemma 5.2 we

have fn and f are bounded and continuous. By lemma 5.1, πn → π∗ pointwise. Since S is

compact, the convergence is uniform.

In theorem 5.1, we make the strong assumption that the Markov chain obtained by

following any fixed policy in the policy space Π is a positive Harris chain. It is similar to the

condition of having an ergodic Markov chain or an absorbing chain in the discrete case. Recall

that a positive Harris chain has the highly desirable property that the Strong Law of Large

27

Numbers (SLLN) holds for the chain independent of the initial state of the chain, so that

we can show convergence of parameterized policy value functions. There are some sufficient

conditions for positive Harris recurrence such as irreducibility, aperiodicity or “minorization”

hypotheses on the transition kernel (Meyn & Tweedie (1993), Nummelin (1984)), but they

are hard to verify. Owing to our assumptions on compact S and X and continuous transition

probability function Q, the following proposition states that the only global requirement for

each controlled Markov chain Φπ being positive Harris is an irreducibility condition of the

post-decision state space Sπ.

Proposition 5.2 Suppose for any policy π ∈ Π, the controlled Markov chain Φπ is ψ-

irreducible and the support of ψ has non-empty interior. Then, Φπ is positive Harris.

Proof:

Φπ is a weak Feller chain (see definition 3.11), since the transition function is continuous and

the transition probability function Q has Feller property. By the irreducibility hypothesis on

Sπ and theorem 6.2.9 of Meyn & Tweedie (1993), Φπ is a T -Chain (see definition 3.13). By

theorem 6.2.5 of Meyn & Tweedie (1993), every compact set in B(Sπ) is petite (see definition

3.12). As a result, Sπ is petite since it is compact. By proposition 9.1.7 of Meyn & Tweedie

(1993), Φπ is Harris recurrent. By theorem 10.4.4 of Meyn & Tweedie (1993), Φπ admits a

unique (up to a constant multiple) invariant measure. Finally, by theorem 10.4.10 of Meyn

& Tweedie (1993), Φπ is positive Harris.

Hernandez-Lerma & Lasserre (2001) provides some other sufficient conditions for a Markov

chain on a non-compact general state space being positive Harris, which are easy to check

in some applications(see theorem 1.3 of Hernandez-Lerma & Lasserre (2001)). For example,

the conditions are easily checked for additive-noise chains in Rn with

St+1 = f(St) + ξt+1

where f : Rn → Rn and ξt+1 are i.i.d. random vectors with ξ1 being absolutely continuous

with respect to the Lebesgue measure and having a strictly positive density. In our settings,

28

consider a LQR problem with additive linear control and i.i.d. noises, i.e.

St+1 = ASt + xt + Wt+1

with xt = BπSt + bπ for some policy π ∈ Π. Then, the transition of the post-decision state

variable is

Sxt+1 = (A + Bπ)Sx

t + bπ + (A + Bπ)Wt+1.

Since the transition follows an additive noise system, for each linear policy π the problem is

reduced to a positive Harris chain as long as (A + Bπ)W1 satisfies the conditions of being

absolutely continuous and having a strictly positive density.

The condition of having a positive Harris chain in each inner loop may not be guar-

anteed through the properties of the problem. For example, the chain is non-irreducible

but has Harris decomposition, that is the chain can be decomposed into one transient set

and a countable disjoint family of absorbing Harris sets with respective ergodic stationary

measures. If one sample realization of the chain is trapped in some absorbing set (i.e. an

atom of the chain) whose order is less than the number of basis functions, there is not suffi-

cient support to identify the least squares parameter estimates. Hence, a properly designed

exploration step such as adding a random exploration component to the policy function is

necessary in an actual implementation of the algorithm.

Another situation is when the chain admits an invariant probability measure (not neces-

sarily unique) on a rich enough full subset of the state space, which means the subset has

measure 1 and suffices to identify the least squares parameter estimates. It is a rather weak

condition, since there is no regularity or even irreducibility assumptions on the chain. In

this case, if we can initialize the chain according to the invariant probability measure rather

than from some fixed state, the chain becomes a stationary process and we can apply the

Strong Law of Large numbers for stationary processes (Doob (1953)) to achieve convergence

of the parameter estimates. In practice, we need to add a sampling stage to estimate the

invariant probability measure in the inner loop of the algorithm and then initialize the chain

according to the empirical probability measure obtained.

29

6 Convergence of approximate policy iteration

The exact policy iteration algorithm is only conceptual, since both the inner loop and the

outer loop need to go to infinity to ensure convergence. In an actual implementation, we

need to specify stopping rules by setting large enough iteration counters N and M . As a

result, we introduce the approximate policy iteration algorithm, in which the inner loop

stops in finite time.

In our approximate policy iteration algorithm, the estimated value function of the ap-

proximate policy is random, since it depends on the sample path of the chain and also the

iteration number of the inner loop. That is to say, for fixed s ∈ S, V πn(s) is a random

variable. The following theorem and corollary state that given the state space Sπ being

compact and the norm being the sup norm || · ||∞ for continuous functions, the approximate

policy iteration algorithm converges in the mean. In other words, the mean of the norm of

the difference between the optimal value function and the estimated policy value function

using approximate policy iteration converges to 0 if the successive approximations become

better and better. The proof follows the same line as the proof of error bounds for approxi-

mate policy iteration with discrete and deterministic (or in almost sure sense) value function

approximations in Bertsekas & Tsitsiklis (1996).

Lemma 6.1 Let π be some policy and V be a value function satisfying ||V − V π||∞ ≤ ε for

some ε ≥ 0. Let π be a policy that satisfies

MπV (s) ≥ MV (s)− δ

for all s ∈ S and some δ ≥ 0. Then, for all s ∈ S

V π(s) ≥ V π(s)− δ + 2γε

1− γ.

Proof:

Compactness of the state space implies that the sup norm of any continuous function is

30

bounded. Let β = ||V π − V π||∞. Then, for all s ∈ S, V π(s) ≥ V π(s)− β. By monotonicity

of Mπ, we have V π = MπV π ≥ Mπ(V π − β) = MπV π − γβ. By hypothesis and definition of

M , we have MπV −MπV − δ ≤ MV −MπV − δ ≤ 0. Then,

V π − V π ≤ V π −MπV π + γβ

≤ V π −MπV π + γβ + MπV −MπV + δ

= MπV π −MπV + MπV −MπV π + γβ + δ

≤ 2γ||V π − V ||∞ + γβ + δ

≤ 2γε + γβ + δ.

Since β ≤ 2γε + γβ + δ ⇔ β ≤ 2γε+δ1−γ

, we obtain V π(s) = V π(s)− β ≥ V π(s)− δ+2γε1−γ

.

The next theorem proves mean convergence of the approximate policy iteration algorithm

when both policy evaluations and policy updates are performed within error tolerances that

converge to 0 according to a certain rate.

Theorem 6.1 (Mean convergence of approximate policy iteration) Let π0, π1, . . . , πn

be the sequence of policies generated by an approximate policy iteration algorithm and let

V π0 , V π1 , . . . , V πn be the corresponding approximate value functions. Further assume that,

for each fixed policy πn, the MDP is reduced to a Markov chain that admits an invariant

probability measure µπn. Let {εn} and {δn} be positive scalars that bound the mean errors

in approximations to value functions and policies (over all iterations) respectively, that is

∀n ∈ N,

Eµπn ||V πn − V πn ||∞ ≤ εn, (16)

and

Eµπn ||Mπn+1Vπn −MV πn||∞ ≤ δn. (17)

Suppose the sequences {εn} and {δn} converge to 0 and limn→∞∑n−1

i=0 γn−1−iεi = limn−1i=0 γn−1−iδi =

0, e.g. εi = δi = γi. Then, this sequence eventually produces policies whose performance

31

converges to the optimal performance in the mean:

limn→∞

Eµπn ||V πn − V ∗||∞ = 0.

Proof:

Let βn = ||V πn+1−V πn||∞, εn = ||V πn−V πn ||∞ and δn = ||Mπn+1Vπn−MV πn ||∞. It is worth

noting that εn, δn are random with Eµπn εn ≤ εn and Eµπn δn ≤ δn. By lemma 6.1, we have

βn ≤ δn+2γεn

1−γ. Taking the expectation, we have βn ≤ δn+2γεn

1−γ.

Similarly, define αn = ||V πn − V ∗||∞. Then, V πn ≥ V ∗ − αn. By monotonicity of M ,

MV πn ≥ M(V ∗ − αn) = MV ∗ − γαn.

Then,

Mπn+1Vπn ≥ Mπn+1(V

πn − εn)

= Mπn+1Vπn − γεn

≥ MV πn − δn − γεn

≥ M(V πn − εn)− δn − γεn

= MV πn − δn − 2γεn

≥ M(V ∗ − αn)− δn − 2γεn

= V ∗ − δn − 2γεn − γαn.

Since V πn+1 ≥ Mπn+1Vπn − γβn, we have V ∗ − V πn+1 ≤ δn + 2γεn + γβn + γαn. Taking

expectations of both sides, we get

V ∗ − V πn+1 ≤ δn + 2γεn + γβn + γαn ≤ δn + 2γεn

1− γ+ γαn.

Then, we have the recursive inequality αn+1 ≤ δn+2γεn

1−γ+ γαn. By opening the recursion, we

obtain

αn+1 ≤ γnα1 +1

1− γ

n−1∑i=0

γn−1−iδi +2γ

1− γ

n−1∑i=0

γn−1−iεi.

32

Hence, we conclude limn→∞ αn = 0. Finally, we have

limn→∞

Eµπn ||V πn − V ∗||∞ ≤ lim supn→∞

Eµπn

[||V πn − V πn ||∞ + ||V πn − V ∗||∞

]

≤ limn→∞

[εn + αn]

= 0,

which completes the proof.

The above theorem can be directly applied to show that the LS/RLSAPI algorithm

converges in the mean.

Corollary 6.1 (Mean convergence of LS/RLSAPI) Theorem 6.1 holds for the LS/RLS

approximate policy iteration algorithm.

Proof:

We only need to check whether the conditions (16) and (17) governing the error tolerances

for evaluating policies in theorem 6.1 are satisfied. If we can make policy updates exactly,

we can take δn = 0 for all n. If the policy update is done inexactly, that is using an

approximate nonlinear proximal point algorithm, we can force the procedure to be within

the error tolerance satisfying the condition on δn in theorem 6.1. Then, it suffices to show that

the mean errors between the approximate and the true value functions of the approximate

policy in each inner loop can be made arbitrarily small. By Jensen’s inequality, we have for

each s ∈ S,

Eµπn |V πn(s)− V πn(s)| ≤√Eµπn |V πn(s)− V πn(s)|2.

By the Cauchy-Schwartz inequality and the simple relations (9) and (10) between pre-

and post-decision value functions, we have for all s ∈ S,

|V πn(s)− V πn(s)|2 = γ2|φ(sx)′(θn,M − θ∗n)|2 ≤ γ2||φ(sx)||22||θn,M − θ∗n||22.

33

Sπ is compact implies that ||φ(sx)||2 ≤ c for some finite positive constant c for all sx.

Let εn > 0. Since θn,M → θ∗n almost surely as M → ∞, the variance of the parameter

estimates shrinks to 0. Recall that for fixed M , θn,M is a random vector of dimension F .

Let θn,M,i be the i-th component where i ∈ 1 . . . F . Hence, there exists Mn ∈ N such that

maxi=1,...,F V ar(θn,Mn,i) ≤ ε2nγ2Fc2

. Then,

Eµπn ||θn,Mn − θ∗n||22 =F∑

i=1

V ar(θn,Mn,i) ≤ Fε2n

γ2Fc2=

ε2n

γ2c2.

Therefore, in the inner loop of each iteration we can uniformly bound the mean difference

between the approximate value function and the true function of the approximate policy:

Eµπn ||V πn − V πn||∞ ≤√

γ2||φ(sx)||22Eµπn ||θn,Mn − θ∗n||22 ≤ εn.

It is worth noting that the subscript n of M implies that it is not fixed but depends on the

outer loop iteration counter n, since the chain changes as the policy gets updated.

Hence, we conclude that theorem 6.1 applies to the LS/RLSAPI algorithm.

Remark: It is worth pointing out that the variances of the parameter estimates depend on

the true variance of samples, which may be unknown. In an actual implementation of the

algorithm, we use an estimated variance from samples instead.

7 Extension to unknown basis functions

To apply the mean convergence results in section 6, we make the strong assumption that the

value function of all policies in the policy space are spanned by a finite set of orthonormal

basis functions. By assuming that the value function of all policies are in some specific

function spaces, we can extend the convergence results to unknown basis functions. For

simplicity, we restrict ourselves to a 1-dimensional state space, a closed interval [a, b], and

only consider the function space Ck[a, b], the set of all continuous functions with up to k-th

derivative.

34

7.1 Orthogonal polynomials

We first introduce the idea of orthogonal polynomials by defining the inner product with

respect to a weighting function w on Ck[a, b] to be

< f, g >w=

∫ b

a

f(s)g(s)w(s)ds.

This inner product defines a quadratic semi-norm ||f ||2w =< f, f >w. Let Gw = {gwn }∞n=1 be

a set of orthogonal basis functions with respect to w and GwN = {gw

n }Nn=1 be the finite subset

of order N . Let Gw and GwN denote the function spaces spanned by G and GN respectively.

Given any f , the best least-square approximation of f with respect to w onto GwN is the

solution to the following optimization problem:

infg∈Gw

N

∫ b

a

(f(s)− g(s))2w(s)ds,

and the solution is fwN =

∑Nn=1

<f,gwn >w

||gwn ||2w gw

n .

7.2 Chebyshev polynomial approximation

We focus on one specific weighting function: the Chebyshev weighting function c(s) = 1√1−s2

on [−1, 1]. Chebyshev approximations are good for non-periodic smooth functions because

it has the desirable uniform convergence property. It is worth noting that the weighting

functions can be easily modified to extend the result to an arbitrary closed interval [a, b],

e.g. the general Chebyshev weighting function is cg(s) = (1− (2x−a−bb−a

)2)−12 .

The family of Chebyshev polynomials T = {tn}∞n=0 is defined as tn(s) = cos(n cos−1 s)

(Judd (1998)). It can also be recursively defined as

t0(s) = 1,

t1(s) = s,

tn+1(s) = 2stn(s)− tn−1(s), n ≥ 1.

We normalize them by letting t0 = t0π

and tn = 2tnπ

for all n ≥ 1. Let CN denote the function

space spanned by the finite orthonormal basis set TN = {tn}Nn=0.

35

The following lemma shows that the Chebyshev least squares approximations are not

much worse than the best polynomial approximation in the sup norm. Hence, uniform

convergence of the Chebyshev polynomial approximation follows.

Lemma 7.1 (Uniform convergence of Chebyshev approximation) For all k, for all

f ∈ Ck[a, b],

||f − Cn||∞ ≤ (4 +4

π2ln n)

(n− k)!

n!(π

2)k(

b− a

2)k||f (k)||∞

and hence Cn → f uniformly.

Proof:

See page 214 of Judd (1998).

Let µπ be the invariant measure of the Markov chain of following a fixed policy π. Suppose

V π ∈ Ck[−1, 1] for all π ∈ Π. Assume that the invariant probability measure µπ has a

continuous density function fπ i.e. µπ(ds) = fπ(s)ds, and fπ is strictly positive on [−1, 1]

and has derivative up to k-th order. Let V π = V π√

fπ

c. We consider the following Chebyshev

least square approximation problem,

infg∈CN

(V (s)− g(s))2c(s)ds.

The solution to this problem is the N -th degree Chebyshev least squares approximation

g∗(s) = CN(s) =N∑

j=0

cjtj(s)

where cj =∫ 1

−1V π(s)tj(s)c(s)ds. We are now ready to extend the mean convergence result

developed in theorem 6.1 to Chebyshev basis functions.

Theorem 7.1 (Mean convergence of LS/RLSAPI with exact Chebyshev polyno-

mials) Suppose we have the same assumptions as in theorem 6.1 and further assume that,

for any policy π ∈ Π, the value function V π is in Ck[−1, 1] and the invariant density function

fπ is known. Theorem 6.1 holds for the LS/RLS approximate policy iteration algorithm with

Chebyshev polynomial approximation.

36

Proof:

Let ε > 0 and B = ||√

cfπ ||∞. Lemma 7.1 implies that there exists N ∈ N such that

||V π − CN ||∞ ≤ ε2B

. Let T π = {tn√

cfπ }∞n=0. It is easy to see that T π is an orthonormal

basis set with respect to fπ. Finding the best least squares approximation for the value

function of policy π on the basis set T πN in LS/RLSAPI is reduced to the following least

squares approximation problem,

infg∈T π

N

∫ 1

−1

(V π(s)− g(s))2fπ(s)ds.

It can be verified that the solution is

g∗(s) = V πN (s) = CN(s)

√c(s)

fπ(s).

Hence,

||V π − V πN ||∞ ≤ B

ε

2B=

ε

2.

The above result is employed in the inner loop of the LS/RLSAPI algorithm to construct a

finite set of basis functions of order F = N +1 given a desired approximation error tolerance

ε. Corollary 6.1 states that in each inner loop we can obtain V πN , a statistical estimate of

V πN , such that Eµπ ||V π

N − V πN ||∞ ≤ ε

2. As a result, we have

Eµπ ||V πN − V π||∞ ≤ Eµπ ||V π

N − V πN ||∞ + ||V π

N − V π||∞ ≤ ε.

Hence, theorem 6.1 applies.

However, there are two technical difficulties in practice. First, we need to know the exact

invariant density function fπ in order to construct a finite basis set. In addition, the bound

of the k-th derivative of V π√

fπ

cmay also be unknown. In an actual implementation of the

algorithm, we may let the bound of (V π√

fπ

c)(k) be some large number to determine the order

of the basis set before evaluating the value function of the corresponding policy. To tackle

the first difficulty, we need to call a procedure that produces approximations of fπ. Since

we have sequential observations of states from the Markov chain, we can obtain estimates

of the invariant density with increasing accuracy and construct approximate basis functions

from the estimated density function. The following lemma and theorem provide theoretical

37

support to the asymptotic property of the algorithm using approximate Chebyshev basis

functions.

Lemma 7.2 Suppose Φ is a positive Harris chain with invariant probability measure µ. For

f, fi ∈ L1(S,B(S), µ), assume (fi), f are bounded and fi → f uniformly.

limn→∞

1

n

n∑i=1

fi(si) =

∫

Sµ(ds)f(s)

almost surely.

Proof:

Let ε > 0. Since fi → f uniformly, there exists N ∈ N such that for all s ∈ S and i > N ,

|fi(s)− f(s)| < ε. Since fi and f are bounded, there exists B > 0 such that |f(s)| ≤ B and

|fi(s)| ≤ B for all s ∈ S and i ∈ N. Hence, there exists M1 > N large enough such that for

any finite sequence (si)Ni=1 in Sπ, 1

M1

∑Ni=1 |fi(si)− f(si)| ≤ 2BN

M1< ε

4and M1−N

M1< 1

4. Then,

∣∣∣∣∣1

M1

M1∑i=1

fi(si)− f(si)

∣∣∣∣∣ ≤ 1

M1

M1∑i=1

|fi(si)− f(si)|

=1

M1

N∑i=1

|fi(si)− f(si)|+ 1

M1

M1∑i=N+1

|fi(si)− f(si)|

<ε

2.

By lemma 4.5, there exists M2 ∈ N such that for all n > M2,∣∣∣∣∣1

n

n∑i=1

f(si)−∫

Sµ(ds)f(s)

∣∣∣∣∣ <ε

2.

Let M = max{M1,M2}. We have

∣∣∣∣∣1

M

M∑i=1

fi(si)−∫

Sµ(ds)f(s)

∣∣∣∣∣ ≤∣∣∣∣∣

1

M

M∑i=1

fi(si)− 1

M

M∑i=1

f(si)

∣∣∣∣∣ +

∣∣∣∣∣1

M

M∑i=1

f(si)−∫

Sµ(ds)f(s)

∣∣∣∣∣< ε.

Hence, we conclude that limn→∞ 1n

∑ni=1 fi(si) =

∫S µ(ds)f(s) almost surely.

38

Theorem 7.2 (Mean convergence of LS/RLSAPI with approximate Chebyshev

polynomials) Suppose, for any policy π ∈ Π, the value function V π is in Ck[a, b] and there

is a procedure producing a sequence of functions (fπi ) such that fπ

i converges to the invariant

density fπ uniformly. We construct the set of approximate basis functions at each iteration

i in the inner loop by letting T π,iN = {tn

√c

fπi}N

n=0. Then, theorem 6.1 holds for the LS/RLS

approximate policy iteration algorithm with approximate Chebyshev polynomials.

Proof:

Let the vector of approximate basis functions at iteration i in the inner loop be φi(s). Then,

the parameter estimate becomes

θm =

[1

m

m∑i=1

φi,i(φi,i − γφi,i+1)′]−1 [

1

m

m∑i=1

φi,iCi

].

By lemma 7.2, θm → θ∗ almost surely. Hence, for arbitrary ε > 0, we can also achieve

a statistical estimate V πN of V π

N such that Eµπ ||V πN − V π

N ||∞ ≤ ε2. The rest of the proof is

virtually the same as the proof of theorem 7.1.

7.3 Invariant density estimation

One of the key assumptions in theorem 7.2 is that the density estimates converge uniformly

to the true invariant density of the chain. There are a variety of common methods for

estimating density functions from a finite data set, including histogram, frequency polygon,

kernel, nearest neighbor, orthogonal series, wavelet, spline, and likelihood based procedures

(Scott (1992), Silverman (1986), Simonoff (1996), Devroye & Gyorfi (1985), Wand & Jones

(1995)). However, the consistency and asymptotic normality are mostly established for data

with independent and identically distributed (i.i.d.) samples. Roussas (1969), Rosenblatt

(1970), Yakowitz (1989), Gyorfi (1981) and Castellana & Leadbetter (1986) studied the

consistency and convergence properties of kernel density estimates from Markov processes

under regularity conditions. Nevertheless, the convergence results are mostly obtained in

the L1 or L2 sense. Kristensen (2008) proves uniform convergence of the kernel estimates of

density function under certain assumptions on the underlying Markov chain in the following

theorem.

39

Assume that the time-homogeneous Markov chain has an invariant density f . Since the

observed sampling sequence is not initialized according to the invariant measure (we have

µπ({s0}) = 1), the process is non-stationary. For i ∈ N, define

pn(x|y) =

∫

Rdp(x|z)pn−1(z|y)dz

where p1(x|y) = p(x|y), the transition density. Since pn → f in the total variation norm

(Meyn & Tweedie (1993)), asymptotically we can achieve the invariant density pointwise by

fn(s) =1

nh

n∑i=1

K(si − s

h),

where K is some kernel function with bandwidth h satisfying some regularity conditions

(Kristensen (2008)). To obtain the uniform convergence of f towards the invariant density

f , the Strong Doeblin Condition (Holden (2000)) is imposed:

There exists n ≥ 1 and ρ ∈ (0, 1) such that pn(y|x) ≥ ρf(x).

Theorem 7.3 (Uniform convergence of approximate invariant density) Assume that

the Markov chain Φ satisfies the strong Doeblin condition, and its transition density s →p(s|s′) is k times differentiable and ∂k

∂sk p(s|s′) is uniformly continuous for all s ∈ Rd. Also,

||s||qf(s) is bounded for some q ≥ d. Then,

sups∈Rd

|fn(s)− f(s)| = OP (hk) + OP (

√ln(n)

nhd).

Proof:

See proof of theorem 4 of Kristensen (2008)

7.4 Fourier series approximation

Another example of the special function space is C1p [−π, π], the set of differentiable con-

tinuous functions with f(−π) = f(π). The subscript p indicates that f can be extended

to a periodic function over R with period 2π. The corresponding weighting function is the

constant function 12π

and the orthogonal basis set is {einx}∞n=0. Again, it is easy to modify

40

the weighting function and the orthogonal basis set to extend to arbitrary interval [a, b]. The

N -th order Fourier approximation for a given function f ∈ C1p [−π, π] is

SN(s) =N∑

n=−N

cneins =a0

2+

N∑

k=1

(ak cos ks + bk sin ks)

where cn = 12π

∫ π

−πf(s)e−insds, ak = 1

π

∫ π

−πf(s) cos(ks)ds and bk = 1

π

∫ π

−πf(s) sin(ks)ds. SN

converges uniformly to f with rate O(N− 12 ) (Strichartz (2000)). Assuming that all policy

value functions V π are all in C1p [−π, π], we can derive similar convergence results as in the

case of Chebyshev approximation, but the technical difficulties do not go away. Fourier

series approximations can also be applied to non-periodic functions, but the convergence is

not uniform (especially on the boundary) and many terms are needed for good fits (Judd

(1998)). Konidaris & Osentoski (2008) empirically evaluates the function approximation

scheme with the Fourier series basis and demonstrates that it performs well compared to

radial basis functions and the polynomial basis.

8 Conclusion

In this paper we propose a policy iteration algorithm for infinite-horizon Markov decision

process problems with continuous state and action spaces. With LS/RLS updating, the exact

algorithm is provably convergent under the assumptions that the stochastic system evolves

according to a positive Harris chain for any fixed policy in the policy space and the true

post-decision value functions of policies are bounded, continuous and spanned by a finite

set of known basis functions. If the basis functions are orthonormal, then we can relax an

invertibility condition of the correlation matrix that ensures convergence of the regression

parameters. We have also shown that the LS/RLSAPI algorithm is convergent in the mean,

meaning that the mean error between the approximate policy value function and the optimal

value function goes to 0 as successive approximations become more accurate. If the true

policy value functions are not spanned by the set of basis functions, the algorithm may not

converge to the optimal value function but a similar error bound between the approximate

value function using the LS/RLS policy iteration and the optimal value function can be

derived as in Bertsekas & Tsitsiklis (1996). Furthermore, the mean convergence results can be

41

extended to the case when the true value functions are in some special function spaces so that

we can construct a finite set of basis functions from orthogonal polynomials. Our next step

would be searching for provably convergent algorithms using other function approximations

(parametric or non-parametric) suitable for MDP problems with value functions of unknown

form. More advanced and sophisticated value function updating rules and approximation

techniques will be considered, including neural networks, local polynomial regression (Fan

& Gijbels (1996)) and kernel-based reinforcement learning (Ormoneit & Sen (2002)).

Acknowledgement

The first author would like to thank Professor Erhan Cinlar, Ning Hao, Yang Feng, Ke Wan,

Jingjin Zhang, and Yue Niu for many inspiring discussions. Thanks to Peter Frazier for the

proofreading and helpful comments. This work was supported by Castle Lab at Princeton

University.

References

Antos, A., Munos, R. & Szepesvari, C. (2007a), Fitted Q-iteration in continuous action-space

MDPs, in ‘Proceedings of Neural Information Processing Systems Conference (NIPS),

Vancouver, Canada, 2007’.

Antos, A., Szepesvari, C. & Munos, R. (2007b), Value-Iteration Based Fitted Policy Iteration:

Learning with a Single Trajectory, in ‘Approximate Dynamic Programming and Reinforce-

ment Learning, 2007. ADPRL 2007. IEEE International Symposium on’, pp. 330–337.

Antos, A., Szepesvari, C. & Munos, R. (2008), ‘Learning near-optimal policies with Bellman-

residual minimization based fitted policy iteration and a single sample path’, Machine

Learning 71(1), 89–129.

Baird, L. (1995), ‘Residual algorithms: Reinforcement learning with function approxima-

tion’, Proceedings of the Twelfth International Conference on Machine Learning pp. 30–37.

42

Bellman, R. & Dreyfus, S. (1959), ‘Functional Approximations and Dynamic Programming’,

Mathematical Tables and Other Aids to Computation 13(68), 247–251.

Bellman, R., Kalaba, R. & Kotkin, B. (1963), ‘Polynomial approximation— a new com-

putational technique in dynamic programming: Allocation processes’, Mathematics of

Computation 17(82), 155–161.

Bertsekas, D. (1995), ‘A Counterexample to Temporal Differences Learning’, Neural Com-

putation 7(2), 270–279.

Bertsekas, D. & Shreve, S. (1978), Stochastic Optimal Control: The Discrete-Time Case,

Academic Press, Inc. Orlando, FL, USA.

Bertsekas, D. & Tsitsiklis, J. (1996), Neuro-Dynamic Programming, Athena Scientific Bel-

mont, MA.

Boyan, J. (1999), Least-squares temporal difference learning, in ‘Proceedings of the Sixteenth

International Conference on Machine Learning’, pp. 49–56.

Boyan, J. & Moore, A. (1995), ‘Generalization in Reinforcement Learning: Safely Approxi-

mating the Value Function’, Advances In Neural Information Processing Systems pp. 369–

376.

Bradtke, S. (1993), ‘Reinforcement Learning Applied to Linear Quadratic Regulation’, Ad-

vances In Neural Information Processing Systems pp. 295–302.

Bradtke, S. & Barto, A. (1996), ‘Linear Least-Squares algorithms for temporal difference

learning’, Machine Learning 22(1), 33–57.

Bradtke, S., Ydstie, B. & Barto, A. (1994), ‘Adaptive linear quadratic control using policy

iteration’, American Control Conference, 1994 3, 3475–3479.

Castellana, J. & Leadbetter, M. (1986), ‘On smoothed probability density estimation for

stationary processes’, Stochastic processes and their applications 21(2), 179–193.

43

Choi, D. & Van Roy, B. (2006), ‘A Generalized Kalman Filter for Fixed Point Approximation

and Efficient Temporal-Difference Learning’, Discrete Event Dynamic Systems 16(2), 207–

239.

Deisenroth, M. P., Peters, J. & Rasmussen, C. E. (2008), Approximate Dynamic Program-

ming with Gaussian Processes, in ‘Proceedings of the 2008 American Control Conference

(ACC 2008)’, pp. 4480–4485.

Devroye, L. & Gyorfi, L. (1985), Nonparametric density estimation. The L1 view, John Wiley

and Sons, New York.

Doob, J. (1953), Stochastic Processes, John Wiley & Sons, New York.

Engel, Y., Mannor, S. & Meir, R. (2005), ‘Reinforcement learning with Gaussian processes’,

ACM International Conference Proceeding Series 119, 201–208.

Fan, J. & Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, Chapman &

Hall/CRC, London.

Gordon, G. (1995), ‘Stable function approximation in dynamic programming’, Proceedings

of the Twelfth International Conference on Machine Learning pp. 261–268.

Gordon, G. (2001), ‘Reinforcement learning with function approximation converges to a

region’, Advances in Neural Information Processing Systems 13, 1040–1046.

Gyorfi, L. (1981), ‘Strongly consistent density estimate from ergodic sample’, J. Multivariate

Anal. 11, 81–84.

Hernandez-Lerma, O. & Lasserre, J. (2001), ‘Further criteria for positive Harris recurrence

of Markov chains’, Proceedings-American Mathematical Society 129(5), 1521–1524.

Holden, L. (2000), ‘Convergence of Markov chains in the relative supremum norm’, Journal

of Applied Probability 37(4), 1074–1083.

Judd, K. (1998), Numerical Methods in Economics, MIT Press Cambridge, MA.

Konidaris, G. & Osentoski, S. (2008), ‘Value Function Approximation in Reinforcement

Learning using the Fourier Basis’, Technical Report UM-CS-2008-19.

44

Kristensen, D. (2008), ‘Uniform Convergence Rates of Kernel Estimators with Heterogenous,

Dependent Data’, CREATES Research Paper 2008-37.

Lagoudakis, M. & Parr, R. (2003), ‘Least-Squares Policy Iteration’, Journal of Machine

Learning Research 4(6), 1107–1149.

Landelius, T. & Knutsson, H. (1996), ‘Greedy adaptive critics for LQR problems: Conver-

gence proofs (Tech. Rep. No. LiTH-ISY-R-1896)’, Linkoping, Sweden: Computer Vision

Laboratory.

Luque, J. (1987), A nonlinear proximal point algorithm, in ‘26th IEEE Conference on Deci-

sion and Control’, Vol. 26, pp. 816–817.

Mahadevan, S. & Maggioni, M. (2007), ‘Proto-value Functions: A Laplacian Framework

for Learning Representation and Control in Markov Decision Processes’, The Journal of

Machine Learning Research 8, 2169–2231.

Melo, F., Lisboa, P. & Ribeiro, M. (2007), ‘Convergence of Q-learning with linear function

approximation’, Proceedings of the European Control Conference 2007 pp. 2671–2678.

Meyn, S. (1997), ‘The policy iteration algorithm for average reward Markov decisionprocesses

with general state space’, Automatic Control, IEEE Transactions on 42(12), 1663–1680.

Meyn, S. & Tweedie, R. (1993), Markov chains and stochastic stability, Springer, New York.

Munos, R. & Szepesvari, C. (2005), Finite time bounds for sampling based fitted value

iteration, in ‘Proceedings of the 22nd international conference on Machine learning’, ACM

New York, NY, USA, pp. 880–887.

Nummelin, E. (1984), General Irreducible Markov Chains and Non-Negative Operators, Cam-

bridge University Press, Cambridge.

Ormoneit, D. & Sen, S. (2002), ‘Kernel-Based Reinforcement Learning’, Machine Learning

49(2), 161–178.

45

Papavassiliou, V. & Russell, S. (1999), ‘Convergence of Reinforcement Learning with General

Function Approximators’, Proceedings of the Sixteenth International Joint Conference on

Artificial Intelligence pp. 748–757.

Perkins, T. & Precup, D. (2003), ‘A Convergent Form of Approximate Policy Iteration’,

Advances In Neural Information Processing Systems pp. 1627–1634.

Powell, W. B. (2007), Approximate Dynamic Programming: Solving the curses of dimen-

sionality, John Wiley and Sons, New York.

Puterman, M. L. (1994), Markov Decision Processes, John Wiley & Sons, New York.

Reetz, D. (1977), ‘Approximate Solutions of a Discounted Markovian Decision Process’,

Bonner Mathematische Schriften 98, 77–92.

Rosenblatt, M. (1970), ‘Density estimates and Markov sequences’, Nonparametric Techniques

in Statistical Inference pp. 199–213.

Roussas, G. (1969), ‘Nonparametric estimation of the transition distribution function of a

Markov process’, Ann. Math. Statist 40, 1386–1400.

Royden, H. (1968), Real analysis, Macmillan New York.

Rummery, G. & Niranjan, M. (1994), On-line Q-learning using connectionist systems, Cam-

bridge University Engineering Department.

Schweitzer, P. & Seidmann, A. (1985), ‘Generalized polynomial approximations in Markovian

decision processes’, Journal of mathematical analysis and applications 110(2), 568–582.

Scott, D. (1992), Multivariate Density Estimation: Theory, Practice, and Visualization, John

Wiley and Sons, New York.

Silverman, B. (1986), Density Estimation for Statistics and Data Analysis, Chapman &

Hall/CRC London.

Simonoff, J. (1996), Smoothing Methods in Statistics, Springer, New York.

46

Stokey, N., Prescott, E. & Lucas, R. (1989), Recursive Methods in Economic Dynamics,

Harvard University Press Cambridge, MA.

Strichartz, R. (2000), The Way of Analysis, Jones & Bartlett Publishers, Sudbury, MA.

Sutton, R. & Barto, A. (1998), Reinforcement Learning: An Introduction, MIT Press Cam-

bridge, MA.

Szita, I. (2007), Rewarding Excursions: Extending Reinforcement Learning to Complex Do-

mains, Eotvos Lorand University, Budapest.

Tesauro, G. (1994), ‘TD-Gammon, a self-teaching backgammon program, achieves master-

lever play’, Neural computation 6(2), 215–219.

Tsitsiklis, J. & Van Roy, B. (1996), ‘Feature-based methods for large scale dynamic pro-

gramming’, Machine Learning 22(1), 59–94.

Tsitsiklis, J. & Van Roy, B. (1997), ‘An analysis of temporal-difference learning with function

approximation’, Automatic Control, IEEE Transactions on 42(5), 674–690.

Van Roy, B., Bertsekas, D., Lee, Y. & Tsitsiklis, J. (1997), A Neuro-Dynamic Programming

Approach to Retailer Inventory Management, in ‘Proceedings of the 36th IEEE Conference

on Decision and Control, 1997’, Vol. 4.

Wand, M. & Jones, M. (1995), Kernel Smoothing, Chapman & Hall/CRC London.

Whitt, W. (1978), ‘Approximations of Dynamic Programs, I’, Mathematics of Operations

Research 3(3), 231–243.

Yakowitz, S. (1989), ‘Nonparametric density and regression estimation for Markov sequences

without mixing assumptions’, Journal of Multivariate Analysis 30(1), 124–136.

Young, P. (1984), Recursive estimation and time-series analysis: an introduction, Springer-

Verlag, New York.

47

Convergence Proofs of Least Squares Policy Iteration …castlelab.princeton.edu/html/Papers/Ma Powell-Policy iteration... · Algorithm for High-Dimensional Inﬂnite Horizon Markov

Documents