Top Banner

of 125

MIT6 231F11 Notes Short

Jun 03, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/12/2019 MIT6 231F11 Notes Short

    1/125

    APPROXIMATE DYNAMIC PROGRAMMING

    A SERIES OF LECTURES GIVEN AT

    CEA - CADARACHE

    FRANCE

    SUMMER 2012

    DIMITRI P. BERTSEKAS

    These lecture slides are based on the book:

    Dynamic Programming and Optimal Con-trol: Approximate Dynamic Programming,Athena Scientific, 2012; see

    http://www.athenasc.com/dpbook.html

    For a fuller set of slides, see

    http://web.mit.edu/dimitrib/www/publ.html

    1

    http://www.athenasc.com/dpbook.htmlhttp://web.mit.edu/dimitrib/www/publ.htmlhttp://web.mit.edu/dimitrib/www/publ.htmlhttp://www.athenasc.com/dpbook.html
  • 8/12/2019 MIT6 231F11 Notes Short

    2/125

    APPROXIMATE DYNAMIC PROGRAMMING

    BRIEF OUTLINE I

    Our subject:

    Large-scale DP based on approximations andin part on simulation.

    This has been a research area of great inter-est for the last 20 years known under variousnames (e.g., reinforcement learning, neuro-dynamic programming)

    Emerged through an enormously fruitful cross-fertilization of ideas from artificial intelligence

    and optimization/control theory Deals with control of dynamic systems under

    uncertainty, but applies more broadly (e.g.,discrete deterministic optimization)

    A vast range of applications in control the-ory, operations research, artificial intelligence,and beyond ...

    The subject is broad with rich variety oftheory/math, algorithms, and applications.Our focus will be mostly on algorithms ...less on theory and modeling

    2

  • 8/12/2019 MIT6 231F11 Notes Short

    3/125

    APPROXIMATE DYNAMIC PROGRAMMING

    BRIEF OUTLINE II

    Our aim:

    A state-of-the-art account of some of the ma-jor topics at a graduate level

    Show how the use of approximation and sim-ulation can address the dual curses of DP:

    dimensionality and modeling Our 7-lecture plan:

    Two lectures on exact DPwith emphasis oninfinite horizon problems and issues of large-scale computational methods

    One lecture on general issues of approxima-tion and simulationfor large-scale problems

    One lecture on approximate policy iterationbased on temporal differences (TD)/projectedequations/Galerkin approximation

    One lecture on aggregation methods

    One lecture onstochastic approximation, Q-learning, and other methods

    One lecture on Monte Carlo methods forsolving general problems involving linear equa-tions and inequalities

    3

  • 8/12/2019 MIT6 231F11 Notes Short

    4/125

    APPROXIMATE DYNAMIC PROGRAMMING

    LECTURE 1

    LECTURE OUTLINE

    Introduction to DP and approximate DP

    Finite horizon problems The DP algorithm for finite horizon problems

    Infinite horizon problems

    Basic theory of discounted infinite horizon prob-lems

    4

  • 8/12/2019 MIT6 231F11 Notes Short

    5/125

    BASIC STRUCTURE OF STOCHASTIC DP

    Discrete-time system

    xk+1=fk(xk, uk, wk), k= 0, 1, . . . , N 1 k: Discrete time xk: State;summarizes past information that

    is relevant for future optimization uk: Control; decision to be selected at timek from a given set

    wk: Random parameter(also called distur-bance or noise depending on the context)

    N: Horizon or number of times control isapplied

    Cost function that is additive over time

    E N1

    gN(xN) + gk(xk, uk, wk)k

    =0

    Alternative system description:P(xk+1|xk, uk)

    xk+1 =wk with P(wk |xk, uk) =P(xk+1 |xk, uk)

    5

  • 8/12/2019 MIT6 231F11 Notes Short

    6/125

    INVENTORY CONTROL EXAMPLE

    Inventory

    System

    Stock Ordered at

    Period k

    Stock at Period k Stock at Period k + 1

    Demand at Period k

    xk

    wk

    xk + 1= xk + uk - wk

    ukCost ofP eriod k

    c uk+ r (xk + uk- wk)

    Discrete-time system

    xk+1 =fk(xk, uk, wk) =xk+ uk wk

    Cost function that is additive over time

    E N1

    gN(xN) + gk(xk, uk, wk)k

    N

    =0

    =E

    1

    cuk+ r(xk+ ukk=0

    wk)

    6

  • 8/12/2019 MIT6 231F11 Notes Short

    7/125

    ADDITIONAL ASSUMPTIONS

    Optimization over policies:These are rules/function

    uk =k(xk), k= 0, . . . , N 1

    that map states to controls (closed-loop optimiza-tion, use of feedback)

    The set of values that the control uk can takedepend at most on xk and not on prior x or u

    Probability distribution ofwk does not dependon past values wk1, . . . , w0, but may depend onxk and uk

    Otherwise past values of w or x would beuseful for future optimization

    7

  • 8/12/2019 MIT6 231F11 Notes Short

    8/125

    GENERIC FINITE-HORIZON PROBLEM

    Systemxk+1 =fk(xk, uk, wk),k= 0, . . . , N 1 Control contraintsuk Uk(xk) Probability distributionPk( |xk, uk) ofwk

    Policies = {0, . . . , N1}, where k mapsstates xk into controls uk = k(xk) and is such

    that k(xk) Uk(xk) for all xk Expected costof starting at x0 is

    N1

    J(x0) =E

    gN(xN) + gk(xk, k(xk), wk)k=0

    Optimal cost function

    J(x0) = min J(x0)

    Optimal policy

    satisfies

    J(x0) =J(x0)

    When produced by DP, is independent ofx0.

    8

  • 8/12/2019 MIT6 231F11 Notes Short

    9/125

    PRINCIPLE OF OPTIMALITY

    Let

    ={

    ,

    0 1, . . . , N1} be optimal policy Consider thetail subproblemwhereby we areat xk at time k and wish to minimize the cost-to-go from time k to time N

    E N1

    gN(xN) +

    g

    x, (x), w=k

    and thetail policy {k,

    k+1, . . . ,

    N1}

    !"#$ &'()*+($,-

    !#-,!!

    "!

    #

    Principle of optimality:The tail policy is opti-mal for the tail subproblem (optimization of the

    future does not depend on what we did in the past) DP solves ALL the tail subroblems

    At the generic step, it solves ALL tail subprob-lems of a given time length, using the solution ofthe tail subproblems of shorter time length

    9

  • 8/12/2019 MIT6 231F11 Notes Short

    10/125

    DP ALGORITHM

    Jk(xk): opt. cost of tail problem starting at xk

    Start with

    JN(xN) =gN(xN),

    and go backwards using

    Jk(xk) = min E gk(xk, uk, wk)ukUk(xk) wk

    + Jk+1 fk(xk, u

    k, wk) , k= 0, 1, . . . , N 1

    i.e., to solve ta

    il subproblem a

    t time k minimize

    Sum ofkth-stage cost + Opt. cost of next tail proble

    starting from next state at time k+ 1

    ThenJ0(x0), generated at the last step, is equalto the optimal cost J(x0). Also, the policy

    ={0, . . . , N1}

    wherek(xk) minimizes in the right side above foreach xk andk, is optimal

    Proof by induction10

  • 8/12/2019 MIT6 231F11 Notes Short

    11/125

    PRACTICAL DIFFICULTIES OF DP

    The curse of dimensionality Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases

    Quick explosion of the number of states incombinatorial problems

    Intractability of imperfect state informationproblems

    The curse of modeling

    Sometimes a simulator of the system is easierto construct than a model

    There may be real-time solution constraints

    A family of problems may be addressed. Thedata of the problem to be solved is given withlittle advance notice

    The problem data may change as the systemis controlled need for on-line replanning

    All of the above are motivations for approxi-mation and simulation

    11

  • 8/12/2019 MIT6 231F11 Notes Short

    12/125

    COST-TO-GO FUNCTION APPROXIMATION

    Use a policy computed from the DP equationwhere the optimal cost-to-go function Jk+1 is re-placed by an approximation Jk+1.

    Apply k(xk), which attains the minimum in

    min E

    g ( J

    k xk, uk, wk)+ k+1

    fk(xk, uk, wk)ukUk(xk)

    Some approaches:

    (a) Problem Approximation:Use Jkderived froma related but simpler problem

    (b) Parametric Cost-to-Go Approximation:Useas Jk a function of a suitable parametricform, whose parameters are tuned by someheuristic or systematic scheme (we will mostlyfocus on this)

    This is a major portion of ReinforcementLearning/Neuro-Dynamic Programming

    (c) Rollout Approach: Use as Jk the cost ofsome suboptimal policy, which is calculatedeither analytically or by simulation

    12

  • 8/12/2019 MIT6 231F11 Notes Short

    13/125

    ROLLOUT ALGORITHMS

    At each k and state xk, use the control k(xk)that minimizes in

    min E g k(xk, uk, wk)+Jk+1ukUk(xk)

    whereJ

    is the cost-to-go of so

    fk(xk, uk, wk)

    ,

    k+1 me heuristic pol-

    icy (called the base policy).

    Cost improvement property: The rollout algo-rithm achieves no worse (and usually much better)cost than the base policy starting from the samestate.

    Main difficulty: Calculating Jk+1(x) may becomputationally intensive if the cost-to-go of thebase policy cannot be analytically calculated.

    May involve Monte Carlo simulation if theproblem is stochastic.

    Things improve in the deterministic case. Connection w/ Model Predictive Control (MPC)

    13

  • 8/12/2019 MIT6 231F11 Notes Short

    14/125

    INFINITE HORIZON PROBLEMS

    Same as the basic problem, but: The number of stages is infinite. The system is stationary.

    Total cost problems:Minimize

    N1

    J k(x0) = lim E g xk, k(xk), wkN wk

    k=0,1,...

    k

    =0

    Discounted problems (

  • 8/12/2019 MIT6 231F11 Notes Short

    15/125

    DISCOUNTED PROBLEMS/BOUNDED COST

    Stationary system

    xk+1 =f(xk, uk, wk), k= 0, 1, . . .

    Cost of a policy ={0, 1, . . .}

    N1J k(x0) = lim E g xk, k(xk), wk

    N wkk=0,1,...

    k

    =0

    with < 1, and g is bounded [for some M, wehave|g(x,u,w)| M for all (x,u,w)] Boundedness ofg guarantees that all costs arewell-defined and bounded: J(x) M1 All spaces are arbitrary

    - only

    boundedness of

    g is important (there are math fine points, e.g.measurability, but they dont matter in practice)

    Important special case: All underlying spacesfinite; a (finite spaces) Markovian Decision Prob-lemor MDP

    All algorithms essentially work with an MDPthat approximates the original problem

    15

  • 8/12/2019 MIT6 231F11 Notes Short

    16/125

    SHORTHAND NOTATION FOR DP MAPPING

    For any function J ofx

    (T J)(x) = min E

    g(x,u,w) + J

    f(x,u,w)uU(x) w

    , x

    T J is the optimal cost function for the one-

    stage problem with stage cost g and terminal costfunction J.

    T operates on bounded functions ofx to pro-duce other bounded functions ofx

    For any stationary policy

    (TJ)(x) =E

    g

    x, (x), w

    + J

    f(x, (x), w)

    , xw

    The critical structure of the problem is cap-tured in T and T

    The entire theory of discounted problems canbe developed in shorthand using T andT

    This is true for many other DP problems

    16

  • 8/12/2019 MIT6 231F11 Notes Short

    17/125

    FINITE-HORIZON COST EXPRESSIONS

    Consider an N-stage policyN

    0 ={0, 1, . . . , N1with a terminal cost J: N1

    JN(x0) =E NJ(xk) +

    g

    x, (x), w0

    =0

    =E

    g

    x0, 0(x0), w0

    +JN(x1)1

    = (T0JN)(x0)

    1

    where N1 ={1, 2, . . . , N1}

    By induction we have

    JN(x) = (T0T1 TN1J)(x),0

    x For a stationary policytheN-stage cost func-tion (with terminal cost J) is

    JN =TN

    J0

    where TN is the N-fold composition ofT

    Similarly the optimal N-stage cost function(with terminal cost J) is TNJ

    TNJ=T(TN1J) is just the DP algorithm17

  • 8/12/2019 MIT6 231F11 Notes Short

    18/125

    SHORTHAND THEORY A SUMMARY

    Infinite horizon cost function expressions[withJ0(x) 0]

    J(x) = lim (T N

    0T1 TNJ0)(x), J(x) = lim (T J0)(N N

    Bellmans equation:J =T J, J=TJ

    Optimality condition:

    : optimal T J =T J

    Value iteration:For any (bounded) J

    J(x) = lim (TkJ)(x),k

    x

    Policy iteration:Given k,

    Policy evaluation:Find Jk by solving

    Jk =TkJk

    Policy improvement: Find k+1 such that

    Tk+1Jk =T Jk

    18

  • 8/12/2019 MIT6 231F11 Notes Short

    19/125

    TWO KEY PROPERTIES

    Monotonicity property:For any J and J

    suchthat J(x) J(x) for all x, and any

    (T J)(x) (T J)(x), x,

    (T J)(x)

    (T J )(x),

    x.

    Constant Shift property:For any J, any scalarr, and any

    T(J+ re)

    (x) = (T J)(x) +r, x,

    T(J+ re) (x) = (TJ)(x) +r, x,whe

    re e is the u

    nit function [e(x) 1].

    Monotonicity is present in all DP models (undis-counted, etc)

    Constant shift is special to discounted models Discounted problems have another propertyof major importance: T and T are contractionmappings(we will show this later)

    19

  • 8/12/2019 MIT6 231F11 Notes Short

    20/125

    CONVERGENCE OF VALUE ITERATION

    IfJ0 0,J(x) = lim (TkJ0)(x), for allx

    k

    Proof: For any initial state x0, and policy ={0, 1, . . .},

    J(x0) =E

    g

    x, (x), w=0

    =E

    k1

    gx, (x), w

    =0

    + E

    g x, (x), w=k

    The tail portion satisfies

    E

    kMg x, (x), w

    , 1=k

    where M |g(x,u,w)

    |. Take the

    min over of

    both sides. Q.E.D.20

  • 8/12/2019 MIT6 231F11 Notes Short

    21/125

    BELLMANS EQUATION

    The optimal cost functionJ

    satisfies BellmansEq., i.e. J =T J.

    Proof:For all x and k,

    kM kMJ(x) (TkJ0)(x) J(x) + ,

    1

    1

    where J0(x) 0 and M |g(x,u,w)|. ApplyingT to this relation, and using Monotonicity andConstant Shift,

    k+1M

    (T J

    )(x) (Tk+1

    J0)(x)1 k+1M (T J)(x) +

    1 Taking the limit as k and using the fact

    lim (Tk+1J0)(x) =J(x)k

    we obtain J =T J. Q.E.D.

    21

  • 8/12/2019 MIT6 231F11 Notes Short

    22/125

    THE CONTRACTION PROPERTY

    Contraction property: For any bounded func-tionsJ and J, and any ,

    max(T J)(x) (T J)(x)

    x

    max J(x) J(x) ,x

    max(TJ)(x)

    (TJ)(x)

    max J(x)

    J(x) .

    x x

    Proof:Denote c= maxx

    S

    J(x)

    J(x)

    . Then

    J(x) c J(x) J(

    x) + c,

    x

    Apply Tto both sides, and use the Monotonicityand Constant Shift properties:

    (T J)(x) c (T J)(x) (T J)(x) +c, x

    Hence

    (T J)(x) (T J)(x) c, x.Q.E.D.

    22

  • 8/12/2019 MIT6 231F11 Notes Short

    23/125

    NEC. AND SUFFICIENT OPT. CONDITION

    A stationary policy is optimal if and only if(x) attains the minimum in Bellmans equationfor each x; i.e.,

    T J =TJ.

    Proof:IfT J =T J, then using Bellmans equa-tion (J =T J), we have

    J =TJ,

    so by uniqueness of the fixed point ofT, we obtain

    J =J; i.e., is optimal.

    Conversely, if the stationary policyis optimal,we have J =J, so

    J =TJ.

    Combining this with Bellmans Eq. (J = T J),we obtain T J =TJ. Q.E.D.

    23

  • 8/12/2019 MIT6 231F11 Notes Short

    24/125

    APPROXIMATE DYNAMIC PROGRAMMING

    LECTURE 2

    LECTURE OUTLINE

    Review of discounted problem theory

    Review of shorthand notation Algorithms for discounted DP

    Value iteration

    Policy iteration

    Optimistic policy iteration Q-factors and Q-learning

    A more abstract view of DP

    Extensions of discounted DP

    Value and policy iteration Asynchronous algorithms

    24

  • 8/12/2019 MIT6 231F11 Notes Short

    25/125

    DISCOUNTED PROBLEMS/BOUNDED COST

    Stationary system with arbitrary state space

    xk+1 =f(xk, uk, wk), k= 0, 1, . . .

    Cost of a policy ={0, 1, . . .}

    N1J k(x0) = lim E

    N wkk=0,1,...

    k

    g

    =0

    xk, k(xk), wk

    with

  • 8/12/2019 MIT6 231F11 Notes Short

    26/125

    SHORTHAND THEORY A SUMMARY

    Cost function expressions[with J0(x) 0]J(x) = lim (T0T 1 TkJ0)(x), J(x) = lim (T

    k J0)(x)

    k k

    Bellmans equation:J =T J, J=TJ or

    J(x) = min E

    g(x,u,w) +Jf(x,u,w) , xuU(x) wJ(x) =E

    g

    x, (x), ww

    +J

    f(x, (x), w)

    , x

    Optimality condition:

    : optimal TJ =T J

    i.e.,

    (x) arg min Eg(x,u,w) +J fuU(x) w (x,u,w) , Value iteration:For any (bounded) J

    J(x) = lim (TkJ)(x), xk

    26

  • 8/12/2019 MIT6 231F11 Notes Short

    27/125

    MAJOR PROPERTIES

    Monotonicity property:For any functionsJandJ on the state space X such that J(x) J(x)for all x X, and any

    (T J)(x) (T J)(x), (TJ)(x) (TJ)(x), x X

    Contraction property: For any bounded func-tionsJ and J, and any ,

    max(T J)(x) (T J)(x)

    maxJ(x) J(x)x x

    ,

    max(T JJ)(x)(T )(x) x maxx J(x)J(x).

    Compact Contraction Notation:

    T JT J JJ, TJTJ JJ,

    where for any bounded function J, we denote byJ the sup-norm

    J = max J(x) .xX

    27

  • 8/12/2019 MIT6 231F11 Notes Short

    28/125

    THE TWO MAIN ALGORITHMS: VI AND PI

    Value iteration:For any (bounded) J

    J(x) = lim (TkJ)(x),k

    x

    Policy iteration:Given k

    Policy evaluation:Find Jk by solving

    Jk(x) =E

    g

    x, (x), w

    + Jk

    f(x, k(x), w)w

    , x

    or Jk =TkJk

    Policy improvement:Letk+1 be such that

    k+1(x)arg min E g(x,u,w) + Jk f(x,u,w) ,

    uU(x) w

    or Tk+1Jk =T Jk

    For finite state space policy evaluation is equiv-

    alent to solving a linear system of equations Dimension of the system is equal to the numberof states.

    For large problems, exact PI is out of the ques-tion (even though it terminates finitely)

    28

  • 8/12/2019 MIT6 231F11 Notes Short

    29/125

    INTERPRETATION OF VI AND PI

    29

    J = TJ

    J = TJ

    TJ

    TJ

    J

    45 Degree Line

    J = TJ

    J

    J1 = T1J1

    Policy Improvement

    Policy Improvement

    T1J

    Policy Evaluation

    J0

    J0

    J0

    J0

    TJ0

    TJ0

    TJ0

    T2J0

    T2J0

    Value Iterations

  • 8/12/2019 MIT6 231F11 Notes Short

    30/125

    JUSTIFICATION OF POLICY ITERATION

    We can show that Jk+1 Jk for all k Proof:For given k, we have

    Tk+1Jk =T Jk TkJk =Jk

    Using the monotonicity property of DP,

    Jk T 2 Nk+1Jk Tk+1Jk lim Tk+1JkN

    Sincelim TN

    k+1Jk =Jk+1

    N

    we have Jk Jk+1 . IfJk =Jk+1 , thenJk solves Bellmans equa-tion and is therefore equal to J

    So at iterationkeither the algorithm generates

    a strictly improved policy or it finds an optimalpolicy

    For a finite spaces MDP, there are finitely manystationary policies, so the algorithm terminateswith an optimal policy

    30

  • 8/12/2019 MIT6 231F11 Notes Short

    31/125

    APPROXIMATE PI

    Suppose that the policy evaluation is approxi-mate,

    Jk Jk , k= 0, 1, . . .

    and policy improvement is approximate,

    Tk+1Jk T Jk , k= 0, 1, . . .

    where and are some positive scalars.

    Error Bound I: The sequence {k} generated

    by approximate policy iteration satisfies+ 2

    limsup

    Jkk

    J (1 )2

    Typical practical behavior:The method makessteady progress up to a point and then the iteratesJ llate within a neighborhood ofJk osci .

    Error Bound II:If in addition the sequence{k}terminates at ,

    J

    J

    + 2

    1 31

  • 8/12/2019 MIT6 231F11 Notes Short

    32/125

    OPTIMISTIC POLICY ITERATION

    Optimistic PI (more efficient):This is PI, wherepolicy evaluation is done approximately, with afinite number of VI

    So we approximate the policy evaluation

    J

    Tm J

    for some number m [1, ) Shorthand definition:For some integers mk

    TkJk =T Jk, Jk+1 =Tmk

    k Jk, k= 0, 1, . . .

    Ifmk 1 it becomes VI Ifmk = it becomes PI Can be shown to converge (in an infinite numberof iterations)

    32

  • 8/12/2019 MIT6 231F11 Notes Short

    33/125

    Q-LEARNING I

    We can write Bellmans equation as

    J(x) = min Q(x, u),uU(x)

    x,

    where Q is the unique solution of

    Q(x, u) =E

    g(x,u,w) + min Q(x, v)

    vU(x)

    with x=f(x,u,w)

    Q(x, u) is called the optimal Q-factorof (x, u)

    We can equivalently write the VI method as

    Jk+1(x) = min Qk+1(x, u),uU(x)

    x,

    where Qk+1 is generated by

    Qk+1(x, u) =E

    g(x,u,w) + min Qk(x, v)

    vU(x)

    with x=f(x,u,w)

    33

  • 8/12/2019 MIT6 231F11 Notes Short

    34/125

    Q-LEARNING II

    Q-factors are no different than costs

    They satisfy a Bellman equationQ =F Qwhere

    (F Q)(x, u) =E

    g(x,u,w) + min Q(x, v)

    vU(x)

    where x=f(x,u,w)

    VI and PI for Q-factors are mathematicallyequivalent to VI and PI for costs

    They require equal amount of computation ...they just need more storage

    Having optimal Q-factors is convenient whenimplementing an optimal policy on-line by

    (x) = min Q(x, u)uU(x)

    Once Q(x, u) are known, the model [g and

    E{}] is not needed. Model-free operation.

    Later we will see how stochastic/sampling meth-ods can be used to calculate (approximations of)Q(x, u) using a simulator of the system (no modelneeded)

    34

  • 8/12/2019 MIT6 231F11 Notes Short

    35/125

    A MORE GENERAL/ABSTRACT VIEW

    Let Y be a real vector space with a norm A function F :Y Y is said to be a contrac-tion mappingif for some (0, 1), we have

    F y F z y z, for ally, z Y.

    is called the modulus of contractionofF.

    Important example:LetXbe a set (e.g., statespace in DP), v : X be a positive-valuedfunction. Let B(X) be the set of all functionsJ : X

    such that J(x)/v(x) is bounded over

    x. We define a norm onB(X), called the weightedsup-norm, by

    J |J(x)| = max .xX v(x)

    Important special case:The discounted prob-lem mappings T and T [for v(x) 1, =].

    35

  • 8/12/2019 MIT6 231F11 Notes Short

    36/125

    A DP-LIKE CONTRACTION MAPPING

    Let X={1, 2, . . .}, and let F :B(X)

    B(X)be a linearmapping of the form

    (F J)(i) =bi+j

    aijJ(j),

    X

    i= 1, 2, . . .

    where bi and aij are some scalars. Then F is acontraction with modulus if and only ifjX|aij | v(j) , i= 1, 2, . . .

    v(i)

    Let F : B(X) B(X) be a mapping of theform

    (F J)(i) = min(FJ)(i), i= 1, 2, . . .M

    where M is parameter set, and for each M,F is a contraction mapping from B(X) to B(X)with modulus. ThenFis a contraction mappingwith modulus .

    Allows the extension of main DP results frombounded cost to unbounded cost.

    36

  • 8/12/2019 MIT6 231F11 Notes Short

    37/125

    CONTRACTION MAPPING FIXED-POINT TH

    Contraction Mapping Fixed-Point Theorem:IfF :B(X) B(X) is a contraction with modulus (0, 1), then there exists a unique J B(X)such that

    J =F J.

    Furthermore, ifJ is any function in B(X), then{FkJ} converges to J and we have

    FkJ J kJ J, k= 1, 2, . . . .

    This is a special case of a general result for

    contraction mappings F : Y Y over normedvector spacesY that arecomplete: every sequence{yk} that is Cauchy (satisfiesym yn 0 asm, n ) converges. The space B(X) is complete (see the text for a

    proof).

    37

  • 8/12/2019 MIT6 231F11 Notes Short

    38/125

    GENERAL FORMS OF DISCOUNTED DP

    We consider an abstract form of DP based on

    monotonicity and contraction Abstract Mapping:Denote R(X): set of real-valued functions J :X , and let H :X UR(X) be a given mapping. We consider themapping

    (T J)(x) = min H(x,u,J),uU(x)

    x X.

    We assume that (T J)(x)> for all x X,so T maps R(X) into R(X).

    Abstract Policies: Let M be the set of poli-cies, i.e., functions such that (x) U(x) forall x X. For each M, we consider the mappingT :R(X) R(X) defined by

    (TJ)(x) =H

    x, (x), J

    , x X. Find a function J R(X) such that

    J(x) = min H(x,u,J), XuU( )

    xx

    38

  • 8/12/2019 MIT6 231F11 Notes Short

    39/125

    EXAMPLES

    Discounted problems(and stochastic shortest

    paths-SSP for = 1)

    H(x,u,J) =E

    g(x,u,w) +J

    f(x,u,w)

    Discounted Semi-Markov Problems

    n

    H(x,u,J) =G(x, u) +

    mxy (u)J(y)y=1

    where mxy are discounted transition probabili-ties, defined by the transition distributions

    Shortest Path Problemsaxu+ J(u) ifu=d,H(x,u,J) =axd ifu=d

    wheredis the destination. There is also a stochas-tic version of this problem.

    Minimax Problems

    H(x,u,J) = max g(x,u,w)+J f(x,u,w)wW(x,u)

    39

  • 8/12/2019 MIT6 231F11 Notes Short

    40/125

    ASSUMPTIONS

    Monotonicity assumption:IfJ, J

    R(X) andJ J, then

    H(x,u,J) H(x,u,J), x X, u U(x)

    Contraction assumption:

    For every J B(X), the functions TJ andT Jbelong to B(X).

    For some (0, 1), and all and J, J B(X), we have

    T J T J J J

    We can show all the standard analytical andcomputational results of discounted DP based onthese two assumptions

    With just the monotonicity assumption (as inthe SSP or other undiscounted problems) we canstill show various forms of the basic results underappropriate assumptions

    40

  • 8/12/2019 MIT6 231F11 Notes Short

    41/125

    RESULTS USING CONTRACTION

    Proposition 1: The mappings T and T areweighted sup-norm contraction mappings with mod-ulus over B(X), and have unique fixed pointsin B(X), denoted J and J, respectively (cf.Bellmans equation).

    Proof: From the contraction property ofH. Proposition 2:For any J B(X) and M,

    lim Tk J=J, lim TkJ=Jk k

    (cf. convergence of value iteration).

    Proof: From the contraction property ofT andT.

    Proposition 3: We have T J = T J if andonly ifJ =J (cf. optimality condition).

    Proof: T J = T J, then T J = J , implyingJ = J. Conversely, if J = J , then TJ =TJ =J=J =T J.

    41

  • 8/12/2019 MIT6 231F11 Notes Short

    42/125

    RESULTS USING MON. AND CONTRACTION

    Optimality of fixed point:

    J(x) = min J(x), x XM

    Furthermore, for every >0, there exists Msuch that

    J(x) J(x) J(x) +, x X

    Nonstationary policies: Consider the set ofall sequences = {0, 1, . . .} with k

    M for

    all k, and define

    J(x) = liminf(T0T1 TkJ)(x), xk

    X,

    with J being any function (the choice of J does

    not matter) We have

    J(x) = min J(x), x X

    42

  • 8/12/2019 MIT6 231F11 Notes Short

    43/125

    THE TWO MAIN ALGORITHMS: VI AND PI

    Value iteration:For any (bounded) J

    J(x) = lim (TkJ)(x),k

    x

    Policy iteration:Given k

    Policy evaluation:Find Jk by solving

    Jk =TkJk

    Policy improvement:Find k+1 such that

    Tk+1Jk =T Jk

    Optimistic PI:This is PI, where policy evalu-ation is carried out by a finite number of VI

    Shorthand definition: For some integers mk

    Tk

    Jk =T Jk, Jk+1 =T

    mk

    k Jk, k= 0, 1, Ifmk 1 it becomes VI Ifmk = it becomes PI For intermediate values ofmk, it is generally

    more efficient than either VI or PI43

  • 8/12/2019 MIT6 231F11 Notes Short

    44/125

    ASYNCHRONOUS ALGORITHMS

    Motivation for asynchronous algorithms

    Faster convergence Parallel and distributed computation Simulation-based implementations

    General framework: Partition X into disjoint

    nonempty subsets X1, . . . , X m, and use separateprocessor updating J(x) for x X Let Jbe partitioned as

    J= (J1, . . . , J m),

    where J is the restriction ofJon the set X. Synchronous algorithm:

    Jt+1 (x) =T(Jt1, . . . , J

    tm)(x), x X, = 1, . . . , m

    Asynchronous algorithm:For some subsets of

    times R,

    1(( t)

    T J , . . . , J

    m(t)

    Jt+1 (x) = 1 m )(x) ift R,Jt(x) t /

    if R

    where t j (t) are communication delays 44

  • 8/12/2019 MIT6 231F11 Notes Short

    45/125

    ONE-STATE-AT-A-TIME ITERATIONS

    Important special case:Assume n states, aseparate processor for each state, and no delays

    Generate a sequence of states{x0, x1, . . .}, gen-erated in some way, possibly by simulation (eachstate is generated infinitely often)

    Asynchronous VI:

    Jt+1 T(Jt1, . . . , J

    tn)() if=xt,

    =

    Jt if=x

    t,

    where T(Jt1, . . . , J tn)() denotes the -th compo-

    nent of the vector

    T(Jt1, . . . , J tn) =T Jt,

    and for simplicity we write Jt instead ofJt()

    The special case where

    {x0, x1, . . .}={1, . . . , n , 1, . . . , n , 1, . . .}

    is the Gauss-Seidel method

    We can show that Jt J under the contrac-

    tion assumption45

  • 8/12/2019 MIT6 231F11 Notes Short

    46/125

    ASYNCHRONOUS CONV. THEOREM I

    Assume that for all, j = 1, . . . , m,Ris infinite

    and limtj (t) = Proposition:Let Thave a unique fixed point J,and assume that there is a sequence of nonemptysubsets

    S(k)

    R(X) with S(k+ 1) S(k) forall k, and with the following properties:

    (1) Synchronous Convergence Condition: Ev-ery sequence {Jk} with Jk S(k) for eachk, converges pointwise to J. Moreover, wehave

    T J

    S(k+1),

    J

    S(k), k= 0, 1, . . . .

    (2) Box Condition:For allk,S(k) is a Cartesianproduct of the form

    S(k) =S1(k) Sm(k),

    where S(k) is a set of real-valued functionson X, = 1, . . . , m.

    Then for every J S(0), the sequence {Jt} gen-erated by the asynchronous algorithm convergespointwise to J.

    46

  • 8/12/2019 MIT6 231F11 Notes Short

    47/125

    ASYNCHRONOUS CONV. THEOREM II

    Interpretation of assumptions:

    S(0)S(k)

    S(k+ 1) J

    J= (J1, J2)

    S1(0)

    S2(0)TJ

    A synchronous iteration from anyJ inS(k) movesinto S(k+ 1) (component-by-component)

    Convergence mechanism:

    S(0)S(k)

    S(k+ 1) J

    J= (J1, J2)

    J1 Iterations

    J2 Iteration

    Key: Independent component-wise improve-ment.An asynchronous component iteration fromany J in S(k) moves into the corresponding com-ponent portion ofS(k+ 1)

    47

  • 8/12/2019 MIT6 231F11 Notes Short

    48/125

    APPROXIMATE DYNAMIC PROGRAMMING

    LECTURE 3

    LECTURE OUTLINE

    Review of theory and algorithms for discountedDP

    MDP and stochastic shortest path problems(briefly)

    Introduction to approximation in policy andvalue space

    Approximation architectures Simulation-based approximate policy iteration

    Approximate policy iteration and Q-factors

    Direct and indirect approximation

    Simulation issues

    48

  • 8/12/2019 MIT6 231F11 Notes Short

    49/125

    DISCOUNTED PROBLEMS/BOUNDED COST

    Stationary system with arbitrary state space

    xk+1 =f(xk, uk, wk), k= 0, 1, . . .

    Cost of a policy ={0, 1, . . .}

    N1J k(x0) = lim E g xk, k(xk), wk

    N wkk=0,1,...

    k

    =0

    with

  • 8/12/2019 MIT6 231F11 Notes Short

    50/125

    MDP - TRANSITION PROBABILITY NOTATIO

    Assume the system is an n-state (controlled)Markov chain

    Change to Markov chain notation

    States i= 1, . . . , n (instead ofx) Transition probabilitiespikik+1(uk)[instead

    ofxk+1=f(xk, uk, wk)] Stage cost g(ik, uk, ik+1)[instead ofg(xk, uk, wk Cost of a policy ={0, 1, . . .}

    N1

    J(i) = lim E kg ik, k(ik), ik+1 i0=N ik

    k=1,2,... k=0 |

    Shorthand notation for DP mappings

    n

    (T J)(i) = min pij (u)uU(i)j=1

    g(i,u,j)+J(j), i= 1, . . . ,n

    (TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . .

    j=1

    50

  • 8/12/2019 MIT6 231F11 Notes Short

    51/125

    SHORTHAND THEORY A SUMMARY

    Cost function expressions[with J0(i) 0]J(i) = lim (T0T 1 TkJ0)(i), J(i) = lim (T

    k J0)(i)

    k k

    Bellmans equation:J =T J, J=TJ or

    n

    J(i) = min

    p ij (u) g(i,u,j) +J (j) , iuU(i)

    j=1

    n

    J(i) =

    pij

    (i)

    g

    i, (i), j

    + J(j)

    j=1 , i

    Optimality condition:

    : optimal TJ =T J

    i.e.,

    n

    (i) arg min

    p ij (u) (u (i)

    j=1

    g i,u,j)+J (j)

    , i

    U

    51

  • 8/12/2019 MIT6 231F11 Notes Short

    52/125

    THE TWO MAIN ALGORITHMS: VI AND PI

    Value iteration:For any J n

    J(i) = lim (TkJ)(i),k

    i= 1, . . . , n

    Policy iteration:Given k

    Policy evaluation:Find Jk by solving

    n

    Jk(i) =

    p k k

    ij

    (i)

    g

    i, (i), j

    +Jk (j)

    , i= 1, . . .

    j=1

    or Jk =TkJk

    Policy improvement:Letk+1 be such thatn

    k+1(i)arg min

    pij (u)

    g(i,u,j)+Jk(j)

    uU(i)j=1

    , i

    or Tk+1Jk =T Jk

    Policy evaluation is equivalent to solving ann n linear system of equations For large n, exact PI is out of the question(even though it terminates finitely)

    52

  • 8/12/2019 MIT6 231F11 Notes Short

    53/125

    TOCHASTIC SHORTEST PATH (SSP) PROBLE

    Involves states i= 1, . . . , nplus a special cost-free and absorbing termination state t

    Objective: Minimize the total (undiscounted)cost. Aim: Reach t at minimum expected cost

    An example: Tetris

    !"#$%&'!%(&

    ))))))

    53

    http://ocw.mit.edu/fairusehttp://ocw.mit.edu/fairusehttp://ocw.mit.edu/fairuse
  • 8/12/2019 MIT6 231F11 Notes Short

    54/125

    SSP THEORY

    SSP problems provide a soft boundary be-tween the easy finite-state discounted problemsand the hard undiscounted problems.

    They share features of both. Some of the nice theory is recovered because

    of the termination state.

    Definition: A proper policy is a stationarypolicy that leads to t with probability 1

    If all stationary policies are proper, T andT are contractions with respect to a commonweighted sup-norm

    The entire analytical and algorithmic theory fordiscounted problems goes through if all stationarypolicies are proper (we will assume this)

    There is a strong theory even if there are im-proper policies (but they should be assumed to benonoptimal - see the textbook)

    54

  • 8/12/2019 MIT6 231F11 Notes Short

    55/125

    GENERAL ORIENTATION TO ADP

    We will mainly adopt an n-state discounted

    model (the easiest case - but think of HUGE n). Extensions to SSP and average cost are possible(but more quirky). We will set aside for later.

    There are many approaches:

    Manual/trial-and-error approach

    Problem approximation Simulation-based approaches (we will focus

    on these): neuro-dynamic programmingorreinforcement learning.

    Simulation is essential for large state spaces

    because of its (potential) computational complex-ity advantage incomputing sums/expectations in-volving a very large number of terms.

    Simulation also comes in handy whenan ana-lytical model of the system is unavailable, but a

    simulation/computer model is possible. Simulation-based methods are of three types:

    Rollout (we will not discuss further) Approximation in value space

    Approximation in policy space55

  • 8/12/2019 MIT6 231F11 Notes Short

    56/125

    APPROXIMATION IN VALUE SPACE

    Approximate J

    or J from a parametric classJ(i, r)where i is the current state and r= (r1, . . . , ris a vector of tunable scalars weights.

    By adjusting r we can change the shape ofJso that it is reasonably close to the true optimalJ.

    Two key issues:

    The choice of parametric class J(i, r) (theapproximation architecture).

    Method for tuning the weights (trainingthe architecture).

    Successful application strongly depends on howthese issues are handled, and on insight about theproblem.

    A simulator may be used, particularly whenthere is no mathematical model of the system (butthere is a computer model).

    We will focus on simulation, but this is not theonly possibility[e.g.,J(i, r) may be a lower boundapproximation based on relaxation, or other prob-lem approximation]

    m)

    56

  • 8/12/2019 MIT6 231F11 Notes Short

    57/125

    APPROXIMATION ARCHITECTURES

    Divided in linear and nonlinear[i.e., linear ornonlinear dependence ofJ(i, r) onr].

    Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.

    Computer chess example: Uses a feature-based

    position evaluator that assigns a score to eachmove/position.

    FeatureExtraction

    Weightingof Features

    Score

    Features:

    Material balance,

    Mobility,

    Safety, etc

    Position Evaluator

    Many context-dependent special features. Most often the weighting of features is linearbut multistep lookahead is involved.

    In chess, most often the training is done by trialand error.

    57

  • 8/12/2019 MIT6 231F11 Notes Short

    58/125

    LINEAR APPROXIMATION ARCHITECTURE

    Ideally, the features encode much of the nonlin-earity inherent in the cost-to-go approximated

    Then the approximation may be quite accuratewithout a complicated architecture.

    With well-chosen features, we can use a linear

    architecture:J(i, r) =(i)

    r,i = 1, . . . , n, or morecompactlyJ(r) = r

    : the matrix whose rows are (i), i= 1, . . . , n

    State i Feature ExtractionMapping Mapping

    Feature Vector (i) Linear

    Linear CostApproximator (i)r

    This is approximation on the subspace

    S={r|r s}spanned by the columns of

    (basis functions)

    Many examples of feature types: Polynomialapproximation, radial basis functions, kernels ofall sorts, interpolation, and special problem-specific(as in chess and tetris)

    58

  • 8/12/2019 MIT6 231F11 Notes Short

    59/125

    APPROXIMATION IN POLICY SPACE

    A brief discussion; we will return to it at theend.

    We parameterize the set of policies by a vectorr= (r1, . . . , rs) and we optimize the cost over r

    Discounted problem example:

    Each value of r defines a stationary policy,with cost starting at state i denoted by J(i; r).

    Use a random search, gradient, or other methodto minimize over r

    n

    J(r) =

    piJ(i; r),i=1

    where (p1, . . . , pn) is some probability distri-bution over the states.

    In a special case of this approach, the param-

    eterization of the policies is indirect, through anapproximate cost function.

    A cost approximation architecture parame-terized byr, defines a policy dependent onrvia the minimization in Bellmans equation.

    59

  • 8/12/2019 MIT6 231F11 Notes Short

    60/125

    APPROX. IN VALUE SPACE - APPROACHES

    Approximate PI(Policy evaluation/Policy im-provement)

    Uses simulation algorithms to approximatethe cost J of the current policy

    Projected equation and aggregation approaches

    Approximation of the optimal costfunction J

    Q-Learning: Use a simulation algorithm toapproximate the optimal costs J(i) or theQ-factors

    n

    Q

    (i, u) =g(i, u) +

    pij (u)J

    (j)j=1

    Bellman error approach:Find r to2

    min EiJ(i, r) (T J)(i, r)r where Ei{} is taken with respect to somedistribution

    Approximate LP(we will not discuss here)60

  • 8/12/2019 MIT6 231F11 Notes Short

    61/125

    APPROXIMATE POLICY ITERATION

    General structure

    System Simulator

    Decision GeneratorDecision(i) S

    Cost-to-Go Approximatorn Generatorr Supplies Values J(j, r) D

    Cost Approximation

    n Algorithm

    J(j, r

    State i

    J(j, r) is the cost approximation for the pre-ceding policy, used by the decision generator tocompute the current policy [whose cost is ap-proximated by J(j, r) using simulation]

    There are several cost approximation/policyevaluation algorithms

    There are several important issues relating tothe design of each block (to be discussed in the

    future).61

    Samples

  • 8/12/2019 MIT6 231F11 Notes Short

    62/125

    POLICY EVALUATION APPROACHES I

    Direct policy evaluation Approximate the cost of the current policy byusing least squares and simulation-generated costsamples

    Amounts to projection ofJ onto the approxi-

    mation subspace

    Subspace S={r| r s}

    = 0

    J

    J

    Direct Method: Projection of

    cost vector J

    Solution of the least squares problem by batchand incremental methods

    Regular and optimistic policy iteration

    Nonlinear approximation architectures may alsobe used

    62

  • 8/12/2019 MIT6 231F11 Notes Short

    63/125

    POLICY EVALUATION APPROACHES II

    Indirect policy evaluation

    S: Subspace spanned by basis functions

    T(!rk)= g + "P!rk

    0

    Value Iterate

    Projectionon S

    !rk+1

    Simulation error

    S: Subspace spanned by basis functions

    !rk

    T(!rk)= g + "P!rk

    0

    !rk+1

    Value Iterate

    Projectionon S

    Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)

    !rk

    An example of indirect approach: Galerkin ap-proximation

    Solve theprojected equation r= T(r)where is projection w/ respect to a suit-able weighted Euclidean norm

    TD(): Stochastic iterative algorithm for solv-ing r= T(r)

    LSPE(): A simulation-based form of pro-jected value iteration

    rk+1= T(rk) + simulation noise

    LSTD(): Solves a simulation-based approx-

    imation w/ a standard solver (Matlab)63

  • 8/12/2019 MIT6 231F11 Notes Short

    64/125

    POLICY EVALUATION APPROACHES III

    Aggregation approximation: Solve

    r= DT(r)

    where the rows ofDand are prob. distributions(e.g., D and aggregate rows and columns of

    the linear system J=TJ).

    pij(u),

    dxi jy

    i

    x y

    OriginalSystem States

    Aggregate States

    pxy(u) =n

    i=1

    dxi

    n

    j=1

    pij(u)jy ,

    DisaggregationProbabilities

    AggregationProbabilities

    g(x, u) =n

    i=1

    dxi

    n

    j=1

    pij(u)g(i,u,j)

    , g(i, u, j)

    Several different choices ofD and .

    64

    j

  • 8/12/2019 MIT6 231F11 Notes Short

    65/125

    POLICY EVALUATION APPROACHES IV

    pij(u),

    dxi jy

    i

    x y

    Original

    System States

    Aggregate States

    pxy(u) =n

    i=1

    dxi

    n

    j=1

    pij(u)jy ,

    Disaggregation

    Probabilities

    Aggregation

    Probabilities

    g(x, u) =

    n

    i=1

    dxi

    n

    j=1

    pij(u)g(i,u,j)

    , g(i, u, j)

    |

    Aggregation is a systematic approach for prob-

    lem approximation. Main elements: Solve (exactly or approximately) the ag-gregate problem by any kind of VI or PImethod (including simulation-based methods)

    Use the optimal cost of the aggregate prob-lem to approximate the optimal cost of the

    original problem

    Because an exact PI algorithm is used to solvethe approximate/aggregate problem the methodbehaves more regularly than the projected equa-tion approach

    65

    j

  • 8/12/2019 MIT6 231F11 Notes Short

    66/125

    THEORETICAL BASIS OF APPROXIMATE PI

    If policies are approximately evaluated using anapproximation architecture such that

    max |J(i, rk) Jk(i)i

    | , k= 0, 1, . . .

    If policy improvement is also approximate,

    max |(Tk+1J)(i, rk) (i

    T J)(i, rk)| , k= 0, 1, . .

    Error bound:The sequence {k} generated by

    approximate policy iteration satisfies+ 2

    lim sup max

    Jk(i)ik

    J(i) (1 )2

    Typical practical behavior:The method makes

    steady progress up to a point and then the iteratesJk oscillate within a neighborhood of J.

    66

  • 8/12/2019 MIT6 231F11 Notes Short

    67/125

    THE USE OF SIMULATION - AN EXAMPLE

    Projection by Monte Carlo Simulation: Com-pute the projection J of a vector J n onsubspace S = {r | r s}, with respect to aweighted Euclidean norm . Equivalently, find r, where

    n

    r = arg min rJ2 = arg min (i)i r (is r

    i=1

    J )r s

    Setting to 0 the gradient at r,

    r = 1n n

    i(i)(i)i=1

    i(i)J(i)i=1

    Approximate by simulation the two expectedvalues

    rk = 1

    k k

    (i (i t) t)t=1

    (it)J(it)t=1

    Equivalent least squares alternative:

    k2

    r k = arg min (it)r J(it)rs

    t

    =1

    67

  • 8/12/2019 MIT6 231F11 Notes Short

    68/125

    THE ISSUE OF EXPLORATION

    To evaluate a policy, we need to generate cost

    samples using that policy - this biases the simula-tion by underrepresenting states that are unlikelyto occur under .

    As a result, the cost-to-go estimates of theseunderrepresented states may be highly inaccurate.

    This seriously impacts the improved policy. This is known as inadequate exploration - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is rela-tively small (e.g., a deterministic system).

    One possibility for adequate exploration: Fre-quently restart the simulationand ensure that theinitial states employed form a rich and represen-tative subset.

    Another possibility: Occasionally generate tran-sitions thatuse a randomly selected controlrather

    than the one dictated by the policy .

    Other methods, to be discussed later,use twoMarkov chains(one is the chain of the policy andis used to generate the transition sequence, theother is used to generate the state sequence).

    68

  • 8/12/2019 MIT6 231F11 Notes Short

    69/125

    APPROXIMATING Q-FACTORS

    The approach described so far for policy eval-uation requires calculating expected values [andknowledge ofpij (u)] for all controls u U(i). Model-free alternative:Approximate Q-factors

    n

    Q(i,u,r) pij (u)j=1

    g(i,u,j) +J(j)

    and use for policy improvement the minimization

    (i) = arg min Q(i,u,r)uU(i)

    ris an adjustable parameter vector and Q(i,u,r)is a parametric architecture, such as

    s

    Q(i,u,r) =

    rmm(i, u)m=1

    We can use any approach for cost approxima-tion, e.g., projected equations, aggregation.

    Use the Markov chain with states (i, u) -pij ((i))is the transition prob. to (j, (i)), 0 to other (j, u).

    Major concern:Acutely diminished exploration.69

  • 8/12/2019 MIT6 231F11 Notes Short

    70/125

    6.231 DYNAMIC PROGRAMMING

    LECTURE 4

    LECTURE OUTLINE

    Review of approximation in value space

    Approximate VI and PI Projected Bellman equations

    Matrix form of the projected equation

    Simulation-based implementation

    LSTD and LSPE methods Optimistic versions

    Multistep projected Bellman equations

    Bias-variance tradeoff

    70

  • 8/12/2019 MIT6 231F11 Notes Short

    71/125

    DISCOUNTED MDP

    System: Controlled Markov chain with states

    i= 1, . . . , nand finite set of controls u U(i) Transition probabilities: pij (u)

    ! "

    #!"!$"

    #!!!$" #" "!$"

    #"!!$"

    Cost of a policy = {0, 1, . . .} starting atstate i:

    N

    J(i) = lim EN

    kg ik, k(ik), ik+1 i=i0k

    =0 |

    with [0, 1) Shorthand notation for DP mappings

    n

    (T J)(i) = min

    pij (u)

    g(i,u,j)+J(j)

    , i= 1, . . . ,uU(i)

    j=1

    n

    (TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . .

    j=1

    71

  • 8/12/2019 MIT6 231F11 Notes Short

    72/125

    SHORTHAND THEORY A SUMMARY

    Bellmans equation:J

    =T J

    , J=TJ orn

    J(i) = min

    pij (u)

    g(i,u,j) +J(j) , iuU(i)

    j=1

    n

    J(i) =

    pij

    (i) i,=1

    g

    (i), j

    + J(j)j

    , i

    Optimality condition:

    : optimal T J =T J

    i.e.,

    n

    (i) arg minuU(i)

    pij (u)j=1

    g(i,u,j)+J(j)

    , i

    72

  • 8/12/2019 MIT6 231F11 Notes Short

    73/125

    THE TWO MAIN ALGORITHMS: VI AND PI

    Value iteration:For any J n

    J(i) = lim (TkJ)(i),k

    i= 1, . . . , n

    Policy iteration:Given k

    Policy evaluation:Find Jk by solving

    n

    Jk(i) =

    p k k

    ij

    (i)

    g

    i, (i), j

    +Jk (j)

    , i= 1, .

    j=1

    or Jk =TkJk

    Policy improvement:Letk+1 be such thatn

    k+1(i)arg min

    pij (u)

    g(i,u,j)+Jk(j)

    uU(i)j=1

    , i

    or Tk+1Jk =T Jk

    Policy evaluation is equivalent to solving ann n linear system of equations For large n, exact PI is out of the question(even though it terminates finitely)

    73

  • 8/12/2019 MIT6 231F11 Notes Short

    74/125

    APPROXIMATION IN VALUE SPACE

    Approximate J

    or J from a parametric classJ(i, r), where i is the current state and r= (r1, . . . , rmis a vector of tunable scalars weights.

    By adjustingr we can change the shape ofJso that it is close to the true optimal J.

    Any r s

    defines a (suboptimal) one-steplookahead policy

    n

    (i) = arg min

    pij (u)

    g(i,u,j)+J(j, r)uU(i)

    j=1

    , i

    We will focus mostly on linear architectures

    J(r) = r

    where is an n s matrix whose columns areviewed as basis functions

    Think n: HUGE, s: (Relatively) SMALL

    For J(r) = r, approximation in value spacemeans approximation ofJ or J within the sub-space

    S= r r s{ |

    }

    74

  • 8/12/2019 MIT6 231F11 Notes Short

    75/125

    APPROXIMATE VI

    Approximates sequentially Jk(i) = (TkJ0)(i),

    k= 1, 2, . . ., with Jk(i, rk) The starting function J0 is given (e.g., J0 0) After a large enough numberNof steps,JN(i, rN)is used as approximation J(i, r) to J(i)

    Fitted Value Iteration: A sequential fit toproduce Jk+1 from Jk, i.e., Jk+1 T Jk or (for asingle policy ) Jk+1 TJk

    For a small subset Sk of statesi, computen

    (T Jk)(i) = min puU(i)

    ij (u)j=1

    g(i,u,j) +Jk(j, r) Fit the function Jk+1(i, rk+1) to the small

    set of values (T Jk)(i), i Sk Simulation can be used for model-free im-

    plementation

    Error Bound: If the fit is uniformly accuratewithin >0 (i.e., maxi|Jk+1(i) T Jk(i)| ),

    2lim sup max Jk(i, rk)

    ik =1,...,n J(i)

    (1 )2

    75

  • 8/12/2019 MIT6 231F11 Notes Short

    76/125

    AN EXAMPLE OF FAILURE

    Consider two-state discounted MDP with states1 and 2, and a single policy.

    Deterministic transitions: 12 and 2

    2 Transition costs 0, so J (1) =J(2) = 0.

    Consider approximate VI scheme that approxi-

    mates cost functions inS=

    (r, 2r)|r 1

    a weighted least squares fit; here =2

    with

    Given Jk = (rk, 2rk), we find Jk+1= (rk+1, 2rk+1),where for weights 1, 2>0, rk+1 is obtained as

    2rk+1 = arg min

    1

    r(T Jk)(1)2

    +2 2r(T Jk)(2)r

    With straightforward calculation

    rk+1 =rk, where= 2(1+22)/(1+42)>1

    So if >1/, the sequence {rk} diverges andso does {Jk}.

    Difficulty is that T is a contraction, but T(= least squares fit composed with T) is not

    Norm mismatch problem76

  • 8/12/2019 MIT6 231F11 Notes Short

    77/125

    APPROXIMATE PI

    Approximate Policy

    Evaluation

    Policy Improvement

    Guess Initial Policy

    Evaluate Approximate Cost

    J(r) = r Using Simulation

    Generate Improved Policy

    Evaluation of typical policy:Linear cost func-tion approximation J(r) = r, where is furank n s matrix with columns the basis func-tions, and ith row denoted (i).

    Policy improvementto generate :n

    (i) = arg min

    pij (u)

    g(i,u,j) +(j)ruU(i)

    j=1

    Error Bound:If

    max |Jk(i, rk) Jk(i)| , k= 0, 1, . . .iThe sequence {k} satisfies

    2lim sup max J k(i) J (i)

    ik

    (1 )2

    ll

    77

  • 8/12/2019 MIT6 231F11 Notes Short

    78/125

    POLICY EVALUATION

    Lets consider approximate evaluation of thecost of the current policy by using simulation.

    Direct policy evaluation- Cost samples gen-erated by simulation, and optimization byleast squares

    Indirect policy evaluation- solving the pro-jected equation r = T(r) where isprojection w/ respect to a suitable weightedEuclidean norm

    S: Subspace spanned by basis functions

    0

    #J

    Projectionon S

    S: Subspace spanned by basis functions

    T(!r)

    0

    !r = #T(!r)

    Projectionon S

    J

    Direct Mehod: Projection of cost vector J Indirect method: Solving a projectedform of Bellmans equation

    Recall that projection can be implemented bysimulation and least squares

    78

  • 8/12/2019 MIT6 231F11 Notes Short

    79/125

    WEIGHTED EUCLIDEAN PROJECTIONS

    Consider a weighted Euclidean norm

    J = n i

    i=1

    J(i)

    2,

    where is a vector of positive weights 1, . . . , n. Let denote the projection operation onto

    S={r|r s}

    with respect to this norm, i.e., for any J

    n,

    J= r

    wherer = arg min

    rsJr2

    79

  • 8/12/2019 MIT6 231F11 Notes Short

    80/125

    PI WITH INDIRECT POLICY EVALUATION

    Approximate Policy

    Evaluation

    Policy Improvement

    Guess Initial Policy

    Evaluate Approximate Cost

    J(r) = r Using Simulation

    Generate Improved Policy

    Given the current policy :

    We solve the projected Bellmans equation

    r= T(r)

    We approximate the solution Jof Bellmansequation

    J=TJ

    with the projected equation solution J(r)

    80

  • 8/12/2019 MIT6 231F11 Notes Short

    81/125

    KEY QUESTIONS AND RESULTS

    Does the projected equation have a solution?

    Under what conditions is the mapping T acontraction, so T has unique fixed point?

    Assuming T has unique fixed point r , howclose is r to J?

    Assumption: The Markov chain correspondingtohas a single recurrent class and no transientstates, i.e., it has steady-state probabilities thatare positive

    N1

    j = lim P(ik =jNN k=1

    |i0 =i)>0

    Proposition: (Norm Matching Property)

    (a) T is contraction of modulus with re-spect to the weighted Euclidean norm ,where = (1, . . . , n) is the steady-state

    probability vector.

    (b) The unique fixed point r ofT satisfies

    1J r1 2

    J J81

  • 8/12/2019 MIT6 231F11 Notes Short

    82/125

    PRELIMINARIES: PROJECTION PROPERTIE

    Important property of the projection on Swith weighted Euclidean norm . For all Jn, J S, thePythagorean Theoremholds:

    J J2 = JJ2+ J J2

    Proof:Geometrically, (J

    J) and (

    J J) areorthogonal in the scaled geometry of the norm , where two vectors x, y n

    are orthogonal

    ifn

    i=1ixiyi = 0. Expand the quadratic in theRHS below:

    J J2 = (JJ) + ( J J)

    2

    The Pythagorean Theorem implies that thepro-jection is nonexpansive, i.e.,

    JJ J J, for allJ, J n.

    To see this, note that

    (J J)2 2 2(J J)

    +

    (I)(J J)

    = J J 2

    82

  • 8/12/2019 MIT6 231F11 Notes Short

    83/125

    PROOF OF CONTRACTION PROPERTY

    Lemma: IfP is the transition matrix of,

    P z z, z n

    Proof: Let pij be the components of P. For allz n, we have

    n n

    P z2 = i 2

    n npij zn

    2j

    i=1 =1

    i pij zjj i=1 j=1

    n n

    =j=1

    ipij z2j =

    i=1

    j z2j =

    j=1

    z2 ,

    where the inequality follows from the convexity ofthe quadratic function, and the next to last equal-ity follows from the defining property

    ni=1ipij =

    j of the steady-state probabilities.

    Using the lemma, the nonexpansiveness of,

    and the definition TJ=g+P J, we have

    TJTJ TJTJ =P(JJ) JJ

    for all J, J n. Hence T is a contraction ofmodulus .

    83

  • 8/12/2019 MIT6 231F11 Notes Short

    84/125

    PROOF OF ERROR BOUND

    Let r

    be the fixed point ofT. We have

    1J r J J .1 2

    Proof:We have

    2J 2 2 r = J J+J r

    J

    = J 22+

    T J T(r) J J2+2J r2,

    where The first equality uses the Pythagorean The-

    orem

    The second equality holds because J is thefixed point ofT and r is the fixed point

    of

    T The inequality uses the contraction propertyofT.

    Q.E.D.

    84

  • 8/12/2019 MIT6 231F11 Notes Short

    85/125

    MATRIX FORM OF PROJECTED EQUATION

    Its solution is the vector J = r

    , where r

    solves the problem

    2min

    r (g+Pr)rs

    .

    Setting to 0 the gradient with respect to r ofthis quadratic, we obtain

    r (g+Pr) = 0,

    where is the diagonal matrix wi

    th the steady-

    state probabilities 1, . . . , n along the diagonal.

    This is just the orthogonality condition: Theerror r (g +Pr) is orthogonal to thesubspace spanned by the columns of.

    Equivalently,

    Cr =d,

    where

    C= (I P), d= g.

    85

  • 8/12/2019 MIT6 231F11 Notes Short

    86/125

    PROJECTED EQUATION: SOLUTION METHOD

    Matrix inversion: r

    =C1

    d Projected Value Iteration (PVI) method:

    rk+1 = T(rk) = (g+Prk)

    Converges to r becauseT is a contraction.

    S: Subspace spanned by basis functions

    !rk

    T(!rk)= g + "P!rk

    0

    !rk+1

    Value Iterate

    Projectionon S

    PVI can be written as:

    rk+1 = arg minrs

    2r (g+Prk)By setting to 0 the gradient with respect

    to r,

    rk+1 (g+Prk)

    = 0,

    which yields

    r =r ()1(Cr d)k+1 k

    k

    86

  • 8/12/2019 MIT6 231F11 Notes Short

    87/125

    SIMULATION-BASED IMPLEMENTATIONS

    Key idea:Calculate simulation-based approxi-mations based on k samples

    Ck C, dk d

    Matrix inversion r = C1d is approximated

    byrk =C

    1k dk

    This is the LSTD(Least Squares Temporal Dif-ferences) Method.

    PVI methodrk+1 =rk

    ()1(Crk

    d) is

    approximated by

    rk+1 =rk Gk(Ckrk dk)

    whereG

    ( k ) 1

    This is the LSPE(Least Squares Policy Evalua-tion) Method.

    Key fact: Ck, dk, and Gk can be computedwith low-dimensional linear algebra (of order s;the number of basis functions).

    87

  • 8/12/2019 MIT6 231F11 Notes Short

    88/125

    SIMULATION MECHANICS

    Wegenerate an infinitely long trajectory (i0, i1, . . .)of the Markov chain, so states i and transitions

    (i, j) appear with long-term frequenciesiandpij .

    After generating the transition (it, it+1), wecompute the row (i )t of and the cost com-ponent g(it, it+1).

    We form

    k1

    Ck =

    (i (i t)

    (it) t+1)

    + 1=0

    (IP)

    kt

    k1

    dk = (it)g(it, it+1)k+ 1 t=0

    g

    Also in the case of LSPE

    k1

    Gk = (i t)

    t)(i k+ 1

    t=0

    Convergence based on law of large numbers.

    Ck, dk, and Gk can be formed incrementally.Also can be written using the formalism of tem-poral differences(this is just a matter of style)

    88

  • 8/12/2019 MIT6 231F11 Notes Short

    89/125

    OPTIMISTIC VERSIONS

    Instead of calculating nearly exact approxima-tions Ck C and dk d, we do a less accurateapproximation, based on few simulation samples

    Evaluate (coarsely) current policy , then do apolicy improvement

    This often leads to faster computation (as op-timistic methods often do)

    Very complex behavior (see the subsequent dis-cussion on oscillations)

    The matrix inversion/LSTD method has serious

    problems due to large simulation noise (because oflimited sampling)

    LSPE tends to cope better because of its itera-tive nature

    A stepsize (0, 1] in LSPE may be useful todamp the effect of simulation noise

    rk+1 =rk Gk(Ckrk dk)

    89

  • 8/12/2019 MIT6 231F11 Notes Short

    90/125

    MULTISTEP METHODS

    Introduce a multistep version of Bellmans equa-tion J=T()J, where for

    [0, 1),

    T() = (1 )

    T+1

    =0

    Geometrically weighted sum of powers ofT.

    Note that T is a contraction with modulus

    , with respect to the weighted Euclidean norm, whereis the steady-state probability vectorof the Markov chain.

    Hence T() is a contraction with modulus

    = (1 ) (1+1 = )1 =0

    Note that 0 as 1 Tt and T() have the same fixed point J and

    1J r J J 1 2

    where r is the fixed point ofT().

    The fixed point r depends on . 90

  • 8/12/2019 MIT6 231F11 Notes Short

    91/125

    BIAS-VARIANCE TRADEOFF

    Subspace S={r| r s}

    J

    Simulation errorJ

    Bias

    = 0

    = 1

    Solution of projected equation

    Simulation error

    r= T()(r)

    Error bound Jr 1 2J 1

    J

    As 1, we have 0, so error bound (andthe quality of approximation) improves as 1.In fact

    limr = J1

    But the simulation noise in approximatingT() = (1 ) T+1

    =0

    increases

    Choice of is usually based on trial and error91

  • 8/12/2019 MIT6 231F11 Notes Short

    92/125

    MULTISTEP PROJECTED EQ. METHODS

    The projected Bellman equation is

    r= T()(r)

    In matrix form: C()r=d(), where

    C() =

    I P(), d() = g(),with P() = (1 )

    P+1, g() =

    =0

    Pg

    =0

    The LSTD() methodis

    (C

    ) 1 (

    k d

    )

    k ,

    ( ) ( )whereC

    andd

    k k

    are

    simulation-based approx-

    imations ofC() andd().

    The LSPE() methodis

    ()

    ()

    rk+1 =rk Gk Ck rk dk

    where Gkis a simulation-b

    ased approx. to

    ()1

    TD():An important simpler/slower iteration[similar to LSPE() with Gk =I - see the text].

    92

  • 8/12/2019 MIT6 231F11 Notes Short

    93/125

    MORE ON MULTISTEP METHODS

    ( ) ( )

    The simulation process to obtainC

    k andd

    k

    is similar to the case = 0 (single simulation tra-jectory i0, i1, . . . more complex formulas)

    k k(

    C ) 1

    =

    (i )

    mttk

    mt(im)(im+1)

    k+ 1t=0 m=t

    k k(

    d ) 1

    =

    (i mt mtt ik ) k+ 1t m

    g m

    =0 =t

    In the context of approximate policy iteration,we can use optimistic versions (few samples be-

    tween policy updates).

    Many different versions (see the text).

    Note the -tradeoffs:

    As (1, C ) (k and d )k contain more sim-ulation noise, so more samples are neededfor a close approximation ofr (the solutionof the projected equation)

    The error bound Jrbecomes smaller As 1, T() becomes a contraction for

    arbitraryprojection norm93

  • 8/12/2019 MIT6 231F11 Notes Short

    94/125

    6.231 DYNAMIC PROGRAMMING

    LECTURE 5

    LECTURE OUTLINE

    Review of approximate PI

    Review of approximate policy evaluation basedon projected Bellman equations

    Exploration enhancement in policy evaluation

    Oscillations in approximate PI

    Aggregation An alternative to the projected

    equation/Galerkin approach

    Examples of aggregation

    Simulation-based aggregation

    94

  • 8/12/2019 MIT6 231F11 Notes Short

    95/125

    DISCOUNTED MDP

    System: Controlled Markov chain with statesi= 1, . . . , n and finite set of controls u

    U(i)

    Transition probabilities: pij (u)

    ! "

    #!"!$"

    #!!!$" #" "!$"

    #"!!$"

    Cost of a policy = {0, 1, . . .} starting atstate i:

    N

    J(i) = lim EN

    kg ik, k(ik), ik+1 i=i0k

    =0 |

    with [0, 1) Shorthand notation for DP mappings

    n

    (T J)(i) = min

    pij (u)

    g(i,u,j)+J(j)

    , i= 1, . . . , n ,uU(i)

    j=1

    n

    (TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . . , n

    j=1

    95

  • 8/12/2019 MIT6 231F11 Notes Short

    96/125

    APPROXIMATE PI

    Approximate Policy

    Evaluation

    Policy Improvement

    Guess Initial Policy

    Evaluate Approximate Cost

    J(r) = r Using Simulation

    Generate Improved Policy

    Evaluation of typical policy:Linear cost func-tion approximation

    J(r) = r

    where is full rank n s matrix with columns

    the basis functions, and ith row denoted

    (i)

    . Policy improvementto generate :

    n

    (i) = arg min pij (u) g(i,u,j) +(j)ruU(i)

    j=1

    96

  • 8/12/2019 MIT6 231F11 Notes Short

    97/125

    EVALUATION BY PROJECTED EQUATIONS

    We discussed approximate policy evaluation bysolving the projected equation

    r= T(r)

    : projection with a weighted Euclidean norm

    Implementation by simulation (single long tra-

    jectory using current policy - important to makeT a contraction). LSTD, LSPE methods.

    ( )

    Multistep option:Solve r= T

    (r) with

    (T

    )= (1 )

    T+1

    =0 As 1, T() becomes a contraction forany projection norm

    Bias-variance tradeoff

    Subspace S={r| r s}

    J

    Simulation errorJ

    Bias

    = 0

    = 1

    Solution of projected equation

    Simulation error

    r= T()(r)

    97

  • 8/12/2019 MIT6 231F11 Notes Short

    98/125

    POLICY ITERATION ISSUES: EXPLORATION

    1st major issue: exploration. To evaluate ,we need to generate cost samples using

    This biases the simulation by underrepresentingstates that are unlikely to occur under .

    As a result, the cost-to-go estimates of these

    underrepresented states may be highly inaccurate. This seriously impacts the improved policy.

    This is known as inadequate exploration - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is rela-

    tively small (e.g., a deterministic system). Common remedy is theoff-policy approach:Re-place Pof current policy with a mixture

    P = (I B)P+ BQwhereB is diagonal with diagonal components in[0, 1] and Q is another transition matrix.

    LSTD and LSPE formulas must be modified ...otherwise the policy P (not P) is evaluated. Re-lated methods and ideas: importance sampling,geometric and free-form sampling(see the text).

    98

  • 8/12/2019 MIT6 231F11 Notes Short

    99/125

    POLICY ITERATION ISSUES: OSCILLATIONS

    2nd major issue: oscillation of policies Analysis using the greedy partition: R is theset of parameter vectors r for which is greedywith respect to J(, r) = r

    R= r|T(r) =T(r) There is a finite number of possible

    vectors r,

    one generated from another in a deterministic way

    rk

    rk+1

    rk+2

    rk+3

    Rk

    Rk+1

    Rk+2

    Rk+3

    The algorithm ends up repeating some cycle ofpolicies k, k+1, . . . , k+m with

    rk Rk+1 , rk+1 Rk+2 , . . . , rk+m Rk ;

    Many different cycles are possible99

  • 8/12/2019 MIT6 231F11 Notes Short

    100/125

    MORE ON OSCILLATIONS/CHATTERING

    In the case of optimistic policy iteration a dif-ferent picture holds

    r1

    r2

    r3

    R1

    R2

    R3

    Oscillations are less violent, but the limitpoint is meaningless!

    Fundamentally, oscillations are due to thelackof monotonicity of the projection operator, i.e.,J J does not imply J J. If approximate PI uses policy evaluation

    r= (W T)(r)

    withWa monotone operator, the generated poli-cies converge (to a possibly nonoptimal limit).

    The operator W used in the aggregation ap-proach has this monotonicity property.

    100

  • 8/12/2019 MIT6 231F11 Notes Short

    101/125

    PROBLEM APPROXIMATION - AGGREGATIO

    Another major idea in ADP is to approximatethe cost-to-go function of the problem with thecost-to-go function of a simpler problem.

    The simplification is often ad-hoc/problem-depende

    Aggregation is a systematic approach for prob-

    lem approximation. Main elements: Introduce a few aggregate states,viewedas the states of an aggregate system

    Define transition probabilities and costs ofthe aggregate system, by relating originalsystem states with aggregate states

    Solve (exactly or approximately) the ag-gregate problem by any kind of VI or PImethod (including simulation-based methods)

    Use the optimal cost of the aggregate prob-lem to approximate the optimal cost of the

    original problem Hard aggregation example: Aggregate statesare subsets of original system states, treated as ifthey all have the same cost.

    101

  • 8/12/2019 MIT6 231F11 Notes Short

    102/125

    AGGREGATION/DISAGGREGATION PROBS

    dxi jy

    i

    x y

    OriginalSystem States

    Aggregate States

    Disaggregation

    Probabilities

    Aggregation

    Probabilities

    MatrixD Matrix

    |

    The aggregate system transition probabilitiesare defined via two (somewhat arbitrary) choices

    For each original system statej and aggregatestatey, theaggregation probability jy

    Roughly, the degree of membership ofj inthe aggregate state y.

    In hard aggregation, jy = 1 if state j be-longs to aggregate state/subset y.

    For each aggregate state xand original systemstate i, thedisaggregation probability dxi

    Roughly, the degree to which i is represen-tative ofx.

    In hard aggregation, equal dxi

    102

    jpij(u)

  • 8/12/2019 MIT6 231F11 Notes Short

    103/125

    AGGREGATE SYSTEM DESCRIPTION

    The transition probability from aggregate statex to aggregate state y under control u

    n n

    pxy(u) =

    dxi ori

    pij (u)jy, P(u) =DP(u)

    =1 j=1

    where the rows ofD and are the disaggregationand aggregation probs.

    The expected transition cost is

    n n

    g(x, u) =

    dxi

    pij (u)g(i,u,j), or g =DP gi=1 j=1

    The optimal cost function of the aggregate prob-lem, denoted R, is

    R(x) = min

    g(x, u) +

    pxy(u)R(y)

    uUy

    , x

    Bellmans equation for the aggregate problem.

    The optimal cost function J of the originalproblem is approximated by J given by

    J(j) =

    jy R(y),

    y

    j

    103

  • 8/12/2019 MIT6 231F11 Notes Short

    104/125

    EXAMPLE I: HARD AGGREGATION

    Group the original system states into subsets,and view each subset as an aggregate state

    Aggregation probs.: jy = 1 i f j belongs toaggregate state y.

    Disaggregation probs.: There are many possi-bilities, e.g., all states i within aggregate state xhave equal prob. dxi.

    If optimal cost vector J is piecewise constant

    over the aggregate states/subsets, hard aggrega-tion is exact.Suggests grouping states with roughlyequal cost into aggregates.

    A variant: Soft aggregation (provides softboundaries between aggregate states).

    104

    1 2 3

    4 5 6

    7 8 9

    x1 x2

    x3 x4

    =

    1 0 0 0

    1 0 0 00 1 0 0

    1 0 0 0

    1 0 0 0

    0 1 0 0

    0 0 1 0

    0 0 1 0

    0 0 0 1

  • 8/12/2019 MIT6 231F11 Notes Short

    105/125

    EXAMPLE II: FEATURE-BASED AGGREGATIO

    Important question: How do we group statestogether?

    If we know good features, it makes sense togroup together states that have similar features

    States Aggregate StatesFeatures

    FeatureExtraction

    A general approach for passing from a feature-based state representation to an aggregation-basedarchitecture

    Essentially discretize the features and generatea corresponding piecewise constant approximationto the optimal cost function

    Aggregation-based architecture is more power-ful(nonlinear in the features)

    ... but may require many more aggregate statesto reach the same level of performance as the cor-responding linear feature-based architecture

    105

  • 8/12/2019 MIT6 231F11 Notes Short

    106/125

    EXAMPLE III: REP. STATES/COARSE GRID

    Choose a collection of representative originalsystem states, and associate each one of them withan aggregate state

    x j

    j2

    j3

    y1 y2

    y3

    Original State Space

    Representative/Aggregate States

    Disaggregation probabilities are dxi = 1 if i isequal to representative state x.

    Aggregation probabilities associate original sys-tem states with convex combinations of represen-tative states

    j y jy yA

    Well-suited for Euclidean space discretization

    Extends nicely to continuous state space, in-cluding belief space of POMDP

    106

    j1

  • 8/12/2019 MIT6 231F11 Notes Short

    107/125

    EXAMPLE IV: REPRESENTATIVE FEATURES

    Here the aggregate states are nonempty sub-sets of original system states (but need not forma partition of the state space)

    Example: Choose a collection of distinct rep-resentative feature vectors, and associate each ofthem with an aggregate state consisting of originalsystem states with similar features

    Restrictions:

    The aggregate states/subsets are disjoint. The disaggregation probabilities satisfy dxi >

    0 if and only ifi

    x.

    The aggregation probabilities satisfyjy = 1for all j y.

    If every original system state ibelongs to someaggregate state we obtain hard aggregation

    If every aggregate state consists of a single orig-inal system state, we obtain aggregation with rep-resentative states

    With the above restrictions D =I, so (D)(D) =D, and Dis an oblique projection(orthogonalprojection in case of hard aggregation)

    107

  • 8/12/2019 MIT6 231F11 Notes Short

    108/125

    APPROXIMATE PI BY AGGREGATION

    pij(u),

    dxi jy

    i

    x y

    OriginalSystem States

    Aggregate States

    pxy(u) =n

    i=1

    dxi

    n

    j=1

    pij(u)jy ,

    Disaggregation

    Probabilities

    Aggregation

    Probabilities

    g(x, u) =n

    i=1

    dxi

    n

    j=1

    pij(u)g(i,u,j)

    , g(i, u, j)

    Consider approximate policy iteration for theoriginal problem, with policy evaluation done by

    aggregation. Evaluation of policy : J = R, where R =DT(R) (R is the vector of costs of aggregatestates for ). Can be done by simulation.

    Looks like projected equation R= T(R)

    (but with D in place of). Advantages:It has no problem with explorationor with oscillations.

    Disadvantage: The rows of D and must beprobability distributions.

    108

    j

  • 8/12/2019 MIT6 231F11 Notes Short

    109/125

    DISTRIBUTED AGGREGATION I

    We consider decomposition/distributed solu-tionof large-scale discounted DP problems by ag-gregation.

    Partition the original system states into subsetsS1, . . . , S m

    Each subset S, = 1, . . . , m:

    Maintains detailed/exact local costsJ(i) for every original system statei S

    using aggregate costs of other subsets

    Maintains an aggregate cost R() = iS diJ(i)

    Sends R() to other aggregate states J(i) and R() are updated by VI according to

    Jk+1(i) = min H(i,u,Jk, Rk),uU(i)

    i Swith Rk being the vector ofR() at time k, and

    n

    H(i,u,J,R) =

    pij (u)g(i , u , j) +

    j=1 j

    pij (u)J(j)

    S

    +

    jS pij (u)R(

    )

    , =

    109

  • 8/12/2019 MIT6 231F11 Notes Short

    110/125

    DISTRIBUTED AGGREGATION II

    Can show that this iteration involves a sup-norm contraction mapping of modulus , so itconverges to the unique solution of the system ofequations in (J, R)

    J(i) = min H(i,u,J,R), R() =uU(i) i

    diJ(i),S

    i S, = 1, . . . , m .

    This follows from the fact that {di | i =1, . . . , n} is a probability distribution.

    View these equations as a set of Bellman equa-tions for an aggregate DP problem.The differ-ence is that the mapping H involves J(j) ratherthan R x(j) for j S. In an

    asyn

    chronous version of the method, the

    aggregate costs R() may be outdated to accountfor communication delays between aggregate states.

    Convergence can be shown using the generaltheory of asynchronous distributed computation(see the text).

    110

  • 8/12/2019 MIT6 231F11 Notes Short

    111/125

    6.231 DYNAMIC PROGRAMMING

    LECTURE 6

    LECTURE OUTLINE

    Review of Q-factors and Bellman equations forQ-factors

    VI and PI for Q-factors

    Q-learning - Combination of VI and sampling

    Q-learning and cost function approximation

    Approximation in policy space

    111

  • 8/12/2019 MIT6 231F11 Notes Short

    112/125

    DISCOUNTED MDP

    System: Controlled Markov chain with statesi= 1, . . . , n and finite set of controls u

    U(i)

    Transition probabilities: pij (u)

    ! "

    #!"!$"

    #!!!$" #" "!$"

    #"!!$"

    Cost of a policy = {0, 1, . . .} starting atstate i:

    N

    J(i) = lim EN

    kg ik, k(ik), ik+1 i=i0k

    =0 |

    with [0, 1) Shorthand notation for DP mappings

    n

    (T J)(i) = min

    pij (u)

    g(i,u,j)+J(j)

    , i= 1, . . . , nuU(i)

    j=1

    n

    (TJ)(i) = pij (i) g i, (i), j +J(j) , i= 1, . . . ,

    j=1

    112

  • 8/12/2019 MIT6 231F11 Notes Short

    113/125

    THE TWO MAIN ALGORITHMS: VI AND PI

    Value iteration:For any J n

    J(i) = lim (TkJ)(i),k

    i= 1, . . . , n

    Policy iteration:Given k

    Policy evaluation:Find Jk by solving

    n

    J k k

    k(i) =

    pij

    (i)

    g

    i, (i), j

    +Jk (j)

    , i= 1, .

    j=1

    or Jk =TkJk

    Policy improvement:Letk+1 be such thatn

    k+1 (i)arg min pij (u) g(i,u,j)+Jk(j) , i

    uU(i)

    j=1

    or Tk+1Jk =T Jk

    We discussed approximate versions of VI andPIusing projection and aggregation

    We focused so far on cost functions and approx-imation.We now consider Q-factors.

    113

  • 8/12/2019 MIT6 231F11 Notes Short

    114/125

    BELLMAN EQUATIONS FOR Q-FACTORS

    The optimal Q-factors are defined byn

    Q(i, u) =

    p ij (u)

    g(i,u,j) + J (j)

    , (i, u)j=1

    Since J =T J, we have J(i) = minuU(i) Q(i, u)so the optimal Q-factors solve the equation

    n

    Q(i, u) = pij (u)g(i,u,j) + min Q(j,u u)U(j)j=1

    Equivalently Q =F Q, where

    n

    (F Q)(i, u) =

    p ij (u)

    g(i,u,j) +

    min Q(j, u)u U(j)

    j=1

    This is Bellmans Eq. for a system whose statesare the pairs (i, u)

    Similar mapping F and Bellman equation fora policy : Q =FQ

    114

    States

    (i, u)

    j

    pij u

    g(i, u, j)

    j

    j, (j)

    State-Control Pairs: Fixed Policy

  • 8/12/2019 MIT6 231F11 Notes Short

    115/125

    SUMMARY OF BELLMAN EQS FOR Q-FACTOR

    Optimal Q-factors:For all (i, u)

    n

    Q(i, u) =

    pij (u)

    g(i,u,j) +

    min Q(j, u)u U(j)

    j=1

    EquivalentlyQ =F Q, wheren

    (F Q)(i, u) =

    pij (u)

    g(i,u,j) + min Q(j, u)u U(j)j=1

    Q-factors of a policy:For all (i, u)

    n

    Q(i, u) = pij (u)

    g(i,u,j) +Q (=1

    j, j)j

    EquivalentlyQ =FQ, wheren

    (FQ)(i, u) = pij (u) g(i,u,j) +Q j, (j)j=1

    115

    States

    (i, u)

    j

    pij ug(i, u, j)

    j

    j, (j)

    State-Control Pairs: Fixed Policy

  • 8/12/2019 MIT6 231F11 Notes Short

    116/125

    WHAT IS GOOD AND BAD ABOUT Q-FACTOR

    All the exact theory and algorithms for costsapplies to Q-factors

    Bellmans equations, contractions, optimal-ity conditions, convergence of VI and PI

    All the approximate theory and algorithms forcosts applies to Q-factors

    Projected equations, sampling and exploration

    issues, oscillations, aggregation

    A MODEL-FREE (on-line) controller imple-mentation

    Once we calculate Q(i, u) for all (i, u),

    (i) = arg min Q(i, u), iuU(i) Similarly, once we calculate a parametric ap-

    proximation Q(i,u,r) for all (i, u),

    (i) = arg min Q(i,u,r), iuU(i)

    The main bad thing: Greater dimension andmore storage! [Can be used for large-scale prob-lems only through aggregation, or other cost func-tion approximation.]

    116

  • 8/12/2019 MIT6 231F11 Notes Short

    117/125

    Q-LEARNING

    In addition to the approximate PI methodsadapted for Q-factors, there is an important addi-tional algorithm:

    Q-learning,which can be viewed as a sam-pled form of VI

    Q-learning algorithm (in its classical form): Sampling: Select sequence of pairs (ik, uk)(use any probabilistic mechanism for this,but all pairs (i, u) are chosen infinitely of-ten.)

    Iteration:For eachk, selectjk according to

    pikj (uk). Update just Q(ik, uk):

    Qk+1(ik,uk) = (1 k)Qk(ik, uk)

    +k

    g(ik, uk, jk) + min

    uQk(jk, u

    U(jk)

    Leave unchanged all other Q-factors: Qk+1(i, u) Qk(i, u) for all (i, u) = (ik, uk).

    Stepsize conditions: k must converge to 0at proper rate (e.g., like 1/k).

    117

  • 8/12/2019 MIT6 231F11 Notes Short

    118/125

    NOTES AND QUESTIONS ABOUT Q-LEARNIN

    Qk+1(ik,uk) = (1 k)Qk(ik, uk)

    +k

    g(ik, uk, jk) + Q k

    min k(j , u)u U(jk)

    Model free implementation. We just need asimulator that given (i, u) produces next state jand cost g(i,u,j)

    Operates on only one state-control pair at atime.Convenient for simulation, no restrictions onsampling method.

    Aims to find the (exactly) optimal Q-factors. Why does it converge toQ?

    Why cant I use a similar algorithm for optimalcosts?

    Important mathematical (fine) point:In theQ-

    factor version of Bellmans equation the order ofexpectation and minimization is reversedrelativeto the cost version of Bellmans equation:

    n

    J(i) = min piuU(i)

    j=1

    j (u) g(i,u,j) +J(j) 118

  • 8/12/2019 MIT6 231F11 Notes Short

    119/125

    CONVERGENCE ASPECTS OF Q-LEARNING

    Q-learning can be shown to converge to true/exactQ-factors (under mild assumptions).

    Proof is sophisticated, based on theories ofstochastic approximation and asynchronous algo-rithms.

    Uses the fact that the Q-learning mapF:

    (F Q)(i, u) =Ej g(i,u,j) + minu

    Q(j, u)

    is a sup-norm contr

    action.

    Generic stochastic approximation algorithm:

    Consider generic fixed point problem involv-ing an expectation:

    x=Ew f(x, w)

    Assume Ew )

    f(x, w

    is a contractionwith

    respect to some norm

    , so the iteration

    xk+1 =Ew f(xk, w)

    converges to the uniqu

    e fixed po

    int

    Approximate Ew f(x, w) by sampling 119

  • 8/12/2019 MIT6 231F11 Notes Short

    120/125

    STOCH. APPROX. CONVERGENCE IDEAS

    For each k, obtain samples {w1, . . . , wk} anduse the approximation

    k1

    xk+1 =

    f(xk, wt)k

    t=1

    E f(xk, w)

    This iteration approximates the

    convergen

    t fixedpoint iteration xk+1 =Ew

    f(xk, w)

    A major flaw:it requires, for each

    k, the compu-

    tation off(xk, wt) forallvalues wt, t= 1, . . . , k.

    This motivates the more convenient iteration

    k1xk+1 =

    f(xt, wt), k= 1, 2, . . . ,

    kt=1

    that is similar, but requires much less computa-tion; it needsonly onevalue offper sample wt.

    By denoting k = 1/k, it can also be written as

    xk+1= (1 k)xk+kf(xk, wk), k= 1, 2, . . . Compare with Q-learning,where the fixed pointproblem is Q=F Q

    (F Q)(i, u) =Ej g(i,u,j) + min Q(j, u

    )

    u

    120

  • 8/12/2019 MIT6 231F11 Notes Short

    121/125

    Q-FACTOR APROXIMATIONS

    We introduce basis function approximation:Q(i,u,r) =(i, u)r

    We can use approximate policy iteration andLSPE/LSTD for policy evaluation

    Optimistic policy iteration methods are fre-quently used on a heuristic basis

    Example: Generate trajectory {(ik, uk) | k =0, 1, . . .}.

    At iteration k, given rkand state/control (ik, uk):

    (1) Simulate next transition (ik, ik+1) using thetransition probabilities pikj (uk).

    (2) Generate control uk+1 from

    uk+1 = arg min Q(ik+1, u , rk)uU(ik+1)

    (3) Update the parameter vector via

    rk+1 =rk (LSPE or TD-like correction) Complex behavior, unclear validity (oscilla-tions, etc). There is solid basis for an important

    special case: optimal stopping (see text)121

  • 8/12/2019 MIT6 231F11 Notes Short

    122/125

    APPROXIMATION IN POLICY SPACE

    We parameterize policies by a vector r =(r1, . . . , rs)(an approximation architecture for poli-cies).

    Each policy (r) =

    (i; r) | i = 1, . . . , ndefines a cost vector J(r) (a function ofr).

    We optimize some measure ofJ(r) over r. For example, use a random search, gradient, orother method to minimize over r

    n

    piJ(r)(i),i=1

    where (p1, . . . , pn) is some probability distributionover the states.

    An important special case:Introduce cost ap-proximation architectureV(i, r) that defines indi-

    rectly the parameterization of the policies

    n

    (i; r) = arg min

    pij (u) i, iuU )

    j=1

    g( u, j)+V(j, r)

    (

    ,

    i

    Brings in features to approximation in policy

    space 122

  • 8/12/2019 MIT6 231F11 Notes Short

    123/125

    APPROXIMATION IN POLICY SPACE METHOD

    Random search methods are straightforwardand have scored some impressive successes withchallenging problems (e.g., tetris).

    Gradient-type methods (known as policy gra-dient methods) also have been worked on exten-sively.

    They move along the gradient with respect tor of

    npiJ(r)(i),

    i=1

    There are explicit gradient formulas which havebeen approximated by simulation

    Policy gradient methods generally suffer by slowconvergence, local minima, and excessive simula-tion noise

    123

  • 8/12/2019 MIT6 231F11 Notes Short

    124/125

    FINAL WORDS AND COMPARISONS

    There is no clear winneramong ADP methods There is interesting theoryin all types of meth-ods (which, however, does not provide ironcladperformance guarantees)

    There are major flaws in all methods:

    Oscillations and exploration issues in approx-imate PI with projected equations

    Restrictions on the approximation architec-ture in approximate PI with aggregation

    Flakiness of optimization in policy space ap-proximation

    Yet these methods have impressive successesto show with enormously complex problems, forwhich there is no alternative methodology

    There are also other competing ADP methods

    (rollout is simple, often successful, and generallyreliable; approximate LP is worth considering)

    Theoretical understanding is important andnontrivial

    Practice is an art and a challenge to our cre-

    ati