Stochastic Control : c Vikram Krishnamurthy 2013 1 Partially Observed Markov Decision Processes - From Filtering to Stochastic Control Prof. Vikram Krishnamurthy Dept. Electrical & Computer Engineering Univ of British Columbia email [email protected]These lecture notes contain all the transparancies that will be used during the lectures.
18
Embed
Partially Observed Markov Decision Processes - From ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Figure 11.1: Partially Observed Markov Decision Process (POMDP)Schematic Setup. Note that the Markovian system together with noisy sensorconstitute a Hidden Markov Model (HMM). The HMM filter computes theposterior (belief state) πk of the state of the Markov chain. The controllerthen chooses the action uk based on πk.
analyze the structure of the dynamic programming equation to figure out thestructure of the optimal policy.
Define the posterior distribution of the Markov chain as
πk(i) = P(xk = i|Ik), i ∈ X where Ik = (π0, u0, y1, . . . , uk−1, yk).
We will call πk as the belief state or information state at time k. Recall fromChapter 3 that this is computed via the HMM filter of Algorithm 3, namely
πk+1 = T (πk, yk+1, uk) =Byk+1
(uk)P′(uk)πk
σ(πk, yk+1, uk), (11.2)
where σ(πk, yk+1, uk) = 1′Byk+1(uk)P
′(uk)πk.
Jµ(π0) = Eµ
{N−1∑
k=0
c(xk, uk, k) + c(xN , N) | π0
}
= Eµ
{N−1∑
k=0
E{c(xk, uk, k) | Ik} + E{c(xN , N) | IN} | π0
}
= Eµ
{N−1∑
k=0
X∑
i=1
c(i, uk, k) πk(i) +
X∑
i=1
c(i, N) πN (i) | π0
}
= Eµ
{N−1∑
k=0
c′uk
(k) πk + c′(N) πN | π0
}(11.3)
Ex 4: Optimal Observer Trajectory
Suppose observer can move. The model is
xk = Akxk−1 + wk
yk = f(xk, uk) + vk
What is its optimal trajectory {uk}?
Compute optimal trajectory {uk} to minimize
J =
N∑k=1
E{
[xk − xk(uk)]2}
This is a partially observed stochastic control problem
Figure 9.2: Monotone increasing threshold policy µ∗(x). Here, x∗ is thethreshold state at which the policy switches from 1 to 2.
for some fixed value x∗ ∈ X . Such a policy will be called a threshold policyand x∗ will be called the threshold state. If one can prove that the optimalpolicy is a threshold, then one only needs to compute the threshold state x∗
at which the optimal policy switches from 1 to 2. In other words computationof µ∗(x) reduces to computing the single point x∗, which in many cases canbe done more efficiently than computing an arbitrary policy µ. Fig. 9.2illustrates such a threshold policy.
Recall from Bellman’s equation that the optimal policy is µ∗(x) = argminu Q(x, u).What are sufficient conditions to ensure that the optimal policy µ∗(x) is in-creasing in x? The answer to this question lies in the area of monotonecomparative statics - which studies how the argmin or argmax of a functionbehaves as one of the variables changes. In particular, we want conditionssuch that µ∗(x) = argminu Q(x, u) is increasing in x. Below we will showthat Q(x, u) being submodular in (x, u) is a sufficient condition for µ∗(x) toincrease in x. Since Q(x, u) is the conditional expectation of the cost to gogiven the current state, in order to show that Q(x, u) is submodular, we needto characterize how expectations vary as the state varies. For this we willuse stochastic dominance.
In the next two subsections we introduce two important tools that willbe used to give conditions under which a MDP has monotone optimal poli-cies. The tools are, respectively, submodularity/superodularity and stochasticdominance.
9.5.1 Submodularity and Supermodularity
Below we introduce the key concepts of submodularity and supermodulairitythat was championed by Topkis in a series of seminal papers [97] culminat-