1 ECE-517: Reinforcement Learning in ECE-517: Reinforcement Learning in Artificial Intelligence Artificial Intelligence Lecture 15: Partially Observable Markov Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Decision Processes (POMDPs) Dr. Itamar Arel Dr. Itamar Arel College of Engineering College of Engineering Electrical Engineering and Computer Science Department Electrical Engineering and Computer Science Department The University of Tennessee The University of Tennessee Fall 2015 Fall 2015 November 5, 2015 November 5, 2015
31
Embed
Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs). October 27, 2010. Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2010. Outline. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
11
ECE-517: Reinforcement Learning in ECE-517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence
Dr. Itamar ArelDr. Itamar ArelCollege of EngineeringCollege of Engineering
Electrical Engineering and Computer Science DepartmentElectrical Engineering and Computer Science DepartmentThe University of TennesseeThe University of Tennessee
Fall 2015Fall 2015
November 5, 2015November 5, 2015
ECE 517 – Reinforcement Learning in AI 22
OutlineOutline
Why use POMDPs?Why use POMDPs?Formal definitionFormal definitionBelief stateBelief stateValue function Value function
To introduce POMDPs let us consider an example To introduce POMDPs let us consider an example where an agent learns to drive a car in New York citywhere an agent learns to drive a car in New York cityThe agent can look forward, backward, left or rightThe agent can look forward, backward, left or rightIt cannot change speed but it can steer into the lane it It cannot change speed but it can steer into the lane it is looking atis looking atThe different types of observations areThe different types of observations are
the direction in which the agent's gaze is directedthe direction in which the agent's gaze is directed the closest object in the agent's gazethe closest object in the agent's gaze whether the object is looming or recedingwhether the object is looming or receding the color of the objectthe color of the object whether a horn is soundingwhether a horn is sounding
To drive safely the agent must steer out of its lane to To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behindavoid slow cars ahead and fast cars behind
ECE 517 – Reinforcement Learning in AI 44
POMDP ExamplePOMDP Example
The agent is in control of the The agent is in control of the middle carmiddle car
The car behind is fast and The car behind is fast and will not slow downwill not slow down
The car ahead is slowerThe car ahead is slowerTo avoid a crash, the agent To avoid a crash, the agent must steer rightmust steer rightHowever, when the agent is However, when the agent is gazing to the right, there is gazing to the right, there is no immediate observation no immediate observation that tells it about the that tells it about the impending crashimpending crashThe agent basically needs to The agent basically needs to learn how the observations learn how the observations might aid its performancemight aid its performance
ECE 517 – Reinforcement Learning in AI 55
POMDP Example (cont.)POMDP Example (cont.)
This is not easy when the agent This is not easy when the agent has no explicit goals beyond has no explicit goals beyond “performing well"“performing well"There are no explicit training There are no explicit training patterns such as “patterns such as “if there is a if there is a car ahead and left, steer right."car ahead and left, steer right."However, a scalar reward is However, a scalar reward is provided to the agent as a provided to the agent as a performance indicator (just like performance indicator (just like MDPs)MDPs)
The agent is penalized for The agent is penalized for colliding with other cars or the colliding with other cars or the road shoulderroad shoulder
The only goal hard-wired into The only goal hard-wired into the agent is that it must the agent is that it must maximize a long-term measure maximize a long-term measure of the rewardof the reward
ECE 517 – Reinforcement Learning in AI 66
POMDP Example (cont.)POMDP Example (cont.)
Two significant problems make it difficult to learn Two significant problems make it difficult to learn under these conditionsunder these conditions
Temporal credit assignmentTemporal credit assignment –– If our agent hits another car and is consequently penalized, If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? actions should not be repeated, and in what circumstances? Generally same as in MDPsGenerally same as in MDPs
Partial Observability -Partial Observability -If the agent is about to hit the car ahead of it, and there is a If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent car to the left, then circumstances dictate that the agent should steer rightshould steer rightHowever, when it looks to the right it has no sensory However, when it looks to the right it has no sensory information regarding what goes on elsewhereinformation regarding what goes on elsewhere
To solve the latter, the agent needs To solve the latter, the agent needs memorymemory – creates – creates knowledge of the state of the world around itknowledge of the state of the world around it
ECE 517 – Reinforcement Learning in AI 77
Forms of Partial ObservabilityForms of Partial Observability
Partial Observability coarsely pertains to eitherPartial Observability coarsely pertains to either Lack of important state informationLack of important state information in observations – must be in observations – must be
compensated using memorycompensated using memory Extraneous informationExtraneous information in observations – needs to learn to avoid in observations – needs to learn to avoid
In our example:In our example: Color of the car in its gaze is extraneous (unless Color of the car in its gaze is extraneous (unless redred cars really cars really
drive faster)drive faster) It needs to build a memory-based model of the world in order to It needs to build a memory-based model of the world in order to
accurately predict what will happenaccurately predict what will happen Creates “belief state” information (we’ll see later)Creates “belief state” information (we’ll see later)
If the agent has access to the complete state, such as a If the agent has access to the complete state, such as a chess playing machine that can view the entire board:chess playing machine that can view the entire board:
It can choose optimal actions without memoryIt can choose optimal actions without memory Markov property holds – i.e. future state of the world is simply a Markov property holds – i.e. future state of the world is simply a
function of the current state and actionfunction of the current state and action
ECE 517 – Reinforcement Learning in AI 88
Modeling the world as a POMDPModeling the world as a POMDP
Our setting is that of an agent taking actions in a world Our setting is that of an agent taking actions in a world according to its policyaccording to its policyThe agent still receives feedback about its performance The agent still receives feedback about its performance through a scalar reward received at each time stepthrough a scalar reward received at each time step
Formally stated, POMDPs consists of …Formally stated, POMDPs consists of … ||SS| states | states SS = {1,2,…,| = {1,2,…,|SS|} of the world|} of the world ||UU| actions (or controls) | actions (or controls) UU = {1,2,…, | = {1,2,…, |UU|} available |} available
to the policyto the policy ||YY| observations | observations YY = {1,2,…,| = {1,2,…,|YY|}|} a (a (possibly stochasticpossibly stochastic) reward ) reward rr((ii) for each state ) for each state ii in in
SS
ECE 517 – Reinforcement Learning in AI 99
Modeling the world as a POMDP (cont.)Modeling the world as a POMDP (cont.)
ECE 517 – Reinforcement Learning in AI 1010
MDPs vs. POMDPsMDPs vs. POMDPs
In MDP: one observation for each stateIn MDP: one observation for each state Concept of Concept of observationobservation and and statestate being interchangeable being interchangeable Memoryless policy that does not make use of internal Memoryless policy that does not make use of internal
statestate
In POMDPs different states may have similar probability In POMDPs different states may have similar probability distributions over observationsdistributions over observations
Different states may look the same to the agentDifferent states may look the same to the agent For this reason, POMDPs are said to have For this reason, POMDPs are said to have hidden statehidden state
Two hallways may look the same for a robot’s sensorsTwo hallways may look the same for a robot’s sensors Optimal action for the first Optimal action for the first take lefttake left Optimal action for the first Optimal action for the first take righttake right A memoryless policy can not distinguish between the twoA memoryless policy can not distinguish between the two
ECE 517 – Reinforcement Learning in AI 1111
MDPs vs. POMDPs (cont.)MDPs vs. POMDPs (cont.)
Noise can create ambiguity in state inferenceNoise can create ambiguity in state inference Agent’s sensors are always limited in the amount of Agent’s sensors are always limited in the amount of
information they can pick upinformation they can pick up
One way of overcoming this is to add sensorsOne way of overcoming this is to add sensors Specific sensors that help it to “disambiguate” hallwaysSpecific sensors that help it to “disambiguate” hallways Only when possible, affordable or desirableOnly when possible, affordable or desirable
In general, we’re now considering agents that need to In general, we’re now considering agents that need to be proactive (also called “anticipatory”) be proactive (also called “anticipatory”)
Not only react to environmental stimuliNot only react to environmental stimuli Self-create context using memorySelf-create context using memory
POMDP problems are harder to solve, but represent POMDP problems are harder to solve, but represent realistic scenariosrealistic scenarios
ECE 517 – Reinforcement Learning in AI 1212
POMDP solution techniques – model based methodsPOMDP solution techniques – model based methods
If an exact model of the environment is available, If an exact model of the environment is available, POMDPs can (in theory) be solvedPOMDPs can (in theory) be solved
i.e. an optimal policy can be foundi.e. an optimal policy can be foundLike model-based MDPs, it’s not so much a learning Like model-based MDPs, it’s not so much a learning problemproblem
No real “learning”, or trial and error taking placeNo real “learning”, or trial and error taking place No exploration/exploitation dilemmaNo exploration/exploitation dilemma Rather a probabilistic planning problem Rather a probabilistic planning problem find the find the
optimal policyoptimal policy
In POMDPs the above is broken into two elementsIn POMDPs the above is broken into two elements Belief state computation, andBelief state computation, and Value function computation based on belief statesValue function computation based on belief states
ECE 517 – Reinforcement Learning in AI 1313
The belief stateThe belief state
Instead of maintaining the complete Instead of maintaining the complete action/observation history, we maintain a action/observation history, we maintain a belief belief state bstate b..
The belief state is a The belief state is a probability distribution over the probability distribution over the statesstates
Given an observationGiven an observation Dim(b) = |S|-1Dim(b) = |S|-1
The The belief spacebelief space is the entire probability space is the entire probability space We’ll use a two-state POMDP as a running exampleWe’ll use a two-state POMDP as a running example
Probability of being in state one = Probability of being in state one = pp probability of being probability of being in state two = in state two = 1-p1-p
Therefore, the entire space of belief states can be Therefore, the entire space of belief states can be represented as a line segmentrepresented as a line segment
ECE 517 – Reinforcement Learning in AI 1414
The belief spaceThe belief space
Here is a representation of the belief space Here is a representation of the belief space when we have two states (s0,s1)when we have two states (s0,s1)
ECE 517 – Reinforcement Learning in AI 1515
The belief space (cont.)The belief space (cont.)
The belief space is continuous, but we only visit a The belief space is continuous, but we only visit a countable number of belief pointscountable number of belief pointsAssumptionAssumption::
Finite actionFinite action set set Finite observationFinite observation set set Next belief state Next belief state b’ b’ = = f f ((bb,a,o) where,a,o) where::
bb: current belief state, a:action, o:observation: current belief state, a:action, o:observation
ECE 517 – Reinforcement Learning in AI 1616
The Tiger ProblemThe Tiger Problem
• Standing in front of two closed doorsStanding in front of two closed doors• World is in one of two states: World is in one of two states: tiger is behind left door or right doortiger is behind left door or right door• Three actions: Three actions: Open left door, open right door, listenOpen left door, open right door, listen
• Listening is not free, and not accurate (may get wrong info)Listening is not free, and not accurate (may get wrong info)• Reward: Reward: Open the wrong door and get eaten by the tiger (Open the wrong door and get eaten by the tiger (large –rlarge –r)) Open the right door and get a prize (Open the right door and get a prize (small +rsmall +r))
Two states: Two states: SLSL and and SRSR (tiger is really behind (tiger is really behind leftleft or or rightright door)door)Three actions: Three actions: LEFT, RIGHT, LISTENLEFT, RIGHT, LISTENTransition probabilities:Transition probabilities:
Listening does not change theListening does not change thetiger’s positiontiger’s positionEach episode is a “Reset”Each episode is a “Reset”
POMDP Policy Tree (Fake Policy)POMDP Policy Tree (Fake Policy)
Listen
ListenOpenLeftdoor
ListenOpenLeft door
Listen
Tiger roarleft Tiger roar right
Tiger roarleft
Tiger roarright
……
Starting belief state(tiger left probability: 0.3)
New belief stateNew beliefstate
New beliefstate
ECE 517 – Reinforcement Learning in AI 2020
POMDP Policy Tree (cont’)POMDP Policy Tree (cont’)
A1
A2
A3A4
A5 A6
A7
A8
o1
o2 o3
o4o5
o3
……
ECE 517 – Reinforcement Learning in AI 2121
How many POMDP policies possibleHow many POMDP policies possible
A1
A2A3 A4
A5 A6A7
A8
o1o2 o3
o4 o5o6
… …
How many policy trees, if How many policy trees, if |A||A| actions, actions, |O||O| observations, observations, TT horizon: horizon:• How many nodes in a tree:How many nodes in a tree:
b’(s’) = Pr (s’ | o, a, b) = Pr (s’ b’(s’) = Pr (s’ | o, a, b) = Pr (s’ o o a a b) / Pr(o b) / Pr(o a a b) b)
= Pr(o |s’, a, b) Pr(s’| a, b) * Pr (a = Pr(o |s’, a, b) Pr(s’| a, b) * Pr (a b) b)
Pr(o | a, b) * Pr (a Pr(o | a, b) * Pr (a b) b)
= Pr(o | s’, a) Pr (s’ | a, b)= Pr(o | s’, a) Pr (s’ | a, b) Pr(o | a, b)Pr(o | a, b) Will not repeat Pr(o | a, b) in the next slide, but Will not repeat Pr(o | a, b) in the next slide, but
assume it is there!assume it is there! Treated as a normalizing factor, so that b’ sums to 1Treated as a normalizing factor, so that b’ sums to 1
Pr(o | s’ a) Pr (s’ | a, b) = O(s’, a, o) Pr (s’ | a, b)Pr(o | s’ a) Pr (s’ | a, b) = O(s’, a, o) Pr (s’ | a, b)
= O(s’, a, o) = O(s’, a, o) Pr (s’ | a, b, s) Pr (s | a, b) Pr (s’ | a, b, s) Pr (s | a, b)
= O(s’, a, o) = O(s’, a, o) Pr (s’ | a, b, s) b(s) ; Pr (s | a, b) = Pr (s | b) = Pr (s’ | a, b, s) b(s) ; Pr (s | a, b) = Pr (s | b) = b(s)b(s)
= O(s’, a, o) = O(s’, a, o) T(s, a, s’) b(s) T(s, a, s’) b(s)
(Please work out some of the details at (Please work out some of the details at home!)home!)
ECE 517 – Reinforcement Learning in AI 2424
Belief StateBelief State
Overall formula:Overall formula:
The belief state is updated proportionally to:The belief state is updated proportionally to: The prob. of seeing the current observation given state s’,The prob. of seeing the current observation given state s’, and to the prob. of arriving at state s’ given the action and and to the prob. of arriving at state s’ given the action and
our previous belief state (b)our previous belief state (b) The above are all given by the model The above are all given by the model
ECE 517 – Reinforcement Learning in AI 2525
Belief State (cont.)Belief State (cont.)
Let’s look at an example:Let’s look at an example: Consider a robot that is initially completely Consider a robot that is initially completely
uncertain about its locationuncertain about its location Seeing a door may, as specified by the model’s Seeing a door may, as specified by the model’s
occur in three different locationsoccur in three different locations Suppose that the robot takes an action and Suppose that the robot takes an action and
observes a T-junctionobserves a T-junction It may be that given the action only one of the It may be that given the action only one of the
three states could have lead to an observation of a three states could have lead to an observation of a T-junctionT-junction
The agent now knows withThe agent now knows withcertainty which state it is incertainty which state it is inNot in all cases the uncertaintyNot in all cases the uncertaintydisappears like thatdisappears like that
'soP
ECE 517 – Reinforcement Learning in AI 2626
Finding an optimal policyFinding an optimal policy
The policy component of a POMDP agent must map The policy component of a POMDP agent must map the current belief state into actionthe current belief state into action
It turns out that the process of maintaining belief It turns out that the process of maintaining belief states is a states is a sufficient statistic (i.e. Markovian)sufficient statistic (i.e. Markovian) We can not do better even if we remembered the We can not do better even if we remembered the
entire history of observations and actionsentire history of observations and actions
We have now transformed the POMDP into a MDPWe have now transformed the POMDP into a MDP Good news:Good news: we have ways of solving those (GPI we have ways of solving those (GPI
algorithms)algorithms) Bad news:Bad news: the belief state space is continuous !! the belief state space is continuous !!
ECE 517 – Reinforcement Learning in AI 2727
Value functionValue function
The belief state is the input to the second The belief state is the input to the second component of the method: the component of the method: the value function value function computationcomputationThe belief state is a point in a continuous The belief state is a point in a continuous space of space of N-1 dimensionsN-1 dimensions!!The value function must be defined over this The value function must be defined over this infinite spaceinfinite spaceApplication of dynamic programming Application of dynamic programming techniques techniques infeasible infeasible
ECE 517 – Reinforcement Learning in AI 2828
Value function (cont.)Value function (cont.)
• Let’s assume only two states: S1 and S2Let’s assume only two states: S1 and S2• Belief state Belief state [0.25 0.75][0.25 0.75] indicates indicates b(s1) = 0.25b(s1) = 0.25, , b(s2) = b(s2) =
0.750.75• With two states, b(s1) is sufficient to indicate belief With two states, b(s1) is sufficient to indicate belief
state: b(s2) = 1 – b(s1)state: b(s2) = 1 – b(s1)
S1[1, 0]
S2[0, 1][0.5, 0.5]
V(b)
b: belief state
ECE 517 – Reinforcement Learning in AI 2929
Piecewise linear and Convex (PWLC)Piecewise linear and Convex (PWLC)
Turns out that the value function is, or can be accurately Turns out that the value function is, or can be accurately approximated, by a approximated, by a piecewise linear and convex functionpiecewise linear and convex functionIntuition on convexity: being certain of a state yields high Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the valuevalue, where as uncertainty lowers the value
S1[1, 0]
S2[0, 1][0.5, 0.5]
V(b)
b: belief state
ECE 517 – Reinforcement Learning in AI 3030
Why does PWLC helps?Why does PWLC helps?
• We can directly work with regions (intervals) of belief space!• The vectors are policies, and indicate the right action to take in
each region of the space
S1[1, 0]
S2[0, 1][0.5, 0.5]
V(b)
b: belief state
Vp1
Vp2Vp3
region1 region2 region3
ECE 517 – Reinforcement Learning in AI 3131
SummarySummary
POMDPs POMDPs better modeling of realistic better modeling of realistic scenariosscenarios
Rely on belief states that are derived from Rely on belief states that are derived from observations and actionsobservations and actions
Can be transformed into an MDP with PWLC Can be transformed into an MDP with PWLC for value function approximationfor value function approximation