Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

11

ECE-517: Reinforcement Learning in ECE-517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence

Lecture 15: Partially Observable Markov Decision Lecture 15: Partially Observable Markov Decision Processes (POMDPs)Processes (POMDPs)

Dr. Itamar ArelDr. Itamar ArelCollege of EngineeringCollege of Engineering

Electrical Engineering and Computer Science DepartmentElectrical Engineering and Computer Science DepartmentThe University of TennesseeThe University of Tennessee

Fall 2015Fall 2015

November 5, 2015November 5, 2015

ECE 517 – Reinforcement Learning in AI 22

OutlineOutline

Why use POMDPs?Why use POMDPs?Formal definitionFormal definitionBelief stateBelief stateValue function Value function


Partially Observable Markov Decision Problems Partially Observable Markov Decision Problems (POMDPs)(POMDPs)

To introduce POMDPs let us consider an example To introduce POMDPs let us consider an example where an agent learns to drive a car in New York citywhere an agent learns to drive a car in New York cityThe agent can look forward, backward, left or rightThe agent can look forward, backward, left or rightIt cannot change speed but it can steer into the lane it It cannot change speed but it can steer into the lane it is looking atis looking atThe different types of observations areThe different types of observations are

the direction in which the agent's gaze is directedthe direction in which the agent's gaze is directed the closest object in the agent's gazethe closest object in the agent's gaze whether the object is looming or recedingwhether the object is looming or receding the color of the objectthe color of the object whether a horn is soundingwhether a horn is sounding

To drive safely the agent must steer out of its lane to To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behindavoid slow cars ahead and fast cars behind


POMDP ExamplePOMDP Example

The agent is in control of the The agent is in control of the middle carmiddle car

The car behind is fast and The car behind is fast and will not slow downwill not slow down

The car ahead is slowerThe car ahead is slowerTo avoid a crash, the agent To avoid a crash, the agent must steer rightmust steer rightHowever, when the agent is However, when the agent is gazing to the right, there is gazing to the right, there is no immediate observation no immediate observation that tells it about the that tells it about the impending crashimpending crashThe agent basically needs to The agent basically needs to learn how the observations learn how the observations might aid its performancemight aid its performance


POMDP Example (cont.)POMDP Example (cont.)

This is not easy when the agent This is not easy when the agent has no explicit goals beyond has no explicit goals beyond “performing well"“performing well"There are no explicit training There are no explicit training patterns such as “patterns such as “if there is a if there is a car ahead and left, steer right."car ahead and left, steer right."However, a scalar reward is However, a scalar reward is provided to the agent as a provided to the agent as a performance indicator (just like performance indicator (just like MDPs)MDPs)

The agent is penalized for The agent is penalized for colliding with other cars or the colliding with other cars or the road shoulderroad shoulder

The only goal hard-wired into The only goal hard-wired into the agent is that it must the agent is that it must maximize a long-term measure maximize a long-term measure of the rewardof the reward


POMDP Example (cont.)POMDP Example (cont.)

Two significant problems make it difficult to learn Two significant problems make it difficult to learn under these conditionsunder these conditions

Temporal credit assignmentTemporal credit assignment –– If our agent hits another car and is consequently penalized, If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? actions should not be repeated, and in what circumstances? Generally same as in MDPsGenerally same as in MDPs

Partial Observability -Partial Observability -If the agent is about to hit the car ahead of it, and there is a If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent car to the left, then circumstances dictate that the agent should steer rightshould steer rightHowever, when it looks to the right it has no sensory However, when it looks to the right it has no sensory information regarding what goes on elsewhereinformation regarding what goes on elsewhere

To solve the latter, the agent needs To solve the latter, the agent needs memorymemory – creates – creates knowledge of the state of the world around itknowledge of the state of the world around it


Forms of Partial ObservabilityForms of Partial Observability

Partial Observability coarsely pertains to eitherPartial Observability coarsely pertains to either Lack of important state informationLack of important state information in observations – must be in observations – must be

compensated using memorycompensated using memory Extraneous informationExtraneous information in observations – needs to learn to avoid in observations – needs to learn to avoid

In our example:In our example: Color of the car in its gaze is extraneous (unless Color of the car in its gaze is extraneous (unless redred cars really cars really

drive faster)drive faster) It needs to build a memory-based model of the world in order to It needs to build a memory-based model of the world in order to

accurately predict what will happenaccurately predict what will happen Creates “belief state” information (we’ll see later)Creates “belief state” information (we’ll see later)

If the agent has access to the complete state, such as a If the agent has access to the complete state, such as a chess playing machine that can view the entire board:chess playing machine that can view the entire board:

It can choose optimal actions without memoryIt can choose optimal actions without memory Markov property holds – i.e. future state of the world is simply a Markov property holds – i.e. future state of the world is simply a

function of the current state and actionfunction of the current state and action


Modeling the world as a POMDPModeling the world as a POMDP

Our setting is that of an agent taking actions in a world Our setting is that of an agent taking actions in a world according to its policyaccording to its policyThe agent still receives feedback about its performance The agent still receives feedback about its performance through a scalar reward received at each time stepthrough a scalar reward received at each time step

Formally stated, POMDPs consists of …Formally stated, POMDPs consists of … ||SS| states | states SS = {1,2,…,| = {1,2,…,|SS|} of the world|} of the world ||UU| actions (or controls) | actions (or controls) UU = {1,2,…, | = {1,2,…, |UU|} available |} available

to the policyto the policy ||YY| observations | observations YY = {1,2,…,| = {1,2,…,|YY|}|} a (a (possibly stochasticpossibly stochastic) reward ) reward rr((ii) for each state ) for each state ii in in

SS


Modeling the world as a POMDP (cont.)Modeling the world as a POMDP (cont.)


MDPs vs. POMDPsMDPs vs. POMDPs

In MDP: one observation for each stateIn MDP: one observation for each state Concept of Concept of observationobservation and and statestate being interchangeable being interchangeable Memoryless policy that does not make use of internal Memoryless policy that does not make use of internal

statestate

In POMDPs different states may have similar probability In POMDPs different states may have similar probability distributions over observationsdistributions over observations

Different states may look the same to the agentDifferent states may look the same to the agent For this reason, POMDPs are said to have For this reason, POMDPs are said to have hidden statehidden state

Two hallways may look the same for a robot’s sensorsTwo hallways may look the same for a robot’s sensors Optimal action for the first Optimal action for the first take lefttake left Optimal action for the first Optimal action for the first take righttake right A memoryless policy can not distinguish between the twoA memoryless policy can not distinguish between the two


MDPs vs. POMDPs (cont.)MDPs vs. POMDPs (cont.)

Noise can create ambiguity in state inferenceNoise can create ambiguity in state inference Agent’s sensors are always limited in the amount of Agent’s sensors are always limited in the amount of

information they can pick upinformation they can pick up

One way of overcoming this is to add sensorsOne way of overcoming this is to add sensors Specific sensors that help it to “disambiguate” hallwaysSpecific sensors that help it to “disambiguate” hallways Only when possible, affordable or desirableOnly when possible, affordable or desirable

In general, we’re now considering agents that need to In general, we’re now considering agents that need to be proactive (also called “anticipatory”) be proactive (also called “anticipatory”)

Not only react to environmental stimuliNot only react to environmental stimuli Self-create context using memorySelf-create context using memory

POMDP problems are harder to solve, but represent POMDP problems are harder to solve, but represent realistic scenariosrealistic scenarios


POMDP solution techniques – model based methodsPOMDP solution techniques – model based methods

If an exact model of the environment is available, If an exact model of the environment is available, POMDPs can (in theory) be solvedPOMDPs can (in theory) be solved

i.e. an optimal policy can be foundi.e. an optimal policy can be foundLike model-based MDPs, it’s not so much a learning Like model-based MDPs, it’s not so much a learning problemproblem

No real “learning”, or trial and error taking placeNo real “learning”, or trial and error taking place No exploration/exploitation dilemmaNo exploration/exploitation dilemma Rather a probabilistic planning problem Rather a probabilistic planning problem find the find the

optimal policyoptimal policy

In POMDPs the above is broken into two elementsIn POMDPs the above is broken into two elements Belief state computation, andBelief state computation, and Value function computation based on belief statesValue function computation based on belief states


The belief stateThe belief state

Instead of maintaining the complete Instead of maintaining the complete action/observation history, we maintain a action/observation history, we maintain a belief belief state bstate b..

The belief state is a The belief state is a probability distribution over the probability distribution over the statesstates

Given an observationGiven an observation Dim(b) = |S|-1Dim(b) = |S|-1

The The belief spacebelief space is the entire probability space is the entire probability space We’ll use a two-state POMDP as a running exampleWe’ll use a two-state POMDP as a running example

Probability of being in state one = Probability of being in state one = pp probability of being probability of being in state two = in state two = 1-p1-p

Therefore, the entire space of belief states can be Therefore, the entire space of belief states can be represented as a line segmentrepresented as a line segment


The belief spaceThe belief space

Here is a representation of the belief space Here is a representation of the belief space when we have two states (s0,s1)when we have two states (s0,s1)


The belief space (cont.)The belief space (cont.)

The belief space is continuous, but we only visit a The belief space is continuous, but we only visit a countable number of belief pointscountable number of belief pointsAssumptionAssumption::

Finite actionFinite action set set Finite observationFinite observation set set Next belief state Next belief state b’ b’ = = f f ((bb,a,o) where,a,o) where::

bb: current belief state, a:action, o:observation: current belief state, a:action, o:observation


The Tiger ProblemThe Tiger Problem

• Standing in front of two closed doorsStanding in front of two closed doors• World is in one of two states: World is in one of two states: tiger is behind left door or right doortiger is behind left door or right door• Three actions: Three actions: Open left door, open right door, listenOpen left door, open right door, listen

• Listening is not free, and not accurate (may get wrong info)Listening is not free, and not accurate (may get wrong info)• Reward: Reward: Open the wrong door and get eaten by the tiger (Open the wrong door and get eaten by the tiger (large –rlarge –r)) Open the right door and get a prize (Open the right door and get a prize (small +rsmall +r))


Tiger Problem: POMDP FormulationTiger Problem: POMDP Formulation

Two states: Two states: SLSL and and SRSR (tiger is really behind (tiger is really behind leftleft or or rightright door)door)Three actions: Three actions: LEFT, RIGHT, LISTENLEFT, RIGHT, LISTENTransition probabilities:Transition probabilities:

Listening does not change theListening does not change thetiger’s positiontiger’s positionEach episode is a “Reset”Each episode is a “Reset”

ListenListen SLSL SRSR

SLSL 1.01.0 0.00.0

SRSR 0.00.0 1.01.0

LeftLeft SLSL SRSR

SLSL 0.50.5 0.50.5

SRSR 0.50.5 0.50.5

RightRight SLSL SRSR

SLSL 0.50.5 0.50.5

SRSR 0.50.5 0.50.5

Current state

Nex

t sta

te


Tiger Problem: POMDP Formulation (cont.)Tiger Problem: POMDP Formulation (cont.)

Observations: TL (tiger left) or TR (tiger right)Observations: TL (tiger left) or TR (tiger right)Observation probabilities:Observation probabilities:

Rewards: – R(SL, Listen) = R(SR, Listen) = -1– R(SL, Left) = R(SR, Right) = -100– R(SL, Right) = R(SR, Left) = +10

ListenListen TLTL TRTR

SLSL 0.850.85 0.150.15

SRSR 0.150.15 0.850.85

LeftLeft TLTL TRTR

SLSL 0.50.5 0.50.5

SRSR 0.50.5 0.50.5

RightRight TLTL TRTR

SLSL 0.50.5 0.50.5

SRSR 0.50.5 0.50.5

Current state

Nex

t sta

te


POMDP Policy Tree (Fake Policy)POMDP Policy Tree (Fake Policy)

Listen

ListenOpenLeftdoor

ListenOpenLeft door

Listen

Tiger roarleft Tiger roar right

Tiger roarleft

Tiger roarright

……

Starting belief state(tiger left probability: 0.3)

New belief stateNew beliefstate

New beliefstate


POMDP Policy Tree (cont’)POMDP Policy Tree (cont’)

A1

A2

A3A4

A5 A6

A7

A8

o1

o2 o3

o4o5

o3

……


How many POMDP policies possibleHow many POMDP policies possible

A1

A2A3 A4

A5 A6A7

A8

o1o2 o3

o4 o5o6

… …

How many policy trees, if How many policy trees, if |A||A| actions, actions, |O||O| observations, observations, TT horizon: horizon:• How many nodes in a tree:How many nodes in a tree:

N =N = |O||O|ii = = (|O|(|O|TT- 1)- 1) / / (|O| - 1)(|O| - 1)i=0

T-1How many trees:

|A|N

1

|O|

|O|^2

…


Computing Belief StatesComputing Belief States

b’(s’) = Pr (s’ | o, a, b) = Pr (s’ b’(s’) = Pr (s’ | o, a, b) = Pr (s’ o o a a b) / Pr(o b) / Pr(o a a b) b)

= Pr(o |s’, a, b) Pr(s’| a, b) * Pr (a = Pr(o |s’, a, b) Pr(s’| a, b) * Pr (a b) b)

Pr(o | a, b) * Pr (a Pr(o | a, b) * Pr (a b) b)

= Pr(o | s’, a) Pr (s’ | a, b)= Pr(o | s’, a) Pr (s’ | a, b) Pr(o | a, b)Pr(o | a, b) Will not repeat Pr(o | a, b) in the next slide, but Will not repeat Pr(o | a, b) in the next slide, but

assume it is there!assume it is there! Treated as a normalizing factor, so that b’ sums to 1Treated as a normalizing factor, so that b’ sums to 1


Computing Belief States: NumeratorComputing Belief States: Numerator

Pr(o | s’ a) Pr (s’ | a, b) = O(s’, a, o) Pr (s’ | a, b)Pr(o | s’ a) Pr (s’ | a, b) = O(s’, a, o) Pr (s’ | a, b)

= O(s’, a, o) = O(s’, a, o) Pr (s’ | a, b, s) Pr (s | a, b) Pr (s’ | a, b, s) Pr (s | a, b)

= O(s’, a, o) = O(s’, a, o) Pr (s’ | a, b, s) b(s) ; Pr (s | a, b) = Pr (s | b) = Pr (s’ | a, b, s) b(s) ; Pr (s | a, b) = Pr (s | b) = b(s)b(s)

= O(s’, a, o) = O(s’, a, o) T(s, a, s’) b(s) T(s, a, s’) b(s)

(Please work out some of the details at (Please work out some of the details at home!)home!)


Belief StateBelief State

Overall formula:Overall formula:

The belief state is updated proportionally to:The belief state is updated proportionally to: The prob. of seeing the current observation given state s’,The prob. of seeing the current observation given state s’, and to the prob. of arriving at state s’ given the action and and to the prob. of arriving at state s’ given the action and

our previous belief state (b)our previous belief state (b) The above are all given by the model The above are all given by the model


Belief State (cont.)Belief State (cont.)

Let’s look at an example:Let’s look at an example: Consider a robot that is initially completely Consider a robot that is initially completely

uncertain about its locationuncertain about its location Seeing a door may, as specified by the model’s Seeing a door may, as specified by the model’s

occur in three different locationsoccur in three different locations Suppose that the robot takes an action and Suppose that the robot takes an action and

observes a T-junctionobserves a T-junction It may be that given the action only one of the It may be that given the action only one of the

three states could have lead to an observation of a three states could have lead to an observation of a T-junctionT-junction

The agent now knows withThe agent now knows withcertainty which state it is incertainty which state it is inNot in all cases the uncertaintyNot in all cases the uncertaintydisappears like thatdisappears like that

'soP


Finding an optimal policyFinding an optimal policy

The policy component of a POMDP agent must map The policy component of a POMDP agent must map the current belief state into actionthe current belief state into action

It turns out that the process of maintaining belief It turns out that the process of maintaining belief states is a states is a sufficient statistic (i.e. Markovian)sufficient statistic (i.e. Markovian) We can not do better even if we remembered the We can not do better even if we remembered the

entire history of observations and actionsentire history of observations and actions

We have now transformed the POMDP into a MDPWe have now transformed the POMDP into a MDP Good news:Good news: we have ways of solving those (GPI we have ways of solving those (GPI

algorithms)algorithms) Bad news:Bad news: the belief state space is continuous !! the belief state space is continuous !!


Value functionValue function

The belief state is the input to the second The belief state is the input to the second component of the method: the component of the method: the value function value function computationcomputationThe belief state is a point in a continuous The belief state is a point in a continuous space of space of N-1 dimensionsN-1 dimensions!!The value function must be defined over this The value function must be defined over this infinite spaceinfinite spaceApplication of dynamic programming Application of dynamic programming techniques techniques infeasible infeasible


Value function (cont.)Value function (cont.)

• Let’s assume only two states: S1 and S2Let’s assume only two states: S1 and S2• Belief state Belief state [0.25 0.75][0.25 0.75] indicates indicates b(s1) = 0.25b(s1) = 0.25, , b(s2) = b(s2) =

0.750.75• With two states, b(s1) is sufficient to indicate belief With two states, b(s1) is sufficient to indicate belief

state: b(s2) = 1 – b(s1)state: b(s2) = 1 – b(s1)

S1[1, 0]

S2[0, 1][0.5, 0.5]

V(b)

b: belief state


Piecewise linear and Convex (PWLC)Piecewise linear and Convex (PWLC)

Turns out that the value function is, or can be accurately Turns out that the value function is, or can be accurately approximated, by a approximated, by a piecewise linear and convex functionpiecewise linear and convex functionIntuition on convexity: being certain of a state yields high Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the valuevalue, where as uncertainty lowers the value

S1[1, 0]

S2[0, 1][0.5, 0.5]

V(b)

b: belief state


Why does PWLC helps?Why does PWLC helps?

• We can directly work with regions (intervals) of belief space!• The vectors are policies, and indicate the right action to take in

each region of the space

S1[1, 0]

S2[0, 1][0.5, 0.5]

V(b)

b: belief state

Vp1

Vp2Vp3

region1 region2 region3


SummarySummary

POMDPs POMDPs better modeling of realistic better modeling of realistic scenariosscenarios

Rely on belief states that are derived from Rely on belief states that are derived from observations and actionsobservations and actions

Can be transformed into an MDP with PWLC Can be transformed into an MDP with PWLC for value function approximationfor value function approximation

Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

Documents

mdpsthe agent

agent reason

reinforcement learning

pomdp examplethe agent

impending crashthe agent

new york citythe agent

slow downthe car

middle carthe car