Top Banner
POMDPs Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4
126

POMDPs

Feb 09, 2016

Download

Documents

nayef

POMDPs. Slides based on Hansen et. Al.’s tutorial + R&N 3 rd Ed Sec 17.4. Planning using Partially Observable Markov Decision Processes: A Tutorial. Presenters: Eric Hansen, Mississippi State University Daniel Bernstein, University of Massachusetts/Amherst - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: POMDPs

POMDPs

Slides based on Hansen et. Al.’s tutorial + R&N 3rd Ed Sec 17.4

Page 2: POMDPs

Planning using Partially Observable Markov Decision Processes: A Tutorial

Presenters:Eric Hansen, Mississippi State University

Daniel Bernstein, University of Massachusetts/AmherstZhengzhu Feng, University of Massachusetts/Amherst

Rong Zhou, Mississippi State University

Page 3: POMDPs

Introduction and foundations

Definition of POMDPGoals, rewards and optimality criteriaExamples and applicationsComputational complexityBelief states and Bayesian conditioning

Page 4: POMDPs

Planning under partial observability

Environment

Action

Imperfect observation

Goal

Environment

Page 5: POMDPs

Two Approaches to Planning under Partial Observability

Nondeterministic planning Uncertainty is represented by set of possible states No possibility is considered more likely than any other

Probabilistic (decision-theoretic) planning Uncertainty is represented by probability distribution

over possible states In this tutorial we consider the second, more

general approach

Page 6: POMDPs

Markov models

Prediction

Planning

Fully observable

Markov chain

MDP

(Markov decision process)

Partially observable

Hidden

Markov model

POMDP

(Partially observable Markov decision process)

Page 7: POMDPs

Definition of POMDP

s0 S1 S2

z0

a0

r0

z1

a1

hidden states:

r1

z2

a2

r2

observations:

actions:

rewards:

Page 8: POMDPs

Goals, rewards and optimality criteria

Rewards are additive and time-separable, and objective is to maximize expected total reward

Traditional planning goals can be encoded in reward function Example: achieving a state satisfying property P at

minimal cost is encoded by making any state satisfying P a zero-reward absorbing state, and assigning all other states negative reward.

POMDP allows partial satisfaction of goals and tradeoffs among competing goals

Planning horizon can be finite, infinite or indefinite

Page 9: POMDPs

Machine Maintenance

X

Canonical application of POMDPs in Operations Research

Page 10: POMDPs

Robot Navigation

Actions: N, S, E, W, Stop+1

–1

Start

0.80.1

0.1

Canonical application of POMDPs in AI Toy example from Russell & Norvig’s AI textbook

Observations: sense surrounding walls

Page 11: POMDPs

Many other applications Helicopter control [Bagnell & Schneider 2001]

Dialogue management [Roy, Pineau & Thrun 2000]

Preference elicitation [Boutilier 2002]

Optimal search and sensor scheduling [Krishnamurthy & Singh 2000]

Medical diagnosis and treatment [Hauskrecht & Fraser 2000]

Packet scheduling in computer networks [Chang et al. 2000; Bent & Van Hentenryck 2004]

Page 12: POMDPs

Computational complexity

Finite-horizon PSPACE-hard [Papadimitriou & Tsitsiklis 1987] NP-complete if unobservable

Infinite-horizon Undecidable [Madani, Hanks & Condon 1999] NP-hard for -approximation [Lusena, Goldsmith &

Mundhenk 2001] NP-hard for memoryless or bounded-memory

control problem [Littman 1994; Meuleau et al. 1999]

Page 13: POMDPs

POMDP <S, A, T, R, Ω, O> tuple

S, A, T, R of MDP Ω – finite set of observations O:SxA-> Π(Ω)

Belief state - information state – b, probability distribution over S - b(s1)

Page 14: POMDPs

POMDP

Goal is to maximize expected long-term reward from the initial state distribution

State is not directly observed

worlda

o

Page 15: POMDPs

Two sources of POMDP complexity Curse of dimensionality

size of state space shared by other planning problems

Curse of memory size of value function (number of vectors) or equivalently, size of controller (memory) unique to POMDPs

ZnAS |||||| 12 || n

Complexity of each iteration of DP:

dimensionality memory

Page 16: POMDPs

Two representations of policy Policy maps history to action Since history grows exponentially

with horizon, it needs to be summarized, especially in infinite-horizon case

Two ways to summarize history belief state

finite-state automaton – partitions history into finite number of “states”

Page 17: POMDPs

Belief simplex

S1 S0

S2

0S0

S1

0

3 states2 states

(1, 0)

(0, 1)(0, 0, 1)

(0, 1, 0) (1, 0, 0)

Page 18: POMDPs

Belief state has Markov property

The process of maintaining the belief state is Markovian

For any belief state, the successor belief state depends only on the action and observation

z1

z2

z2

z1

a2

a1

P(s0) = 0 P(s0) = 1

Page 19: POMDPs

Belief-state MDP State space: the belief simplex Actions: same as before State transition function:

P(b’|b,a) = e E P(b’|b,a,e)P(e|b,a) Reward function:

r(b,a) =sS b(s)r(s,a) Bellman optimality equation:

'

)'(),'(),(max)(b

Aa bVabbPabrbV

Should be Integration…

Page 20: POMDPs

Belief-state controller

P(b|b,a,e)

CurrentBeliefState

(Register)

Policy

Obs. e b b a Action

Update belief state after action and observation Policy maps belief state to action Policy is found by solving the belief-state MDP

“State Estimation”

Page 21: POMDPs

POMDP as MDP in Belief Space

Page 22: POMDPs

Dynamic Programming for POMDPs We’ll start with some important concepts:

a1

s2s1

policy tree linear value functionbelief state

s1 0.25

s2 0.40

s3 0.35

a2 a3

a3 a2 a1 a1

o1

o1 o2 o1 o2

o2

Page 23: POMDPs

Dynamic Programming for POMDPs

a1 a2 s1 s2

Page 24: POMDPs

Dynamic Programming for POMDPs

s1 s2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a1 a1

o1 o2

a2

a1 a2

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

Page 25: POMDPs

Dynamic Programming for POMDPs

s1 s2

a1

a1 a1

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a2

o1 o2

Page 26: POMDPs

Dynamic Programming for POMDPs

s1 s2

Page 27: POMDPs

POMDP Value Iteration: Basic Idea[Finitie Horizon Case]

Page 28: POMDPs

First Problem Solved Key insight: value function

piecewise linear & convex (PWLC) Convexity makes intuitive sense

Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward

Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward

Each line (hyperplane) represented with vector Coefficients of line (hyperplane) e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1))

To find value function at b, find vector with largest dot pdt with b

Page 29: POMDPs

Two states: 0 and 1R(0)=0 ; R(1) = 1[stay]0.9 stay; 0.1 go [go] 0.9 go; 0.1 staySensor reports correct state with 0.6 probDiscount facto=1

POMDP Value Iteration: Phase 1: One action plans

Page 30: POMDPs

POMDP Value Iteration: Phase 2: Two action (conditional) plans

stay

stay stay

0 1

Page 31: POMDPs
Page 32: POMDPs

Point-based Value Iteration: Approximating with Exemplar Belief States

Page 33: POMDPs

Solving infinite-horizon POMDPs Value iteration: iteration of dynamic

programming operator computes value function that is arbitrarily close to optimal

Optimal value function is not necessarily piecewise linear, since optimal control may require infinite memory

But in many cases, as Sondik (1978) and Kaelbling et al (1998) noticed, value iteration converges to a finite set of vectors. In these cases, an optimal policy is equivalent to a finite-state controller.

Page 34: POMDPs

Policy evaluation

q1 q2o1o2 q2

s1q1

s2q1

s1q2

s2q2

o2o2

o2

o2o1

o1

o1

o1

As in the fully observable case, policy evaluation involves solving a system of linear equations. There is one unknown (and one equation) for each pair of system state and controller node

Page 35: POMDPs

Policy improvement0a0

1a1

3a1

2a0

4a0

z0

z0,z1

z1

0a0

1a1

z0

z1

z0,z1

z0,z1

z0

z1

z0,z1

0a0

4a0

z0,z1

z0,z1

3a1

z0

z1

0 1

V(b)

0 1

V(b)

0 1

V(b)0,2

4

3

11

0 0 34

Page 36: POMDPs

Per-Iteration Complexity of POMDP value iteration..

Number of a vectors needed at tth iteration

Time for computing each a vector

Page 37: POMDPs

Approximating POMDP value function with bounds It is possible to get approximate value

functions for POMDP in two ways Over constrain it to be a NOMDP: You get

Blind Value function which ignores the observation

A “conformant” policy For infinite horizon, it will be same action always! (only |A| policies)

Relax it to be a FOMDP: You assume that the state is fully observable.

A “state-based” policy

Under-estimates value(over-estimates cost)

Over-estimates value(under-estimates cost)

Per iteration

Page 38: POMDPs

Upper bounds for leaf nodes can come from FOMDP VI and lower bounds from NOMDP VI

Observations are written as o or z

Page 39: POMDPs

Comparing POMDPs with Non-deterministic conditional planningPOMDP Non-Deterministic Case

Page 40: POMDPs

RTDP-Bel doesn’t do look ahead, and also stores the current estimate of value function (see update)

Page 41: POMDPs

---SLIDES BEYOND THIS NOT COVERED--

Page 42: POMDPs

Two Problems How to represent value function over continuous belief

space? How to update value function Vt with Vt-1?

POMDP -> MDPS => B, set of belief statesA => sameT => τ(b,a,b’)R => ρ(b, a)

Page 43: POMDPs

Running Example POMDP with

Two states (s1 and s2) Two actions (a1 and a2) Three observations (z1, z2, z3)

1D belief space for a 2 state POMDPProbability that state is s1

Page 44: POMDPs

Second Problem Can’t iterate over all belief states (infinite) for value-

iteration but… Given vectors representing Vt-1, generate vectors

representing Vt

Page 45: POMDPs

Horizon 1 No future Value function consists only

of immediate reward e.g.

R(s1, a1) = 0, R(s2, a1) = 1.5, R(s1, a2) = 1, R(s2, a2) = 0 b = <0.25, 0.75>

Value of doing a1 = 1 x b(s1) + 0 x b(s2) = 1 x 0.25 + 0 x 0.75

Value of doing a2 = 0 x b(s1) + 1.5 x b(s2) = 0 x 0.25 + 1.5 x 0.75

Page 46: POMDPs

Second Problem Break problem down into 3 steps

-Compute value of belief state given action and observation

-Compute value of belief state given action -Compute value of belief state

Page 47: POMDPs

Horizon 2 – Given action & obs If in belief state b,what is the best value of

doing action a1 and seeing z1? Best value = best value of immediate action +

best value of next action Best value of immediate action = horizon 1

value function

Page 48: POMDPs

Horizon 2 – Given action & obs Assume best immediate action is a1 and obs is z1 What’s the best action for b’ that results from initial b

when perform a1 and observe z1? Not feasible – do this for all belief states (infinite)

Page 49: POMDPs

Horizon 2 – Given action & obs Construct function over entire (initial) belief space

from horizon 1 value function with belief transformation built in

Page 50: POMDPs

Horizon 2 – Given action & obs S(a1, z1) corresponds to paper’s

S() built in: - horizon 1 value function - belief transformation - “Weight” of seeing z after performing a - Discount factor - Immediate Reward

S() PWLC

Page 51: POMDPs

Second Problem Break problem down into 3 steps

-Compute value of belief state given action and observation

-Compute value of belief state given action -Compute value of belief state

Page 52: POMDPs

Horizon 2 – Given action What is the horizon 2 value of a belief state given

immediate action is a1? Horizon 2, do action a1 Horizon 1, do action…?

Page 53: POMDPs

Horizon 2 – Given action What’s the best strategy at b? How to compute line (vector) representing best

strategy at b? (easy) How many strategies are there in figure? What’s the max number of strategies (after taking

immediate action a1)?

Page 54: POMDPs

Horizon 2 – Given action

How can we represent the 4 regions (strategies) as a value function?

Note: each region is a strategy

Page 55: POMDPs

Horizon 2 – Given action Sum up vectors representing region Sum of vectors = vectors (add lines, get lines) Correspond to paper’s transformation

Page 56: POMDPs

Horizon 2 – Given action What does each region represent? Why is this step hard (alluded to in paper)?

Page 57: POMDPs

Second Problem Break problem down into 3 steps

-Compute value of belief state given action and observation

-Compute value of belief state given action -Compute value of belief state

Page 58: POMDPs

Horizon 2

a1

a2

U

Page 59: POMDPs

Horizon 2

This tells youhow to act! =>

Page 60: POMDPs

Purge

Page 61: POMDPs

Second Problem Break problem down into 3 steps

-Compute value of belief state given action and observation

-Compute value of belief state given action -Compute value of belief state

Use horizon 2 value function to update horizon 3’s ...

Page 62: POMDPs

The Hard Step Easy to visually inspect to obtain different regions But in higher dimensional space, with many actions and

observations….hard problem

Page 63: POMDPs

Naïve way - Enumerate

How does Incremental Pruning do it?

Page 64: POMDPs

Incremental Pruning How does IP improve

naïve method? Will IP ever do worse

than naïve method?

Combinations

Purge/Filter

Page 65: POMDPs

Incremental Pruning What’s other novel idea(s) in IP?

RR: Come up with smaller set D as argument to Dominate()

RR has more linear pgms but less contraints in the worse case. Empirically ↓ constraints saves more time than ↑ linear

programs require

Page 66: POMDPs

Incremental Pruning What’s other novel idea(s) in IP?

RR: Come up with smaller set D as argument to Dominate()

Why are the terms after U needed?

Page 67: POMDPs

Identifying Witness Witness Thm:

-Let Ua be a set of vectors representing value function

-Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 ) -If there is a vector v which differs from u in one

observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and there is a b such that b.v > b.u, -then Ua is not equal to the true value function

Page 68: POMDPs

Witness Algm Randomly choose a belief state b Compute vector representing best value at b (easy) Add vector to agenda While agenda is not empty

• Get vector Vtop from top of agenda• b’ = Dominate(Vtop, Ua)• If b’ is not null (there is a witness),

compute vector u for best value at b’ and add it to Ua compute all vectors v’s that differ from u at one observation

and add them to agenda

b’ b’’ b’ b’’b

Page 69: POMDPs

Linear Support If value function is incorrect, biggest diff is at edges

(convexity)

Page 70: POMDPs

Linear Support

Page 71: POMDPs

Number of of policy trees

|A||Z|T-1 at horizon T

Example for |A| = 4 and |Z| = 2 Horizon # of policy trees

0 11 42 643 16,3844 1,073,741,824

Page 72: POMDPs

Policy graph

2 states

z0z0

z1z1

z0, z1

a0

a1

a2

a0

a1

a2

z0

z0,z1z1

z0

z1

V(b)

Page 73: POMDPs

Policy iteration for POMDPs Sondik’s (1978) algorithm represents policy as

a mapping from belief states to actions only works under special assumptions very difficult to implement never used

Hansen’s (1998) algorithm represents policy as a finite-state controller fully general easy to implement faster than value iteration

Page 74: POMDPs

Properties of policy iteration

Theoretical Monotonically improves finite-state controller Converges to -optimal finite-state controller

after finite number of iterations Empirical

Runs from 10 to over 100 times faster than value iteration

Page 75: POMDPs

Scaling up

State abstraction and factored representationBelief compressionForward search and sampling approachesHierarchical task decomposition

Page 76: POMDPs

State abstraction and factored representation of POMDP

DP algorithms are typically state-based Most AI representations are “feature-based” |S| is typically exponential in the number of

features (or variables) – the “curse of dimensionality”

State-based representations for problems with more than a few variables are impractical

Factored representations exploit regularities in transition and observation probabilities, and reward

Page 77: POMDPs

Example: Part-painting problem[Draper, Hanks, Weld 1994]

Boolean state variablesflawed (FL), blemished (BL), painted (PA), processed (PR), notified (NO)

actionsInspect, Paint, Ship, Reject, Notify

cost functionCost of 1 for each actionCost of 1 for shipping unflawed part that is not paintedCost of 10 for shipping flawed part or rejecting unflawed part

initial belief state Pr(FL) = 0.3, Pr(BL|FL) = 1.0, Pr(BL|FL) = 0.0, Pr(PA) = 0.0, Pr(PR) = 0.0, Pr(NO) = 0.0

Page 78: POMDPs

Factored representation of MDP[Boutilier et al. 1995; Hoey, St. Aubin, Hu, & Boutilier 1999]

Dynamic Bayesian network captures variable independence Algebraic decision diagram captures value independence

FL

SL-FL

NO

PA

SH

RE

FL’

SL-FL’

NO’

PA’

SH’

RE’

FL FL’T 1.0F 0.0

PA SH RE NO PA’ T T/F T/F T/F 1.0 F F F F 0.95 F T T/F T/F 0.0 F T/F T T/F 0.0 F T/F T/F T 0.0

FL’

FL

1.0 0.0 PA’

PA

SH

RE

NO

0.95 0.0 1.0

Dynamic Belief Network Decision DiagramsProbability Tables

Page 79: POMDPs

Decision diagramsX

Y

Z

TRUE FALSE

X

Y

Z

5.8 3.6

Y

Z

18.6 9.5

Binary decision Diagram (BDD)

Algebraic decisiondiagram (ADD)

Page 80: POMDPs

Addition (subtraction), multiplication (division), minimum (maximum), marginalization, expected value

Complexity of operators depends on size of decision diagrams, not number of states!

Operations on decision diagrams

= +

X

Y

Z

11.0 12.0

Y

Z

22.0 23.0

X

Z

10.0 20.0

Y

30.0

X

Y

Z

1.0 2.0 3.0 33.0

Page 81: POMDPs

Symbolic dynamic programming for factored POMDPs [Hansen & Feng 2000]

Factored representation of value function: replace |S|-vectors with ADDs that only make relevant state distinctions

Two steps of DP algorithm Generate new ADDs for value function Prune dominated ADDs

State abstraction is based on aggregating states with the same value

Page 82: POMDPs

ait

obs1

obs2

transition probabilities

observation probabilities

obs3

action reward

akt+1

Generation step: Symbolic implementation

Page 83: POMDPs

Pruning step: Symbolic implementation

pruning is the most computationally expensive part of algorithm

must solve a linear program for each (potential) ADD in value function

because state abstraction reduces the dimensionality of linear programs, it significantly improves efficiency

Page 84: POMDPs

Improved performance

Speedup factor Test problem

Degree of abstraction Generate Prune

1 0.01 42 26 2 0.03 17 11 3 0.10 0.4 3 4 0.12 0.8 0.6 5 0.44 -3.4 0.4 6 0.65 -0.7 0.1 7 1.00 -6.5 -0.1

Number abstract statesNumber primitive states Degree of abstraction =

Page 85: POMDPs

Optimal plan (controller) for part-painting problem

Inspect

Inspect Reject

Notify

Paint Ship

~OKOK

OK

~OK

PR NO PR NOPR

FLFL

FL PA FLFL PA

FL FL

FL BLFL BL FL BLFL BL

Page 86: POMDPs

Approximate state aggregation

= 0.4

Simplify each ADD in value function by mergingleaves that differ in value by less than .

Page 87: POMDPs

Approximate pruning Prune vectors from value function that add

less than to value of any belief statea1

a2

a3

a4

(0,1) (1,0)

Page 88: POMDPs

Error bound

These two methods of approximation share the same error bound

“Weak convergence,” i.e., convergence to within 2/(1-) of optimal (where is discount factor)

After “weak convergence,” decreasing allows further improvement

Starting with relatively high and gradually decreasing it accelerates convergence

Page 89: POMDPs

Strategy: ignore differences of value less than some threshold

Complementary methods Approximate state aggregation Approximate pruning

…address two sources of complexity size of state space size of value function (memory)

Approximate dynamic programming

Page 90: POMDPs

Belief compression Reduce dimensionality of belief space by

approximating the belief state Examples of approximate belief states

tuple of mostly-likely state plus entropy of belief state [Roy & Thrun 1999]

belief features learned by exponential family Principal Components Analysis [Roy & Gordon 2003]

standard POMDP algorithms can be applied in the lower-dimensional belief space, e.g., grid-based approximation

Page 91: POMDPs

Forward search

a0 a1

z0

a0 a0 a0 a0

z0z1 z1

a1 a1 a1 a1

z0 z0z1 z0 z0 z0 z0 z0 z0z1 z1 z1 z1 z1 z1 z1

Page 92: POMDPs

Sparse sampling Forward search can be combined with

Monte Carlo sampling of possible observations and action outcomes [Kearns et al 2000; Ng & Jordan 2000]

Remarkably, complexity is independent of size of state space!!!

On-line planner selects -optimal action for current belief state

Page 93: POMDPs

State-space decomposition

For some POMDPs, each action/observation pair identifies a specific region of the state space

Page 94: POMDPs

Motivating Example Continued A “deterministic observation” reveals that world is

in one of a small number of possible states

Same for “hybrid POMDPs”, which are POMDPs with some fully observable and some partially observable state variables

Page 95: POMDPs

Region-based dynamic programming

Tetrahedron and surfaces

Page 96: POMDPs

Hierarchical task decomposition We have considered abstraction in state space Now we consider abstraction in action space For fully observable MDPs:

Options [Sutton, Precup & Singh 1999] HAMs [Parr & Russell 1997] Region-based decomposition [Hauskrecht et al 1998] MAXQ [Dietterich 2000]

Hierarchical approach may cause sub-optimality, but limited forms of optimality can be guaranteed Hierarchical optimality (Parr and Russell) Recursive optimality (Dietterich)

Page 97: POMDPs

Hierarchical approach to POMDPs Theocharous & Mahadevan (2002)

based on hierarchical hidden Markov model approximation ~1000 state robot hallway-navigation problem

Pineau et al (2003) based on Dietterich’s MAXQ decomposition approximation ~1000 state robot navigation and dialogue

Hansen & Zhou (2003) also based on Dietterich’s MAXQ decomposition convergence guarantees and epsilon-optimality

Page 98: POMDPs

Macro action as finite-state controller

Allows exact modeling of macro’s effects macro state transition probabilities macro rewards

West Stop

NorthEast

EastSouthgoal

goal

clear

wall

wall

clear

clear

wall

Page 99: POMDPs

Taxi example [Dietterich 2000]

Page 100: POMDPs

Task hierarchy [Dietterich 2000]

Taxi

Get Put

Navigate

Pickup Putdown

NorthSouthEastWest

Page 101: POMDPs

Hierarchical finite-state controller

Get Put

Nav.Pickup Nav. Putdown

East North

North

South East

West StopStop

Stop Stop

Stop

Page 102: POMDPs

MAXQ-hierarchical policy iterationCreate initial sub-controller for each sub-POMDP in hierarchy

Repeat until error bound is less than Identify subtask that contributes most to overall

error Use policy iteration to improve the

corresponding controller For each node of controller, create abstract

action (for parent task) and compute its model Propagate error up through hierarchy

Page 103: POMDPs

Modular structure of controller

Page 104: POMDPs

Complexity reduction Per-iteration complexity of policy iteration

|A| |Q||Z|, where A is set of actions Z is set of observations

Q is set of controller nodes Per-iteration complexity of hierarchical PI

|A| i |Qi||Z| , where |Q| = i|Qi| With hierarchical decomposition,

complexity is sum of the complexity of the subproblems, instead of product

Page 105: POMDPs

Scalability

MAXQ-hierarchical policy iteration can solve any POMDP, if it can decompose it into sub-POMDPs that can be solved by policy iteration Although each sub-controller is limited in

size, the hierarchical controller is not limited in size

Although the (abstract) state space of each subtask is limited in size, the total state space is not limited in size

Page 106: POMDPs

Multi-Agent Planning with POMDPs

Partially observable stochastic gamesGeneralized dynamic programming

Page 107: POMDPs

Multi-Agent Planning with POMDPs

Many planning problems involve multiple agents acting in a partially observable environment

The POMDP framework can be extended to address this

world

a1

z1, r1

z2, r2

a2

1

2

Page 108: POMDPs

Partially observable stochastic game (POSG) A POSG is S, A1, A2, Z1, Z2,, P, r1, r2, where

S is a finite state set, with initial state s0

A1, A2 are finite action sets Z1, Z2 are finite observation sets P(s’|s, a1, a2) is state transition function P(z1, z2| s, a1, a2) is observation function r1(s, a1, a2) and r2(s, a1, a2) are reward functions

Special cases: All agents share the same reward function Zero-sum games

Page 109: POMDPs

Plans and policies

A local policy is a mapping i : Zi* Ai

A joint policy is a pair 1, 2 Each agent wants to maximize its own

long-term expected reward Although execution is distributed, planning

can be centralized

Page 110: POMDPs

Beliefs in POSGs

With a single agent, a belief is a distribution over states

How does this generalize to multiple agents?

Could have beliefs over beliefs over beliefs, but there is no algorithm for working with these

Page 111: POMDPs

Example

States: grid cell pairs Actions: ,,,

Transitions: noisy

Goal: pick up balls

Observations: red lines

Page 112: POMDPs

Another Example

States: who has a message to send? Actions: send or don’t send Reward: +1 for successful broadcast

0 if collision or channel not used Observations: was there a collision? (noisy)

msg msg

Page 113: POMDPs

Strategy Elimination in POSGs Could simply convert to normal form But the number of strategies is doubly

exponential in the horizon length

R111,

R112

… R1n1,

R1n2

… … …

Rm11,

Rm12

… Rmn1,

Rmn2

Page 114: POMDPs

Generalized dynamic programming Initialize 1-step policy trees to be actions Repeat:

Evaluate all pairs of t-step trees from current sets

Iteratively prune dominated policy trees Form exhaustive sets of t+1-step trees from

remaining t-step trees

Page 115: POMDPs

What Generalized DP Does

The algorithm performs iterated elimination of dominated strategies in the normal form game without first writing it down

For cooperative POSGs, the final sets contain the optimal joint policy

Page 116: POMDPs

Some Implementation Issues As before, pruning can be done using linear

programming Algorithm keeps value function and policy

trees in memory (unlike POMDP case) Currently no way to prune in an incremental

fashion

Page 117: POMDPs

A Better Way to Do Elimination We use dynamic programming to eliminate

dominated strategies without first converting to normal form

Pruning a subtree eliminates the set of trees containing it

a1

a1 a2

a2 a2 a3 a3

o1

o1 o2 o1 o2

o2a3

a2 a1

o1 o2

a2

a2 a3

a3 a2 a2 a1

o1

o1 o2 o1 o2

o2

prune

eliminate

Page 118: POMDPs

Dynamic Programming Build policy tree sets simultaneously Prune using a generalized belief space

s1

s2

agent 2 state space

agent 1 state space

a3

a1 a1

o1 o2

a2

a3 a1

o1 o2p1 p2

a2

a2 a2

o1 o2

a3

a1 a2

o1 o2q1 q2

Page 119: POMDPs

Dynamic Programming

a1 a2 a1 a2

Page 120: POMDPs

Dynamic Programming

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

Page 121: POMDPs

Dynamic Programming

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

Page 122: POMDPs

Dynamic Programming

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

Page 123: POMDPs

Dynamic Programming

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

Page 124: POMDPs

Dynamic Programming

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

Page 125: POMDPs

Dynamic Programming

Page 126: POMDPs

Complexity of POSGs

The cooperative finite-horizon case is NEXP-hard, even with two agents whose observations completely determine the state [Bernstein et al. 2002]

Implications: The problem is provably intractable (because

P NEXP) It probably requires doubly exponential time

to solve in the worst case