Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs

Rosemary Emery-Montemerlo

joint work with

Geoff Gordon, Jeff Schneider and Sebastian Thrun

July 21, 2004 AAMAS 2004

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo

Robot Teams

With limited communication, existing paradigms for decentralized robot control are not sufficient

Game theoretic methods are necessary for multi-robot coordination under these conditions

Decentralized Decision Making

A robot cannot choose actions based only on joint observations consistent with its own sensor readings

It must consider all joint observations that are consistent with its possible sensor readings

Relationship Between Decision Theoretic Models

State Space State Space Belief Space Belief Space

MDP POMDP ?

Distributionover

Belief Space

Models of Multi-Agent Systems Partially observable stochastic

games Generalization of stochastic games to

partially observable worlds Related models

DEC-POMDP [Bernstein et al., 2000] MTDP [Pynadath and Tambe, 2002] I-POMDP [Gmystrasiewicz and Doshi, 2004] POIPSG [Peshkin et al., 2000]

Partially Observable Stochastic Games

POSG = {I, S, A, Z, T, R, O} I is the set of agents, I= {1,…,n} S is the set of states A is the set of actions, A= A1 An Z is the set of observations, Z= Z1

Zn T is the transition function, T: S A S R is the reward function, R: S A O are the observation emission

probabilities O: S Z A [0,1]

Solving POSGs

POSGs are computationally infeasible to solve

Solving POSGs

Full POSG

One-StepLookaheadGame at time t(Bayesian Game)

We can approximate a POSG as a series of smaller Bayesian games

Bayesian Games Private information relevant to game

Uncertainty in utility Type

Encapsulates private information Will limit selves to games with finite number

of types In robot example

Type 1: Robot doesn’t see anything Type 2: Robot sees intruder at location x

Bayesian Games BG = {I, , A,p(), u}

is the joint type space, = 1 n is a specific joint type, = {1,…, n}

p() is common prior on the distribution over

u is the utility function, u= {u1,…,un} ui(ai,a-i,(i, -i))

i is a strategy for player i Defines what player i does for each of its

possible types Actions are individual actions, not joint

actions

Bayesian-Nash Equilibrium

Set of best response strategies Each agent tries to maximize its

expected utility conditioned on its probability distribution over the other agents’ types p() Each agent has a policy i that, given

-i , maximizes ui(i,-i, -i)

POSG to Bayesian Game Approximation {I,S,A,Z,T,R,O} to {I, , A,p(), u}t

I = I A = A Type space i

t = all possible histories of agent i’s actions and observations up to time t

p()t calculated from S0,A,T,Z,O, t-1

Prune low probability types Each joint type maps to a joint belief

u given by heuristic and ui = uj QMDP

AlgorithmInitializet=0, hi = {},p(0)0=solveGame(0,p(0))

Make Observationhi = obsi

t U ait-1 U hi

Determine Typei

t = bestMatch(hi, i

Execute Actionai

Propagate Forwardt+1,p(t+1)

Find Policy for t+1t+1=solveGame(t,p(t

))t= t+1

Agent i

Initializet=0, hj = {},p(0)0=solveGame(0,p(0))

Make Observationhj = obsj

t U ajt-1 U hj

Determine Typej

t = bestMatch(hj, 2

Execute Actionaj

Propagate Forwardt+1,p(t+1)

Find Policy for t+1t+1=solveGame(t,p(t

))t= t+1

Agent j

Robotic Team Tag Version of Team Tag

Environment is portion of Gates Hall Full teammate

observability Opponent can be

captured by a single robot in any state

QMDP used as heuristic

Two pioneer-class robots

Robot Policies

Lady And The Tiger [Nair et al. 2003]

Computation Time

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

3 4 5 6 7 8 9 10

Horizion

Full POSG

Bayesian GameApproximation

Contributions Algorithm for finding approximate

solutions to POSG with common payoffs Tractability achieved by modeling POSG as

a sequence of Bayesian games Performs comparably to the full POSG for a

small finite-horizon problem Improved performance over ‘blind’

application of utility heuristic in more complex problems

Successful real-time game-theoretic controller for indoor robots

Questions?

remery@cs.cmu.edu www.cs.cmu.edu/~remery

Back-Up Slides

Policy Performance

3 4 5 6 7 8 9 10

Horizon

Full POSG

Bayesian GameApproximationSelfish Policy

Lady And The Tiger [Nair et al. 2003]

Robotic Team Tag I = {1,2} S = S1 X S2 X Sopponent

Si = {s0,…,s28}, sopponent= {s0,…,s28,stagged} |S| = 25230

Ai = {N,S,E,W,Tag} Zi = [{si,-1},s-i,a-i] T: adjacent cells O: see opponent if on same cell R: minimize capture time Modified from [Pineau et al. 2003]

Environment

Performance

Full Observability ofTeammate's Position

Without Full Observability ofTeammate's Position

Full Observability

Most Likely State

BayesianApproximation

Robotic Team Tag Results

Performance

Full Observability ofTeammate's Position

Without FullObservability of

Teammate's Position

Full Observability

Most Likely State

BayesianApproximation

Robotic Team Tag Results

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs

policy i

robot doesnt

single robot

joint belief u

joint observations consistent

joint type space

set of actions

set of observations

Documents

Reinforcement Learning in Partially Observable Multiagent...

Continuous-Observation Partially Observable Semi-Markov .......

Partially Observable Markov Decision Processes ·...

Partially observable Markov decision processes for spoken...

Partially Observable Markov Decision Processes...

Inverse Reinforcement Learning in Partially Observable...

Reinforcement Learning Partially Observable Markov Decision....

Planning in partially-observable switching-mode continuous.....

Partially Observable Markov Decision...

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning Algorithm for Partially Observable...

Partially Observable Markov Decision Processes...

Dynamic Programming for Partially Observable Stochastic...

Partially Observable Total-Cost Markov Decision Processes...

Partially Observable Markov Decision Processes

Partially Observable Markov Decision Process in ...