Top Banner
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th , 2006 CS286r Presented by Ilan Lobel
29

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Nash Q-Learning for General-Sum Stochastic Games

Hu & Wellman

March 6th, 2006

CS286r

Presented by

Ilan Lobel

Page 2: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Outline

Stochastic Games and Markov Perfect Equilibria Bellman’s Operator as a Contraction Mapping Stochastic Approximation of a Contraction Mapping Application to Zero-Sum Markov Games Minimax-Q Learning Theory of Nash-Q Learning Empirical Testing of Nash-Q Learning

Page 3: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

How do we model games that evolve over time ?

Stochastic Games ! Current Game = State

Ingredients:– Agents (N)– States (S)– Payoffs (R)– Transition Probabilities (P)– Discount Factor (δ)

Page 4: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Example of a Stochastic Game

1,2 3,4

5,6 7,8

-1,2 -3,4

-5,6 -7,8

A

B

C D

0,0

-10,10

A

B

C D E

Move with 30% probabilitywhen (B,D)

Move with 50% probabilitywhen (A,C) or (A,D)

δ = 0.9

Page 5: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Markov Game is a Generalization of…

Repeated Games

Markov GamesAdd States

Page 6: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Markov Game is a Generalization of…

Repeated Games

Markov GamesAdd States

MDP

Add Agents

Page 7: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Markov Perfect Equilibrium (MPE)

Strategy maps states into randomized actions– πi: S Δ(A)

No agent has an incentive to unilaterally change her policy.

Page 8: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Cons & Pros of MPEs

Cons:– Can’t implement everything described by the Folk

Theorems (i.e., no trigger strategies)

Pros:– MPEs always exist in finite Markov Games (Fink, 64)– Easier to “search for”

Page 9: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Learning in Stochastic Games

Learning is specially important in Markov Games because MPE are hard to compute.

Do we know:– Our own payoffs ?– Others’ rewards ?– Transition probabilities ?– Others’ strategies ?

Page 10: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Learning in Stochastic Games

Adapted from Reinforcement Learning:– Minimax-Q Learning (zero-sum games)– Nash-Q Learning– CE-Q Learning

Page 11: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Zero-Sum Stochastic Games

Nice properties:– All equilibria have the same value.– Any equilibrium strategy of player 1 against any

equilibrium strategy of player 2 produces an MPE.– It has a Bellman’s-type equation.

Page 12: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Bellman’s Equation in DP

)}'()',,(),({max)('

sJ*ssaPasrsJ*s

a

Bellman Operator: T

Bellman’s Equation Rewritten:

TJ*J*

)}'()',,(),({max))(('

sJssaPasrsTJs

a

Page 13: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Contraction Mapping

The Bellman Operator has the contraction property:

Bellman’s Equation is a direct consequence of the contraction.

|)(')(|max |)(')(|max sJsJsTJsTJ ss

Page 14: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

The Shapley Operator for Zero-Sum Stochastic Games

)}'(),,',( ),,({maxmin))((s'

212121 sJaassPaasrsTJ aa

The Shapley Operator is a contraction mapping. (Shapley, 53)

Hence, it also has a fixed point, which is an MPE:

TJ*J*

Page 15: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Value Iteration for Zero-Sum Stochastic Games

Direct consequence of contraction.

Converges to fixed point of operator.

kk TJJ 1

0any Start with J

Page 16: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Q-Learning

Another consequence of a contraction mapping:– Q-Learning converges !

Q-Learning can be described as an approximation of value iteration:– Value iteration with noise.

Page 17: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Q-Learning Convergence

Q-Learning is called a Stochastic Iterative Approximation of Bellman’s operator:– Learning Rate of 1/t.– Noise is zero-mean and has bounded variance.

It converges if all state-action pairs are visited infinitely often.

(Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)

Page 18: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games

Initialize your Q0(s,a1,a2) for all states, actions. Update rule:

Player 1 then chooses action u1 in the next stage sk+1.

)}],,({maxmin),,([

),,(1(),,(

21121

2121

21

1

uusQaasr

aasQaasQ

kkuukt

ktk kk

Page 19: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Minimax-Q Learning

It’s a Stochastic Iterative Approximation of Shapley Operator.

It converges to a Nash Equilibrium if all state-action-action triplets are visited infinitely often. (Littman, 96)

Page 20: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Can we extend it to General-Sum Stochastic Games ?

Yes & No. Nash-Q Learning is such an extension. However, it has much worse computational

and theoretical properties.

Page 21: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Nash-Q Learning Algorithm

Initialize Q0j(s,a1,a2) for all states, actions and for

every agent.– You must simulate everyone’s Q-factors.

Update rule:

Choose the randomized action generated by the Nash operator.

)}],,({ ),,([

),,(1(),,(

21121

21211

uusQNashaasr

aasQaasQ

kkkt

kjktk

jk

Page 22: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

The Nash Operator andThe Principle of Optimality

Nash Operator finds the Nash of a stage game. Find Nash of stage game with Q-factors as

your payoffs.

)},,({ ),,( 21121 uusQNashaasr kkk

Payoffs for Rest of theMarkov Game

Current Reward

Page 23: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

The Nash Operator

Unkown complexity even for 2 players. In comparison, the minimax operator can be

solved in polynomial time. (there’s a linear programming formulation)

For convergence, all players must break ties in favor of the same Nash Equilibrium.

Why not go model-based if computation is so expensive ?

Page 24: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Convergence Results

If every stage game encountered during learning has a global optimum, Nash-Q converges.

If every stage game encountered during learning has a saddle point, Nash-Q converges.

Both of these are VERY strong assumptions.

Page 25: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Convergence Result Analysis

The global optimum assumption implies full cooperation between agents.

The saddle point assumption implies no cooperation between agents.

Are these equivalent to DP Q-Learning and minimax-Q Learning, respectively ?

Page 26: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Empirical Testing: The Grid-world

WORLD 1Some Nash Equilibria

Page 27: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Empirical Testing: Nash Equilibria

WORLD 2

All Nash Equilibria

(97%)

(3%) (3%)

Page 28: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Empirical Performance

In very small and simple games, Nash-Q learning often converged even though theory did not predict so.

In particular, if all Nash Equilibria have the same value Nash-Q did better than expected.

Page 29: Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Conclusions

Nash-Q is a nice step forward:– It can be used for any Markov Game.– It uses the Principle of Optimality in a smart way.

But there is still a long way to go:– Convergence results are weak.– There are no computational complexity results.