QUIZ!!

QUIZ!!

T/F: Optimal policies can be defined from an optimal Value function. TRUE T/F: “Pick the MEU action first, then follow optimal policy” is optimal. TRUE T/F: π*(s)=max s’ V*(s’). FALSE T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE T/F: Value Iteration: The policy cannot converge before the value function FALSE

Explain the difference between Policy Iteration and Value Iteration. Why can Policy Iteration be faster than Value Iteration?

1

CS 511a: Artificial IntelligenceSpring 2013

Lecture 11: MDPs / Reinforcement Learning

Feb 25, 2013

Robert Pless,

Course adopted from Kilian Weinberger, with many slides from either Dan Klein, Stuart Russell or Andrew Moore

2

Announcements

Project 2 due Thursday night. HW 1 due Friday 5pm*

* accepted no penalty or late-day charge until Monday 10am.

3

Policy Iteration

4

Why do we compute V* or Q*,

if all we care about is the best

policy *?

Utilities for Fixed Policies Another basic operation: compute

the utility of a state s under a fix (general non-optimal) policy

Define the utility of a state s, under a fixed policy :V(s) = expected total discounted

rewards (return) starting in s and following

Recursive relation (one-step look-ahead / Bellman equation):

5

a

s

s, a

T(s,a,s’)s’

R(s,a,s’)

V (s)

Q*(s,a)

a=(s)

Policy Evaluation How do we calculate the V’s for a fixed policy?

Idea one: modify Bellman updates

Idea two: Optimal solution is stationary point (equality). Then it’s just a linear system, solve with Matlab (or whatever)

6

Policy Iteration Policy evaluation: with fixed current policy , find values

with simplified Bellman updates: Iterate until values converge

Policy improvement: with fixed utilities, find the best action according to one-step look-ahead

7

Comparison In value iteration:

Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)

Policy might not change between updates (wastes computation)

In policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies Value update can be solved as linear system Can be faster, if policy changes infequently

Hybrid approaches (asynchronous policy iteration): Any sequences of partial updates to either policy entries or utilities will

converge if every state is visited infinitely often

8

Asynchronous Value Iteration In value iteration, we update every state in each iteration

Actually, any sequences of Bellman updates will converge if every state is visited infinitely often

In fact, we can update the policy as seldom or often as we like, and we will still converge

Idea: Update states whose value we expect to change:If is large then update predecessors of s

Reinforcement Learning

10

Reinforcement Learning

Reinforcement learning: Still have an MDP:

A set of states s S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’)

Still looking for a policy (s)

New twist: don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn

11Demo

Example: Animal Learning

RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated

Example: foraging Bees learn near-optimal foraging plan in field of

artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar

intake measurement to motor planning area

12

Passive Learning

Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You are given a policy (s) Goal: learn the state values … what policy evaluation did

In this case: Learner “along for the ride” No choice about what actions to take Just execute the policy and learn from experience We’ll get to the active case soon This is NOT offline planning! You actually take actions in the world

and see what happens…

13

Passive Model-Based Learning Idea:

Learn the model empirically through experience Solve for values as if the learned model were correct

Simple empirical model learning Count outcomes for each s,a Normalize to give estimate of T(s,a,s’) Discover R(s,a,s’) when we experience (s,a,s’)

Solving the MDP with the learned model Iterative policy evaluation, for example

14

(s)

s

s, (s)

s, (s),s’

s’

Example: Model-Based Learning

Episodes:

x

y

T(<3,3>, right, <4,3>) = 1 / 3

T(<2,3>, right, <3,3>) = 2 / 2

+100

-100

= 1

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)

15

Passive Model-Free Learning Big idea: why bother learning T?

1. Direct Estimation: Average V(s) value directly and compute

expected discounted reward for each state. No need to compute T or R.

16

(s)

s

s, (s)

s’

Model-Free Learning Want to compute an expectation weighted by P(x):

Model-based: estimate P(x) from samples, compute expectation

Model-free: estimate expectation directly from samples

Why does this work? Because samples appear with the right frequencies!

17

Example:Model-Free Estimation

Episodes:

x

y

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)V(2,3) ~ (96 + -103) / 2 = -3.5

V(3,3) ~ (99 + 97 + -102) / 3 = 31.3

= 1, R = -1

+100

-100

18

Sample-Based Policy Evaluation?

Update V without building T or R.

19

(s)

s

s, (s)

s1’s2’ s3’s, (s),s’

s’

Passive Model-Free Learning Big idea: why bother learning T?

1. Direct Estimation: Average V(s) value directly and compute

expected discounted reward for each state. No need to compute T or R.

2. Temporal-Difference Leearning: Update value function towards whatever

successor occurs – maintain running average.

20

(s)

s

s, (s)

s’

Temporal-Difference Learning Big idea: learn from every experience!

Update V(s) each time we experience (s,a,s’,r) Likely s’ will contribute updates more often

Temporal difference learning Policy still fixed! Move values toward value of whatever

successor occurs: running average!

21

(s)

s

s, (s)

s’

Sample of V(s):

Update to V(s):

Same update:

Exponential Moving Average Exponential moving average

Makes recent samples more important

Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average

Decreasing learning rate can give converging averages

22

Problems with TD Value Learning

TD value leaning is a model-free way to do policy evaluation

However, if we want to turn values into a (new) policy, we’re sunk:

Idea: learn Q-values directly Makes action selection model-free too!

a

s

s, a

s,a,s’s’

23

Active Learning

Full reinforcement learning You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You can choose any actions you like Goal: learn the optimal policy … what value iteration did!

In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the

world and find out what happens…

24

Detour: Q-Value Iteration Value iteration: find successive approx optimal values

Start with V0*(s) = 0, which we know is right (why?)

Given Vi*, calculate the values for all states for depth i+1:

But Q-values are more useful! Start with Q0

*(s,a) = 0, which we know is right (why?) Given Qi

*, calculate the q-values for all q-states for depth i+1:

25

Q-Learning Q-Learning: sample-based Q-value iteration Learn Q*(s,a) values

Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:

Incorporate the new estimate into a running average:

[DEMO – Grid Q’s]

26

Q-Learning Q-Learning: sample-based Q-value iteration Learn Q*(s,a) values

Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:

Incorporate the new estimate into a running average:

27

Example’s Tom Erez, Hopper: http://www.youtube.com/watch?feature=playe

r_embedded&v=kUfmnoobTHQ - !

28

http://www.youtube.com/watch?feature=player_embedded&v=kUfmnoobTHQ#!

http://www.youtube.com/watch?feature=player_embedded&v=kUfmnoobTHQ#!

QUIZ!!

Documents

fixed policy

asynchronous policy

best policy

policy entries

fixed current policy

comparisonin value iteration

state s

update utilities