QUIZ!! T/F: Optimal policies can be defined from an optimal Value function. TRUE T/F: “Pick the MEU action first, then follow optimal policy” is optimal. TRUE T/F: π*(s)=max s’ V*(s’). FALSE T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE T/F: Value Iteration: The policy cannot converge before the value function FALSE Explain the difference between Policy Iteration and Value Iteration. Why can Policy Iteration be faster than Value Iteration? 1
QUIZ!!. T/F: Optimal policies can be defined from an optimal Value function. TRUE T/F: “ Pick the MEU action first, then follow optimal policy ” is optimal. TRUE T/F: π*(s)=max s ’ V*(s ’ ). FALSE T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
QUIZ!!
T/F: Optimal policies can be defined from an optimal Value function. TRUE T/F: “Pick the MEU action first, then follow optimal policy” is optimal. TRUE T/F: π*(s)=max s’ V*(s’). FALSE T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE T/F: Value Iteration: The policy cannot converge before the value function FALSE
Explain the difference between Policy Iteration and Value Iteration. Why can Policy Iteration be faster than Value Iteration?
1
CS 511a: Artificial IntelligenceSpring 2013
Lecture 11: MDPs / Reinforcement Learning
Feb 25, 2013
Robert Pless,
Course adopted from Kilian Weinberger, with many slides from either Dan Klein, Stuart Russell or Andrew Moore
2
Announcements
Project 2 due Thursday night. HW 1 due Friday 5pm*
* accepted no penalty or late-day charge until Monday 10am.
3
Policy Iteration
4
Why do we compute V* or Q*,
if all we care about is the best
policy *?
Utilities for Fixed Policies Another basic operation: compute
the utility of a state s under a fix (general non-optimal) policy
Define the utility of a state s, under a fixed policy :V(s) = expected total discounted
Policy Evaluation How do we calculate the V’s for a fixed policy?
Idea one: modify Bellman updates
Idea two: Optimal solution is stationary point (equality). Then it’s just a linear system, solve with Matlab (or whatever)
6
Policy Iteration Policy evaluation: with fixed current policy , find values
with simplified Bellman updates: Iterate until values converge
Policy improvement: with fixed utilities, find the best action according to one-step look-ahead
7
Comparison In value iteration:
Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)
Policy might not change between updates (wastes computation)
In policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies Value update can be solved as linear system Can be faster, if policy changes infequently
Hybrid approaches (asynchronous policy iteration): Any sequences of partial updates to either policy entries or utilities will
converge if every state is visited infinitely often
8
Asynchronous Value Iteration In value iteration, we update every state in each iteration
Actually, any sequences of Bellman updates will converge if every state is visited infinitely often
In fact, we can update the policy as seldom or often as we like, and we will still converge
Idea: Update states whose value we expect to change:If is large then update predecessors of s
Reinforcement Learning
10
Reinforcement Learning
Reinforcement learning: Still have an MDP:
A set of states s S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’)
Still looking for a policy (s)
New twist: don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn
11Demo
Example: Animal Learning
RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated
Example: foraging Bees learn near-optimal foraging plan in field of
artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar
intake measurement to motor planning area
12
Passive Learning
Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You are given a policy (s) Goal: learn the state values … what policy evaluation did
In this case: Learner “along for the ride” No choice about what actions to take Just execute the policy and learn from experience We’ll get to the active case soon This is NOT offline planning! You actually take actions in the world
and see what happens…
13
Passive Model-Based Learning Idea:
Learn the model empirically through experience Solve for values as if the learned model were correct
Simple empirical model learning Count outcomes for each s,a Normalize to give estimate of T(s,a,s’) Discover R(s,a,s’) when we experience (s,a,s’)
Solving the MDP with the learned model Iterative policy evaluation, for example
14
(s)
s
s, (s)
s, (s),s’
s’
Example: Model-Based Learning
Episodes:
x
y
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2
+100
-100
= 1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
15
Passive Model-Free Learning Big idea: why bother learning T?
1. Direct Estimation: Average V(s) value directly and compute
expected discounted reward for each state. No need to compute T or R.
16
(s)
s
s, (s)
s’
Model-Free Learning Want to compute an expectation weighted by P(x):
Model-based: estimate P(x) from samples, compute expectation
Model-free: estimate expectation directly from samples
Why does this work? Because samples appear with the right frequencies!
17
Example:Model-Free Estimation
Episodes:
x
y
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)V(2,3) ~ (96 + -103) / 2 = -3.5
V(3,3) ~ (99 + 97 + -102) / 3 = 31.3
= 1, R = -1
+100
-100
18
Sample-Based Policy Evaluation?
Update V without building T or R.
19
(s)
s
s, (s)
s1’s2’ s3’s, (s),s’
s’
Passive Model-Free Learning Big idea: why bother learning T?
1. Direct Estimation: Average V(s) value directly and compute
expected discounted reward for each state. No need to compute T or R.
2. Temporal-Difference Leearning: Update value function towards whatever
successor occurs – maintain running average.
20
(s)
s
s, (s)
s’
Temporal-Difference Learning Big idea: learn from every experience!
Update V(s) each time we experience (s,a,s’,r) Likely s’ will contribute updates more often
Temporal difference learning Policy still fixed! Move values toward value of whatever
successor occurs: running average!
21
(s)
s
s, (s)
s’
Sample of V(s):
Update to V(s):
Same update:
Exponential Moving Average Exponential moving average
Makes recent samples more important
Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average
Decreasing learning rate can give converging averages
22
Problems with TD Value Learning
TD value leaning is a model-free way to do policy evaluation
However, if we want to turn values into a (new) policy, we’re sunk:
Idea: learn Q-values directly Makes action selection model-free too!
a
s
s, a
s,a,s’s’
23
Active Learning
Full reinforcement learning You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You can choose any actions you like Goal: learn the optimal policy … what value iteration did!
In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the
world and find out what happens…
24
Detour: Q-Value Iteration Value iteration: find successive approx optimal values
Start with V0*(s) = 0, which we know is right (why?)
Given Vi*, calculate the values for all states for depth i+1:
But Q-values are more useful! Start with Q0
*(s,a) = 0, which we know is right (why?) Given Qi
*, calculate the q-values for all q-states for depth i+1: