Page 1
COMP 4180: Intelligent Mobile Robotics
Reinforcement Learning
Jacky BaltesDepartment of Computer Science
University of Manitoba
Email: [email protected]
http://www4.cs.umanitoba.ca/~jacky/...Teaching/Courses/COMP_4180-
IntelligentMobileRobotics/current/index.php
Page 2
Outline
● Reinforcement Learning Problem– Dynamic Programming– Control learning– Control policies that choose optimal actions– Q Learning– Convergence
● Monte-Carlo Methods● Temporal Difference Learning
Page 4
Example: TD-Gammon
Page 5
Reinforcement Learning Problem
Page 6
Markov Decision Processes
Page 7
Agent's Learning Task
Page 8
State Value Function
Page 9
Bellman Equation(Deterministic Case)
Page 12
Iterative Policy Evaluation
Page 13
Iterative Policy Evaluation
Page 15
Q (Action-Value) Function
Page 16
Q (Action-Value) Function
Page 18
Bellman EquationDeterministic Case
Page 19
Optimal Value Functions
Page 20
Policy Improvement
Page 23
Generalized Policy Iteration
Page 24
Value IterationQ-Learning
Page 25
Non-deterministic Case
Page 26
Bellman EquationsNon-deterministic Case
Page 27
Value IterationQ-Learning
Page 30
Reinforcement Learning
Page 31
Monte-Carlo MethodsPolicy Evaluation
Page 32
Monte Carlo MethodPolicy Evaluation
Page 33
Temporal Difference (TD) Learning
Page 34
TD(0): Policy Evaluation
Page 35
TD(0): Policy Evaluation
Page 37
SARSA Policy Iteration
Page 39
SARSA Example V(s)
Page 40
SARSA ExampleQ(s,a)
Page 41
Rotational Inverted Pendulum
Rotational Inverted Pendulum Stablization Demo, Tor Aarnodthttp://www.eecg.utoronto.ca/~aamodt/BAScThesis/RLsim.htm
Page 42
Q-Learning (Off-Policy TD)
Page 43
Q-Learning (Off Policy Iteration)
Page 44
TD vs Monte Carlo
Page 45
Temporal Difference Learning
Page 46
Monte Carlo Method
Page 49
Eligibility Traces
Page 51
Function Approximation
Page 52
Function Approximation
Page 53
Stochastic Gradient Descent
Page 55
Subtleties and Ongoing Research
● Replace Q^ table with neural net or other generalizer
● Handle cases where the state is only partially observable
● Design optimal exploration strategies● Extend to continuous action, state● Learn and use delta^: S x A -> S● Relationship to dynamic programming
Page 56
References
● Reinforcement Learning: An Introduction. Richard S. Sutton, Andrew G. Barto. MIT Press 1998. http://www-anw.cs.umass.edu/~rich/book/the-book.html
● Neuro-Dynamic Programming, Dimitri Bertsekas, John Tsitsiklis, Athena Scientific, 1996.
● Reinforcement Learning: A Tutorial. M. Harmon, S. Harmon.● Reinforcement Learning: A Survey, L. Kaebling et al., Journal of Aritificial
Intelligence Research, Vol 4, pp. 237-285● How to Make Software Agents Do the Right Thing: An Introduction to
Reinforcement Learning, S. Singh, P. Norvig, D. Cohn.● Reinforcement Learning Software:
– http://www-anw.cs.umass.edu/~rich/software.html– http://www.cse.msu.edu/rlr/domains.html
● Reinforcement Learning for Humanoid Robots–
● Frank Hoffman. http://www.nada.kth.se/kurser/kth/2D1431/02/index.html