COMP 4180: Intelligent Mobile Robotics Reinforcement Learning

COMP 4180: Intelligent Mobile Robotics

Reinforcement Learning

Jacky BaltesDepartment of Computer Science

University of Manitoba

Email: [email protected]

http://www4.cs.umanitoba.ca/~jacky/...Teaching/Courses/COMP_4180-

IntelligentMobileRobotics/current/index.php

Outline

● Reinforcement Learning Problem– Dynamic Programming– Control learning– Control policies that choose optimal actions– Q Learning– Convergence

● Monte-Carlo Methods● Temporal Difference Learning

Control Learning

Example: TD-Gammon

Reinforcement Learning Problem

Markov Decision Processes

Agent's Learning Task

State Value Function

Bellman Equation(Deterministic Case)

Example

Example

Iterative Policy Evaluation

Iterative Policy Evaluation

What to learn?

Q (Action-Value) Function

Q (Action-Value) Function

Bellman EquationDeterministic Case

Optimal Value Functions

Policy Improvement

Example

Example

Generalized Policy Iteration

Value IterationQ-Learning

Non-deterministic Case

Bellman EquationsNon-deterministic Case

Value IterationQ-Learning

Example

Example

Reinforcement Learning

Monte-Carlo MethodsPolicy Evaluation

Monte Carlo MethodPolicy Evaluation

Temporal Difference (TD) Learning

TD(0): Policy Evaluation

TD(0): Policy Evaluation

e-Greedy Policy

SARSA Policy Iteration

SARSA Example

SARSA Example V(s)

SARSA ExampleQ(s,a)

Rotational Inverted Pendulum

Rotational Inverted Pendulum Stablization Demo, Tor Aarnodthttp://www.eecg.utoronto.ca/~aamodt/BAScThesis/RLsim.htm

Q-Learning (Off-Policy TD)

Q-Learning (Off Policy Iteration)

TD vs Monte Carlo

Temporal Difference Learning

Monte Carlo Method

N-Step return

TD() Learning

Eligibility Traces

On-line TD()

Function Approximation

Function Approximation

Stochastic Gradient Descent

Convergence

Subtleties and Ongoing Research

● Replace Q^ table with neural net or other generalizer

● Handle cases where the state is only partially observable

● Design optimal exploration strategies● Extend to continuous action, state● Learn and use delta^: S x A -> S● Relationship to dynamic programming

References

● Reinforcement Learning: An Introduction. Richard S. Sutton, Andrew G. Barto. MIT Press 1998. http://www-anw.cs.umass.edu/~rich/book/the-book.html

● Neuro-Dynamic Programming, Dimitri Bertsekas, John Tsitsiklis, Athena Scientific, 1996.

● Reinforcement Learning: A Tutorial. M. Harmon, S. Harmon.● Reinforcement Learning: A Survey, L. Kaebling et al., Journal of Aritificial

Intelligence Research, Vol 4, pp. 237-285● How to Make Software Agents Do the Right Thing: An Introduction to

Reinforcement Learning, S. Singh, P. Norvig, D. Cohn.● Reinforcement Learning Software:

– http://www-anw.cs.umass.edu/~rich/software.html– http://www.cse.msu.edu/rlr/domains.html

● Reinforcement Learning for Humanoid Robots–

● Frank Hoffman. http://www.nada.kth.se/kurser/kth/2D1431/02/index.html

http://www-anw.cs.umass.edu/~rich/book/the-book.html

http://www-anw.cs.umass.edu/~rich/software.html

http://www.cse.msu.edu/rlr/domains.html

COMP 4180: Intelligent Mobile Robotics Reinforcement Learning

Documents

COMP 4180: Intelligent Mobile Robotics Reinforcement Learning