Top Banner
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University
30

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Bayesian Reinforcement Learning with Gaussian Processes

Huanren Zhang

Electrical and Computer Engineering

Purdue University

Page 2: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Outline Introduction to Reinforcement Learning (RL) Markov Decision Processes (MDPs) Traditional RL Solution Methods Gaussian Processes (GPs) Gaussian Process Temporal Difference (GPT

D) Experiment Conclusion

Page 3: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning (RL)

An agent interacts with the environment and learns how to map situations to actions in order to maximize the reward.

Involves sequences of decision Almost all Artificial Intelligence (AI) problems can be

formulated as RL problems

Page 4: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning (RL)

Evaluative feedback (reward or reinforcement) Indicates how good the action taken is, but not

whether it is correct or wrong. Balance between exploration and exploitation

Exploitation— make most of the current information that has already got

Exploration — explore the unknown states that may cause higher return in the long run.

Online learning

Page 5: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Markov Decision Processes (MDPs) RL problems can be formulated as Markov Decis

ion Processes (MDPs) An MDP is a tuple

State space: S Action space: A Reward function State transition function

represents the probability of making a transition from state s to state s’ taking action a

By a model, we mean the reward function and state-transition function.

Page 6: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Maze world problem

Page 7: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Traditional RL Solution Methods Dynamic Programming (DP) Monte Carlo (MC) Methods Temporal Difference (TD) Methods

All the methods are based on estimate of value function under certain policy

The value of a state is the total amount of reward an agent can expect to accumulate starting from that state

Page 8: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Maze world problem

The value for the states that are near the goal should be greater than those far away from the goal.

Page 9: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Temporal Difference (TD) Methods Learn directly from experience Bootstrap: update estimate based on other le

arned estimate Do Not need a model Updating rule:

δt : the temporal difference αt : time dependent learning rate γ: discounting rate

Page 10: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

TD Method (With Optimistic Policy Iteration)

Page 11: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Policy Learned by TD method(After 100 trails)

Page 12: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Gaussian Processes (GPs)

A Bayesian approach: provides full posterior over values, not just point estimates

Forces us to make our assumptions explicit Non-parametric – priors are placed and

inference is performed directly in function space (kernels)

Domain knowledge intuitively coded in priors

Page 13: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Gaussian Processes (GPs) “An indexed set of jointly Gaussian random v

ariables" The index set X may be just about any set. For a Gaussian Process F,

Kernel function k(x,x’) is symmetric positive definite (Mercer kernel).

Page 14: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Conditioning Theorem

Page 15: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

GP regression

Page 16: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

GP regression

Page 17: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

GP regression

Page 18: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

GPTD Methods

Generative model for the reward of the trajectory s1,, s2 ,…,st,

In compact form

Page 19: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

GPTD Methods

Conditioning theorem

where

Page 20: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Can we use the uncertainty information to help balance exploitation and exploration? New value function (improved GPTD):

Parameter c balances the importance between exploitation and exploration.

Information theory – higher uncertainty means more information.

Visiting states with higher uncertainty gives higher information gain – another kind of value for a state

Page 21: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Experiment Gaussian kernel is used:

||s-s’|| represents the Euclidean distance between two states s and s’ : adjacent state will have similar value function in maze problem.

Use Optimistic Policy Iteration (OPI) to determine policy Take actions that lead to the highest expected

returned based on current value estimate.

Page 22: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

GPTD Method

Page 23: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Policy Learned by GPTD

Page 24: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

GPTD for Multi-goal Maze

Page 25: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Policy Learned by GPTD

Page 26: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Improved GPTD

Page 27: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Policy Learned by Improved GPTD

Page 28: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Conclusion

Gaussian process gives how certainty the estimate is along with the estimate.

GPTD will give much better results than traditional RL methods

The main contribution of the project is the proposal of one way to utilize the uncertainty to balance exploration and exploitation in RL; and the experiment shows its effectiveness

Page 29: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reference Bishop, C. M. 2006. Pattern Recognition and Machine Learning.

Secaucus, NJ, USA: Springer-Verlag New York, Inc. Engel, Y.; Mannor, S.; and Meir, R. 2003. Bayesian meets bellm

an: The gaussian process approach to temporal difference learning. International Conference on Machine Learning.

Engel, Y.; Mannor, S.; and Meir, R. 2005. Reinforcement learning with gaussian process. International Conference on Machine Learning.

Engel, Y. 2005. Algorithms and Representations for Reinforcement Learning. Ph.D. Dissertation, The Hebrew University of Jerusalem, Israel.

Puterman, M. L. 1994. Markov Decision Process: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York, NY.

Russell, S. J., and Norvig, P. 2002. Artificial Intelligence: A Modern Approach (2nd Edition). Prentice Hall.

Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA.

Page 30: Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Questions?

Bayesian Reinforcement Learning with Gaussian Processes