Reinforcement Learning with Misspecified Bayesian ...people.csail.mit.edu/agf/Files/12NIPSW-BNPPOPS-poster.pdf · classes,” in Proceedings of the IEEE International Conference on

Rather than minimizing prediction error maximize what we actually care about:

return of the policy.

Reinforcement Learning with Misspecified Bayesian Nonparametric Model Classes Joshua Joseph, Alborz Geramifard, Jonathan P. How and Nicholas Roy

Poor Performance in Standard Model-Based Reinforcement Learning due to Misspecification

Parametric Reward Based Model Search

Bayesian Nonparametric RBMS

Results on a Toy Problem

[1] J. Joseph, A. Geramifard, J. W. Roberts, J. P. How, and N. Roy, “Reinforcement learning with misspecified model classes,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2013), Under Review. [2] R. Fonteneau, S. A. Murphy, L. Wehenkel, and D. Ernst, “Model-free monte carlo-like policy evaluation,” Journal of Machine Learning Research - Proceedings Track, vol. 9, pp. 217–224, 2010.

Training Data

Model Class Prior

Maximum a Posteriori

Policy Determination

𝜋𝜃𝑀𝐴𝑃 𝜃𝑀𝐴𝑃

𝑥𝑡+1~𝑓(𝑥𝑡 , 𝑎𝑡; 𝜃)

𝑥𝑡+1~𝑓(𝑥𝑡 , 𝑎𝑡; 𝜃𝑀𝐴𝑃)

Training Data

Model Class

Policy Determination

Policy Evaluation

Maximum Return Policy

𝜋𝜃

𝜋𝜃

𝜃

𝜃 ∈ Θ

𝑉 𝜋𝜃

∆𝒙𝟐

MAP

Resulting policy’s return: -62

Resulting policy’s return: -20

MAP model:

RBMS model:

𝜽 = 𝜽 + 𝒄𝝏𝑉 𝜋𝜽

𝝏𝜽

𝑥𝑡+1~𝑓 𝑥𝑡 , 𝑎𝑡; 𝜃 , 𝜃 = 𝜑, 𝐷

Optimal:

MAP:

Discontinuous dynamics!

The standard approach to learning a model in RL:

…the model is fit without considering the reward The policy that results from using a Gaussian process (which assumes smoothness) to model the dynamics

Reward Based Model Search (RBMS) [1]: a) Policy evaluation is performed using [2] b) Policy improvement is performed by gradient ascent

Misspecified Model Class True System

𝜽𝟏

𝑉 𝜋𝜽1

𝑉 𝜋𝜽2

𝑉 𝜋𝜽3 𝜽𝟐

𝑉 𝜋𝜽4

𝜽 = 𝜽 + 𝒄𝝏𝑉 𝜋𝜽

𝝏𝜽

𝑥𝑡+1~𝑓(𝑥𝑡 , 𝑎𝑡; 𝜃)

b) a) start

goal

𝑉 𝜋 = 1

𝑁 𝑟𝑛,𝑡

Batch of arbitrary (off-policy) data:

wind

True System Misspecified Model Class

Domain description: • Actions = {up, right} • Gaussian process dynamics model • -1 reward for each time step • -100 for falling in a pit • Taking actions on the ice results in “slipping” south • 100 episodes of training data from a random policy

𝒙𝟏

∆𝒙𝟐

𝒙𝟏

RBMS

• Unlike the parametric approach, the predictions from a Bayesian nonparametric model is a function of the training data.

• To adapt parametric RBMS to Bayesian nonparametric model classes: • Policy evaluation using [2] still works • Policy improvement is unclear

Open question: How should we perform policy improvement? • Ascending using the gradient in the hyperparameter space is straightforward • What about the data?

• Remove data from D • Add in “fake” data generated from f • Move the data

hyperparameters

data

For Bayesian nonparametric models, what does this term mean?

when the true dynamics cannot be represented by any model in the class (most real-world problems)

The Gaussian process dynamics model’s mean function for each action

Conclusion • RBMS is able to learn models from misspecified model classes that perform well in cases where

learning based on minimizing prediction error does not • RBMS allows us to use smaller model classes, resulting in significantly lower sample complexity than

using larger, more expressive model classes • The extension of RBMS to Bayesian Nonparametric models is promising but requires more work to

understand how to perform policy improvement

The gradient for RBMS policy improvement was computed by

randomly adding and removing data from the Gaussian process

Reinforcement Learning with Misspecified Bayesian ...people.csail.mit.edu/agf/Files/12NIPSW-BNPPOPS-poster.pdf · classes,” in Proceedings of the IEEE International Conference on

Documents