Rather than minimizing prediction error maximize what we actually care about: return of the policy. Reinforcement Learning with Misspecified Bayesian Nonparametric Model Classes Joshua Joseph, Alborz Geramifard, Jonathan P. How and Nicholas Roy Poor Performance in Standard Model-Based Reinforcement Learning due to Misspecification Parametric Reward Based Model Search Bayesian Nonparametric RBMS Results on a Toy Problem [1] J. Joseph, A. Geramifard, J. W. Roberts, J. P. How, and N. Roy, “Reinforcement learning with misspecified model classes,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2013), Under Review. [2] R. Fonteneau, S. A. Murphy, L. Wehenkel, and D. Ernst, “Model-free monte carlo-like policy evaluation,” Journal of Machine Learning Research - Proceedings Track, vol. 9, pp. 217–224, 2010. Training Data Model Class Prior Maximum a Posteriori Policy Determination +1 ~( , ;) +1 ~( , ; ) Training Data Model Class Policy Determination Policy Evaluation Maximum Return Policy ∈Θ ∆ MAP Resulting policy’s return: -62 Resulting policy’s return: -20 MAP model: RBMS model: = + +1 ~ , ; , = , Optimal: MAP: Discontinuous dynamics! The standard approach to learning a model in RL: …the model is fit without considering the reward The policy that results from using a Gaussian process (which assumes smoothness) to model the dynamics Reward Based Model Search (RBMS) [1]: a) Policy evaluation is performed using [2] b) Policy improvement is performed by gradient ascent Misspecified Model Class True System 1 2 3 4 = + +1 ~( , ;) b) a) start goal = 1 , Batch of arbitrary (off-policy) data: wind True System Misspecified Model Class Domain description: • Actions = {up, right} • Gaussian process dynamics model • -1 reward for each time step • -100 for falling in a pit • Taking actions on the ice results in “slipping” south • 100 episodes of training data from a random policy ∆ RBMS • Unlike the parametric approach, the predictions from a Bayesian nonparametric model is a function of the training data. • To adapt parametric RBMS to Bayesian nonparametric model classes: • Policy evaluation using [2] still works • Policy improvement is unclear Open question: How should we perform policy improvement? • Ascending using the gradient in the hyperparameter space is straightforward • What about the data? • Remove data from D • Add in “fake” data generated from f • Move the data hyperparameters data For Bayesian nonparametric models, what does this term mean? when the true dynamics cannot be represented by any model in the class (most real-world problems) The Gaussian process dynamics model’s mean function for each action Conclusion • RBMS is able to learn models from misspecified model classes that perform well in cases where learning based on minimizing prediction error does not • RBMS allows us to use smaller model classes, resulting in significantly lower sample complexity than using larger, more expressive model classes • The extension of RBMS to Bayesian Nonparametric models is promising but requires more work to understand how to perform policy improvement The gradient for RBMS policy improvement was computed by randomly adding and removing data from the Gaussian process