Apprenticeship Learning for the Dynamics Model Overview Challenges in reinforcement learning for complex physical systems such as helicopters: Data collection: Aggressive exploration is dangerous. Difficult to specify the proper reward function for a given task. We present apprenticeship learning algorithms which use an expert demonstration which: Do not require explicit exploration. Do not require an explicit reward function specification. Experimental results: Demonstrate effectiveness of algorithms on a highly challenging control problem. Significantly extend the state of the art in autonomous helicopter flight. In particular, first completion of autonomous stationary forward flips, stationary sideways rolls, nose-in funnels and tail- in funnels. Complex tasks: hard to specify the reward function. S S T T A A N N F F O O R R D D An Application of Reinforcement Helicopter Pieter Abbeel, Adam Coates, • Key question: How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process? • State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) • Can we avoid explicit exploration? Have good model of dynamics? NO “Explore” YES “Exploit” Expert human pilot flight (a 1 , s 1 , a 2 , s 2 , a 3 , s 3 , ….) Learn P sa (a 1 , s 1 , a 2 , s 2 , a 3 , s 3 , ….) Autonomous flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning ) ( ... ) ( E max 0 T s R s R Control policy Take away message: In the apprenticeship learning setting, i.e., when we have an expert demonstration, we do not need explicit exploration to perform as well as the expert. Theorem. Assuming we have a polynomial number of teacher demonstrations, then the apprenticeship learning algorithm will return a policy that performs as well as the teacher within a polynomial number of iterations. [Abbeel & Ng, 2005 for more details.] Dynamics Model P sa Reward Function R Reinforcemen t Learning ) ( ... ) ( E max 0 T s R s R Control policy Apprenticeship Learning for the Reward Function Reward function can be very difficult to specify. E.g., for our helicopter control problem we have: R(s) = c 1 * (position error) 2 + c 2 * (orientation error) 2 + c 3 * (velocity error) 2 + c 4 *(angular rate error) 2 + … + c 25 * (inputs) 2 . Difficult to specify the proper reward function for a given task. Can we avoid the need to specify the reward function? Our approach: [Abbeel & Ng, 2004] is based on inverse reinforcement learning [Ng & Russell, 2000]. Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function in polynomial number of iterations. Inverse RL Algorithm: For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T (s) such that under R(s) the expert ment Learning and Apprenticeship Learning Data collection: aggressive exploration is dangerous