Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Bayesian Reinforcement Learning

Machine Learning RCC

16th June 2011

Outline

• Introduction to Reinforcement Learning

• Overview of the field

• Model-based BRL

• Model-free RL

References

• ICML-07 Tutorial– P. Poupart, M. Ghavamzadeh, Y. Engel

• Reinforcement Learning: An Introduction – Richard S. Sutton and Andrew G. Barto

Machine Learning

Unsupervised Learning

Reinforcement Learning

Supervised Learning

Definitions

State Action Reward

Policy

£££££

Reward function

Markov Decision Process

Policy

Transition Probability

Reward function

Value Function

Optimal Policy

• Assume one optimal action per state

Unknown

Value Iteration

Reinforcement Learning

• RL Problem: Solve MDP when reward/transition models are unknown

• Basic Idea: Use samples obtained from agent’s interaction with environment

Model-Based vs Model-Free RL

• Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy

• Model-Free: Directly learn value function/policy

RL Solutions

• Value Function Algorithms– Define a form for the value function– Sample state-action-reward sequence– Update value function– Extract optimal policy

• SARSA, Q-learning

RL Solutions

• Actor-Critic– Define a policy structure

(actor)– Define a value function

(critic)– Sample state-action-reward– Update both actor & critic

RL Solutions

• Policy Search Algorithm– Define a form for the policy– Sample state-action-reward sequence– Update policy

• PEGASUS– (Policy Evaluation-of-Goodness

And Search Using Scenarios)

Online - Offline

• Offline– Use a simulator– Policy fixed for each ‘episode’– Updates made at the end of episode

• Online– Directly interact with environment– Learning happens step-by-step

Model-Free Solutions

1. Prediction: Estimate V(x) or Q(x,a)

2. Control: Extract policy• On-Policy• Off-Policy

Monte-Carlo Predictions

Leave car park Get out of city Motorway Enter Cambridge

-90 -83 -55 -11

-15 -61 -11

-100 -87 -72 -11

Temporal Difference Predictions

Leave car park Get out of city Motorway Enter Cambridge

-90 -83 -55 -11

-15 -61 -11

-96 -70 -72 -11

Advantages of TD

• Don’t need a model of reward/transitions

• Online, fully incremental

• Proved to converge given conditions on step-size

• “Usually” faster than MC methods

From TD to TD(λ)

Reward

Terminal state

From TD to TD(λ)

Reward

Terminal state

SARSA & Q-learning

TD-Learning

SARSA Q-Learning

On-Policy

Estimate value function for

current policy

Off-Policy

Estimate value function for

optimal policy

GP Temporal Difference

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

value functionpolicymodelfree

optimal action

policysample stateaction

leave car parkget

rewardtransition models

reinforcement learningoverview

engelreinforcement learning

rewardtransition dynamics

Documents

Deep Learning for Reinforcement Learning in · PDF fileDeep....

Reinforcement Learning: Learning algorithms

From Reinforcement Learning to Deep Reinforcement...

Tutorial: Deep Reinforcement Learning - Machine Learning...

Emergent Communication for Collaborative Reinforcement...

Reinforcement Learning & Apprenticeship Learning

Reinforcement Learning Lecture Inverse Reinforcement...

Reinforcement Learning

Reinforcement Learning Das Reinforcement Learning-Problem...

Cooperative Inverse Reinforcement Learning...Cooperative...

Reinforcement Learning Chapter 13 What is Reinforcement...

Generalization in Reinforcement Learning: Successful...

Inverse Reinforcement Learning -...

Inverse Reinforcement Learning CS885 Reinforcement ...

Reinforcement Learning and Deep Reinforcement...

Galvanized Reinforcement Rebars in RCC Structures