Top Banner
Introduction to Machine Learning Reinforcement Learning Barnabás Póczos
55

Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

Jun 29, 2018

Download

Documents

phungkiet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

Introduction to Machine Learning

Reinforcement Learning

Barnabás Póczos

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AAAA

Page 2: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

2

Contents

Markov Decision Processes:

State-Value function, Action-Value Function

Bellman Equation

Policy Evaluation, Policy Improvement, Optimal Policy

Dynamical programming:

Policy Iteration

Value Iteration

Modell Free methods:

MC Tree search

TD Learning

Page 3: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

RL Books

Page 4: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

4

Introduction to Reinforcement Learning

Page 5: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

5

Reinforcement Learning Applications

Finance

Portfolio optimization

Trading

Inventory optimization

Control

Elevator, Air conditioning, power grid, …

Robotics

Games

Go, Chess, Backgammon

Computer games

Chatbots

Page 6: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

6

Reinforcement Learning Framework

. . . . . .

Agent

Environment

Page 7: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

7

Markov Decision Processes

RL Framework + Markov assumption

Page 8: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

8

Discount Rates

An issue:

Solution:

Page 9: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

9

RL is different from Supervised/Unsupervised learning

Page 10: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

10

State-Value Function

Bellman Equation of V state-value function:

Backup Diagram:

Page 11: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

11

Proof of Bellman Equation:

Bellman Equation

Page 12: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

12

Action-Value Function

Bellman Equation of the Q Action-Value function:

Backup Diagram:

Proof: similar to the proof of the Bellman Equation of V state-value function.

Page 13: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

13

Relation between Q and V Functions

Q from V:

V from Q:

Page 14: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

14

The Optimal Value Function and Optimal Policy

Partial ordering between policies:

Some policies are not comparable!

Optimal policy and optimal state-value function:

V*(s) shows the maximum expected discounted reward that one can achieve from state s with optimal play

Page 15: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

15

The Optimal Action-Value Function

Similarly, the optimal action-value function:

Important Properties:

Page 16: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

16

Theorem: For any Markov Decision Processes

The Existence of the Optimal Policy

(*) There is always a deterministic optimal policy for any MDP

Page 17: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

17

Example

Goal = Terminal state

4 states

2 possible actions in each state. [E.g in A: 1) go to B or 2) go to C ]

P(s’ | s, a) = (0.9 , 0.1) with 10% we go to a wrong direction

Page 18: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

18

Calculating the Value of Policy ¼

Goal

¼1 : always choosing Action 1

Page 19: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

19

Calculating the Value of Policy ¼

Goal

¼2 : always choosing Action 2

Similarly as before:

Page 20: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

20

Calculating the Value of Policy ¼

Goal

¼3 : mixed

Page 21: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

21

Comparing the 3 policies

Page 22: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

22

Theorem: Bellman optimality equation for V*:

Backup Diagram:

Bellman optimality equation for V*

Similarly, as we derived Bellman Equation for V and Q, we can derive Bellman Equations for V* and Q* as well

We proved this for V:

Page 23: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

23

Proof of Bellman optimality equation for V*:

Bellman optimality equation for V*

Page 24: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

24

Bellman optimality equation for Q*:

Backup Diagram:

Bellman optimality equation for Q*

Proof: similar to the proof of the Bellman Equation of V*.

Page 25: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

25

Greedy Policy for V

Equivalently, (Greedy policy for a given V(s) function):

Page 26: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

26

The Optimal Value Function and Optimal Policy

Bellman optimality equation for V*:

Theorem: A greedy policy for V* is an optimal policy. Let us denote it with ¼*

Theorem: A greedy optimal policy from the optimal Value function:

This is a nonlinear equation!

Page 27: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

27

RL Tasks

Policy evaluation:

Policy improvement

Finding an optimal policy

Page 28: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

28

Policy Evaluation

Page 29: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

29

Policy Evaluation with Bellman Operator

This equation can be used as a fix point equation to evaluate policy ¼

Bellman operator: (one step with ¼, then using V)

Iteration:

Theorem:

Bellman equation:

Page 30: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

30

Policy Improvement

Page 31: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

31

Policy Improvement

Theorem:

Page 32: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

32

Proof of Policy Improvement

Proof:

Page 33: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

33

Finding the Optimal Policy

Page 34: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

34

Finding the Optimal Policy

Policy Iteration

Value Iteration

Monte Carlo Method

TD Learning

First we will discuss methods that need to know the model:

Model based approaches:

Model-free approaches:

Page 35: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

35

Policy Iteration

1. Initialization

2. Policy Evaluation

Page 36: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

36

Policy Iteration

One drawback of policy iteration is that each iteration involves policy evaluation

3. Policy Improvement

Page 37: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

37

Value Iteration

The greedy operator:

Main idea:

The value iteration update:

Page 38: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

38

Model Free Methods

Page 39: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

39

Monte Carlo Policy Evaluation

Page 40: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

40

Monte Carlo Policy Evaluation

Without knowing the model

Page 41: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

41

Empirical average: Let us use N simulations

starting from state s following policy ¼. The

observed rewards are:

Let

This is the so-called „Monte Carlo” method.

MC can estimate V(s) without knowing the model

Monte Carlo Estimation of V(s)

Page 42: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

42

If we don’t want to store the N sample points:

Online Averages (=Running averages)

Similarly,

Page 43: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

43

From one single trajectory we can get lots of R estimates:

Warning: These R(si) random variables might be dependent!

s0 ! s1 ! s2 ! … ! sT

r1 r2 r3 r4 R(s0)

R(s1)

R(s2)

A better MC method

Page 44: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

44

Temporal Differences method

We already know the MC estimation of V:

Here is an other estimate:

Page 45: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

45

Temporal difference:

Benefits

No need for model! (Dynamic Programming with Bellman

operators need them!)

No need to wait for the end of the episode! (MC methods need

them)

We use an estimator for creating another estimator

(=bootstrapping ) … still it works

Instead of waiting for Rk, we estimate it using Vk-1

Temporal Differences method

Page 46: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

46

They all estimate V

DP:

Estimate comes from the Bellman equation

It needs to know the model

TD:

Expectation is approximated with random samples

Doesn’t need to wait for the end of the episodes.

MC:

Expectation is approximated with random samples

It needs to wait for the end of the episodes

Comparisons: DP, MC, TD

Page 47: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

47

White circle: state

Black circle: action

T: terminal state

T T T T T

T T T T T

st

T T

T T

T T T

T T T

MDP Backup Diagrams

Page 48: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

48

T T T T T

T T T T T

st

T T

T T

T T T

T T T

Monte Carlo Backup Diagram

1tr

st1

2tr

2ts

Page 49: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

49

T T T T T

T T T T T

st1

1tr

T T T T T

T T T T T

Temporal Differences Backup Diagram

st

Page 50: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

50

T

T T T

st

rt1

st1

T

T T

T

T T

T

T

T

Dynamic Programming Backup Diagram

Page 51: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

51

TD for function Q

This was our TD estimate for V:

We can use the same for Q(s,a):

Page 52: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

52

Finding The Optimal Policy with TD

Page 53: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

53

We already know the Bellman-equation for Q*:

DP update:

TD update for Q [= Q Learning]

Finding The Optimal Policy with TD

Page 54: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

54

Q(s,a) arbitrary

For each episode

s:=s0; t:=0

For each time step t in the actual episode

t:=t+1

Choose action a according to a policy ¼ e.g. (epsilon-greedy)

Execute action a

Observer reward r and new state s’

s:=s’

End For

End For

Q Learning Algorithm

Page 55: Reinforcement Learning - Carnegie Mellon School of ...mgormley/courses/10601-s17/slides/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts

55

Q-learning learns an optimal policy no matter which policy the agent is actually following (i.e., which action a it selects for any state s)

as long as there is no bound on the number of times it tries an action in any state (i.e., it does not always do the same subset of actions in a state).

Because it learns an optimal policy no matter which policy it is carrying out, it is called an off-policy method.

Q Learning Algorithm