Top Banner
2/18/15 1 Click to edit Master title style Click to edit Master subtitle style 2/18/15 1 Approximate Models for Batch RL Emma Brunskill
33

Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

Feb 16, 2019

Download

Documents

phamtram
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 1

Click to edit Master title style

Click to edit Master subtitle style

2/18/15 1

Approximate Models for Batch RL

Emma Brunskill

Page 2: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 2

Image from David Silver

FVI / FQI

Policy Iteration maintains both

an explicit representation of a policy and

the value of that policy

PI

Approximate model planners

Page 3: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 3

Forward Search w/Generative Model

Slide modified from David Silver

a2 a1

s1 s2 s1 s2

Page 4: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 4

Exact/Exhaustive Forward Search

Slide modified from David Silver

a2 a1

s1 s2 s1 s2max

a1

expexp

Page 5: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 5

How many nodes in a H-depth tree (as a function of state space |S| and action space |A|)?

Slide modified from David Silver

a2 a1

s1 s2 s1 s2max

a1

expexp

Page 6: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 6

How many nodes in a H-depth tree (as a function of state space |S| and action space |A|)? (|S||A|)H

Slide modified from David Silver

a2 a1

s1 s2 s1 s2max

a1

expexp

Page 7: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 7

Sparse Sampling: Don’t Enumerate All Next States, Instead Sample Next States s’ ~ P(s’|s,a)

Sample n next states, si ~ P(s’|s,a)

Compute (1/n) Sumi V(s_i)

Converges to expected future reward: Sums’ p(s’|s,a)V(s’)

Slide modified from David Silver

a2 a1

s1 s2 s1 s2max

a1

expexp s1 s37

Page 8: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 8

Sparse Sampling: # nodes if sample n states at each action node? Independent of |S|! O(n|A|)H

Sample n next states, si ~ P(s’|s,a)

Compute (1/n) Sumi V(s_i)

Converges to expected future reward: Sums’ p(s’|s,a)V(s’)

Slide modified from David Silver

a2 a1

s1 s2 s1 s2max

a1

expexp s1 s37

Page 9: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 9

Sparse Sampling: # nodes if sample n states at each action node? Independent of |S|! O(n|A|)H

Upside: Can choose n to achieve bounds on the accuracy of the value function at the root state, independent of state space size

Downside: Still exponential in horizon, n still large for good bounds

Slide modified from David Silver

a2 a1

s1 s2 s1 s2max

a1

expexp s1 s37

Page 10: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 10

Limitation of Sparse Sampling

Slide modified from Alan Fern

Page 11: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 11

Monte Carlo Tree Search

Combine ideas of sparse sampling with an adaptive method for focusing on more promising parts of the ree

Here “more promising” means the actions that are seem likely to yield higher long term reward

Uses the idea of simulation search

Page 12: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 12

Simulation Based Search

Slide modified from David Silver

Page 13: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 13

Simulation based Search

Slide modified from David Silver

Page 14: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 14

Simple Monte Carlo Search

Slide modified from David Silver

rollout policy dafa

greedy improvement with respect to fixed rollout policy

Page 15: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 15

Upper Confidence Tree (UCT)[Kocsis & Szepesvari, 2006]

Slide modified from Alan Fern

• Combine forward search and simulation search• Instance of Monte-Carlo Tree Search

• Repeated Monte Carlo simulation of rollout policy• Rollouts add one or more nodes to search tree

• UCT• Uses optimism under uncertainty idea• Some nice theoretical properties

• Much better realtime performance than sparse sampling

Page 16: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 16

a1

Slide modified from Alan Fern

Page 17: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 17

a1 a2

Slide modified from Alan Fern

Page 18: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 18

a1 a2

Slide modified from Alan Fern

Page 19: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 19

a1 a2

Slide modified from Alan Fern

Page 20: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 20

a1 a2

s2 s11

Slide modified from Alan Fern

Page 21: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 21

Slide modified from Alan Fern

(Upper Confidence Bound)

Page 22: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 22

Slide modified from Alan Fern

Page 23: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 23

Slide modified from Alan Fern

● Requires us to have a simulator/ generative model

● Each pass down the tree, follow tree policy until reach a state leaf where not all actions have been tried.

● Then need to simulate starting from that state leaf the result of taking another action

Page 24: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 24

Guarantees on UCT

Slide modified from Remi Munos

Page 25: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 25

Slide modified from slides from Alan Fern & David Silver

Computer GoPrevious game tree approaches faired poorly

Page 26: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 26

Rules of Go

Slide modified from David Silver

Page 27: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 27

Position Evaluation in Go

Slide modified from David Silver

Page 28: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 28

Monte Carlo Evaluation in Go:Planning problem, just a very very hard one

Slide modified from David Silver

Page 29: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 29

Enormous Progress. MCTS Huge Impact

Slide modified from David Silver

Page 30: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 30

Going Back to Batch RL...

• Use supervised learning method to compute model

• Use learned model with MCTS planning• Note: error in model will impact error in

estimated values!

• Computes an action for current state, take action, then redo planning for next state

Page 31: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 31

Autonomous Driving using Texplore (Hester and Stone 2013)

Page 32: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 32

Image from David Silver

FVI / FQI API

Approximate model planners

Page 33: Click to edit Master title style Batch RL Approximate ...ebrun/15889e/lectures/Lecture5slides.pdf · Click to edit Master title style Click to edit Master subtitle style 2/18/15 1

2/18/15 33