Top Banner
Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence
13

Passive Reinforcement Learning - Virginia Tech

Dec 18, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Passive Reinforcement Learning - Virginia Tech

Passive Reinforcement Learning

Bert Huang Introduction to Artificial Intelligence

Page 2: Passive Reinforcement Learning - Virginia Tech

Notation Review• Recall the Bellman Equation:

⇡⇤(s) = arg max

a2A(s)

X

s0

P (s0|s, a)U(s0)

U(s) = R(s) + � max

a2A(s)

X

s0

P (s0|s, a)U(s0)

U(s) = max

a2A(s)R(s, a) + �

X

s0

P (s0|s, a)U(s0)alternate version

Page 3: Passive Reinforcement Learning - Virginia Tech

• Computes utility for every state

• Needs exact transition model

• Needs to fully observe state

• Needs to know exact reward for each state

Value Iteration Drawbacks

Page 4: Passive Reinforcement Learning - Virginia Tech

Slippery Bridgeetc.

Page 5: Passive Reinforcement Learning - Virginia Tech

Value Iteration Passive Learning Active Learning

States and rewards

Transitions

Decisions

Value Iteration Passive Learning Active Learning

States and rewards Observes all states and rewards in environment

Observes only states (and rewards) visited by

agent

Observes only states (and rewards) visited by

agent

Transitions Observes all action-transition probabilities

Observes only transitions that occur from chosen

actions

Observes only transitions that occur from chosen

actions

Decisions N/A Learning algorithm does not choose actions

Learning algorithm chooses actions

Page 6: Passive Reinforcement Learning - Virginia Tech

Passive Learning• Recordings of agent running fixed policy

• Observe states, rewards, actions

• Direct utility estimation

• Adaptive dynamic programming (ADP)

• Temporal-difference (TD) learning

Page 7: Passive Reinforcement Learning - Virginia Tech

Direct Utility EstimationU(s) = R(s) + � max

a2A(s)

X

s0

P (s0|s, a)U(s0)

U⇡(s) = R(s) + �X

s0

P (s0|s,⇡(s))U⇡(s0)

future reward of state assuming we use this policy

Direct utility estimation: use observed rewards and future rewards to estimate U (i.e., take average of samples from data)

Page 8: Passive Reinforcement Learning - Virginia Tech

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

Page 9: Passive Reinforcement Learning - Virginia Tech

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

RIGHT UP

Action Result

RIGHT RIGHT

RIGHT RIGHT

RIGHT DOWN

RIGHT RIGHT

Page 10: Passive Reinforcement Learning - Virginia Tech

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

Estimate of

Ui+1(s) R(s) + � max

a2A(s)

X

s0

P (s0|s, a)Ui(s0)

Estimate of

Page 11: Passive Reinforcement Learning - Virginia Tech

Temporal-Difference LearningU⇡(s) = R(s) + �

X

s0

P (s0|s,⇡(s))U⇡(s0)

U⇡(s) U⇡(s) + ↵(R(s) + �U⇡(s0)� U⇡(s))

U⇡(s) = R(s) + �Es0 [U⇡(s0)]

U⇡(s) = Es0 [R(s) + �U⇡(s0)]

observed utility

current estimate of utilitylearning rate parameter

Page 12: Passive Reinforcement Learning - Virginia Tech

Temporal-Difference LearningU⇡(s) U⇡(s) + ↵(R(s) + �U⇡(s0)� U⇡(s))

Run each time we transition from state s to s’

Converges slower than ADP, but much simpler update.

Leads to famous q-learning algorithm (next video)

Page 13: Passive Reinforcement Learning - Virginia Tech

Passive Learning• Recordings of agent running fixed policy

• Observe states, rewards, actions

• Direct utility estimation

• Adaptive dynamic programming (ADP)

• Temporal-difference (TD) learning