Top Banner
Inverse Reinforcement Learning CS885 Reinforcement Learning Module 6: November 9, 2021 Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58). CS885 Fall 2021 Pascal Poupart 1 University of Waterloo
23

Inverse Reinforcement Learning CS885 Reinforcement ...

Apr 02, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Inverse Reinforcement Learning CS885 Reinforcement ...

Inverse Reinforcement LearningCS885 Reinforcement LearningModule 6: November 9, 2021

Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML.

Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58).

CS885 Fall 2021 Pascal Poupart 1University of Waterloo

Page 2: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 2

Reinforcement Learning Problem

Agent

Environment

StateReward Action

Data: (𝑠!, π‘Ž!, π‘Ÿ!, 𝑠", π‘Ž", π‘Ÿ", … , 𝑠#, π‘Ž#, π‘Ÿ#)Goal: Learn to choose actions that maximize rewards

University of Waterloo

Page 3: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 3

Imitation Learning

Expert

Environment

StateReward Optimal action

Data: (𝑠!, π‘Ž!βˆ— , 𝑠", π‘Ž"βˆ—, … , 𝑠#, π‘Ž#βˆ— )Goal: Learn to choose actions by imitating expert actions

University of Waterloo

Page 4: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 4

Problems

β€’ Imitation learning: supervised learning formulation– Issue #1: Assumption that state-action pairs are identically

and independently distributed (i.i.d.) is false𝑠!, π‘Ž!βˆ— β†’ 𝑠", π‘Ž"βˆ— β†’ β‹― β†’ (𝑠#, π‘Ž#βˆ— )

– Issue #2: Can’t easily transfer learnt policy to environments with different dynamics

University of Waterloo

Page 5: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 5

Inverse Reinforcement Learning (IRL)

University of Waterloo

Rewards Optimal policy πœ‹βˆ—RL

Rewards Optimal policy πœ‹βˆ—IRL

Benefit: can easily transfer reward function to new environment where we can learn an optimal policy

Page 6: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 6

Formal Definition

Definitionβ€’ States: 𝑠 ∈ 𝑆‒ Actions: π‘Ž ∈ 𝐴‒ Transition: Pr(𝑠!|𝑠!"#, π‘Ž!"#)β€’ Rewards: π‘Ÿ ∈ ℝ‒ Reward model: Pr(π‘Ÿ!|𝑠!, π‘Ž!)β€’ Discount factor: 0 ≀ 𝛾 ≀ 1β€’ Horizon (i.e., # of time steps): β„Ž

Data: (𝑠!, π‘Ž!, π‘Ÿ!, 𝑠", π‘Ž", π‘Ÿ", … , 𝑠# , π‘Ž# , π‘Ÿ#)Goal: find optimal policy πœ‹βˆ—

University of Waterloo

Definitionβ€’ States: 𝑠 ∈ 𝑆‒ Optimal actions: π‘Žβˆ— ∈ 𝐴‒ Transition: Pr(𝑠!|𝑠!"#, π‘Ž!"#)β€’ Rewards: π‘Ÿ ∈ ℝ‒ Reward model: Pr(π‘Ÿ!|𝑠!, π‘Ž!)β€’ Discount factor: 0 ≀ 𝛾 ≀ 1β€’ Horizon (i.e., # of time steps): β„Ž

Data: (𝑠%, π‘Ž%βˆ— , 𝑠#, π‘Ž#βˆ—, … , 𝑠&, π‘Ž&βˆ— )Goal: find Pr(π‘Ÿ!|𝑠!, π‘Ž!) for which expert actions π‘Žβˆ— are optimal

Reinforcement Learning (RL) Inverse Reinforcement Learning (IRL)

Page 7: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 7

IRL Applications

Advantagesβ€’ No assumption that state-action pairs are i.i.d.β€’ Transfer reward function to new environments/tasks

University of Waterloo

autonomous drivingrobotics

Page 8: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 8

IRL Techniques

General approach: 1. Find reward function for which

expert actions are optimal2. Use reward function to optimize

policy in same or new environments

Broad categories of IRL techniquesβ€’ Feature matchingβ€’ Maximum margin IRLβ€’ Maximum entropy IRLβ€’ Bayesian IRL

University of Waterloo

(𝑠!, π‘Ž!βˆ— , 𝑠", π‘Ž"βˆ— , … , 𝑠#π‘Ž#βˆ— )

𝑅 𝑠, π‘Ž or Pr(π‘Ÿ|𝑠, π‘Ž)

πœ‹βˆ—

Page 9: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 9

Feature Expectation Matching

β€’ Normally: find 𝑅 such that πœ‹βˆ— chooses the same actions π‘Žβˆ— as expert

β€’ Problem: we may not have enough data for some states (especially continuous states) to properly estimate transitions and rewards

β€’ Note: rewards typically depend on features πœ™6(𝑠, π‘Ž)e.g., 𝑅 𝑠, π‘Ž = βˆ‘6𝑀6πœ™6(𝑠, π‘Ž) = π’˜7𝝓(𝑠, π‘Ž)

β€’ Idea: Compute feature expectations and match them

University of Waterloo

Page 10: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 10

Feature Expectation Matching

Let 𝝁8(𝑠9) =:;βˆ‘<=:; βˆ‘> 𝛾>𝝓(𝑠>

< , π‘Ž>(<))

be the average feature count of expert 𝑒(where 𝑛 indexes trajectories)

Let 𝝁?(𝑠9) be the expected feature count of policy πœ‹

Claim: If 𝝁? 𝑠 = 𝝁8 𝑠 βˆ€π‘  then 𝑉? 𝑠 = 𝑉8 𝑠 βˆ€π‘ 

University of Waterloo

Page 11: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 11

ProofFeatures: 𝝓 𝑠, π‘Ž = (πœ™! 𝑠, π‘Ž , πœ™" 𝑠, π‘Ž , πœ™# 𝑠, π‘Ž , … )$Linear reward function: π‘…π’˜ 𝑠, π‘Ž = βˆ‘&𝑀&πœ™&(𝑠, π‘Ž) = π’˜$𝝓(𝑠, π‘Ž)

Discounted state visitation frequency: πœ“'%( 𝑠′ = 𝛿(𝑠′, 𝑠)) + π›Ύβˆ‘'πœ“'%

( (𝑠) Pr 𝑠* 𝑠, πœ‹ 𝑠

Value function: 𝑉( 𝑠 = βˆ‘'&πœ“'( 𝑠* π‘…π’˜(𝑠′, πœ‹(𝑠′))= βˆ‘'&πœ“'( 𝑠* π’˜$𝝓(𝑠*, πœ‹ 𝑠* )= π’˜$βˆ‘'&πœ“'( 𝑠* 𝝓(𝑠′, πœ‹ 𝑠′ )= π’˜$ 𝝁((𝑠)

Hence: 𝝁( 𝑠 = 𝝁+ 𝑠à π’˜!𝝁" 𝑠 = π’˜!𝝁# 𝑠à 𝑉" 𝑠 = 𝑉#(𝑠)

University of Waterloo

Page 12: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 12

Indeterminacy of Rewards

β€’ Learning π‘…π’˜ 𝑠, π‘Ž = π’˜3𝝓(𝑠, π‘Ž) amounts to learning π’˜

β€’ When 𝝁4 𝑠 = 𝝁5 𝑠 , then 𝑉4 𝑠 = 𝑉5(𝑠), but π’˜ can be anything since

𝝁4 𝑠 = 𝝁5 𝑠 Γ  π’˜3𝝁4 𝑠 = π’˜3𝝁5 𝑠 βˆ€π’˜

β€’ We need a bias to determine π’˜β€’ Ideas:

– Maximize the margin– Maximize entropy

University of Waterloo

Page 13: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 13

Maximum Margin IRL

β€’ Idea: select reward function that yields the greatest minimum difference (margin) between the Q-values of the expert actions and other actions

π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘› = min@

𝑄 𝑠, π‘Žβˆ— βˆ’maxABAβˆ—

𝑄 𝑠, π‘Ž

University of Waterloo

Page 14: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 14

Maximum Margin IRL

Let 𝝁? 𝑠, π‘Ž = 𝝓 𝑠, π‘Ž + 𝛾 βˆ‘@- Pr 𝑠C 𝑠, π‘Ž 𝝁?(𝑠C)Then 𝑄? 𝑠, π‘Ž = π’˜7𝝁?(𝑠, π‘Ž)Find π’˜βˆ— that maximizes margin:

π’˜βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜min) π’˜*𝝁+ 𝑠, π‘Žβˆ— βˆ’max,-,βˆ—

π’˜*𝝁+ 𝑠, π‘Ž

s.t. 𝝁+ 𝑠, π‘Ž = 𝝁. 𝑠, π‘Ž βˆ€π‘ , π‘Ž

Problem: maximizing margin is somewhat arbitrary since it doesn’t allow suboptimal actions to have values that are close to optimal

University of Waterloo

Page 15: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 15

Maximum EntropyIdea: Among models that match the expert’s average features, select the model with maximum entropy

max.(0)

𝐻(𝑃 𝜏 )

s.t. !2343

βˆ‘0∈6787𝝓(𝜏) = 𝐸[𝝓 𝜏 ]

Trajectory: 𝜏 = (𝑠)0, π‘Ž)0, 𝑠!0, π‘Ž!0, … , 𝑠90, π‘Ž90)Trajectory feature vector: 𝝓 𝜏 = βˆ‘8 𝛾8𝝓(𝑠80, π‘Ž80)Trajectory cumulative reward: 𝑅 𝜏 = π’˜$𝝓 𝜏 = βˆ‘8 𝛾8π’˜$𝝓(𝑠80, π‘Ž80)

Probability of a trajectory: π‘ƒπ’˜ 𝜏 = +' (

βˆ‘( +' ( =+π’˜*𝝓 (

βˆ‘(& +π’˜*𝝓((&)

Entropy: 𝐻(𝑃 𝜏 ) = βˆ’βˆ‘0𝑃(𝜏) log𝑃(𝜏)

University of Waterloo

Page 16: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 16

Maximum LikelihoodMaximum Entropy

max6(7)

𝐻(𝑃 𝜏 )

s.t. "89:9

βˆ‘7∈<=>=πœ™(𝜏) = 𝐸[πœ™ 𝜏 ]

Dual objective: This is equivalent to maximizing the log likelihood of the trajectories under the constraint that 𝑃 𝜏 takes an exponential form:

maxπ’˜

βˆ‘7∈<=>= log π‘ƒπ’˜(𝜏)s.t. π‘ƒπ’˜(𝜏) ∝ π‘’π’˜

'𝝓(7)

University of Waterloo

Page 17: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 17

Maximum Log Likelihood (LL)π’˜βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜

!|6787|

βˆ‘0∈6787 logπ‘ƒπ’˜(𝜏)

= π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜!

|6787|βˆ‘0∈6787 log

+π’˜*𝝓(()

βˆ‘(& +π’˜*𝝓((&)

= π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜!

|6787|βˆ‘0∈6787π’˜$𝝓(𝜏) βˆ’ logβˆ‘0& π‘’π’˜

*𝝓(0*)

Gradient: βˆ‡π’˜πΏπΏ =!

|6787|βˆ‘0∈6787𝝓(𝜏) βˆ’ βˆ‘0&&

+π’˜*𝝓 (&&

βˆ‘(& +π’˜*𝝓 (&

𝝓(𝜏**)

= !|6787|

βˆ‘0∈6787𝝓(𝜏) βˆ’ βˆ‘0&&+π’˜*𝝓 (&&

βˆ‘(& +π’˜*𝝓 (&

𝝓(𝜏**)

= !|6787|

βˆ‘0∈6787𝝓(𝜏) βˆ’ βˆ‘0&& π‘ƒπ’˜(𝜏**)𝝓(𝜏**)= 𝐸6787 𝝓 𝜏 βˆ’ πΈπ’˜[𝝓 𝜏 ]

University of Waterloo

Page 18: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 18

Gradient estimationComputing πΈπ’˜[πœ™ 𝜏 ] exactly is intractable due to exponential number of trajectories. Instead, approximate by sampling.

πΈπ’˜ πœ™ 𝜏 β‰ˆ1𝑛 O0∼.π’˜(0)

πœ™(𝜏)

Importance sampling: Since we don’t have a simple way of sampling 𝜏 from π‘ƒπ’˜(𝜏), sample 𝜏 from a base distribution π‘ž(𝜏)and then reweight 𝜏 by π‘ƒπ’˜(𝜏)/π‘ž(𝜏):

πΈπ’˜ πœ™ 𝜏 β‰ˆ1𝑛 O0∼>(0)

π‘ƒπ’˜ πœπ‘ž 𝜏 πœ™(𝜏)

We can choose π‘ž 𝜏 to be a) uniform, b) close to demonstration distribution, or c) close to π‘ƒπ’˜ 𝜏

University of Waterloo

Page 19: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 19

Maximum Entropy IRL Pseudocode

Input: expert trajectories 𝜏( ∼ πœ‹()*(+! where 𝜏( = 𝑠#, π‘Ž#, 𝑠,, π‘Ž,, …Initialize weights π’˜ at randomRepeat until stopping criterion

Expert feature expectation: 𝐸-!"#!$% 𝝓 𝜏 = #|/0!0|

βˆ‘1! ∈ /0!0𝝓(𝜏()Model feature expectation:

Sample 𝑛 trajectories: 𝜏 ~ π‘ž(𝜏)πΈπ’˜ πœ™ 𝜏 = #

4βˆ‘1

5π’˜ 16 1

𝝓(𝜏)Gradient: βˆ‡π’˜πΏπΏ = 𝐸-!"#!$% 𝝓 𝜏 βˆ’ πΈπ’˜ 𝝓 𝜏Update model: π’˜ ← π’˜+π›Όβˆ‡π’˜πΏπΏ

Return π’˜

University of Waterloo

Assumption: Linear rewards π‘…π’˜ 𝑠, π‘Ž = π’˜7𝝓(𝑠, π‘Ž)

Page 20: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 20

Non-Linear Rewards

Suppose rewards are non-linear in π’˜e.g., π‘…π’˜ 𝑠, π‘Ž = π‘π‘’π‘’π‘Ÿπ‘Žπ‘™π‘π‘’π‘‘π’˜(𝑠, π‘Ž)

Then π‘…π’˜ 𝜏 = βˆ‘> 𝛾>π‘…π’˜ 𝑠>7, π‘Ž>7

Likelihood: 𝐿𝐿 π’˜ = "|<=>=|

βˆ‘7∈<=>=π‘…π’˜(𝜏) βˆ’ logβˆ‘78 𝑒@π’˜(78)

Gradient: βˆ‡π’˜πΏπΏ = 𝐸<=>= βˆ‡π’˜π‘…π’˜ 𝜏 βˆ’ πΈπ’˜[βˆ‡π’˜π‘…π’˜ 𝜏 ]

University of Waterloo

Page 21: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 21

Maximum Entropy IRL Pseudocode

Input: expert trajectories 𝜏# ∼ πœ‹#$%#&' where 𝜏# = 𝑠(, π‘Ž(, 𝑠), π‘Ž), …Initialize weights 𝑀 at randomRepeat until stopping criterion

Expert feature expectation: 𝐸"!"#!$% βˆ‡π’˜π‘…π’˜ 𝜏# = (|,-'-|

βˆ‘.! ∈ ,-'- βˆ‡π’˜π‘…π’˜ 𝜏#Model feature expectation:

Sample 𝑛 trajectories: 𝜏 ~ π‘ž(𝜏)πΈπ’˜ βˆ‡π’˜π‘…π’˜ 𝜏 = (

0βˆ‘.

1π’˜ .2 .

βˆ‡π’˜π‘…π’˜ 𝜏Gradient: βˆ‡π’˜πΏπΏ = 𝐸"!"#!$% βˆ‡π’˜π‘…π’˜ 𝜏 βˆ’ πΈπ’˜ βˆ‡π’˜π‘…π’˜ 𝜏Update model: π’˜ ← π’˜ + π›Όβˆ‡π’˜πΏπΏ

Return π’˜

University of Waterloo

General case: Non-linear rewards 𝑅9 𝑠, π‘Ž

Page 22: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 22

Policy ComputationTwo choices:1) Optimize policy based on π‘…π’˜(𝑠, π‘Ž) with favorite RL algorithm2) Compute policy induced by π‘ƒπ’˜(𝜏).

Induced policy: probability of choosing π‘Ž after 𝑠 in trajectoriesLet 𝑠, π‘Ž, 𝜏 be a trajectory that starts with 𝑠, π‘Ž and then continues with the state action-pairs of 𝜏

πœ‹π’˜ π‘Ž 𝑠 = π‘ƒπ’˜(π‘Ž|𝑠)= βˆ‘( .π’˜ ',7,0

βˆ‘.&,(& .π’˜(',7&,0&)

= βˆ‘( +'π’˜ 0,. 12'π’˜(()

βˆ‘.&,(& +'π’˜ 0,.& 12'π’˜((&)

University of Waterloo

Page 23: Inverse Reinforcement Learning CS885 Reinforcement ...

CS885 Fall 2021 Pascal Poupart 23

Demo: Maximum Entropy IRL

Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. ICML.

University of Waterloo