Top Banner
The Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University of Washington & Microsoft Research
29

The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Jul 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

The Mathematical Foundations of Policy Gradient Methods

Sham M. KakadeUniversity of Washington

&Microsoft Research

Page 2: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Reinforcement (interactive) learning (RL):

Page 3: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Setting: Markov decision processes

S states. start with s0 ⇠ d0

A actions.dynamics model P(s0|s, a).reward function r(s)discount factor �

Sutton, Barto ’18

Stochastic policy ⇡: st ! at

Standard objective: find ⇡ which maximizes:

V⇡(s0) = E[r(s0) + �r(s1) + �2r(s2) + . . .]

where the distribution of st , at is induced by ⇡.

S. M. Kakade (UW) Curiosity 1 / 1

Markov Decision Processes:a framework for RL

• A policy:!: States → Actions

• We execute ! to obtain a trajectory:.!, 0!, 1!, .", 0", 1", …

• Total 3-discounted reward:

4#(.!) = 8 9$%!

&3$1$ │.!, !

Goal: Find a policy that maximizes our value, !!(#").

O0 TOuna

j EI

Page 4: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Challenges in RL

1. Exploration(the environment may be unknown)

2. Credit assignment problem(due to delayed rewards)

3. Large state/action spaces:hand state: joint angles/velocitiescube state: configuration actions: forces applied to actuators

Dexterous Robotic Hand ManipulationOpenAI, Oct 15, 2019

Page 5: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Values, State-Action Values, and Advantages

• Expectation with respect to sampled trajectories under !• Have S states and A actions.• Effective “horizon” is 1/(1 − 3) time steps.

!!(#") = & '#$"

%(#)(##, +#) │#", -

.!(#", +") = & '#$"

%(# )(##, +#)│#", +", -

/! #, + = .! #, + − !!(#)

demand start atso takedo thenfollow

FT advantages 8 4

go ye weightinglike she

Page 6: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

The “Tabular” Dynamic Programming approach

• Table: ‘bookkeeping’ for dynamic programming (with known rewards/dynamics)

1. Estimate the state-action value '!(), +) for every entry in the table.

2. Update the policy - & goto step 1

• Generalization: how can we deal with this infinite table? using sampling/supervised learning?

State >: (joint angles, … cube config,…)

Action ?:(forces at joints)

&!((, *): state-action value“one step look-ahead value” using #

(31°, 12°, … , 8134,… ) (1.2 Newton, 0.1 Newton,… ) 8 units of reward

⋮ ⋮ ⋮startwith to

ATTY Q ga

Page 7: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

§ Part – I: BasicsA. Derivation and EstimationB. Preconditioning and the Natural Policy Gradient

§ Part – II: Convergence and ApproximationA. Convergence: This is a non-convex problems!B. Approximation: How to the think about the role of deep learning?

This Tutorial: Mathematical Foundations of Policy Gradient Methods

generalization

Page 8: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Part-1: Basics

Page 9: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

State-Action Visitation Measures!• This helps to clean up notation!• “Occupancy frequency” of being in state . and action a, after following ! starting in .!

%$!! # = 1 − ) * +%&"

')% , #% = # │#", /

• @.!# is a probability distribution

• With this notation:

4#(.!) =1

1 − 3 8.∼0"!# ,1∼# 1(., 0)digG chance

of visiting

Page 10: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Direct Policy Optimization over Stochastic Policies

• /( 0 # is the probability of action 0 given #, parameterized by

/( 0 # ∝ exp(5((#, 0))

• Softmax policy class: 5( #, 0 = 6$,*• Linear policy class: 5( #, 0 = 6⃗ ⋅ 9(#, 0)where 9(#, 0) ∈ ;+• Neural policy class: 5((#, 0) is a neural network

0 just z

Page 11: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

In practice, policy gradient methods rule…

• Why do we like them?

• They easily deal with large state/action spaces (through the neural net parameterization)

• We can estimate the gradient using only simulation of our current policy ,!(the expectation is under the state actions visited under ,!)

• They directly optimize the cost function of interest!

They are the most effective method for obtaining state of the art.

! ← ! + $ %&!A((")

Page 12: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Two (equal) expressions for the policy gradient!

(some shorthand notation above)

• Where do these expression come from?

• How do we compute this?

%&# (" = 11 − - .$∼&B,(∼! /

# (, 1 %log 5# 1|(

%&# (" = 11 − - .$∼&B,(∼! 7

# (, 1 %log 5# 1|(

miectonies

11 exercise

hintp Glo lyPoly O

Page 13: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Example: an important special case!• Remember the softmax policy class (a “tabular” parameterization)

-C + # ∝ exp(6D,E)• Complete class with AB params:

one parameter per state action, so it contains the optimal policy

• Expression for softmax class:

7!C #"76D,E

= 8!2 # -C + # /C #, +

• Intuition: increase !!,# if the ‘weighted’ advantage is large.

Page 14: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Part-1A: Derivations and Estimation

O

Page 15: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

General DerivationrV ⇡✓ (s0)

= rX

a0

⇡✓(a0|s0)Q⇡✓ (s0, a0)

=X

a0

⇣r⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0) +

X

a0

⇡✓(a0|s0)rQ⇡✓ (s0, a0)

=X

a0

⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0)

+X

a0

⇡✓(a0|s0)r⇣r(s0, a0) + �

X

s1

P (s1|s0, a0)V ⇡✓ (s1)⌘

=X

a0

⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0) + �

X

a0,s1

⇡✓(a0|s0)P (s1|s0, a0)rV ⇡✓ (s1)

= E [Q⇡✓ (s0, a0)r log ⇡✓(a0|s0)] + �E [rV ⇡✓ (s1)] .

y vain quaint matey'siutinemn

Y drain rule taking action

m dates.ainresintrinar

CE w

unravelimmediate impact E E8tQe0lyA

Page 16: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

SL vs RL: How do we obtain gradients?• In supervised learning, how do we compute the gradient of our loss CD(E)?

6 ← 6 + ; <=(6)• Hint: can we compute our loss?

• In reinforcement learning, how do we compute the policy gradient C43(.!)?

6 ← 6 + ; <!C(#")

%&# (" = 11 − - .$,( /# (, 1 %log 5# 1|(

If 8 Qeoly

even computing

Gol is tricky

Page 17: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Monte Carlo Estimation• Sample a trajectory: execute !3 and .!, 0!, 1!, .", 0", 1", …

• Lemma: [Glynn ’90, Williams ‘92]] This gives an unbiased estimate of the gradient:

E #$%$ = $%$((%)This is the “likelihood ratio” method.

bQ(st, at) =1X

t0=0

�t0r(st0+t, at0+t)

[rV ✓ =1X

t=0

�t bQ(st, at)r log ⇡✓(at|st)

OHHtruncation

Exercise

Page 18: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Back to the softmax policy class…

-C + # ∝ exp(6D,E)• Expression for softmax class:

7!C #"76D,E

= 8!2 # -C + # /C #, +

• What might be making gradient estimation difficult here?(hint: when does gradient descent “effective” stop?)

sas Y sa

g e

method Gp stopswhen grad is small

if alals is smallsuppose Asais 770 but als is small X

Page 19: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Part-1B: Preconditioning and the Natural Policy Gradient

Page 20: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

A closer look at Natural Policy Gradient (NPG)

• Practice: (almost) all methods are gradient based, usually variants of:Natural Policy Gradient [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]

• NPG warps the distance metric to stretch the corners out (using the Fisher information metric) move ‘more’ near the boundaries. The update is:

F E = 8.∼0#,1∼# Clog !3 0|. Clog !3 0|. 4

E ← E + L F E 5"C43(.!)

tf statehave a Dist

Flats

SEm Istate space sine 2 7 Parang

Page 21: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

TRPO (Trust Region Policy Optimization)

• TRPO [Schulman ‘15] (related: PPO [Schulman ‘17]): move staying “close” in KL to previous policy:

E$6" = argmin3 43(.!)s. t. 8.∼0#$ PD !3 ⋅ . R !3$ ⋅ .

• NPG=TRPO: they are first order equivalent (and have same practical behavior)

8

Page 22: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

NPG intuition. But first….• NPG as preconditioning:

E ← E + L F E 5"C43(.!)OR

E ← E + L1 − 3 8 Clog !3 0|. Clog !3 0|. 4 5"8 Clog !3 0|. B3(., 0)

• What does the following problem remind you of?

8 SS7 5"8[SU]

• What is NPG is trying to approximate?

weup

I Px ray

WrGakAL

Page 23: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Equivalent Update Rule (for the softmax)• Take the best linear fit of W3 in “policy space”-features”: this gives

W.,1∗ = B3(., 0)• Using the NPG update rule :

E.,1 ← E.,1 +L

1 − γ B3(., 0)

• And so an equivalent update rule to NPG is:

!3 0|. ← !3 0|. exp L1 − γB

3(., 0) /\

• What algorithm does this remind you of?

Questions: convergence? General case/approximation?

Ftv exercisein

soft Spica

PolicyIteration

as 9 70 next policy it is policy it update

Page 24: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

But does gradient descent even work in RL??

Supervised Learning Reinforcement Learning

What about approximation?

Stay tuned!!

Page 25: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

Part-2: Convergence and Approximation

Page 26: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

The Optimization Landscape

Supervised Learning:• Gradient descent tends to ‘just

work’ in practice and is not sensitive to initialization• Saddle points not a problem…

Reinforcement Learning:• Local search depends on initialization in

many real problems, due to “very” flat regions.

• Gradients can be exponentially small in the “horizon”

Page 27: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

RL and the vanishing gradient problem

Reinforcement Learning:• The random init. has “very” flat regions in real problems (lack of ‘exploration’)• Lemma: [Agarwal, Lee, K., Mahajan 2019]

With random init, all F-th higher-order gradients are 2#$/& in magnitude for up to k <H/ ln L orders, L = 1/(1 − O).

• This is a landscape/optimization issues.(also a statistical issue if we used random init).

Prior work: The Explore/Exploit Tradeoff

Thrun ’92

Random search does not find the reward quickly.

(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]

S. M. Kakade (UW) Curiosity 4 / 16

s!

artat2 a 3

a

n

Page 28: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

§ A: Convergence • Let’s look at the tabular/”softmax” case

§ B: Approximation§ Approximation: “linear” policies and neural nets

Part 2:Understanding the convergence properties of the (NPG) policy gradient methods!

Page 29: The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

NPG: back to the “soft” policy iteration interpretation

• Remember the softmax policy class

!3 0 . ∝ exp(E.,1)has A ⋅ B params

• At iteration t, the NPG update rule:E$6" ← E$ + L F E$ 5"C4$(.!)

is equivalent to a “soft” (exact) policy iteration update rule:

!$6" 0|. ← !$ 0|. exp L1 − γB

$(., 0) /\

• What happens for this non-convex update rule?

to11 A

00 A

ParfIintIfEyz qyqnaes.AtEike