Top Banner
Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior of an agent. We show that it is fairly simple to teach an agent complicated and adaptive behaviors under a free-energy principle. This principle suggests that agents adjust their internal states and sampling of the environment to minimize their free-energy. In this context, free-energy represents a bound on the probability of being in a particular state, given the nature of the agent, or more specifically the model of the environment an agent entails. We show that such agents learn causal structure in the environment and sample it in an adaptive and self- supervised fashion. The result is a policy that reproduces exactly the policies that are optimized by reinforcement learning and dynamic programming. Critically, at no point do we need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem using just the free- energy principle. The ensuing proof of concept is important because the free-energy formulation also provides a principled account of perceptual inference in the brain and furnishes a unified framework for action and perception. Action and active inference: A free-energy formulation
33

Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Dec 13, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Abstract

This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior of an agent. We show that it is fairly simple to teach an agent complicated and adaptive behaviors under a free-energy principle. This principle suggests that agents adjust their internal states and sampling of the environment to minimize their free-energy. In this context, free-energy represents a bound on the probability of being in a particular state, given the nature of the agent, or more specifically the model of the environment an agent entails. We show that such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. The result is a policy that reproduces exactly the policies that are optimized by reinforcement learning and dynamic programming. Critically, at no point do we need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem using just the free-energy principle. The ensuing proof of concept is important because the free-energy formulation also provides a principled account of perceptual inference in the brain and furnishes a unified framework for action and perception.

Action and active inference:A free-energy formulation

Page 2: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Perception, memory and attention(Bayesian Brain)

Action and value learning(Optimum control)

argmax ( ( ) | )a

V s a argmin ( ( | ) || ( | ))D q p s

causes ()

Prediction error

sensory input

( | )q S

R

Q

CS

reward (US)

action

S-R

S-S

The free-energy principle

,argmin ( ( ), )

aF s

s

s

av

Page 3: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 4: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

agent - menvironment

( , )s g a w

argmin ( , )a

a F s

argmin ( , )F s

( , )f a z

Separated by a Markov blanket

External states

Internal states

Sensation

Action

Exchange with the environment

Page 6: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 7: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

)1(1x

)1(2x

2s 3s 4s1s

( ) ( ) ( 1) ( ,1)

( ) ( ) ( 1) ( ,1)

( , )

( , )

i i i v

i i i x

v g x v

x f x v

Hierarchical model Top-down messagesBottom-up messages( , ) ( , ) ( ) ( ) ( , 1)

( , ) ( , ) ( ) ( )

v i v i i T i v iv

x i x i i T ix

D

D

( , ) ( , ) ( , 1) ( )

( , ) ( , ) ( , ) ( )

( ( ))

( ( ))

v i v i v i i

x i x i x i i

g

D f

)1(1v

Prediction error

1s1s 2s 3s 4s

1s1

x 2x

1v1

v

asT saa Action

Active inference: closing the loop (synergy)

Page 10: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 11: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

sensory prediction and error hidden states (location)

cause (perturbing force) perturbation and action

Active inference under flat priors(movement with percept)

10 20 30 40 50 60-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

time10 20 30 40 50 60

-2

-1

0

1

2

time

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time10 20 30 40 50 60

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time

Visual stimulus

Sensory channels

a

v

s

1x1

x 2x

vv

s

x

v

( )g

av

Page 12: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

sensory prediction and error hidden states (location)

cause (perturbing force) perturbation and action

Active inference under tight priors(no movement or percept)a

v

s

1x1

x 2x

vv

s

10 20 30 40 50 60

0

0.5

1

time10 20 30 40 50 60

-2

-1

0

1

2

time

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time10 20 30 40 50 60

-1

-0.5

0

0.5

1

time

x

v

( )g

a

v

Page 13: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

under flat priors under tight priors

actionperceived andtrue perturbation

Retinal stabilisation or tracking induced

by priors

Visual stimulus

a-2 -1 0 1 2

-2

-1

0

1

2

displacement

-2 -1 0 1 2-2

-1

0

1

2

displacement

10 20 30 40 50 60-1

-0.5

0

0.5

1

time

10 20 30 40 50 60-1

-0.5

0

0.5

1

time

real

perceived

x

x

a

v

v

Page 14: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 15: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

sensory prediction and error

cause (prior) perturbation and action

Active inference under tight priors(movement and percept)

a

v

1s1s 2s 3s 4s

1x1

x 2x

vv

Proprioceptive input

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time10 20 30 40 50 60

-0.2

0

0.2

0.4

0.6

0.8

1

time

10 20 30 40 50 60-2

-1

0

1

2

time10 20 30 40 50 60

-0.2

-0.1

0

0.1

0.2

time

hidden states (location)

Page 16: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

robust to perturbation

and change in motor gain

displacement time

trajectories

real

perceived

action and causes

action

perceived cause (prior)

exogenous cause

Self-generated movements

induced by priors

-2 -1 0 1 2-2

-1

0

1

2

10 20 30 40 50 60-0.5

0

0.5

1

-2 -1 0 1 2-2

-1

0

1

2

10 20 30 40 50 60-0.5

0

0.5

1

-2 -1 0 1 2-2

-1

0

1

2

10 20 30 40 50 60-0.5

0

0.5

1

Page 17: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 18: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

from reflexes to action

a

vis vis

vs w

J

pro pros x w

svis

,x v

x

v

spro

v v Tv

x x Tx

D

D

( ( ))

( ( ))

( )

s s

x x x

v v v

s g

D f

( )J x

1J

1x

2x2J

(0,0)

1x

1 2( , )v vJointed arm

Cued movementsand sensorimotor

integration

Page 19: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

10 20 30 40 50 60-0.5

0

0.5

1

1.5

2

2.5prediction and error

time10 20 30 40 50 60

-0.5

0

0.5

1

1.5

2hidden states

time

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2causal states

time10 20 30 40 50 60

-0.5

0

0.5

1

1.5

time

perturbation and action

1x

2x

1x

2x

3v

1,2v

3v

1,2v1a

2a

-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2

position

posit

ion

1 2( , )v v

( , )J x t

Trajectory

Cued reaching with noisy proprioception

Page 20: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2-2 -1 0 1 2

-2

-1

0

1

2

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

Noisy

pro

prio

cept

ion

Noisy vision

position position

posit

ion

posit

ion

Conditional precisionspro

vis

Bayes optimal integration of

sensory modalities

Page 21: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 22: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

11 2 34

2 1/2 2 2 3/2

( , )( ( ) )

2 1 : 0

(1 5 ) 5 (1 5 ) : 0

xxf x

G x x x x vx

x xG

x x x x

position

velo

city

null-clines

-2 -1 0 1 2-2

The mountain car problem

0 x

x

equations of motion

-2 -1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

position

Height

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

position

Forces

Desired location

1

Page 23: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

flow and density nullclines

velo

city

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

velo

city

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

position

velo

city

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

position-2 -1 0 1 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0

Q

Uncontrolled

Controlled

Expected

x

x

argmin

( | )( | ) ln

( | )

| ( ( ))

Q

x

D

p x mD p x m dx

Q x m

p x m eig f

argmin

0

F

a

Page 24: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

100 200 300 400 500-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5prediction and error

time100 200 300 400 500

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5hidden states

time

100 200 300 400 500-5

0

5causes - level 2

time100 200 300 400 500

-1

-0.5

0

0.5

1

time

perturbation and action

20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1

1.5prediction and error

time20 40 60 80 100 120

-1.5

-1

-0.5

0

0.5

1

1.5hidden states

time

20 40 60 80 100 120-8

-6

-4

-2

0

2

4

6x 10

-4 causes - level 2

time20 40 60 80 100 120

-15

-10

-5

0

5

10

15

time

perturbation and action

Learning in controlled environment

Active inference in uncontrolled environment

Page 25: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Using just the free-energy principle and a simple gradient ascent scheme, we have solved a benchmark problem in optimal control theory with just a handful of learning trials. At no point did we use reinforcement learning or dynamic programming.

Goal-directed behaviour and trajectories

Page 26: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

prediction and error

time

hidden states

time

velo

city

behaviour action

position

velo

city

time

perturbation and actionbehaviour

20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1

1.5

20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1

1.5

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

20 40 60 80 100 120-3

-2

-1

0

1

2

3

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

20 40 60 80 100 120-3

-2

-1

0

1

2

3

Action under perturbation

Page 27: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

8s

20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1

1.5

hidden states

time20 40 60 80 100 120

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

hidden states

time20 40 60 80 100 120

-1

-0.5

0

0.5

hidden states

time

5.5x 4x

6x

Simulating Parkinson's disease?

Page 28: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 29: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

velo

city

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

velo

city

position-20 -10 0 10 20

-30

-20

-10

0

10

20

30

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

position

velo

city

controlled

velo

city

velo

city

velo

city

before

after

trajectoriesdensities

Q

0

Learning autonomous behaviour

Page 30: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

50 100 150 200 250-30

-20

-10

0

10

20

30

40

50prediction and error

time50 100 150 200 250

-30

-20

-10

0

10

20

30

40

50hidden states

time

position

velo

city

learnt

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

50 100 150 200 250-10

-5

0

5

10

time

perturbation and action

Autonomous behaviour under

random perturbations

Page 31: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

Page 32: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

( , , )s x v a

s

,x v

a

,x vDesired and inferred

states

Sensory prediction error

Motor command (action)

Forward model (generative model)

Forward model (generative model){ , }x v s

Inverse modelInverse models a

Desired and inferred states

Sensory prediction error

Forward modelForward model{ , , }x v a s

Motor command (action)

EnvironmentEnvironment{ , , }x v a s

s

,x v

a

,x v

EnvironmentEnvironment{ , , }x v a s

( , , )s x v a

Free-energy formulation Forward-inverse formulation

Inverse model (control policy)

Inverse model (control policy){ , }x v a

Corollary dischargeEfference copy

Page 33: Abstract This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior.

Summary

• The free-energy can be minimised by action (through changes in states generating sensory input) or perception (through optimising the predictions of that input)

• The only way that action can suppress free-energy is through reducing prediction error at the sensory level (speaking to a juxtaposition of motor and sensory systems)

• Action fulfils expectations, which can manifest as an explaining away of prediction error through resampling sensory input (e.g., visual tracking);

• Or intentional movement, fulfilling expectations furnished by empirical priors.

• In an optimum control setting a training environment can be constructed by minimising the cross-entropy between the ensemble density and some desired density. This can be learnt and reproduced under active inference.