On Linking Reinforcement Learning with Unsupervised Learning Cornelius Weber, FIAS presented at Honda HRI, Offenbach, 17 th March 2009
Jan 22, 2016
On Linking Reinforcement Learningwith Unsupervised Learning
Cornelius Weber, FIAS
presented at Honda HRI, Offenbach, 17th March 2009
for taking action, we need only the relevant features
x
y
z
unsupervisedlearningin cortex
reinforcementlearning
in basal ganglia
state spaceactor
Doya, 1999
actor
state space
1-layer RL model of BG ...
go left?
go right?... is too simple to handle complex input
complex input(cortex)
need another layer(s) to pre-process complex data
feature detection
action selection
actor
state space
models’ background:
- gradient descent methods generalize RL to several layers Sutton&Barto RL book (1998); Tesauro (1992;1995)
- reward-modulated Hebb Triesch, Neur Comp 19, 885-909 (2007), Roelfsema & Ooyen, Neur Comp 17, 2176-214 (2005); Franz & Triesch, ICDL (2007)
- reward-modulated activity leads to input selection Nakahara, Neur Comp 14, 819-44 (2002)
- reward-modulated STDP Izhikevich, Cereb Cortex 17, 2443-52 (2007), Florian, Neur Comp 19/6, 1468-502 (2007); Farries & Fairhall, Neurophysiol 98, 3648-65 (2007); ...
- RL models learn partitioning of input space e.g. McCallum, PhD Thesis, Rochester, NY, USA (1996)
sensory input
reward
action
scenario: bars controlled by actions, ‘up’, ‘down’, ‘left’, ‘right’;
reward given if horizontal bar at specific position
model that learns the relevant features
top layer: SARSA RL
lower layer: winner-take-all feature learning
both layers: modulate learning by δ
RL weights
featureweights
input
action
SARSA with WTA input layer
note: non-negativity constraint on weights
Energy function: estimation error of state-action value
identities used:
RL action weights
feature weights
data
learning the ‘short bars’ data
reward
action
short bars in 12x12 average # of steps to goal: 11
RL action weights
feature weights
input reward 2 actions (not shown)
data
learning ‘long bars’ data
WTAnon-negative
weights
SoftMaxnon-negative
weights
SoftMaxno weight
constraints
Discussion
- simple model: SARSA on winner-take-all network with δ-feedback
- learns only the features that are relevant for action strategy
- theory behind: derivation of value function estimation (approx.)
- non-negative coding aids feature extraction
- link between unsupervised- and reinforcement learning
- demonstration with more realistic data needed
Bernstein FocusNeurotechnology,BMBF grant 01GQ0840
EU project 231722“IM-CLeVeR”,call FP7-ICT-2007-3
Frankfurt Institutefor Advanced Studies,FIAS
Sponsors
Bernstein FocusNeurotechnology,BMBF grant 01GQ0840
EU project 231722“IM-CLeVeR”,call FP7-ICT-2007-3
Frankfurt Institutefor Advanced Studies,FIAS
Sponsors
thank you ...