An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

An Information-theoretic On-line LearningPrinciple for Specialization in Hierarchical

Decision-Making Systems

Heinke Hihn, Sebastian Gottwald, and Daniel A. BraunUlm University

Institute for Neural Information Processing

December 12, 201958th IEEE Conference on Decision and Control, Nice

2 Introduction | Hihn et al. | December 12, 2019

Emergence of Specialized Decision-MakersDecision-Maker: optimizes a utility U in state s:

a∗s = arg maxa

U(s,a)

Central Idea:Limited resources such as

I Linear Decision-MakersI Limited information processing

drive specialization. 1 2

Motivation: Linear decision-makers are easy toanalyze.

1Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D.A. Bounded rationality, abstraction, and

hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI,2:27, 2015.

2Hihn, H., Gottwald, S., and Braun, D. A. (2018). Bounded rational decision-making with adaptive

neural network priors. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition.2 / 15

3 Decision-making under Limited Resources | Hihn et al. | December 12, 2019

Bounded Rationality and Specialization

Herbert A. Simoncoined the term

Bounded Rationality

Intelligent agents must invest their re-sources such that they optimally trade offutility versus processing costs 3 4

Consequence: Specialization

3Simon, H. A. A behavioral model of rational choice. The Quarterly Journal of Economics,

69(1):99–118, 1955.4

Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains,minds, and machines. Science (2015)

3 / 15

4 Information-theoretic Bounded Rationality | Hihn et al. | December 12, 2019

Information-theoretic Bounded Rationality 5

Observation

Observation

High uncertainty Low uncertainty

High uncertainty Remaining uncertainty

Unlimited Resources

Limited Resources

p(a|

s)

p(a|

s)

p(a

|s)

p(a

|s)

maxp(a|s)

Ep(s),p(a|s) [U(s,a)] s.t. I(S;A) ≤ C (1)

p∗(a|s) = arg maxp(a|s)

E [U(s,a)]− 1β

I(S;A) (2)

Mutual Information: I(S; A) = Ep(a|s) [DKL(p(a|s)||p(a))]

5Ortega, P. A., and Braun, D.A.. Thermodynamics as a theory of decision-making with

information-processing costs. Proceedings of the Royal Society of London A: Mathematical, Physicaland Engineering Sciences, 469(2153), 2013.

4 / 15

5 Information-theoretic Bounded Rationality | Hihn et al. | December 12, 2019

Hierarchical Decision-Making

Extend to two-level hierarchy with experts x ∈ X 6

S → X → A (3)

Extended objective:

maxp(a|s,x),p(x |s)

E[U(s,a)]− 1β1

I(S;X )− 1β2

I(S;A|X ). (4)

6Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D.A. Bounded rationality, abstraction, and

hierarchical decision-making: An information- theoretic optimality principle. Frontiers in Robotics andAI, 2:27, 2015.

5 / 15

6 Learning | Hihn et al. | December 12, 2019

Learning via Gradient DescentParametrize distributions with parameters θ and ϑ:

J(s, x ,a) = U(s,a)− 1β1

logpθ(x |s)

p(x)− 1β2

logpϑ(a|s, x)

p(a|x)(5)

maxθ

Epθ(x |s)

[f (x , s)− 1

β1log pθ(x |s)

p(x)

](6)

f (x , s) = Epϑ(a|x ,s)

[U(s,a)− 1

β2log

pϑ(a|s, x)p(a|x)

]︸︷︷︸

Expert Objective

(7)

Approximate the prior distributions p(x) and p(a|x) byrunning means.

6 / 15

7 Learning | Hihn et al. | December 12, 2019

Utilities for Classification and Regression

1. cross-entropy lossL(y , y) =∑i yi log 1

yi= −∑i yi log yi

2. mean squared error L(y , y) =∑i(yi − yi)2

maxθ

Epθ(x |s)

[f (x , s)− 1

β1log pθ(x |s)

p(x)

](8)

f (x , s) = Epϑ(y |x ,s)

[−L(y , y)− 1

β2log

pϑ(y |s, x)p(y |x)

]︸︷︷︸

Expert Objective

(9)

7 / 15

8 Results | Hihn et al. | December 12, 2019

Classification

−1 0 1x1

−1

0

1

x2

Circles

1 2 4Experts

0.00

0.25

0.50

0.75

1.00

%

Accuracy

-2 -1 0 1 2x2

-2

-1

0

1

2

x1

State Partion

1 2 4Experts

0.0

0.1

0.2

0.3

Bit

s

I(W ;A|X)

2 40

1

2I(W |X)

1 2 3 4x

0.0

0.1

0.2

p(x

)

Selection Prior

0.0 2.5x1

0

1

x2

Half Moons

1 2 4Experts

0.00

0.25

0.50

0.75

1.00

%

-2 -1 0 1 2x2

-2

-1

0

1

2

x1

1 2 4Experts

0.00

0.25

0.50

0.75

Bit

s

2 40.0

0.5

1.0

1.5

1 2 3 4x

0.0

0.2

0.4

p(x

)

−2 0 2x1

−2

0

2

x2

Blobs

1 2 4Experts

0.00

0.25

0.50

0.75

1.00

%

-2 -1 0 1 2x2

-2

-1

0

1

2

x1

1 2 4Experts

0.0

0.2

0.4

0.6

Bit

s

2 40

1

2

1 2 3 4x

0.0

0.1

0.2

p(x

)

0

1

2

3

0

1

2

3

0

1

2

3

8 / 15


Reinforcement Learning: Setup

Markov Decision Process as a tuple (S,A,P, r), whereI S is the set of statesI A the set of actionsI P : S ×A× S → [0,1] is the transition probabilityI r : S ×A → R is a reward function

Find policy πθ maximizing expected reward:

θ∗ = arg maxθ

Eτ∼πθ

[ ∞∑t=0

r(st ,at)

]︸︷︷︸

J(πθ)

. (10)

9 / 15


RL Objective

Penalize deviation from a prior policy:

arg maxπ

Eπ

[ ∞∑t=0

γt(

r(st ,at)−1β

logπ(at |st)

π(a)

)]. (11)

Similar to MaxEnt RL 7, Trust Region Policy Optimization8, Mutual Information Regularized RL 9

7Eysenbach, B. and Levine, S. If MaxEnt RL is the Answer, What is the Question?. arXiv preprint

(2019).8

Schulman, J., et al. Trust region policy optimization. In International Conference on MachineLearning (2015)

9Leibfried, F., and Grau-Moya, J. Mutual-information regularization in markov decision processes

and actor-critic learning. Conference on Robot Learning (2019).10 / 15


RL Objectives

Advantage-Actor-Critic 10 Selection Stage Objective:

maxθ

Eπθ(x |s)[f (s, x)− 1

β1log

πθ(x |s)π(x)

], (12)

where

f (s, x) = Eπϑ(a|s,x)[r(s,a)− 1

β2log

πϑ(a|s, x)π(a|x)

]︸︷︷︸

Expert Objective

(13)

10Schulman, J., et al. High-dimensional continuous control using generalized advantage estimation.

International Conference on Learning Representations (2015)11 / 15


Reinforcement Learning - State Partition

γu = -1 u = +1

0.0 0.50

20

40

Car

tP

osit

ion

Cart Position

0.0 0.5

0

2

Cart Velocity

0.0 0.5

−0.2

0.0

0.2Pole Angle

0.0 0.5

−2.5

0.0

2.5Pole Velocity

0 2

0.0

0.5

Car

tV

eloci

ty

0 20

20

40

0 2

−0.2

0.0

0.2

0 2

−2.5

0.0

2.5

−0.2 0.0 0.2

0.0

0.5

Pol

eA

ngl

e

−0.2 0.0 0.2

0

2

−0.2 0.00

20

40

−0.2 0.0 0.2

−2.5

0.0

2.5

−2.5 0.0 2.5

0.0

0.5

Pol

eV

eloci

ty

−2.5 0.0 2.5

0

2

−2.5 0.0 2.5

−0.2

0.0

0.2

−2.5 0.0 2.50

25

50

12 / 15


Reinforcement Learning - Continuous Control Problems

γ

a

x

α

0 2000 4000 6000 8000 10000Episode

0

2000

4000

6000

8000

10000

Rew

ard

Cumulative Reward per Episode

1 Expert

5 Experts

TRPO

−1 0 1a

0.0

0.5

1.0

p(a

)

Action Priors

0 2500 5000 7500 10000Episode

0

1

2

Bit

s

Mean Expert DKL

1 Expert

5 Experts

see 11

11Schulman, J., et al. Trust region policy optimization. In International Conference on Machine

Learning (2015)13 / 15


Gain Scheduling

x = Aix + Biu + ε, for x ∈ Xi

Bi =

{1 if x ≥ 0−1 if x < 0

(14)

0 500 1000 1500 2000 2500Iteration

−6000

−4000

−2000

−1000

Cos

t

Cumulative Control Cost

0 1250 2500Iteration

0.5

1.0

Bit

s

Policy Entropies

H(X|S)

H(A;S|X)


4.0

4.2

4.4

4.6

Bit

s

Mean Expert DKL


0.00

0.25

0.50

0.75

1.00

Bit

s

Mean Selector DKL

0 32 64t

−20

0

20

x

Plant Control

Optimal Control

14 / 15

15 Conclusion | Hihn et al. | December 12, 2019

Conclusion

I Principled method applicable to a variety of tasksI Resource limitation drives specializationI No prior task information required: utility driven

partitioningI Normative framework to analyze hierarchical

structuresI System build only by linear decision-makersI Open Questions

I High dimensional tasksI Sample efficiency in RL

15 / 15

An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

Documents