Top Banner
An Information-theoretic On-line Learning Principle for Specialization in Hierarchical Decision-Making Systems Heinke Hihn, Sebastian Gottwald, and Daniel A. Braun Ulm University Institute for Neural Information Processing December 12, 2019 58 th IEEE Conference on Decision and Control, Nice
15

An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

Apr 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

An Information-theoretic On-line LearningPrinciple for Specialization in Hierarchical

Decision-Making Systems

Heinke Hihn, Sebastian Gottwald, and Daniel A. BraunUlm University

Institute for Neural Information Processing

December 12, 201958th IEEE Conference on Decision and Control, Nice

Page 2: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

2 Introduction | Hihn et al. | December 12, 2019

Emergence of Specialized Decision-MakersDecision-Maker: optimizes a utility U in state s:

a∗s = arg maxa

U(s,a)

Central Idea:Limited resources such as

I Linear Decision-MakersI Limited information processing

drive specialization. 1 2

Motivation: Linear decision-makers are easy toanalyze.

1Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D.A. Bounded rationality, abstraction, and

hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI,2:27, 2015.

2Hihn, H., Gottwald, S., and Braun, D. A. (2018). Bounded rational decision-making with adaptive

neural network priors. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition.2 / 15

Page 3: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

3 Decision-making under Limited Resources | Hihn et al. | December 12, 2019

Bounded Rationality and Specialization

Herbert A. Simoncoined the term

Bounded Rationality

Intelligent agents must invest their re-sources such that they optimally trade offutility versus processing costs 3 4

Consequence: Specialization

3Simon, H. A. A behavioral model of rational choice. The Quarterly Journal of Economics,

69(1):99–118, 1955.4

Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains,minds, and machines. Science (2015)

3 / 15

Page 4: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

4 Information-theoretic Bounded Rationality | Hihn et al. | December 12, 2019

Information-theoretic Bounded Rationality 5

Observation

Observation

High uncertainty Low uncertainty

High uncertainty Remaining uncertainty

Unlimited Resources

Limited Resources

p(a|

s)

p(a|

s)

p(a

|s)

p(a

|s)

maxp(a|s)

Ep(s),p(a|s) [U(s,a)] s.t. I(S;A) ≤ C (1)

p∗(a|s) = arg maxp(a|s)

E [U(s,a)]− 1β

I(S;A) (2)

Mutual Information: I(S; A) = Ep(a|s) [DKL(p(a|s)||p(a))]

5Ortega, P. A., and Braun, D.A.. Thermodynamics as a theory of decision-making with

information-processing costs. Proceedings of the Royal Society of London A: Mathematical, Physicaland Engineering Sciences, 469(2153), 2013.

4 / 15

Page 5: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

5 Information-theoretic Bounded Rationality | Hihn et al. | December 12, 2019

Hierarchical Decision-Making

Extend to two-level hierarchy with experts x ∈ X 6

S → X → A (3)

Extended objective:

maxp(a|s,x),p(x |s)

E[U(s,a)]− 1β1

I(S;X )− 1β2

I(S;A|X ). (4)

6Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D.A. Bounded rationality, abstraction, and

hierarchical decision-making: An information- theoretic optimality principle. Frontiers in Robotics andAI, 2:27, 2015.

5 / 15

Page 6: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

6 Learning | Hihn et al. | December 12, 2019

Learning via Gradient DescentParametrize distributions with parameters θ and ϑ:

J(s, x ,a) = U(s,a)− 1β1

logpθ(x |s)

p(x)− 1β2

logpϑ(a|s, x)

p(a|x)(5)

maxθ

Epθ(x |s)

[f (x , s)− 1

β1log pθ(x |s)

p(x)

](6)

f (x , s) = Epϑ(a|x ,s)

[U(s,a)− 1

β2log

pϑ(a|s, x)p(a|x)

]︸ ︷︷ ︸

Expert Objective

(7)

Approximate the prior distributions p(x) and p(a|x) byrunning means.

6 / 15

Page 7: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

7 Learning | Hihn et al. | December 12, 2019

Utilities for Classification and Regression

1. cross-entropy lossL(y , y) =∑i yi log 1

yi= −∑i yi log yi

2. mean squared error L(y , y) =∑i(yi − yi)2

maxθ

Epθ(x |s)

[f (x , s)− 1

β1log pθ(x |s)

p(x)

](8)

f (x , s) = Epϑ(y |x ,s)

[−L(y , y)− 1

β2log

pϑ(y |s, x)p(y |x)

]︸ ︷︷ ︸

Expert Objective

(9)

7 / 15

Page 8: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

8 Results | Hihn et al. | December 12, 2019

Classification

−1 0 1x1

−1

0

1

x2

Circles

1 2 4Experts

0.00

0.25

0.50

0.75

1.00

%

Accuracy

-2 -1 0 1 2x2

-2

-1

0

1

2

x1

State Partion

1 2 4Experts

0.0

0.1

0.2

0.3

Bit

s

I(W ;A|X)

2 40

1

2I(W |X)

1 2 3 4x

0.0

0.1

0.2

p(x

)

Selection Prior

0.0 2.5x1

0

1

x2

Half Moons

1 2 4Experts

0.00

0.25

0.50

0.75

1.00

%

-2 -1 0 1 2x2

-2

-1

0

1

2

x1

1 2 4Experts

0.00

0.25

0.50

0.75

Bit

s

2 40.0

0.5

1.0

1.5

1 2 3 4x

0.0

0.2

0.4

p(x

)

−2 0 2x1

−2

0

2

x2

Blobs

1 2 4Experts

0.00

0.25

0.50

0.75

1.00

%

-2 -1 0 1 2x2

-2

-1

0

1

2

x1

1 2 4Experts

0.0

0.2

0.4

0.6

Bit

s

2 40

1

2

1 2 3 4x

0.0

0.1

0.2

p(x

)

0

1

2

3

0

1

2

3

0

1

2

3

8 / 15

Page 9: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

9 Results | Hihn et al. | December 12, 2019

Reinforcement Learning: Setup

Markov Decision Process as a tuple (S,A,P, r), whereI S is the set of statesI A the set of actionsI P : S ×A× S → [0,1] is the transition probabilityI r : S ×A → R is a reward function

Find policy πθ maximizing expected reward:

θ∗ = arg maxθ

Eτ∼πθ

[ ∞∑t=0

r(st ,at)

]︸ ︷︷ ︸

J(πθ)

. (10)

9 / 15

Page 10: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

10 Results | Hihn et al. | December 12, 2019

RL Objective

Penalize deviation from a prior policy:

arg maxπ

[ ∞∑t=0

γt(

r(st ,at)−1β

logπ(at |st)

π(a)

)]. (11)

Similar to MaxEnt RL 7, Trust Region Policy Optimization8, Mutual Information Regularized RL 9

7Eysenbach, B. and Levine, S. If MaxEnt RL is the Answer, What is the Question?. arXiv preprint

(2019).8

Schulman, J., et al. Trust region policy optimization. In International Conference on MachineLearning (2015)

9Leibfried, F., and Grau-Moya, J. Mutual-information regularization in markov decision processes

and actor-critic learning. Conference on Robot Learning (2019).10 / 15

Page 11: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

11 Results | Hihn et al. | December 12, 2019

RL Objectives

Advantage-Actor-Critic 10 Selection Stage Objective:

maxθ

Eπθ(x |s)[f (s, x)− 1

β1log

πθ(x |s)π(x)

], (12)

where

f (s, x) = Eπϑ(a|s,x)[r(s,a)− 1

β2log

πϑ(a|s, x)π(a|x)

]︸ ︷︷ ︸

Expert Objective

(13)

10Schulman, J., et al. High-dimensional continuous control using generalized advantage estimation.

International Conference on Learning Representations (2015)11 / 15

Page 12: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

12 Results | Hihn et al. | December 12, 2019

Reinforcement Learning - State Partition

γu = -1 u = +1

0.0 0.50

20

40

Car

tP

osit

ion

Cart Position

0.0 0.5

0

2

Cart Velocity

0.0 0.5

−0.2

0.0

0.2Pole Angle

0.0 0.5

−2.5

0.0

2.5Pole Velocity

0 2

0.0

0.5

Car

tV

eloci

ty

0 20

20

40

0 2

−0.2

0.0

0.2

0 2

−2.5

0.0

2.5

−0.2 0.0 0.2

0.0

0.5

Pol

eA

ngl

e

−0.2 0.0 0.2

0

2

−0.2 0.00

20

40

−0.2 0.0 0.2

−2.5

0.0

2.5

−2.5 0.0 2.5

0.0

0.5

Pol

eV

eloci

ty

−2.5 0.0 2.5

0

2

−2.5 0.0 2.5

−0.2

0.0

0.2

−2.5 0.0 2.50

25

50

12 / 15

Page 13: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

13 Results | Hihn et al. | December 12, 2019

Reinforcement Learning - Continuous Control Problems

γ

a

x

α

0 2000 4000 6000 8000 10000Episode

0

2000

4000

6000

8000

10000

Rew

ard

Cumulative Reward per Episode

1 Expert

5 Experts

TRPO

−1 0 1a

0.0

0.5

1.0

p(a

)

Action Priors

0 2500 5000 7500 10000Episode

0

1

2

Bit

s

Mean Expert DKL

1 Expert

5 Experts

see 11

11Schulman, J., et al. Trust region policy optimization. In International Conference on Machine

Learning (2015)13 / 15

Page 14: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

14 Results | Hihn et al. | December 12, 2019

Gain Scheduling

x = Aix + Biu + ε, for x ∈ Xi

Bi =

{1 if x ≥ 0−1 if x < 0

(14)

0 500 1000 1500 2000 2500Iteration

−6000

−4000

−2000

−1000

Cos

t

Cumulative Control Cost

0 1250 2500Iteration

0.5

1.0

Bit

s

Policy Entropies

H(X|S)

H(A;S|X)

0 1250 2500Iteration

4.0

4.2

4.4

4.6

Bit

s

Mean Expert DKL

0 1250 2500Iteration

0.00

0.25

0.50

0.75

1.00

Bit

s

Mean Selector DKL

0 32 64t

−20

0

20

x

Plant Control

Optimal Control

14 / 15

Page 15: An Information-theoretic On-line Learning Principle for ...Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.

15 Conclusion | Hihn et al. | December 12, 2019

Conclusion

I Principled method applicable to a variety of tasksI Resource limitation drives specializationI No prior task information required: utility driven

partitioningI Normative framework to analyze hierarchical

structuresI System build only by linear decision-makersI Open Questions

I High dimensional tasksI Sample efficiency in RL

15 / 15