Top Banner
PhD Presentation Biologically-inspired Models for Learning Agents
47

PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

Jun 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 2: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Introduction

Motivation

Case Studies

Conclusions

Page 3: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Introduction

Motivation

Case Studies

Conclusions

Page 4: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Prof. Francisco Melo as thesis co-supervisor

CAT in mid-July

Objectives

General problem

General solution

Focus on case studies / experiments

Main idea

Provide learning models to autonomous agents

Inspired on biological models

Page 5: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Introduction

Motivation

Case Studies

Conclusions

Page 6: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Definitions [Franklin & Graesser, 1997; Maes, 1994]

situated in dynamic environments

have and actively pursue goals

satisfy their needs

respond to external events from the environment

MAS - live and interact with

other agents

Page 7: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Requirements [Franklin & Graesser, 1997; Maes, 1994]

mechanisms to distinguish perceived features

focus on relevant features, ignore non-important ones

adapt to and learn new knowledge from the environment

take the right action at each decision time

structures that represent the acquired knowledge

update representations overtime to reflect experience

Page 8: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Building Agents

Key is ADAPTATION

Provide prior knowledge

sufficient for the agent to perceive its environment

Use learning mechanisms

update the agent’s knowledge

Page 9: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Problems

Prior knowledge

lots of pre-programming of behaviors

large knowledge bases

Perceptual limitations

world dynamics, good states

Acting limitations

good actions

Learning

which paradigm / framework to use?

Page 10: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Parallel between natural and artificial agents

Inhabit highly dynamic environments

Have to make complex decisions under uncertainty

Limited perceptual and acting capabilities

Focus on important events

Live in organized societies

Page 11: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Inspiration from biological models

Evolutionary adaptive mechanisms

Simple but powerful survival tools

Improve performance with experience

Take the most of the perceived information

Lead to a greater fitness

Page 12: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Inspiration from several research areas

Psychology

Biology

Ethology

Neuroscience

Page 13: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Classical conditioning in RL

Improve learning speed

State-space reduction

Emotion-based Intrinsic Motivated RL

Single-agent event-processing mechanism

Use emotions as intrinsic rewards

Clues from agent-environment relationship

Improve agent fitness

Socially-aware IMRL

Multi-agent social processing mechanism

Use affiliation / cooperation

Improve population fitness

Page 14: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Introduction

Motivation

Case Studies

Conclusions

Page 15: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Inspired from animal learning

Teach an animal to respond in certain way

Provide reward and punishment appropriately

Main Ideas [Sutton & Barto, 1998]

Learn from experience

Situations + Actions → Reward

Reward is external feedback signal

Objective: maximize the reward receive throughout time

Task: discover which actions maximize reward in each state

Trial-and-error search

Mind subsequent (delayed) rewards

Page 16: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Main Idea

Inspiration from classical conditioning paradigm

Partition observations into stimuli

Propose a measure for distance between states

Learn the value of states based on proximate

states

Propagated learning

Reduce space-state

Reduce learning time

Page 17: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Classical Conditioning [Pavlov, 1927]

Advantages Contingency between stimuli in the environment

Independent of the animal's behavior

Animal does not learn behavior consequences

Predict the outcomes of new events from already-known situations

Create new contexts for behavior activation

US

food delivery

UR

salivation

CS

bell

CR

conditioned salivation

CS

bell

training…

Page 18: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Model Based on Sensory Pattern Mining [Sequeira & Antunes, 2010]

Partition observations into stimuli

e.g. see bone, has ball, hear “Fetch!”

Build tree containing frequent patterns

Page 19: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Model Use the Jaccard index [Jaccard, 1912]

frequency of intersection between stimuli over

frequency of union of the stimuli:

Advantages Sensible to particular correlations between stimuli

Rapid access to frequent patterns

Page 20: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Learning Model Extend Q-learning algorithm [Watkins, 1989]

Determine similar states using the pattern tree

State distance measure:

Propagated multi-state update of values

New state receives information from similar states

Page 21: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Experiment

Inspired in animal training

stimuli: 3 visual, 2 tactile, 2 auditory

actions: Pick, Drop, Eat,

Approach Trainer, Approach Ball

4 phases: acquisition, extinction,

association, substitution

Objectives

form associations between co-occurrent stimuli

evoke innate responses in new stimuli

discovery of new contexts for already-known responses

Page 22: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Main results Faster initial learning

Secondary conditioning (e.g. “Fetch!” heard in more cells)

New contexts for actions (e.g. Eat when bone is present)

Page 23: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Main Ideas [Singh et al., 2010]

Reward behaviors rather than consequences

Agent receives augmented reward

extrinsic reward

“normal” reward in RL, related with task (e.g. fulfillment of

needs)

intrinsic reward

does not directly relate with the task (e.g. play or explore)

Objective: maximize total reward

Page 24: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Main Idea

Inspiration from emotional appraisal mechanisms

Mathematical adaptation of dimensions

Emotions as intrinsic rewards

Integrate with IMRL framework

Provide clues from agent-environment relationship

Enhance single agent fitness

Page 25: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Emotions [Dawkins, 2000; Cardinal et al., 2002]

Evolutionary adaptive mechanism

Combined with learning signal advantageous and dangerous situations

help when seeking food and avoiding harm

Bias decision making [Naqvi et al., 2006]

maximizing reward and minimizing punishment

In humans [Phelps & LeDoux, 2005]

memory enhancement, sensory plasticity, attention facilitation,

regulation of social behavior, regulation and inhibition of

emotional responses

Page 26: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Appraisal theories of emotion [Ellsworth & Scherer, 2003; Leventhal & Scherer, 1987]

Emotions arise from evaluations

Characterize subject-environment relationship

Significance for the person’s well-being or goals

Appraisal dimensions

each dimension evaluates a specific aspect

Page 27: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Model of emotions in IMRL

Inspired in appraisal theories of emotions

Adopt four common appraisal dimensions

novelty, motivation, valence, control

each evaluates agent-environment relationship

numerical value represents dimension activation

Use dimension adaptations as reward features

each feature is component of intrinsic reward

Page 28: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Affective reward features

Adaptation from Major Dimensions of Appraisal [Ellsworth & Scherer, 2003]

intentionally did not adapt social dimensions

Problem: appraisal theories usually deal with high-level

psychological processes

complex concepts (e.g. causal attribution, norms)

Solution: inspiration from the Multilevel Process Theory Of

Emotion [Leventhal & Scherer, 1987]

appraise events at different levels

emotions from reflex-like responses into complex cog. Patterns

Evaluate aspects of the agent’s history of interaction

Page 29: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Affective reward features

Novelty: degree of familiarity of events

Valence: innate pleasure detector, learned preferences

Motivation: relevance of event for goals or needs

Control: degree of correctness of the world-model

Page 30: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Experiments

Grid-world scenarios inspired in foraging environments

agent is a predator, tries to eat preys in the environment

observations: cell position, see prey

actions: N, S, E, W, Eat

Dyna-Q/prioritized sweeping alg. [Moore & Atkeson, 2003]

Objectives

maximize the agent’s fitness (extrinsic reward)

optimize feature weight vector

Page 31: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Exploration scenario

One prey

Eat prey, rext=1

Non-Markovian

Results

optimal weight vector

optimal fitness: 1.902,2

"only extrinsic" fitness: 135,9

Page 32: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Persistence scenario Two preys: rabbit and hare

Eat prey

rabbit: rext=0,1; hare: rext=1

Fence

n North actions, next time, n+1

Non-Markovian

Results

optimal weight vector

optimal fitness: 1.020,8

"only extrinsic" fitness: 25,4

Page 33: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Prey-season scenario Two preys: rabbit and hare

Eat prey

rabbit: rext=0,1; hare: rext=1

Two seasons: rabbit and hare

10.000 steps

if 10 rabbits eaten, rext=-1

Non-Markovian

Results

optimal weight vector

optimal fitness: 5.203,5

"only extrinsic" fitness: 334,2

Page 34: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Different rewards scenario

Two preys: rabbit and hare

always available

Eat prey

rabbit: rext=0,1; hare: rext=1

Markovian

Results

optimal weight vector

optimal fitness: 87.925,7

"only extrinsic" fitness: 87.890,8

Page 35: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Conclusions

Intrinsic reward features based on emotional appraisal

Guide the agent during learning

Focus on specific aspects of the environment

Balance between different strategies

Bring attention to advantageous states

Ignore not so favorable states

Page 36: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Main Idea

Integrate with IMRL framework

Multi-agent scenarios

Inspiration from affiliation and altruism

Mathematical adaptation of social signals

Emergence of socially-aware behaviors

Raise the fitness of the population

Even the fitness of each agent

Page 37: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Affiliation [Dörner, 1999; Bach, 2009]

Urge to affiliate / interact with other agents

Send and receive legitimacy signals

reward socially-acceptable behaviors (l-signals)

punish unsuccessful interactions (anti l-signals)

internally reward or punish socially-aware behaviors

(internal l-signals)

Altruism [de Waal, 2008]

Intrinsic reward when benefit for the social group

initial cost but subsequent compensation

Page 38: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Model for socially-motivated learning

Fitness measured at the population level:

Intrinsic reward: two social features

external signal: received from other agents, based on l-signal

internal signal : generated by the agent, based on internal l-signal

represent level of satisfaction of affiliation need

Total reward

Page 39: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Social features for limited resource scenarios

Extrinsic reward

rext = IsFull – 0.1 IsHungry

External reward feature

rsExt = LastToEat AND SeeFood AND SeeOther AND !Eat

Internal reward feature

rsInt = LastToEat AND SeeFood AND !Eat

Page 40: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Experiments

Grid-world scenarios inspired in foraging environments

two predator agents

observations: position, SeeFood, SeeOther, LastToEat, IsHungry

actions: N, S, E, W, Eat

rewards: reat = 1, rhungry = -0.1

agents become hungry after 30 timesteps

Objectives

maximize the population fitness (sum of extrinsic rewards)

optimize feature weight vector

Page 41: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Single-food scenario

One food resource

Agent that eats starts closer

to food resource (bottom-right)

Results

optimal weight vector

optimal fitness: 3.249,2

"only extrinsic" fitness: -19.991,3

Page 42: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Equal-resource scenario

Two food resources

Agent that eats starts bottom-right

Possibility of both eating

Results

optimal weight vector

optimal fitness: 18.178,8

"only extrinsic" fitness: -2.296,9

Page 43: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Stronger-agent scenario

One food resource

Both start bottom-right

One agent is stronger

When both try to eat,

only one succeeds

Results

optimal weight vector

optimal fitness: 2.656,1

"only extrinsic" fitness: -1.164,9

Page 44: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Introduction

Motivation

Case Studies

Conclusions

Page 45: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Biologically-inspired learning models Provide built-in prior knowledge

Learning framework based on RL and IMRL

Rewards based on agent-environment relationship

Results Speed up learning

State-space reduction

Intrinsic features provide clues on important aspects

Lead to different strategies

Not directly related to, but increase fitness

Lead to “socially-aware” behaviors

Page 46: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd

Improve classical conditioning model

Support more learning paradigms

Improve multi-agent model

Inspiration on cooperation

Evolutionary Game Theory

CAT…

Page 47: PhD Presentation - ULisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf · web.ist.utl.pt/~pedro.sequeira/phd Introduction Motivation Case Studies Conclusions

web.ist.utl.pt/~pedro.sequeira/phd