The Learned Dog Class 8: Looking more closely at reinforcement...

The Learned Dog

Class 8: Looking more closely at reinforcement...

Agenda

• Looking ahead...

• Where are we and where are we going?

• Papers

• Matching

• Lindsay’s critique

• Neural basis of reward

Looking ahead

Roadmap

• 4/7: stimulus control, discrimination, generalization, & PC/OC interactions.

• 4/14: Applications in Pet Dog Training

• 4/21: Aversive Control of Behavior (Problem Set 2 due)

• 4/28: Behavior Mod & Pharmacology

• 5/5: Applications: zoos

• 5/12: Ethological Perspective

• 5/19: Social Learning & Wrap-up

Final Paper

• Due 5/19

• 8 - 10 pages double-spaced

• Topic of your choice...

• Should be on a topic important to you.

• Should demonstrate your mastery of the material presented in the class, especially in its application to real world settings

Operant conditioning: explanations

Rule of thumb: when in doubt apply what you know about Pavlovian Conditioning (response => CS, outcome => US)

Contingency: learning the extent to which a reinforcer is contingent on performing a given response

P(cookie|’sit’) greater, less than, or equal to P(cookie|’bark’)

Contingency space

Response should increase in frequency

Response should decrease in frequency

Contingency matrix

Cookie No Cookie

Sit 75% 25%

Background 75% 25%

In this case, sitting doesn’t seem to increase the chances of getting a cookie

Contingency matrix

Cookie No Cookie

Sit 75% 25%

Background 25% 75%

In this case, sitting definitely increases the chances of getting a cookie

Contingency matrix

Cookie No Cookie

Sit 3 1

Background 20 60

Cookie No Cookie

Sit 75% 25%

Background 25% 75%

In this case, sitting also seems to increase the chances of getting a cookie

Contingency Matrices don’t reflect time course (more recent experience vs. less recent experience)

Cookie

Sit 50

Background 0

First 50 Repetitions

Next 50 Repetitions

Cookie

Sit 0

Background 50

Cookie

Sit 0

Background 50

Cookie

Sit 50

Background 0

Contingency Matrix doesn’t reflect signaling effect

Cookie

Sit 0

Background 50

Cookie

Sit 50

Background 0

Cookie

Sit 0

Background w/o light

50

Cookie

Sit 50

Background w/ light

0

Background may block acquisition of sit

Light signals no reinforcement so background doesn’t block acquisition

Are contingency matrices floating around in your dog’s brain...

• Models used to explain Pavlovian Conditioning: Rescorla-Wagner, Pearce-Hall, Gallistel, etc. are also useful in explaining operant conditioning...

• Operant response => CS, Reinforcement => US

• These models also explain

• Effects of trial order

• Effect of signaling background reinforcement

• Accurate judgments require large numbers of trials

Via learning value of an action approximates value

of reward

Animal mostly chooses action with highest value...

Surprise last trial = |Actual last trial - Expected last

trial |

V’sit’:change = Surprise last

trial*Vcookie

V’sit’:new = V’sit’:old + V’sit’:change

Learning equations:

Choice of action:

Mostly, choose action with highest value

Some subtle but difficult challenges for learner...

• What exactly am I getting rewarded for?

• Nice accident that they agree with us...

• How much weight do I place on what just happened?

• Too little and you will be slow to learn or adjust to changes

• Too much and you may not see the forest for the trees

• How much weight do I place on recent experience vs. less recent experience

• When do I stop learning?

What is learned?

What associations are made?

response reinforcerstimulus

response reinforcer

stimulus reinforcer

responsestimulus

R-S*

S-S*

S-R

S R S*

Response - Reinforcer Association (R-S*)

• Training phase: Lever -> pellets, Chain -> sugar water

• Devalue reinforcer

• Test phase: Put rats back in context with both a lever and a chain and see what they do.

• Answer: they stop working for the devalued reinforcer: if pellets devalued, the frequency of lever pushing goes way down (but doesn’t disappear).

• Evidence that they make an association between the action and ‘an image’ of its consequences

Stimulus-Reinforcer (S-S*) Association

Evidence for Stimulus - Reward association via Response

And evidence for Stimulus-Response association (S-R)

• In devaluation experiments since response doesn’t go away altogether this is taken as evidence of a stimulus response association...

• Stimulus: presence of lever

• Response: push it.

Operant conditioning is about learning about the ability to control important environmental events

Animals need to learn that they are in control of important environmental events

• Learned helplessness

• ‘when response and consequence are independent of one another, the organism learns that important environmental events are not subject to its control; this learning may produce a profound inability to learn in later situations in which important events are controllable’

• This has typically been demonstrated with the inability to control/escape from aversive stimuli, BUT I think you see a flavor of this when the animal never learns that the appearance of good things are contingent on its actions...

• Excessive luring

• Criteria vs. finding a reason to reward

Schwartz, B., E. A. Wasserman, et al. (2002). Psychology of Learning and Behavior. New York, NY, W. W. Norton & Company, Inc.

Schedules of reinforcement

Schedules of reinforcement

• The big idea is to characterize the nature of the relationship between a behavior (response) and an outcome...

• Can vary in terms of number of times behavior must be performed in order to achieve some outcome (Ratio schedules)

• Can vary in the length of the time interval after a previous outcome that the animal must wait before their next response will count (Interval schedules)

• Both can either be a fixed relationship (fixed schedules) or a varying relationship (variable schedules)

• How do these various relationships affect behavior?

Fixed and Variable Ratio Schedules: how many times do I have to X to get Y?

• Fixed Ratio Schedules (FR)

• Continuous reinforcement schedules (CFR)

• Ratios other than 1, e.g. FR6: need to do the behavior 6 times in order to get desired reward.

• Example: getting paid $5 for every 4 buckets of strawberries

• Time does not play a role!

Fixed and Variable Ratio Schedules: how many times do I have to X to get Y?

• Variable Ratio Schedules (VR)

• Ratio varies around a mean (average). E.g., VR6 means that on average the behavior needs to be repeated 6 times in order to get desired reward. Sometimes only 5 times will be required, some times 8 times, but on average 6 responses will be required.

• Example: slot machines

• Time does not play a role!

Fixed & variable ratios: a rule of thumb...

• What seems to matter is the number of times that a desired response has been rewarded. So, suppose 50 rewarded “sits” are required for a dog to learn to sit...

• If using a CFR schedule, it will take 50 repetitions to achieve this level

• If using a FR4 schedule, it will take 200 repetitions since only 1 out of 4 will be rewarded.

• But the same rule seems to apply for extinguishing behavior

• Behavior trained on a FR4 schedule will take 4X the number of reps required to extinguish on a CFR schedule.

Fixed and Variable Interval Schedules: how long to do I have to wait before my response counts?

• Interval schedules all involve a period of time after a reward has been received during which the animal’s actions are ignored. Once the interval has elapsed, then the first response is rewarded.

• Animals tend to anticipate (start responding toward end of interval)

• Fixed Interval Schedules (FI). The interval is fixed. E.g., FI5, responses within the first 5 seconds after a reward are ignored.

• Pattern of work when papers are due every 2 weeks

Time

Responses ignored

during this period

First response produces a

reward during this period

Fixed and Variable Interval Schedules: how long to do I have to wait before my response counts?

• Variable Interval Schedules (VI)

• Interval varies around a mean. E.g., VI30 means that the animal will have to wait 30 seconds on average before their response will count, but sometimes it will be 27 seconds, sometimes 32 seconds, but on average it will be 30 seconds.

• Fishing is an example of a VI schedule

• Tell me I am wrong, but interval schedules aren’t nearly as applicable to animal training as are ratio schedules...

Fixed schedules typically have a scalloped

appearanceCFR is the exception


Note: Pattern of activity is yet another indication

that animals can have a

good sense of time/interval

Rules of thumb...

• Ratio schedules (FR & VR) produce higher levels of responding than do interval schedules (FI & VI )

• FR & FI produce alternating periods of inactivity followed by periods of high activity

• Changing schedule in VI & VR changes slope

• Changing schedule in FR has little effect, in FI affects period of inactivity


Matching

Matching: the set-up

• Animals have a choice of performing 2 actions, e.g., pushing the right lever or the left lever.

• Associated with each choice is a VI (variable interval) schedule, for example, the right lever might be VI10 and the left might be VI20.

• All things being equal, animals allocate their activity in direct proportion to the relative payoff of the 2 options.

• So in the case above, since the right lever pays off twice as often as the left lever, the animal will tend to push the right lever twice as much as the left lever.

Matching: the formula This is common sense, even if it doesn’t look like it


Matching: relies on concurrent VI schedules

• Matching experiments always rely on concurrent VI schedules. Can you see why?

• What would a smart animal do if faced with 2 options, one that is FR2 and one that is FR10?

• Why doesn’t a smart animal just focus its attention on the lever with the best VI schedule?

25% = 3/(9+3)

75% = 9/(9+3)

Matching accounts for other factors...

Nature is smart about these kinds of things

MA ,MB -> magnitude of reinforcement

DA ,DB -> delay of reinforcement

TA ,TB -> reinforcement time (time allowed to eat)


Matching Law and immediate gratification vs. delayed gratification...

• Animals behave as if they are using the matching law to weigh the option of an sooner, but smaller pay-off vs. a later but larger pay-off.

• A smaller but immediate reward is almost always preferred to a larger reward some time in the future. BUT...

• Given the choice between 2 future rewards, one sooner but smaller and the other later but larger, animals may choose to defer gratification depending on the relative delays and differences in magnitude of reward.

Matching vs. maximizing...

• This is a really subtle point in which they distinguish between the process and the outcome...

• WRT Pavlovian Conditioning, the process is best modeled via Rescorla-Wagner or Pearce-Hall, but the effect is analogous to contingency tables.

• Here the effect is matching, but the process is best thought of as choosing the better of the alternatives facing it at the moment. Over time this produces an effect that looks like matching.

Matching & Economics: demand

• When demand is ‘elastic’, the demand for a good is highly dependent on its price.

• When demand is ‘inelastic’, the demand for a good is independent of its price

• With animals, cost = effort, low cost means a low FR, and high cost means a high FR.

• Matching Law only holds when the demand curves for both reinforcers are similar

Cost

Quantity Purchased

Elastic Demand

Cost

Quantity Purchased

Inelastic Demand

The matching law and income...

• Think of income as the number of responses allowed.

• When there is no constraint on the number of responses an animal can make, it is as if they have a high income, so differential demand curves are less important.

• But if the animal can only make a fixed number of responses, then it is as if they have a low income and the differences in demand curves for one reinforcer vs. another matter.

The matching law and substitutability

• Reinforcers can be...

• Substitutable (food pellets and food pellets)

• Complementary (food pellets and water)

• Neither

• The Matching Law only holds for substitutable reinforcers

Matching & Open vs. closed economy

• Open economy: in the end, you do not go hungry

• Closed economy: you get what you work for

• Animal experiments tend to be ‘open economy’ since the animals are fed outside of the test setting regardless of how well they did.

Take home message on matching...

• Another example in which nature produces a remarkably efficient response to contingencies in the world.

• Lots of examples of ‘optimal’ behavior in nature especially with respect to foraging behavior

• Choosing the locally ‘better’ alternative often leads to globally best behavior.

• Deferred gratification is hard no matter the species :-)

Lindsay on Reinforcement...

‘Probability is the most important concept in modern science, especially as nobody has the slightest notion what it means...’ - Bertrand Russell

Lindsay, S. R. (2000). Applied Dog Behavior and Training. Ames, IA, Iowa State University Press.

Reinforcement and punishment couched in the language of probability, but...

• But its observed effect is after the fact, or post-hoc.

• Defined in terms of its effect on behavior in the future.

• As such, strictly speaking, you can not say whether something is reinforcing or punishing at the time it occurs, since it is only in the future looking back that you can say, yes the frequency of the behavior increased or decreased...

• What happens when the animal is already performing at high levels so there is no measurable improvement. Can you say it is reinforcing?

• What does it mean when behavior is highly variable as in shaping...

An alternative view on reinforcement

• ‘the goal of purposive behavior is to predict and control outcomes. Locating food when hungry and finding a successful route of escape when threatened are behaviors that are both strongly reinforced in the same general way.’

• ‘Essentially reinforcement occurs when an animal successfully controls any event in such a way that the animal’s self-interest are served (survival) and its well-being enhanced.’


How does punishment fit into this?

• ‘... punishment is defined as occurring whenever a behavior fails to anticipate and control a significant event adequately. Punishment is not something done to a behavior or to an animal but rather something that the behavior itself does or fails to do — that is, it fails to appropriate an important resource or escape or avoid an aversive or dangerous situation.’

• ‘Punishment resulting from a failure to predict a reinforcing event results in fear/anxiety, whereas a failure to control the occurrence of a reinforcing event results in frustration’


Control and prediction

• ‘Successful control depends on adequate prediction, and adequate prediction depends on successful control.’

• In other words it’s all about control & prediction...

• Control: if I want to attain a given outcome, do I have one or more reliable strategies for achieving that outcome. That is, if I perform the strategy, my expectation is that I will achieve the desired outcome.

• Prediction: can I predict the imminent/future occurrence of biologically significant events so as to be in a good position, now, to take advantage of them, or to avoid their occurrence.

• Note, prediction is only useful if there are reliable strategies to control outcomes based on those predictions.

Learning as the process of forming and refining expectations

• Learning is the process of forming and refining expectations in light of what actually happens...

• Was the actual outcome, better than expected, worse than expected, or exactly what was expected?

• Learning occurs most rapidly when the mismatch between expectation and reality is greatest, all things being equal.

When outcomes don’t match expectations

• Attractive Outcome

• Better than expected: surprise (R)

• Worse than expected: disappointment (P)

• Aversive Outcome

• Better than expected: relief (R)

• Worse than expected: startle (P)

Reinforcement and punishment, once again...

• ‘Reinforcement occurs when an instrumental effort succeeds in achieving more control over some attractive or aversive event than predicted by the operative expectancy,’

• ‘whereas punishment occurs when an instrumental effort achieves less control over some attractive or aversive event than predicted by the operative expectancy.


Recasting classical and operant conditioning


Some examples...

• What is the dog controlling, what are the expectations and when are they confirmed or violated? When is the most learning occurring?

• Shaping a sit: reward the dog every time it sits

• Once the dog is sitting reliably...

• add a cue just as they are about ready to sit, and reward.

• stop rewarding spontaneous sits, and/or lower rate of reward

The Learned Dog Class 8: Looking more closely at reinforcement...

Documents