The Learned DogClass 8: Looking more closely at reinforcement...
AgendaLooking ahead...Where are we and where are we going?PapersMatchingLindsays critiqueNeural basis of reward
Roadmap4/7: stimulus control, discrimination, generalization, & PC/OC interactions.4/14: Applications in Pet Dog Training4/21: Aversive Control of Behavior (Problem Set 2 due)4/28: Behavior Mod & Pharmacology5/5: Applications: zoos5/12: Ethological Perspective5/19: Social Learning & Wrap-up
Final PaperDue 5/198 - 10 pages double-spacedTopic of your choice...Should be on a topic important to you.Should demonstrate your mastery of the material presented in the class, especially in its application to real world settings
Operant conditioning: explanations
Rule of thumb: when in doubt apply what you know about Pavlovian Conditioning (response => CS, outcome => US)
Contingency: learning the extent to which a reinforcer is contingent on performing a given response
P(cookie|sit) greater, less than, or equal to P(cookie|bark) Contingency spaceResponse should increase in frequencyResponse should decrease in frequency
Contingency matrixIn this case, sitting doesnt seem to increase the chances of getting a cookie
Contingency matrixIn this case, sitting definitely increases the chances of getting a cookie
Contingency matrixIn this case, sitting also seems to increase the chances of getting a cookie
Contingency Matrices dont reflect time course (more recent experience vs. less recent experience)First 50 RepetitionsNext 50 Repetitions
Contingency Matrix doesnt reflect signaling effectBackground may block acquisition of sitLight signals no reinforcement so background doesnt block acquisition
CookieSit0Background w/o light50
CookieSit50Background w/ light0
Are contingency matrices floating around in your dogs brain...Models used to explain Pavlovian Conditioning: Rescorla-Wagner, Pearce-Hall, Gallistel, etc. are also useful in explaining operant conditioning...Operant response => CS, Reinforcement => USThese models also explainEffects of trial orderEffect of signaling background reinforcementAccurate judgments require large numbers of trials
Via learning value of an action approximates value of rewardAnimal mostly chooses action with highest value...Surprise last trial = |Actual last trial - Expected last trial |Vsit:change = Surprise last trial*VcookieVsit:new = Vsit:old + Vsit:changeLearning equations:Choice of action:Mostly, choose action with highest value
Some subtle but difficult challenges for learner...What exactly am I getting rewarded for?Nice accident that they agree with us...How much weight do I place on what just happened?Too little and you will be slow to learn or adjust to changesToo much and you may not see the forest for the treesHow much weight do I place on recent experience vs. less recent experienceWhen do I stop learning?
What is learned?
What associations are made?
Response - Reinforcer Association (R-S*)Training phase: Lever -> pellets, Chain -> sugar waterDevalue reinforcerTest phase: Put rats back in context with both a lever and a chain and see what they do.Answer: they stop working for the devalued reinforcer: if pellets devalued, the frequency of lever pushing goes way down (but doesnt disappear).Evidence that they make an association between the action and an image of its consequences
Stimulus-Reinforcer (S-S*) AssociationEvidence for Stimulus - Reward association via Response
And evidence for Stimulus-Response association (S-R)In devaluation experiments since response doesnt go away altogether this is taken as evidence of a stimulus response association...Stimulus: presence of leverResponse: push it.
Operant conditioning is about learning about the ability to control important environmental events
Animals need to learn that they are in control of important environmental events Learned helplessnesswhen response and consequence are independent of one another, the organism learns that important environmental events are not subject to its control; this learning may produce a profound inability to learn in later situations in which important events are controllableThis has typically been demonstrated with the inability to control/escape from aversive stimuli, BUT I think you see a flavor of this when the animal never learns that the appearance of good things are contingent on its actions...Excessive luringCriteria vs. finding a reason to rewardSchwartz, B., E. A. Wasserman, et al. (2002). Psychology of Learning and Behavior. New York, NY, W. W. Norton & Company, Inc.
Schedules of reinforcement
Schedules of reinforcementThe big idea is to characterize the nature of the relationship between a behavior (response) and an outcome...Can vary in terms of number of times behavior must be performed in order to achieve some outcome (Ratio schedules)Can vary in the length of the time interval after a previous outcome that the animal must wait before their next response will count (Interval schedules)Both can either be a fixed relationship (fixed schedules) or a varying relationship (variable schedules)How do these various relationships affect behavior?
Fixed and Variable Ratio Schedules: how many times do I have to X to get Y?Fixed Ratio Schedules (FR)Continuous reinforcement schedules (CFR)Ratios other than 1, e.g. FR6: need to do the behavior 6 times in order to get desired reward.Example: getting paid $5 for every 4 buckets of strawberriesTime does not play a role!
Fixed and Variable Ratio Schedules: how many times do I have to X to get Y?Variable Ratio Schedules (VR)Ratio varies around a mean (average). E.g., VR6 means that on average the behavior needs to be repeated 6 times in order to get desired reward. Sometimes only 5 times will be required, some times 8 times, but on average 6 responses will be required. Example: slot machinesTime does not play a role!
Fixed & variable ratios: a rule of thumb...What seems to matter is the number of times that a desired response has been rewarded. So, suppose 50 rewarded sits are required for a dog to learn to sit...If using a CFR schedule, it will take 50 repetitions to achieve this levelIf using a FR4 schedule, it will take 200 repetitions since only 1 out of 4 will be rewarded.But the same rule seems to apply for extinguishing behaviorBehavior trained on a FR4 schedule will take 4X the number of reps required to extinguish on a CFR schedule.
Fixed and Variable Interval Schedules: how long to do I have to wait before my response counts?Interval schedules all involve a period of time after a reward has been received during which the animals actions are ignored. Once the interval has elapsed, then the first response is rewarded. Animals tend to anticipate (start responding toward end of interval)Fixed Interval Schedules (FI). The interval is fixed. E.g., FI5, responses within the first 5 seconds after a reward are ignored.Pattern of work when papers are due every 2 weeksTimeResponses ignored during this periodFirst response produces a reward during this period
Fixed and Variable Interval Schedules: how long to do I have to wait before my response counts?Variable Interval Schedules (VI)Interval varies around a mean. E.g., VI30 means that the animal will have to wait 30 seconds on average before their response will count, but sometimes it will be 27 seconds, sometimes 32 seconds, but on average it will be 30 seconds.Fishing is an example of a VI scheduleTell me I am wrong, but interval schedules arent nearly as applicable to animal training as are ratio schedules...
Fixed schedules typically have a scalloped appearanceCFR is the exceptionSchwartz, B., E. A. Wasserman, et al. (2002). Psychology of Learning and Behavior. New York, NY, W. W. Norton & Company, Inc.Note: Pattern of activity is yet another indication that animals can have a good sense of time/interval
Rules of thumb...Ratio schedules (FR & VR) produce higher levels of responding than do interval schedules (FI & VI )FR & FI produce alternating periods of inactivity followed by periods of high activityChanging schedule in VI & VR changes slopeChanging schedule in FR has little effect, in FI affects period of inactivitySchwartz, B., E. A. Wasserman, et al. (2002). Psychology of Learning and Behavior. New York, NY, W. W. Norton & Company, Inc.
Matching: the set-upAnimals have a choice of performing 2 actions, e.g., pushing the right lever or the left lever.Associated with each choice is a VI (variable interval) schedule, for example, the right lever might be VI10 and the left might be VI20.All things being equal, animals allocate their activity in direct proportion to the relative payoff of the 2 options.So in the case above, since the right lever pays off twice as often as the left lever, the animal will tend to push the right lever twice as much as the left lever.
Matching: the formulaThis is common sense, even if it doesnt look like itSchwartz, B., E. A. Wasserman, et al. (2002). Psychology of Learning and Behavior. New York, NY, W. W. Norton & Company, Inc.
Matching: relies on concurrent VI schedulesMatching experiments always rely on concurrent VI schedules. Can you see why?What would a smart animal do if faced with 2 options, one that is FR2 and one that is FR10?Why doesnt a smart animal just focus its attention on the lever with the best VI schedule?25% = 3/(9+3)75% = 9/(9+3)
Matching accounts for other factors...Nature is smart about these kinds of thingsMA ,MB -> magnitude of reinforcementDA ,DB -> delay of reinforcementTA ,TB -> reinforcement time (time allowed to eat)Schwartz, B., E. A. Wasserman, et al. (2002). Psychology of Learning and Behavior. New York, NY, W. W. Norton & Company, Inc.
Matching Law and immediate gratification vs. delayed gratification...Animals behave as if they are using the matching law to weigh the option of an sooner, but smaller pay-off vs. a later but larger pay-off.A smaller but immediate reward is almost always preferred to a larger reward some time in the future. BUT...Given the choice between 2 future rewards, one sooner but smaller and the other later but larger, animals may choose to defer gratification depending on the relative delays and differences in magnitude of reward.
Matching vs. maximizing...This is a really subtle point in which they distinguish between the process and the outcome...WRT Pavlovian Conditioning, the process is best modeled via Rescorla-Wagner or Pearce-Hall, but the effect is analogous to contingency tables.Here the effect is matching, but the process is best thought of as choosing the better of the alternatives facing it at the moment. Over time this produces an effect that looks like matching.
Matching & Economics: demandWhen demand is elastic, the demand for a good is highly dependent on its price.When demand is inelastic, the demand for a good is independent of its priceWith animals, cost = effort, low cost means a low FR, and high cost means a high FR.Matching Law only holds when the demand curves for both reinforcers are similar CostQuantity PurchasedElastic DemandCostQuantity PurchasedInelastic Demand
The matching law and income...Think of income as the number of responses allowed.When there is no constraint on the number of responses an animal can make, it is as if they have a high income, so differential demand curves are less important.But if the animal can only make a fixed number of responses, then it is as if they have a low income and the differences in demand curves for one reinforcer vs. another matter.
The matching law and substitutabilityReinforcers can be...Substitutable (food pellets and food pellets)Complementary (food pellets and water)NeitherThe Matching Law only holds for substitutable reinforcers
Matching & Open vs. closed economyOpen economy: in the end, you do not go hungryClosed economy: you get what you work forAnimal experiments tend to be open economy since the animals are fed outside of the test setting regardless of how well they did.
Take home message on matching...Another example in which nature produces a remarkably efficient response to contingencies in the world.Lots of examples of optimal behavior in nature especially with respect to foraging behaviorChoosing the locally better alternative often leads to globally best behavior.Deferred gratification is hard no matter the species :-)
Lindsay on Reinforcement...
Probability is the most important concept in modern science, especially as nobody has the slightest notion what it means... - Bertrand RussellLindsay, S. R. (2000). Applied Dog Behavior and Training. Ames, IA, Iowa State University Press.
Reinforcement and punishment couched in the language of probability, but...But its observed effect is after the fact, or post-hoc. Defined in terms of its effect on behavior in the future.As such, strictly speaking, you can not say whether something is reinforcing or punishing at the time it occurs, since it is only in the future looking back that you can say, yes the frequency of the behavior increased or decreased...What happens when the animal is already performing at high levels so there is no measurable improvement. Can you say it is reinforcing?What does it mean when behavior is highly variable as in shaping...
An alternative view on reinforcementthe goal of purposive behavior is to predict and control outcomes. Locating food when hungry and finding a successful route of escape when threatened are behaviors that are both strongly reinforced in the same general way. Essentially reinforcement occurs when an animal successfully controls any event in such a way that the animals self-interest are served (survival) and its well-being enhanced.Lindsay, S. R. (2000). Applied Dog Behavior and Training. Ames, IA, Iowa State University Press.
How does punishment fit into this?... punishment is defined as occurring whenever a behavior fails to anticipate and control a significant event adequately. Punishment is not something done to a behavior or to an animal but rather something that the behavior itself does or fails to do that is, it fails to appropriate an important resource or escape or avoid an aversive or dangerous situation.Punishment resulting from a failure to predict a reinforcing event results in fear/anxiety, whereas a failure to control the occurrence of a reinforcing event results in frustrationLindsay, S. R. (2000). Applied Dog Behavior and Training. Ames, IA, Iowa State University Press.
Control and predictionSuccessful control depends on adequate prediction, and adequate prediction depends on successful control.In other words its all about control & prediction...Control: if I want to attain a given outcome, do I have one or more reliable strategies for achieving that outcome. That is, if I perform the strategy, my expectation is that I will achieve the desired outcome.Prediction: can I predict the imminent/future occurrence of biologically significant events so as to be in a good position, now, to take advantage of them, or to avoid their occurrence.Note, prediction is only useful if there are reliable strategies to control outcomes based on those predictions.
Learning as the process of forming and refining expectationsLearning is the process of forming and refining expectations in light of what actually happens...Was the actual outcome, better than expected, worse than expected, or exactly what was expected?Learning occurs most rapidly when the mismatch between expectation and reality is greatest, all things being equal.
When outcomes dont match expectationsAttractive OutcomeBetter than expected: surprise (R)Worse than expected: disappointment (P)Aversive OutcomeBetter than expected: relief (R)Worse than expected: startle (P)
Reinforcement and punishment, once again...Reinforcement occurs when an instrumental effort succeeds in achieving more control over some attractive or aversive event than predicted by the operative expectancy,whereas punishment occurs when an instrumental effort achieves less control over some attractive or aversive event than predicted by the operative expectancy.Lindsay, S. R. (2000). Applied Dog Behavior and Training. Ames, IA, Iowa State University Press.
Recasting classical and operant conditioning
Lindsay, S. R. (2000). Applied Dog Behavior and Training. Ames, IA, Iowa State University Press.
Some examples...What is the dog controlling, what are the expectations and when are they confirmed or violated? When is the most learning occurring?Shaping a sit: reward the dog every time it sitsOnce the dog is sitting reliably... add a cue just as they are about ready to sit, and reward. stop rewarding spontaneous sits, and/or lower rate of reward