Hierarchically organized behavior and its neural foundations: A reinforcement-learning perspective Matthew M. Botvinick and Yael Niv Princeton University Department of Psychology and Institute for Neuroscience Andrew C. Barto University of Massachussetts, Amherst Department of Computer Science August 13, 2007 Running head: Hierarchical reinforcement learning Word count: 8709 Corresponding author: Matthew Botvinick Princeton University Department of Psychology Green Hall Princeton, NJ 08540 (609)258-1280 [email protected]
69
Embed
Princeton Universityyael/NIPSWorkshop/BotvinickNivBarto.pdf · Matthew M. Botvinick and Yael Niv Princeton University Department of Psychology and Institute for Neuroscience Andrew
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hierarchically organized behavior and its neural foundations:
A reinforcement-learning perspective
Matthew M. Botvinick and Yael NivPrinceton UniversityDepartment of Psychology andInstitute for Neuroscience
Andrew C. BartoUniversity of Massachussetts, AmherstDepartment of Computer Science
August 13, 2007
Running head: Hierarchical reinforcement learning
Word count: 8709
Corresponding author:Matthew BotvinickPrinceton UniversityDepartment of PsychologyGreen HallPrinceton, NJ 08540(609)[email protected]
Botvinick, Niv and Barto Hierarchical reinforcement learning
1
Abstract
Research on human and animal behavior has long emphasized its hierarchical structure,
according to which tasks are comprised of subtask sequences, which are themselves built
of simple actions. The hierarchical structure of behavior has also been of enduring
interest within neuroscience, where it has been widely considered to reflect prefrontal
cortical functions. In this paper, we reexamine behavioral hierarchy and its neural
substrates from the point of view of recent developments in computational reinforcement
learning. Specifically, we consider a set of approaches known collectively as
hierarchical reinforcement learning, which extend the reinforcement learning paradigm
by allowing the learning agent to aggregate actions into reusable subroutines or skills. A
close look at the components of hierarchical reinforcement learning suggests how they
might map onto neural structures, in particular regions within the dorsolateral and orbital
prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement
learning might provide a complement to existing psychological models of hierarchically
structured behavior. A particularly important question that hierarchical reinforcement
learning brings to the fore is that of how learning identifies new action routines that are
likely to provide useful building blocks in solving a wide range of future problems. Here
and at many other points, hierarchical reinforcement learning offers an appealing
framework for investigating the computational and neural underpinnings of hierarchically
structured behavior.
Botvinick, Niv and Barto Hierarchical reinforcement learning
2
In recent years, it has become increasingly common within both psychology and
neuroscience to explore the applicability of ideas from machine learning. Indeed, one
can now cite numerous instances where this strategy has been fruitful. Arguably,
however, no area of machine learning has had as profound and sustained an impact on
psychology and neuroscience as that of computational reinforcement learning (RL). The
impact of RL was initially felt in research on classical and instrumental conditioning
where ttot is the number of time-steps elapsed since the relevant option was selected (one
for primitive actions); stinit is the state in which the option was selected; octrl is the option
whose policy selected the option that is now terminating (or the root value function if the
terminating option was selected by the root policy); and rcum is the cumulative discounted
reward for the duration of the option:
Eq. 7 rcum = γ i−1rtinit +ii=1
ttot
∑
Note that rtinit +i incorporated pseudo-reward only if stinit +i was a subgoal state for octrl.
Thus, pseudo-reward was used to compute prediction errors ‘within’ an option, i.e., when
updating the option’s policy, but not ‘outside’ the option, at the next level up. It should
also be remarked that, at the termination of non-primitive options, two TD prediction
Botvinick, Niv and Barto Hierarchical reinforcement learning
42
errors were computed, one for the last primitive action selected under the option and one
for the option itself (see Figure 3).
Following calculation of each δ, value functions and option strengths were updated:
Eq. 8 Voctrl (stinit )←Voctrl (stinit )+αCδ
Eq. 9 Woctrl (stinit ,o)←Woctrl (stinit ,o)+αAδ
The time index was then incremented and a new option/action selected, with the entire
cycle continuing until the top-level goal was reached.
In our simulations, the model was first pre-trained for a total of 50000 time-steps without
termination or reward delivery at G. This allowed option-specific action strengths and
values to develop, but did not lead to any change in strengths or values at the root level.
Thus, action selection at the top level was random during this phase of training. In order
to obtain the data displayed in Figure 4 C, for clarity of illustration, training with pseudo-
reward only was conducted with a small learning rate (αA = 0.01, αC = 0.1), reinitializing
to a random state whenever the relevant option reached its subgoal.
Botvinick, Niv and Barto Hierarchical reinforcement learning
43
Notes
1. An alternative term for temporal abstraction is thus policy abstraction.
2. Some versions of HRL allow for options to be interrupted at points where another
option or action is associated with a higher expected value. See, e.g., Sutton et al.,
(1999).
3. For other work translating HRL into an actor-critic format, see Bhatnagara and
Panigrahi (2006)
4. It is often assumed that the utility attached to rewards decreases with the length of
time it takes to obtain them, and in such cases the objective is to maximize the
discounted long-term reward. As reflected in the Appendix, our implementation
assumes such discounting. For simplicity, however, discounting is ignored in the
main text.
5. The termination function may be probabilistic.
6. As discussed by Sutton et al. (1999), it is possible to update the value function based
only on comparisons between states and their immediate successors. However, the
relevant procedures, when combined with those involved in learning option-specific
policies (as described later), require complicated bookkeeping and control operations
for which neural correlates seem less plausible.
7. If it is assumed that option policies can call other options, then the actor must also
keep track of the entire set of active options and their calling relations.
8. Mean solution times over the last 10 episodes from a total of 500 episodes, averaged
over 100 simulation runs, was 11.79 with the doorway options (passageway state
Botvinick, Niv and Barto Hierarchical reinforcement learning
44
visited on 0% of episodes), compared with 9.73 with primitive actions only
(passageway visited on 79% of episodes). Note that, given a certain set of
assumptions, convergence on the optimal, shortest path, policy can be guaranteed in
RL algorithms, including those involved in HRL. However, this is only strictly true
under boundary conditions that involve extremely slow learning, due to an extremely
slow transition from exploration to exploitation. Away from these extreme
conditions, there is a marked tendency for HRL systems to “satisfice,” as illustrated
in the passageway simulation.
9. These studies, directed at facilitating the learning of environmental models, are also
relevant to learning of option hierarchies.
10. For different approaches to the mapping between HRL and neuroanatomy, see De
Pisapia (2003) and Zhou and Coggins (2002; 2004).
Botvinick, Niv and Barto Hierarchical reinforcement learning
45
Author Note
The present work was completed with support from the National Institute of Mental
Health, grant number P50 MH062196 (M.M.B.), and from the National Science
Foundation, grant number CCF-0432143 (A.C.B.). Any opinions, findings and
conclusions or recommendations expressed in this material are those of the authors and
do not necessarily reflect the views of the funding agencies. The authors thank Carlos
Brody, Jonathan Cohen, Scott Kuindersma, Ken Norman, Randy O’Reilly, Geoff
Schoenbaum, Asvin Shah, Ozgur Simsek, Andrew Stout, Chris Vigorito, and Pippin
Wolfe for useful comments on the work reported.
Botvinick, Niv and Barto Hierarchical reinforcement learning
46
References
Agre, P. E. (1988). The dynamic structure of everyday life (Tech. Rep. No. 1085).Cambridge, MA: Massachusetts Institute of Technology, Artificial IntelligenceLaboratory.
Aldridge, J. W., Berridge, K. C., & Rosen, A. R. (2004). Basal ganglia neuralmechanisms of natural movement sequences. Canadian Journal of Physiologyand Pharmacology, 82, 732-739.
Aldridge, W. J., & Berridge, K. C. (1998). Coding of serial order by neostriatal neurons:a "natural action" approach to movement sequence. Journal of Neuroscience, 18,2777-2787.
Alexander, G. E., Crutcher, M. D., & DeLong, M. R. (1990). Basal ganglia-thalamocortical circuits: parallel substrates for motor, oculomotor, "prefrontal"and "limbic" functions. Progress in Brain Research, 85, 119-146.
Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel organization offunctionally segregated circuits linking basal ganglia and cortex. Annual Reviewof Neuroscience, 9, 357-381.
Allport, A., & Wylie, G. (2000). Task-switching, stimulus-response bindings andnegative priming. In S. Monsell & J. Driver (Eds.), Control of CognitiveProcesses: Attention and Performance XVIII. Cambridge, MA: MIT Press.
Anderson, J. R. (2004). An integrated theory of mind. Psychological Review, 111, 1036-1060.
Ansuini, C., Santello, M., Massaccesi, S., & Castiello, U. (2006). Effects of end-goal onhand shaping. Journal of Neurophysiology, 95, 2456-2465.
Arbib, M. A. (1985). Schemas for the temporal organization of behaviour. HumanNeurobiology, 4, 63-72.
Asaad, W. F., Rainer, G., & Miller, E. K. (2000). Task-specific neural activity in theprimate prefrontal cortex. Journal of Neurophysiology, 84, 451-459.
Averbeck, B. B., & Lee, D. (2007). Prefrontal neural correlates of memory for sequences.Journal of Neuroscience, 27, 2204-2211.
Balleine, B. W., & Dickinson, A. (1998). Goal-directed instrumental action: contingencyand incentive learning and their cortical substrates. Neuropharmacology, 37, 407-419.
Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk & J. Davis & D.Beiser (Eds.), Models of Information Processing in the Basal Ganglia (pp. 215-232). Cambridge, MA: MIT Press.
Botvinick, Niv and Barto Hierarchical reinforcement learning
47
Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcementlearning. Discrete Event Dynamic Systems: Theory and Applications, 13, 343-379.
Barto, A. G., Singh, S., & Chentanez, N. (2004). Intrinsically motivated learning ofhierarchical collections of skills. Proceedings of the 3rd International Conferenceon Development and Learning (ICDL 2004).
Barto, A. G., & Sutton, R. S. (1981). Toward a modern theory of adaptive networks:Expectation and prediction. Psychological Review, 88, 135-170.
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements thatcan solve difficult learning control problems. IEEE Transactions on Systems, Manand Cybernetics, 13, 834-846.
Berlyne, D. E. (1960). Conflict, Arousal and Curiosity. New York: McGraw-Hill.
Bhatnagara, S., & Panigrahi, J. R. (2006). Actor-critic algorithms for hierarchical Markovdecision processes. Automatica, 42, 637-644.
Bor, D., Duncan, J., Wiseman, R. J., & Owen, A. M. (2003). Encoding strategiesdissociate prefrontal activity from working memory demand. Neuron, 37, 361-367.
Botvinick, M., & Plaut, D. C. (2002). Representing task context: proposals based on aconnectionist model of action. Psychological Research, 66(4), 298-311.
Botvinick, M., & Plaut, D. C. (2004). Doing without schema hierarchies: a recurrentconnectionist approach to normal and impaired routine sequential action.Psychological Review, 111(2), 395-429.
Botvinick, M., & Plaut, D. C. (2006). Such stuff as habits are made on: A reply to Cooperand Shallice (2006). Psychological Review, under review.
Botvinick, M. M. (in press). Multilevel structure in behaviour and the brain: a model ofFuster's hierarchy. Philosophical Transactions of the Royal Society (London),Series B.
Bruner, J. (1973). Organization of early skilled action. Child Development, 44, 1-11.
Bunge, S. A. (2004). How we use rules to select actions: a review of evidence fromcognitive neuroscience. Cognitive, Affective & Behavioral Neuroscience, 4, 564-579.
Bunzeck, N., & Duzel, E. (2006). Absolute Coding of Stimulus Novelty in the HumanSubstantia Nigra/VTA. Neuron, 51, 369-379.
Botvinick, Niv and Barto Hierarchical reinforcement learning
48
Cohen, J. D., Braver, T. S., & O'Reilly, R. C. (1996). A computational approach toprefrontal cortex, cognitive control and schizophrenia: recent developments andcurrent challenges. Philosophical Transactions of the Royal Society (London),Series B, 351, 1515-1527.
Cohen, J. D., Dunbar, K., & McClelland, J. L. (1990). On the control of automaticprocesses: a parallel distributed processing account of the Stroop effect.Psychological Review, 97, 332-361.
Conway, C. M., & Christiansen, M. H. (2001). Sequential learning in non-humanprimates. Trends in Cognitive Sciences, 5, 539-546.
Cooper, R., & Shallice, T. (2000). Contention scheduling and the control of routineactivities. Cognitive Neuropsychology, 17, 297-338.
Courtney, S. M., Roth, J. K., & Sala, J. B. (in press). A hierarchical biased-competitionmodel of domain-dependent working memory maintenance and executive control.In N. Osaka & R. Logie & M. D'Esposito (Eds.), Working Memory: Behaviouraland Neural Correlates. Oxford: Oxford University Press.
D'Esposito, M. (2007). From cognitive to neural models of working memory.Philosophical Transactions of the Royal Society (London), Series B, 362, 761-772.
Daw, N. D., Courville, A. C., & Touretzky, D. S. (2003). Timing and partial observabilityin the dopamine system, Advances in Neural Information Processing Systems 15(pp. 99-106). Cambridge, MA: MIT Press.
Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition betweenprefrontal and striatal systems for behavioral control. Nature Neuroscience, 8,1704-1711.
Daw, N. D., Niv, Y., & Dayan, P. (2006). Actions, policies, values and the basal ganglia.In E. Bezard (Ed.), Recent Breakthroughs in Basal Ganglia Research. New York:Nova Science Publishers.
De Pisapia, N., & Goddard, N. H. (2003). A neural model of frontostriatal interactions forbehavioral planning and action chunking. Neurocomputing, 52-54, 489-495.
Dehaene, S., & Changeux, J.-P. (1997). A hierarchical neuronal network for planningbehavior. Proceedings of the National Academy of Sciences, 94, 13293-13298.
Dell, G. S., Berger, L. K., & Svec, W. R. (1997). Language production and serial order.Psychological Review, 104, 123-147.
Dietterich, T. G. (2000). Hierarchical reinforcement learning with the maxq valuefunction decomposition Journal of Artificial Intelligence Research, 13, 227-303.
Botvinick, Niv and Barto Hierarchical reinforcement learning
49
Elfwing, S., Uchibe, K., & Christensen, H. I. (2007). Evolutionary development fohierarchical learning structures IEEE Transactions on EvolutionaryComputations, 11, 249-264.
Estes, W. K. (1972). An associative basis for coding and organization in memory. In A.W. Melton & E. Martin (Eds.), Coding processes in human memory (pp. 161-190). Washington D. C.: V. H. Winston & Sons.
Fischer, K. W. (1980). A theory of cognitive development: the control and constructionof hierarchies of skills. Psychological Review, 87, 477-531.
Fischer, K. W., & Connell, M. W. (2003). Two motivational systems that shapedevelopment: Epistemic and self-organizing. British Journal of EducationalPsychology: Monograph Series II, 2, 103-123.
Frank, M. J., & Claus, E. D. (2006). Anatomy of a decision: Striato-orbitofrontalinteractions in reinforcement learning, decision making, and reversal.Psychological Review, 113, 300-326.
Fugelsang, J. A., & Dunbar, K. N. (2005). Brain-based mechanisms underlying complexcausal thinking. Neuropsychologia, 43, 1204-1213.
Fujii, N., & Graybiel, A. M. (2003). Representation of action sequence boundaries bymacaque prefrontal cortical neurons. Science, 301, 1246-1249.
Fuster, J. M. (1997). The prefrontal cortex: Anatomy, physiology, and neuropsychologyof the frontal lobe. Philadelphia, PA: Lippincott-Raven.
Fuster, J. M. (2001). The prefrontal cortex--An Update: Time is of the essence. Neuron,30, 319-333.
Fuster, J. M. (2004). Upper processing stages of the perception-action cycle. Trends inCognitive Sciences, 8, 143-145.
Gergely, G., & Csibra, G. (2003). Teleological reasoning in infancy: the naive theory ofrational action. Trends in Cognitive Sciences, 7, 287-292.
Gopnik, A., Glymour, C., Sobel, D., Schulz, T., Kushnir, T., & Danks, D. (2004). Atheory of causal learning in children: causal maps and Bayes nets. PsychologicalReview, 111, 1-31.
Gopnik, A., & Schulz, L. (2004). Mechanisms of theory formation in young children.Trends in Cognitive Sciences, 8, 371-377.
Grafman, J. (2002). The human prefrontal cortex has evolved to represent components ofstructured event complexes. In J. Grafman (Ed.), Handbook of Neuropsychology.Amsterdam: Elsevier.
Botvinick, Niv and Barto Hierarchical reinforcement learning
50
Graybiel, A. M. (1995). Building action repertoires: memory and learning functions ofthe basal ganglia. Current Opinion in Neurobiology, 5, 733-741.
Graybiel, A. M. (1998). The basal ganglia and chunking of action repertoires.Neurobiology of Learning and Memory, 70, 119-136.
Greenfield, P. M. (1984). A theory of the teacher in the learning activities of everydaylife. In B. Rogoff & J. Lave (Eds.), Everyday cognition: Its development in socialcontext (pp. 117-138). Cambridge, MA: Harvard University Press.
Greenfield, P. M., Nelson, K., & Saltzman, E. (1972). The development of ruleboundstrategies for manipulating seriated cups: a parallel between action and grammar.Cognitive Psychology, 3, 291-310.
Greenfield, P. M., & Schneider, L. (1977). Building a tree structure: the development ofhierarchical complexity and interrupted strategies in children's constructionactivity. Developmental Psychology, 13, 299-313.
Grossberg, S. (1986). The adaptive self-organization of serial order in behavior: Speech,language, and motor control. In E. C. Schwab & H. C. Nusbaum (Eds.), PatternRecognition by Humans and Machines, vol. 1: Speech Perception (pp. 187-294).New York: Academic Press.
Harlow, H. F., Harlow, M. K., & Meyer, D. R. (1950). Learning motivated by amanipulation drive. Journal of Experimental Psychology, 40, 228-234.
Haruno, M., & Kawato, M. (2006). Heterarchical reinforcement-learning model forintegration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning. Neural Networks, 19, 1242-1254.
Hayes-Roth, B., & Hayes-Roth, F. (1979). A cognitive model of planning. CognitiveScience, 3, 275-310.
Hengst, B. (2002). Discovering hierarchy in reinforcement learning with HEXQ.Proceedings of the International Conference on Machine Learning, 19, 243-250.
Holroyd, C. B., & Coles, M. G. H. (2002). The neural basis of human error processing:Reinforcement learning, dopamine, and the error-related negativity. PsychologicalReview, 109(4), 679-709.
Hoshi, E., Shima, K., & Tanji, J. (1998). Task-dependent selectivity of movement-relatedneuronal activity in the primate prefrontal cortex. Journal of Neurophysiology, 80,3392-3397.
Houk, J. C., Adams, C. M., & Barto, A. G. (1995). A model of how the basal gangliagenerate and use neural signals that predict reinforcement. In J. C. Houk & D. G.Davis (Eds.), Models of Information Processing in the Basal Ganglia (pp. 249-270). Cambridge: MIT Press.
Botvinick, Niv and Barto Hierarchical reinforcement learning
51
Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: Newanatomical and computational perspectives. Neural Networks, 15, 535-547.
Johnston, K., & Everling, S. (2006). Neural Activity in Monkey Prefrontal Cortex IsModulated by Task Context and Behavioral Instruction during Delayed-Match-to-Sample and Conditional Prosaccade–Antisaccade Tasks. Journal of CognitiveNeuroscience, 18, 749-765.
Jonsson, A., & Barto, A. (2001). Automated State Abstraction for Options using the U-Tree Algorithm, Advances in Neural Information Processing Systems 13 (pp.1054-1060). Cambridge, MA: MIT Press.
Jonsson, A., & Barto, A. (2005). A causal approach to hierarchical decomposition offactored MDPs. Proceedings of the International Conference on MachineLearning, 22.
Kaplan, F., & Oudeyer, P.-Y. (2004). Maximizing learning progress: an internal rewardsystem for development. In F. Iida & R. Pfeifer & L. Steels (Eds.), EmbodiedArtificial Intelligence: Springer-Verlag.
Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time.Machine Learning, 49, 209-232.
Koechlin, E., Ody, C., & Kouneiher, F. (2003). The architecture of cognitive control inthe human prefrontal cortex. Science, 302(5648), 1181-1185.
Laird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in Soar: the anatomy of ageneral learning mechanism. Machine Learning, 1, 11-46.
Landrum, E. R. (2005). Production of negative transfer in a problem-solving task.Psychological Reports, 97, 861-866.
Lashley, K. S. (1951). The problem of serial order in behavior. In L. A. Jeffress (Ed.),Cerebral mechanisms in behavior: The Hixon symposium (pp. 112-136). NewYork, NY: Wiley.
Lee, F. J., & Taatgen, N. A. (2003). Production compilation: a simple mechanism tomodel complex skill acquisition. Human Factors, 45, 61-76.
Lee, I. H., Seitz, A. R., & Assad, J. A. (2006). Activity of tonically active neurons in themonkey putamen during initiation and withholding of movement. Journal ofNeurophysiology, 95, 2391-3403.
Lehman, J. F., Laird, J., & Rosenbloom, P. (1996). A gentle introduction to Soar, anarchitecture for human cognition. In S. Sternberg & D. Scarborough (Eds.),Invitation to Cognitive Science (Vol. 4, pp. 212-249). Cambridge, MA: MITPress.
Botvinick, Niv and Barto Hierarchical reinforcement learning
52
Li, L., & Walsh, T. J. (2006). Towards a unified theory of state abstraction for MDPs.Paper presented at the Ninth International Symposium on Artificial Intelligenceand Mathematics.
Logan, G. D. (2003). Executive control of thought and action: in search of the wildhomunculus. Current Directions in Psychological Science, 12, 45-48.
Luchins, A. S. (1942). Mechanization in problem solving. Psychological Monographs,248, 1-95.
MacDonald, A. W., 3rd, Cohen, J. D., Stenger, V. A., & Carter, C. S. (2000).Dissociating the role of the dorsolateral prefrontal and anterior cingulate cortex incognitive control. Science, 288(5472), 1835-1838.
MacKay, D. G. (1987). The organization of perception and action: a theory for languageand other cognitive skills. New York: Springer-Verlag.
Mannor, S., Menache, I., Hoze, A., & Klein, U. (2004). Dynamic abstraction inreinforcement learning via clustering, Proceedings of the Twenty-FirstInternational Conference on Machine Learning (pp. 560-567): ACM Press.
McGovern, A. (2002). Autonomous discovery of temporal abstractions from interactionwith an environment. University of Massachussetts.
Meltzoff, A. N. (1995). Understanding the intentions of others: re-enactment of intendedacts by 18-month-old children. Developmental Psychology, 31, 838-850.
Menache, I., Mannor, S., & Shimkin, N. (2002). Dynamic discovery of sub-goals inreinforcement learning. Proceedings of the Thirteenth European Conference onMachine Learning, 295-306.
Middleton, F. A., & Strick, P. L. (2002). Basal-ganglia 'projections' to the prefrontalcortex of the primate. Cerebral Cortex, 12, 926-935.
Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function.Annual Review of Neuroscience, 24, 167-202.
Miller, G. A., Galanter, E., & Pribram, K. H. (1960). Plans and the structure of behavior.New York: Holt, Rinehart & Winston.
Miyamoto, H., Morimoto, J., Doya, K., & Kawato, M. (2004). Reinforcement learningwith via-point representation. Neural Networks, 17, 299-305.
Monsell, S. (2003). Task switching. Trends in Cognitive Sciences, 7, 134-140.
Monsell, S., Yeung, N., & Azuma, R. (2000). Reconfiguration of task-set: is it easier toswitch to the weaker task? Psychological Research, 63, 250-264.
Botvinick, Niv and Barto Hierarchical reinforcement learning
53
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalicdopamine based on predictive hebbian learning. Journal of Neuroscience, 16,1936-1947.
Morris, G., Arkadir, D., Nevet, A., Vaadia, E., & Bergman, H. (2004). Coincident butdistinct messages of midbrain dopamine and striatal tonically active neurons.Neuron, 43, 133-143.
Muhammad, R., Wallis, J. D., & Miller, E. K. (2006). A comparison of abstract rules inthe prefrontal cortex, premotor cortex, inferior temporal cortex, and striatum.Journal of Cognitive Neuroscience, 18, 974-989.
Nason, S., & Laird, J. E. (2005). Soar-RL: integrating reinforcement learning with Soar.Cognitive Systems Research, 6, 51-59.
Nau, D., Au, T.-C., Ilghami, O., Kuter, U., Murdock, J. W., Wu, D., & Yaman, F. (2003).SHOP2: An HTN Planning System. Journal of Artificial Intelligence Research,20, 379-404.
Newell, A., & Simon, H. A. (1963). GPS, a program that simulates human thought. In E.A. Feigenbaum & J. Feldman (Eds.), Computers and Thought (pp. 279-293). NewYork: McGraw-Hill.
Newtson, D. (1976). Foundations of attribution: The perception of ongoing behavior. InJ. H. Harvey & W. J. Ickes & R. F. Kidd (Eds.), New Directions in AttributionResearch (pp. 223-248). Hillsdale, NJ: Erlbaum.
O'Doherty, J., Critchley, H., Deichmann, R., & Dolan, R. J. (2003). Dissociating valenceof outcome from behavioral control in human obital and ventral prefrontalcortices. Journal of Neuroscience, 7931, 7931-7939.
O'Doherty, J., Dayan, P., Schultz, P., Deischmann, J., Friston, K., & Dolan, R. J. (2004).Dissociable roles of ventral and dorsal striatum in instrumental conditioning.Science, 304(452-454).
O'Reilly, R. C., & Frank, M. J. (2005). Making working memory work: a computationalmodel of learning in prefrontal cortex and basal ganglia. Neural Computation, 18,283-328.
Oudeyer, P.-Y., Kaplan, F., & Hafner, V. (2007). Intrinsic motivation systems forautonomous development. IEE Transactions on Evolutionary Computation, 11,265-286.
Parent, A., & Hazrati, L. N. (1995). Functional anatomy of the basal ganglia. I. Thecortico-basal ganglia-thalamo-cortical loop. Brain Research Reviews, 20, 91-127.
Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of machines.Advances in Neural Information Processing Systems, 10, 1043-1049.
Botvinick, Niv and Barto Hierarchical reinforcement learning
54
Pashler, H. (1994). Dual-task interference in simple tasks: data and theory. PsychologicalBulletin, 116, 220-244.
Petrides, M. (1995). Impairments on nonspatial self-ordered and externally orderedworking memory tasks after lesions to the mid-dorsal part of the lateral frontalcortex in the monkey. Journal of Neuroscience, 15, 359-375.
Piaget, J. (1936/1952). The origins of intelligence in children (M. Cook, Trans.). NewYork: International Universities Press. (Originally published, 1936).
Pickett, M., & Barto, A. G. (2002). PolicyBlocks: An algorithm for creating usefulmacro-actions in reinforcement learning. In C. Sammut & A. Hoffmann (Eds.),Machine Learning: Proceedings of hte Nineteenth International Conference onMachine Learning (pp. 506-513). San Francisco: Morgan Kaufmann.
Postle, B. R. (2006). Working memory as an emergent property of the mind and brain.Neuroscience, 139, 23-28.
Ravel, S., Sardo, P., Legallet, E., & Apicella, P. (2006). Influence of Spatial Informationon Responses of Tonically Active Neurons in the Monkey Striatum. Journal ofNeurophysiology, 95, 2975-2986.
Rayman, W. E. (1982). Negative transfer: a threat to flying safety. Aviation, Space andEnvironmental Medicine, 53, 1224-1226.
Reason, J. T. (1992). Human Error. Cambridge, England: Cambridge University Press.
Redgrave, P., & Gurney, K. (2006). The short-latency dopamine signal: a role indiscovering novel actions? Nature Reviews Neuroscience, 7, 967-975.
Roesch, M. R., Taylor, A. R., & Schoenbaum, G. (2006). Encoding of time-discountedrewards in orbitofrontal cortex is independent of value. Neuron, 51, 509-520.
Rolls, E. T. (2004). The functions of the orbitofrontal cortex. Brain and Cognition, 55,11-29.
Rougier, N. P., Noell, D. C., Braver, T. S., Cohen, J. D., & O'Reilly, R. C. (2005).Prefrontal cortex and flexible cognitive control: Rules without symbols.Proceedings of the National Academy of Sciences, 102, 7338-7343.
Ruh, N. (2007). Acquisition and Control of Sequential Routine Activities: Modelling andEmpirical Studies. University of London.
Rumelhart, D., & Norman, D. A. (1982). Simulating a skilled typist: a study of skilledcognitive-motor performance. Cognitive Science, 6, 1-36.
Botvinick, Niv and Barto Hierarchical reinforcement learning
55
Rushworth, M. F. S., Walton, M. E., Kennerley, S. W., & Bannerman, D. M. (2004).Action sets and decisions in the medial frontal cortex. Trends in CognitiveSciences, 8, 410-417.
Ryan, R. M., & Deci, E. L. (2000). Intrinsic and extrinsic motivation: . ContemporaryEducational Psychology, 25, 54-67.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-oldinfants. Science, 13, 1926-1928.
Saffran, J. R., & Wilson, D. P. (2003). From syllables to syntax: multilevel statisticallearning by 12-month-old infants. Infancy, 4, 273-284.
Salinas, E. (2004). Fast remapping of sensory stimuli onto motor actions on the basis ofcontextual modulation. Journal of Neuroscience, 24, 1113-1118.
Schank, R. C., & Abelson, R. P. (1977). Scripts, plans, goals and understanding.Hillsdale, NJ: Erlbaum.
Schmidhuber, J. (1991). A possibility for implementing curiosity and boredom in model-building neural controllers, From Animals to Animats: Proceedings of the FirstInternational Conference on Simulation of Adaptive Behavior (pp. 222-227).Cambridge: MIT Press.
Schneider, D. W., & Logan, G. D. (2006). Hierarchical control of cognitive processes:switching tasks in sequences. Journal of Experimental Psychology: General, 135,623-640.
Schoenbaum, G., Chiba, A. A., & Gallagher, M. (1999). Neural encoding in orbitofrontalcortex and basolateral amygdala during olfactory discrimination learning. Journalof Neuroscience, 19, 1876-1884.
Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamineneurons to reward and conditioned stimuli during successive steps of learning adelayed response task. Journal of Neuroscience, 13, 900-913.
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction andreward. Science, 275, 1593-1599.
Schultz, W., Tremblay, K. L., & Hollerman, J. R. (2000). Reward processing in primateorbitofrontal cortex and basal ganglia. Cerebral Cortex, 10, 272-283.
Shallice, T., & Burgess, P. W. (1991). Deficits in strategy application following frontallobe damage in man. Brain, 114, 727-741.
Shima, K., Isoda, M., Mushiake, H., & Tanji, J. (2007). Categorization of behaviouralsequences in the prefrontal cortex. Nature, 445, 315-318.
Botvinick, Niv and Barto Hierarchical reinforcement learning
56
Shima, K., & Tanji, J. (2000). Neuronal activity in the supplementary andpresupplementary motor areas for temporal organization of multiple movements.Journal of Neurophysiology, 84, 2148-2160.
Shimamura, A. P. (2000). The role of the prefrontal cortex in dynamic filtering.Psychobiology, 28, 207-218.
Simsek, O., Wolfe, A., & Barto, A. (2005). Identifying useful subgoals in reinforcementlearning by local graph partitioning. Proceedings of the Twenty-SecondInternational Conference on Machine Learning (ICML 05).
Singh, S., Barto, A. G., & Chentanez, N. (2005). Intrinsically motivated reinforcementlearning. In L. K. Saul & Y. Weiss & L. Bottou (Eds.), Advances in NeuralInformation Processing Systems 17: Proceedings of the 2004 Conference (pp.1281-1288). Cambridge: MIT Press.
Sirigu, A., Zalla, T., Pillon, B., Dubois, B., Grafman, J., & Agid, Y. (1995). Selectiveimpairments in managerial knowledge in patients with pre-frontal cortex lesions.Cortex, 31, 301-316.
Sommerville, J., & Woodward, A. L. (2005a). Pulling out the intentional structure ofaction: the relation between action processing and action production in infancy.Cognition, 95, 1-30.
Sommerville, J. A., & Woodward, A. L. (2005b). Infants' Sensitivity to the CausalFeatures of Means–End Support Sequences in Action and Perception. Infancy, 8,119-145.
Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopaminemodulation in learning and planning. Neuroscience, 103, 65-85.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of pavlovian reinforcement.In M. Gabriel & J. Moore (Eds.), Learning and Computational Neuroscience:Foundations of Adaptive Networks (pp. 497-537). Cambridge: MIT Press.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction.Cambridge, MA: MIT Press.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: Aframework for temporal abstraction in reinforcement learning. ArtificialIntelligence, 112, 181-211.
Tenenbaum, J. B., & Saxe, R. R. (Eds.). (2006). Bayesian models of actionunderstanding. Cambridge, MA: MIT Press.
Thrun, S. B., & Scwhartz, A. (1995). Finding structure in reinforcement learning. In G.Tesauro & D. S. Touretzky & T. Leen (Eds.), Advances in Neural Information
Botvinick, Niv and Barto Hierarchical reinforcement learning
57
Processing Systems: Proceedings of the 1994 Conference. Cambridge, MA: MITPress.
Wallis, J. D., Anderson, K. C., & Miller, E. K. (2001). Single neurons in prefrontal cortexencode abstract rules. Nature, 411, 953-956.
Wallis, J. D., & Miller, E. K. (2003). From rule to response: Neuronal processes in thepremotor and prefrontal cortex. Journal of Neurophysiology, 90, 1790-1806.
Ward, G., & Allport, A. (1997). Planning and problem-solving using the five-disc Towerof London task. Quarterly Journal of Experimental Psychology, 50A, 59-78.
White, I. M. (1999). Rule-dependent neuronal activity in the prefrontal cortex.Experimental Brain Research, 126, 315-335.
White, R. W. (1959). Motivation reconsidered: the concept of competence. PsychologicalReview, 66, 297-333.
Wickens, J., Kotter, R., & Houk, J. C. (1995). Cellular models of reinforcement. In J. L.Davis & D. G. Beiser (Eds.), Models of Information Processing in the BasalGanglia (pp. 187-214). Cambridge: MIT Press.
Wolpert, D., & Flanagan, J. (2001). Motor prediction. Current Biology, 18, R729-R732.
Wood, J. N., & Grafman, J. (2003). Human prefrontal cortex: processing andrepresentational perspectives. Nature Reviews Neuroscience, 4, 139-147.
Woodward, A. L., Sommerville, J. A., & Guajardo, J. J. (2001). How infants make senseof intentional action. In B. F. Malle & L. J. Moses & D. A. Baldwin (Eds.),Intentions and Intentionality: Foundations of Social Cognition. Cambridge, MA:MIT Press.
Yan, Z., & Fischer, K. (2002). Always under construction: dynamic variations in adultcognitive microdevelopment. Human Development, 45, 141-160.
Zacks, J. M., Braver, T. S., Sheridan, M. A., Donaldson, D. I., Snyder, A. Z., Ollinger, J.M., Buckner, R. L., & Raichle, M. E. (2001). Human brain activity time-locked toperceptual event boundaries. Nature Neuroscience, 4, 651-655.
Zacks, J. M., Speer, N. K., Swallow, K. M., Braver, T. S., & Reynolds, J. R. (2007).Event perception: A mind/brain perspective. Psychological Bulletin, 133, 273-293.
Zacks, J. M., & Tversky, B. (2001). Event structure in perception and conception.Psychological Bulletin, 127, 3-21.
Zalla, T., Pradat-Diehl, P., & Sirigu, A. (2003). Perception of action boundaries inpatients with frontal lobe damage. Neuropsychologia, 41, 1619-1627.
Botvinick, Niv and Barto Hierarchical reinforcement learning
58
Zhou, W., & Coggins, R. (2002). Computational models of the amygdala and theorbitofrontal cortex: a hierarchical reinforcement learning system for roboticcontrol. In R. I. McKay & J. Slaney (Eds.), Lecture Notes AI: LNAI 2557 (pp.419-430).
Zhou, W., & Coggins, R. (2004). Biologically inspired reinforcement learning: reward-based decomposition for multi-goal environments. In A. J. Ijspeert & M. Murata& N. Wakamiya (Eds.), Biologically Inspired Approaches to AdvancedInformation Technology (pp. 80-94). Berlin: Springer-Verlag.
Botvinick, Niv and Barto Hierarchical reinforcement learning
59
Figure Captions
1. An illustration of how options can facilitate search. (A) A search tree with arrows
indicating the pathway to a goal state. A specific sequence of seven independently
selected actions is required to reach the goal. (B) The same tree and trajectory, the
colors indicating that the first four and the last three actions have been aggregated
into options. Here, the goal state is reached after only two independent choices
(selection of the options). (C) Illustration of search using option models, which
allow the ultimate consequences of an option to be forecast without requiring
consideration of the lower-level steps that would be involved in executing the option.
2. An actor-critic implementation. (A) Schematic of the basic actor-critic architecture.