1,2 Christof Teuscher, 3 Gedeon O. Deák and Eric Carlson 1 · 2020. 3. 4. · Jochen Triesch,1,2 Christof Teuscher,3 Gedeon O. Deák1 and Eric Carlson1 1. Department of Cognitive

Developmental Science 9:2 (2006), pp 125–157

© 2006 The Authors. Journal compilation © 2006 Blackwell Publishing Ltd. Published by Blackwell Publishing Ltd., 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

Blackwell Publishing LtdTARGET ARTICLE WITH COMMENTARIES AND RESPONSEThe emergence of gaze followingGaze following: why (not) learn it?

Jochen Triesch,1,2 Christof Teuscher,3 Gedeon O. Deák1 and Eric Carlson1

1. Department of Cognitive Science, University of California, San Diego, USA2. Frankfurt Institute for Advanced Studies, Johann Wolfgang Goethe University, Germany3. Los Alamos National Laboratory, Los Alamos, USA

For commentaries on this article see Csibra (2006), Moore (2006) and Richardson and Thomas (2006).

Abstract

We propose a computational model of the emergence of gaze following skills in infant–caregiver interactions. The model isbased on the idea that infants learn that monitoring their caregiver’s direction of gaze allows them to predict the locations ofinteresting objects or events in their environment (Moore & Corkum, 1994). Elaborating on this theory, we demonstrate thata specific Basic Set of structures and mechanisms is sufficient for gaze following to emerge. This Basic Set includes the infant’sperceptual skills and preferences, habituation and reward-driven learning, and a structured social environment featuring acaregiver who tends to look at things the infant will find interesting. We review evidence that all elements of the Basic Set areestablished well before the relevant gaze following skills emerge. We evaluate the model in a series of simulations and show thatit can account for typical development. We also demonstrate that plausible alterations of model parameters, motivated by findingson two different developmental disorders – autism and Williams syndrome – produce delays or deficits in the emergence of gazefollowing. The model makes a number of testable predictions. In addition, it opens a new perspective for theorizing aboutcross-species differences in gaze following.

Introduction

The capacity for shared attention is a cornerstone of socialintelligence. It plays a crucial role in the communicationbetween infant and caregiver (Brazelton, Koslowski &Main, 1974; Kaye, 1982; Adamson & Bakeman, 1991;Adamson, 1995; Moore & Dunham, 1995). By 9–12months most infants can follow adults’ gaze and point-ing gestures, and monitor a caregiver’s affect and use itto modulate their own response to an ambiguous stimulus.These behaviors emerge and coalesce on a predictableschedule (e.g. Butterworth & Itakura, 2000; Deàk, Flom& Pick, 2000), although specific milestones show consider-able individual differences in age of attainment (Mundy& Gomes, 1998; Markus, Mundy, Morales, Delgado &Yale, 2000). Shared attention skills allow the young ofour species to learn what is important in the environ-ment, based on the patterns of attention in older, moreexpert individuals. In conjunction with a shared lan-guage, these skills allow children to communicate what

they perceive and think about, and to construct mentalrepresentations of what others perceive and think about.Consequently, shared attention is crucial for languageand communication (Bruner, 1983; Baldwin, 1993;Tomasello, 1999).

The term shared attention is typically used to denotea set of different skills comprising gaze following, point-ing and requesting behaviors. While some authors usethe terms joint and shared attention interchangeablyto refer to the matching of one’s focus of attention withthat of another person, other authors make a subtledistinction between the two. ‘Shared’ attention is sometimesreserved for the more complex form of communication,wherein two individuals attend to the same object, andeach have knowledge of the other’s attention to thisobject (Tomasello, 1995; Emery, 2000). In this paper, wewill be concerned with joint attention more broadly,which we view as an important precursor to the emerg-ence of true shared attention. Our particular focus is ongaze following, which may be defined as looking where

Address for correspondence: Jochen Triesch, Department of Cognitive Science, University of California, San Diego, 9500 Gilman Drive, La Jolla,CA 92093-0515, USA; e-mail: [email protected]

126 Jochen Triesch et al.

© 2006 The Authors. Journal compilation © 2006 Blackwell Publishing Ltd.

somebody else is looking. Gaze following is a goodstarting point for investigations into shared attention,because it develops early in life and is a precedent forother shared attention skills.

How does gaze following emerge?

Starting with a pioneering study by Scaife and Bruner(1975), the emergence of gaze following has been inves-tigated in many studies. There has been some debateabout when gaze following emerges in human infants,with most estimates ranging from 3 to 12 months (e.g.Butterworth & Cochran, 1980; D’Entremont, Hains &Muir, 1997; Hood, Willen & Driver, 1998; Morales,Mundy & Rojas, 1998). The reasons for this wide rangeare threefold. First, researchers have used different crite-ria to define gaze following (Tomasello, 1995). Second,different levels of sophistication of gaze following can bedistinguished. Third, different experimental paradigmsmay differ in sensitivity. The earliest signs or precursorsof gaze following can be observed around 3 months ofage, and some very rudimentary skills are even presentin newborns (Farroni, Massaccesi, Pividori & Johnson,2004). In particular, D’Entremont et al. (1997) showedthat 3-month-olds will turn their eyes in the direction ofan adult’s head turn more frequently than in the oppo-site direction. Their observation requires rather idealconditions, such as targets that are well within the infant’svisual field. In addition, these demonstrations of ‘gazefollowing’ seem to rely on more basic visual trackingmechanisms that facilitate gaze shifts in the direction ofmotion of a centrally located stimulus. In fact, suchmotion cueing may initially be necessary, but by around9 months static head pose alone can be sufficient forgaze following (Moore, Angelopoulos & Bennett, 1997).

Beyond these first signs of gaze following, Butterworthand Jarrett (1991) proposed three different stages of gazefollowing emerging around 6, 12 and 18 months, respect-ively (but also see Deàk et al., 2000). These stages aredefined by infants’ new abilities, first to ignore distract-ing visual objects, and later to follow adults’ gaze tolocations outside of their visual field.

An important line of research is concerned with thespecific features that infants use to establish the adult’sdirection of gaze. There is evidence that younger infantsrely more on the caregiver’s head pose than the eyes,whereas between 12 and 14 months there is a significantincrease in sensitivity to eye orientation (Caron, Butler& Brooks, 2002). By 18 months, gaze following is reliablyproduced on the basis of eye movements alone (Butterworth& Jarrett, 1991). This body of work suggests that limita-tions of the infant’s developing face processing skills mayplay an important role in their ability to follow gaze.

A rather difficult question is what gaze following skillsimply about how infants at various ages conceptualizetheir caregivers’ looking behavior. Although earlyaccounts interpreted gaze following skills as indicatingconsiderable social understanding or even a theory ofmind, it has been argued that young infants may learnto follow gaze without such an understanding (Moore &Corkum, 1994; Corkum & Moore, 1995). More recently,Woodward (2003) demonstrated that infants need nothave an understanding of the relation between a personwho looks and the object of his or her gaze. In addition,early gaze following skills may not even require a repre-sentational strategy involving the identification of thecaregiver as an intentional, perceiving individual (Leekam,Hunnisett & Moore, 1998). Certainly, such representa-tions will emerge over time in older infants, but theymight not be necessary to explain the emergence of gazefollowing behaviors.

Gaze following in other species

Humans are not the only species that exhibit gazefollowing. Gaze following has been demonstrated in anumber of other species, including some (but not all)non-human primates (e.g. Itakura, 1996, 2004; Emery,Lorincz, Perrett, Oram & Baker, 1997; Tomasello, Call& Hare, 1997). Chimpanzees even seem to exhibit themore advanced level of gaze following that requiresignoring a distractor object along the scan path – Butter-worth’s geometric stage of gaze following (see above)(Tomasello, Hare & Agnetta, 1999). In addition, Hare,Call, Agnetta and Tomasello (2000) demonstrated thatchimpanzees know what conspecifics can and cannotsee. There has also been some work with non-primates.Domestic dogs, for example, are capable of following thegaze of humans at about the level of 6- to 9-month-oldhuman infants (but are not capable of shared attention)(Hare & Tomasello, 1999; Agnetta, Hare & Tomasello,2000). In contrast, wolves don’t seem to follow the gazeof humans (Hare, Brown, Williamson & Tomasello, 2002).Why some species are able to follow gaze while otherspecies are not is currently unclear. Behavioral researchhas been cataloging cross-species differences but little isknown about the underlying reasons for cross-speciesdifferences.

The role of learning

Early attempts to explain gaze following postulated theexistence of innate modules. Examples of strongly nativisttheories have been articulated by Leslie and Baron-Cohen(Leslie, 1987; Baron-Cohen, 1995). Such approacheshave marginalized the role of learning in the development

The emergence of gaze following 127


of cognitive skills. One line of critique against modularaccounts is that they tend to have little predictive power,because it is typically not made explicit how the moduleswork internally and exactly what information is passedbetween them (see Deák & Triesch, in press, for detailedanalysis). In principle, however, this criticism can beovercome, and recent computational and robotic modelingwork has started to address this question (Scassellati,2002).

An alternative view explains the emergence of gazefollowing by postulating that infants gradually discoverthat monitoring their caregiver’s direction of gaze allowsthem to predict where interesting visual events will be.This idea was first articulated by Moore and Corkum(1994; Corkum & Moore, 1995). Note that while thisview highlights the role of learning processes, it does notpreclude an evolved propensity to follow gaze in certainsituations, which depends only minimally or not at all onearly social experiences. Such mechanisms may beimportant in jump-starting the learning process. There issubstantial evidence consistent with a learning account.In particular, Corkum and Moore (1998) (C&M) demon-strated that 8-month-old infants can be trained tofollow their caregiver’s gaze in a contingent reinforcementparadigm, where an interesting visual stimulus wasshown if the infant followed the adult’s gaze to thestimulus location. C&M concluded that ‘learning couldbe involved in the acquisition of gaze following’ (p. 37).A second experiment by C&M, however, seems some-what inconsistent with a pure learning account. Specific-ally, they found it more difficult to train infants to lookto the location opposite of where the adult turned. Thisprompts C&M to claim that ‘simple learning is not suf-ficient as the mechanism through which joint attentioncues acquire their signal value’ (p. 28). In our view, however,C&M’s second experiment is quite difficult to interpretand the results appear still consistent with a learningaccount.1

The importance of learning is also supported by someevidence, albeit preliminary, that gaze following skillsemerge gradually through social experience. Deák et al.(2000) found that 12- and 18-month-old infants’ gaze

following diminished less across trials if targets werenovel and distinctive, than if targets were repetitive andidentical. This suggests that even in a single interactionwith as few as 12 trials, infants adjust their expectationsabout the validity of adults’ social cues for predictingvisual reward. Also, Deák et al. (Deák, Wakabayashi,Sepeta & Triesch, 2004) reported preliminary observa-tional data showing that gaze and gesture following skillsemerge somewhat gradually between 5 and 10 monthsof age, which is consistent with an ongoing learningprocess. In sum, then, there is intriguing evidence tosuggest that learning models might explain how gazefollowing and other joint attention skills emerge in thefirst 18 months. However, existing models are too vagueto specify the kinds of data that would help us sharpena powerful, predictive account of how these skills emerge.

The need for computational models

Our ultimate goal is to explain how and why gazefollowing (in its different forms) emerges at a level thatreveals the underlying mechanisms of change in thebrain and their relation to changes in overt social behavior.A theory of the emergence of gaze following shouldaccount for the experimental findings obtained in behavi-oral experiments, be consistent with known neurosciencedata, and make specific predictions that can be used tofalsify it. It should offer plausible explanations for dif-ferences in populations with developmental disordersand in other species. All else being equal, it should be assimple and parsimonious as possible.

In this paper we propose an account of the emergenceof gaze following and evaluate its plausibility throughcomputational modeling. Like many others, we believethat computational models can be a great aid in theoriz-ing about developmental phenomena. The benefits ofsuch an approach have been adequately discussed in sev-eral places (e.g. Elman, Bates, Johnson, Karmiloff-Smith,Parisi & Plunkett, 1996; O’Reilly & Munakata, 2002).For instance, computational models can be very helpfulin bridging the explanatory gap between biologicalmechanisms and observed behaviors. Importantly,computational approaches can be useful in analyzingthe causal structure of developmental processes, that is,which changes may be necessary or sufficient for deve-lopmental events like the emergence of a new cognitiveskill. These questions cannot easily be studied experi-mentally because (1) changes to individual neural pro-cesses are not readily observable or manipulable, and (2)there are typically many processes changing at the sametime, making it very difficult to answer questions aboutcause and effect relations. Computational modeling maybe particularly helpful in studying such relations because

1 There are at least two questions about the proper interpretation ofExperiment 2 in Corkum and Moore (1998). First, it is unclear to whatextent the participants could already follow gaze, because the exclusionmeasure was not very powerful. Corkum and Moore’s interpretationrests on the assumption that the tested infants were incapable of anygaze following. Second, motion cues may have facilitated gaze shifts inthe direction of the caregiver’s head turn, but Corkum and Moore’sinterpretation rests on the assumption that turns in the opposite direc-tion are equally likely a priori. This does not consider that motioncueing facilitates gaze shifts in the same direction, which is supportedby current evidence (e.g. Farroni et al., 2000).



one can easily monitor all changes in the model, andsystematically prohibit or promote certain changes inorder to study how this alters the developmental trajectory.

The specific approach described in the following iscomparable to other modeling work in the area ofcognitive development. To some extent our approach isinspired by connectionist models (Elman et al., 1996)and dynamical systems approaches to development(Thelen & Smith, 1994). We share with connectionistmodelers the desire to explain behavior in terms ofunderlying neural structures. In contrast to classical con-nectionist models of development, however, our approachemphasizes aspects of the embodied nature of cognitivedevelopment (Clark, 1997; Wilson, 2002). In particular,we consider the role of the learner’s situated real-timeinteraction with its environment. A good understandingand careful modeling of this interaction is a central goalof our approach (see Schlesinger & Parisi, 2001, foranother example of this approach). These issues havealso been addressed to some extent within the dynamicsystems approach (Thelen, Schöner, Scheier & Smith,2000), but our approach emphasizes the role of biologic-ally plausible reward-driven learning processes. It issurprising to us that reward-driven learning mechanismssuch as Temporal Difference learning (see below) arerarely being used in computational models of infantdevelopment. For example, connectionist style modelstypically utilize supervised learning (often using thebackpropagation learning mechanism) which is notapplicable to many developmental learning contexts.Similarly, in dynamical systems approaches, goal-directedlearning is frequently not addressed either. Instead, thetransition from one (younger and less capable) develop-mental state to the next (older and more capable) stateis often modeled by changing a control parameter of thedynamical system in order to account for different per-formance levels. What is not addressed is what forcesmay drive these changing control parameters in develop-ing infants. We feel that computational models that aimto carefully capture the affect-driven learning duringsituated, real-time interactions with the environment holdmuch promise for advancing our understanding of earlycognitive development. The account that follows is anattempt to evaluate the promise of such models in thecontext of gaze following.

The Basic Set account of gaze following

At the heart of our account lies the idea that infantslearn gaze following because they discover that monitor-ing their caregiver’s direction of gaze allows them to pre-dict where interesting visual sights occur. Elaborating on

this idea, we propose that gaze following (and otherattention-sharing skills) emerge through the interplay ofa Basic Set of structures and mechanisms. This setincludes perceptual skills and preferences, reward-drivenlearning, habituation and a structured social environ-ment (Fasel, Deák, Triesch & Movellan, 2002). In thefollowing, we will briefly discuss each component of thisBasic Set, and review evidence that each of these is func-tioning in normally developing infants before the timethat the first solid gaze following skills emerge. This iscrucial for establishing the viability of this set as a causalprecursor for the emergence of gaze following skills. Wewill then describe how these components may interact toallow for the learning of gaze following.

Perceptual skills and preferences

Several perceptual skills and preferences that are in placeby 3 months of age or earlier might be important forshared attention skills to develop. Even the youngestinfants prefer human stimuli, especially their caregivers’faces and voices (Brazelton et al., 1974; DeCasper &Fifer, 1980; Pascalis, de Schonen, Morton, Deruelle &Fabre-Grenet, 1995). One interpretation is that socialstimuli have a higher salience than competing inanimatestimuli (Bates, 1979). Infants also generally enjoy socialinteraction. Around 2–3 months, infants begin respond-ing in a more consistent and focused way to caregivers.At the same time most infants produce their first socialsmiles, and parents report greater engagement and‘presence’ during interactions (Cole & Cole, 1996).Infants as young as 3 months prefer looking at theeyes of an approaching person, rather than the mouth(Haith, Bergman & Moore, 1979).

Attention-shifting skills (critical for following gaze orpointing cues) begin to mature around 3–4 months (e.g.Butcher, Kalverboer & Geuze, 2000; Farroni, Johnson,Brockbank & Simion, 2000; Johnson, Posner & Rothbart,1994), but other, more complex perceptual skills willcontinue to undergo significant changes. A skill that ishighly relevant to the development of gaze followingand other attention-sharing skills is face processing, ormore specifically, head pose and eye direction perception(i.e. discriminating the rotational angles of the face, andestimating the line of gaze). One study found that 1-month-olds prefer a photograph of their caregiver’s facein frontal to profile poses, suggesting that even younginfants can discriminate extreme differences in care-givers’ head poses (Sai & Bushnell, 1998). But this findinghas not been extended, so we do not know how wellinfants of different ages can discriminate different headposes. It appears that 8–10-month-olds use head pose,not eye direction, to estimate adults’ gaze direction



(Moore et al., 1997). Robust use of the eyes seems toemerge later, with significant improvement between 12and 14 months (Caron et al., 2002). Thus, by this age,face processing skills must be sufficiently well developedto allow for robust gaze following even in somewhatambiguous circumstances. However, for gaze following tobe successful, the ability to accurately encode the caregiver’shead pose needs to be mapped to the proper motorbehaviors, which requires additional learning processes.

Reward-driven learning

Reward-driven learning, we claim, is important forlearning attention-sharing. Reward-driven or reinforce-ment learning occurs in 2- and 3-month-olds (Kaye,1982) and may even be present at birth (Floccia, 1997).2

Two-month-olds can, for example, learn within minutesto predict the locations of the next interesting event in asimple repeated sequence (Haith, Hazan & Goodman,1988). We propose that the principal learning mecha-nisms used for acquiring attention-sharing behaviors areneurally plausible processes of Reinforcement Learningcalled Temporal Difference or TD learning (Sutton, 1988;Sutton & Barto, 1998). These processes are not merelySkinnerean, nor are they anti-mentalistic, but they havethe goal of formalizing the relation between an agent’saffect-laden experienced outcomes (positive or negative)and the agent’s means of adapting behavior to increasepositive outcomes and decrease negative ones. TD learn-ing in particular has been tied to specific neuromodula-tory systems (Schultz, Dayan & Montague, 1997), andrecent models are neurally plausible (Montague, Hyman& Cohen, 2004). In particular, the firing of dopaminer-gic neurons in parts of the basal ganglia has been asso-ciated with the temporal difference signal from whichTD learning methods derive their name. Although TDlearning has previously played almost no role in develop-mental models, it holds promise for understanding thedevelopment of behaviors in all contexts that involveaffectively valued outcomes. Reward-driven learning,however, may not be the only learning mechanism thatis important for the emergence of gaze following.

Habituation

Habituation also plays an important role in our theoryas a fundamental learning process. Habituation processes

have complex dynamics that are in themselves challeng-ing to understand and to model (Sirois & Mareschal,2002). In most previous modeling attempts, habituationwas related directly to the behavioral responses ofthe organism, e.g. the strength or probability of a motorresponse to a certain stimulus. Our view is somewhatdifferent in that we relate habituation processes tochanges in the internal evaluation or reward of a stimulus.Together, habituation and reward-driven learning (seeabove) will produce certain behavioral sequences andmodify them adaptively. For example, when an infantlooks at a caregiver’s face, or at a toy held by the care-giver, habituation will systematically occur, which weinterpret as a systematically declining reward valueover time for looking at this object. Dishabituation, con-versely, amounts to a recovery of this reward. BecauseTD learning predicts future rewards, habituation willfacilitate attention shifts away from the current targetso that a new, more rewarding target can be fixated.Dishabituation leads to a relative recovery of the rewardvalue of an object when a different stimulus is attended.These processes, in conjunction with reward-drivenlearning of behavioral policies, will produce cycles ofattention-shifting between interesting social objects in thevisual environment, such as the caregiver, and variousother objects with properties that infants find interest-ing. The utility of these cycles for learning to follow gazewill depend on predictable behavior patterns provided bythe caregiver.

Structured social environment

We posit that the most relevant situations for learningshared attention skills include interactions such as face-to-face play, feeding, diaper changing and bathing,which make up a high proportion of infants’ wakingtime. What is important about such interactions, wehypothesize, is their predictable event-contingency struc-ture. This structure is learnable, by means of reinforce-ment learning and habituation, and infants can learn tomaximize their positive engagement in such interactions.Studies on the statistical structure of infant–parentinteractions generally show that each participant syn-chronizes his or her actions with the other, and selectsactions based partly on the other’s recent actions, emotionsand messages (Watson & Ramey, 1985). We hypothesizethat infants soon start to predict where interestingobjects and events will be, based on their caregivers’gaze patterns. The caregiver’s gaze is predictive ofinteresting sights because caregivers will tend to look atother people or at objects they are manipulating (Land,Mennie & Rusted, 1999), and infants are interested insuch stimuli.

2 Sometimes the term contingency learning is used in the developmentalliterature. We use reinforcement learning because it is more commonin neuroscience, cognitive science and machine learning, and becauseit makes explicit an assumption that is implicit in the idea of contin-gency learning – specifically, that the learner is motivated or affectivelydriven to predict, and experience, certain outcomes.



The emergence of gaze following

How can the Basic Set elements (perceptual skills andpreferences, TD learning, habituation, structured socialenvironment provided by caregivers) act in concert toallow gaze following to emerge? Our claim is that infants(or other developing organisms, or even robots) withthese ‘ingredients’ will learn to anticipate the locationsof interesting visual stimuli based on caregivers’ atten-tive behaviors, both intentional (e.g. pointing) and un-intentional (e.g. reflexive looking). They will learn to parsesocial events into conditions and outcomes, each associ-ated with a hedonic value. A typical social sequence thatsupports learning might include the following events:

1. Initially, the caregiver and infant are looking at oneanother, in part because the infant has a preferencefor looking at social stimuli (i.e. it is rewarding to doso).

2. The caregiver looks away toward an object (possiblywhile holding or pointing to it), causing, first, areduction in the reward value of the caregiver’s face(making the infant more likely to search for otherstimuli); and second, producing directional motion ofthe head or eyes, which can trigger a same-directionattention shift by the infant (Farroni et al., 2000). Also,the infant may start to habituate to the caregiver’sface, further biasing the infant to make a gaze shift.

3. In some of these cases, due to ‘noisy’ action selectionor random exploration of different behaviors (e.g.Sutton & Barto, 1998), the infant makes a gaze shiftin the same direction as the adult. This can result inthe infant looking directly at the rewarding sight, orit can bring the sight into the field of view so that asubsequent eye movement can bring it to the centerof gaze.

4. In these cases, the infant on average receives a relat-ively greater reward (in terms of interesting sights)than if he or she had selected other actions. In a‘high-reward sequence’, infants receive informationabout contingencies between the caregiver’s headpose and the presence of interesting visual events in acertain location. This allows infants to learn that it isbeneficial to follow caregivers’ gaze shifts by shiftingtheir own gaze to the same regions of space.

In summary, we propose that the Basic Set of struc-tures and mechanisms outlined above allows infants tolearn to follow gaze because they learn to exploit thecaregiver’s tendency to look at things that are interesting(rewarding) for the infant. This theory is geared toexplain the basic phenomenon of gaze following, i.e.how the infant learns to associate the head pose of otherswith gaze shifts to certain locations inside or outside of

its own visual field. Ultimately, the test of this theorywill be whether it can be extended to explain many ofthe interesting subtleties such as the ordered sequence ofthe development of gaze following skills, or the valueof different caregiver cues (eyes, face, body posture) forjoint attention, or the later development of theory-of-mind-like representations. We are optimistic thatour framework provides a good starting point for thisendeavor, and that we will eventually be able to accountfor a large range of empirical phenomena, including‘higher’ shared attention skills. We will return to thispoint in the discussion.

Computational model

We now present a simple computational model to testwhether the mechanisms of the Basic Set can lead to theemergence of gaze following and to explore how altera-tions of model parameters can simulate some develop-mental disorders that are characterized by delays in theemergence of gaze following.3 The goal of this inquiryis to determine under what conditions the Basic Set issufficient for the emergence of gaze following. We do notsuggest, however, that all of the Basic Set elementsare strictly necessary – some might be replaceable byalternative mechanisms. Also, we do not claim that thisset is sufficient for a comprehensive account of all humanattention-sharing behaviors. It merely attempts to explainthe basic gaze following behaviors that progressivelyemerge during the first year in typically developing infants,and, hopefully, disruptions of this progression that occurin certain developmental disorders. Future work willestablish whether the model can also explain, for example,point-following behaviors.

The model was implemented in Matlab. The sourcecode is available at http://mesa.ucsd.edu

Environment and caregiver model

The simulation comprises a model of the infant (referredto simply as ‘infant’, merely for expositional fluency), amodel of the caregiver (the ‘caregiver’) and a model ofthe environment in which they interact. An overview ofthe model is given in Figure 1. As a simplification in themodel, we assume that infant and caregiver are facingeach other and remain in the same position. The spacesurrounding infant and caregiver is discretized into Ndistinct regions. The caregiver can look at any of theseregions or at the infant. The infant can look at any of

3 An initial account of the model was given in Carlson and Triesch(2003).



these regions or at the caregiver. The infant’s and care-giver’s shifting of gaze are the only ways they interactwith each other and the environment. Time runs in dis-crete steps, each corresponding to roughly a quarter ofa second. Each gaze shift is assumed to take one time step.

At any time there is one interesting object present orevent occurring in one of the N regions of the environ-ment. This could be an interesting toy, a third socialagent, the caregiver’s hand manipulating an object orperforming a gesture, or other stimuli that the infantwould find interesting. We will refer to this object or eventas the target. (Below we will also consider environmentswith multiple targets.) After some minimum time at onelocation Tmin, the interesting target is relocated to arandomly chosen new location with some probabilitypshift per time step.

Whenever the target moves, the caregiver model shiftsits direction of gaze. There is a certain probability pvalidthat the caregiver will be looking at the new location ofthe target. Otherwise, the caregiver’s new direction ofgaze is drawn from a uniform distribution over all of theother N locations (one for the infant plus N − 1 locationsnot containing the target). Thus, the parameter pvalidmodels how predictive the caregiver’s direction of gaze isfor indicating the location of the interesting target.

The parameter pvalid also has a second function. Wecan use it to model inaccuracies in the infant’s head posediscrimination. Consider the case where the caregiveris always looking at the target. Even in this case, if theinfant’s head pose discrimination is inaccurate or noisy,the infant will not be able to correctly infer the care-

giver’s head pose and, as a consequence, the estimatedhead pose will not be very predictive of rewarding sights.Thus, a not-so-predictive caregiver whose head pose canbe estimated accurately and a highly predictive caregiverwhose head pose we can only infer correctly some frac-tion of the time will produce the same net effect, and wecan model both situations with the same parameter pvalid.

Note that this environment and caregiver modelis extremely simple. In particular, the caregiver is notresponding to the infant in any way. This is obviously agross simplification of the complex, reciprocal dynamicsof infant–caregiver interactions (e.g. Kaye, 1982), but aswe will demonstrate below, even this kind of socialenvironment can be sufficient for gaze following to emerge.More complex, interactive caregiver models have alsorecently been investigated, and these show that the care-giver’s behavior plays an important role (Teuscher &Triesch, 2004). In particular, the caregiver’s behavior hasto be properly matched to the parameters of the infantmodel for optimal learning speed, although gaze followingwill emerge under a wide range of caregiver behaviors.

Infant model

Our infant model is essentially that of a pleasure-drivenagent. There are many ways of formalizing this idea buta particularly appropriate formal framework is reinforce-ment learning (Sutton & Barto, 1998). Besides being thebasis for modern theories of learning under rewards andpunishments, reinforcement learning is also an impor-tant subfield of machine learning with some impressiveapplication successes (Sutton & Barto, 1998). In particular,our model uses temporal difference learning (TD learn-ing) algorithms, which have been proposed as models forcertain basal ganglia functions (Schultz et al., 1997). Adetailed description of the equations of the model isgiven in the Appendix.

We conceive the infant as a reinforcement learningsystem that learns to make two kinds of decisions. First,at any given time it decides whether to shift gaze or keepfixating the same location. Second, it decides where tolook next, once the decision to shift the direction of gazehas been made. The information available to the infantincludes the identity of its current object of fixation, itsassociated reward value, and the length of time the infanthas been fixating this object. If and only if the fixatedobject is the caregiver, the infant will know the caregiver’scurrent head pose.

Looking, reward and habituation

The infant model receives rewards for looking at inter-esting things. The amount of reward received depends

Figure 1 Overview of the model showing infant, caregiver and interesting object. Corresponding model parameters are given in brackets. Note that while we draw the spatial locations as arranged in a hexagonal fashion, the model does not assume or use any specific topological relations between these locations.



on the contents of the infant’s gaze and how habituatedthe infant is to those contents. There are four possiblethings for the infant to see, (1) a frontal view of thecaregiver (in case the caregiver is also looking at theinfant), (2) a non-frontal view of the caregiver, which wesimply refer to as a profile view (in case the caregiver isnot looking at the infant), (3) the target or (4) nothing.Associated with these sights are the base rewards Rfrontal,Rprofile, Rtarget, Rnothing. The actual reward received by theinfant is the base reward attenuated by habituation. Asthe infant looks at a location, the infant habituates toits contents in the sense that the actual reward for anyobject at this location will decrease over time. Similarly,dishabituation is modeled as a recovery of the actualreward for objects at other locations.

For each object in the environment, including the care-giver, the infant has a habituation value hfix(t) ∈ [0,1],indicating the fraction of the base reward the infantreceives for looking at this object. A value of hfix = 1means that the infant is not habituated to the object,while a value of hfix = 0 means that the infant is com-pletely habituated to the object. As the infant continuesto fixate on an object its habituation value decreasesaccording to hfix(t) = hfix(0)e

−βt, where hfix(0) is the habitu-ation level at the beginning of the current fixation, and tis the time since the start of the fixation, and β is thehabituation rate. Thus, the actual reward received bythe infant at time t is ractual(t) = R fixhfix(t), where Rfix ∈{Rfrontal, Rprofile, R target, R nothing} is the base reward. At thesame time, the reward levels for objects at locations notbeing fixated recover in a corresponding fashion, modelingdishabituation. In particular, when the infant is notlooking at an object it dishabituates according to hnofix(t)= 1 − hnofix(0)e−βt, where t is the time since last looking atthat object and hnofix is the level of habituation of thisobject currently not being fixated.

One infant, two agents: when and where

Inspired by the proposal that the decisions of when toshift gaze and where to shift gaze are made in separateneural pathways (Findlay & Walker, 1999), the infantmodel consists of two separate agents. The state space ofthe when-agent, which decides whether to continue tofixate on the same location or shift gaze, has two dimen-sions. The first dimension represents the time the infanthas been fixating at the same location, discretized asthe number of time steps (0, 1, 2, . . . , 8, 9 or more). Thesecond dimension is the actual reward received by theinfant. This is the total reward the infant receives on thattime step, taking habituation into account, discretizeduniformly into ten bins between the maximum and mini-mum possible actual rewards.

If the when-agent makes the decision to shift gaze, thewhere-agent determines the target of the gaze shift. Thestate space of this agent has only a single dimension: thecaregiver’s head pose. Importantly, unless the infant islooking at the caregiver, the caregiver’s head pose willbe unknown to the infant. Concretely, this agent distin-guishes N + 2 different states: N for the N different headposes observed when the caregiver looks at the N regionsof space, plus one for the caregiver’s head pose when thecaregiver is facing the infant, plus one state to representthat the head pose of the caregiver is unknown to theinfant. The where-agent learns to map these states ontoN + 1 different actions: one action for looking at eachof the N regions of space and one action for looking atthe caregiver. Note that we assume a one-to-one cor-respondence between a caregiver head pose and the regionof space the caregiver looks at. In reality, this mappingis ambiguous and the ambiguity can produce character-istic errors in gaze following (Butterworth & Jarrett, 1991).Modeling this ambiguity and how the infant learns toresolve it is the subject of a separate paper (Lau &Triesch, 2004).

Learning in both agents occurs through the SARSAalgorithm (see Appendix), which was chosen because ofits simplicity. Both agents balance exploration vs. exploi-tation by selecting actions with a softmax action selec-tion mechanism (see Appendix). It should be noted thatseparating the infant model into two separate learningagents is not strictly necessary. We would expect similarresults for a simpler model that uses a single reinforce-ment learning agent to model the infant, whose statespace was the product space of the state spaces of thewhen and where agents, and whose possible actions areto shift gaze to any of the N + 1 locations. However, thelearning time would be expected to increase because ofthe higher dimensionality of the resulting state space.

Experiments

Normal emergence of gaze following

In this section we describe a first analysis of the modeland the effects of some model parameters on its learningbehavior. For easy reference, all parameters, their defaultvalues, and their allowed ranges are listed in Table 1. Inthe following, default parameter values are used unlessotherwise indicated. The effect of changing severalparameters is discussed below. Generally speaking,the model is robust to changes in the parameters overwide ranges. The parameters Tmin, pshift and pvalid wereset ad hoc but could eventually be set in accordancewith data from an observational study of naturalistic



infant–caregiver interactions that is currently under way(Deák et al., 2004).

To quantify the emergence of gaze following in themodel and its dependence on model parameters we usethe following approach. At specific points during thelearning process we temporarily ‘freeze’ the model andevaluate its behavior for 1000 time steps (which cor-responds to slightly more than 4 minutes of simulatedinteraction), after which the learning process resumes.The model behavior at these stages of the learningprocess is analyzed by observing the infant model inter-acting with the environment and computing two statis-tics. The caregiver index CGI is defined as the frequencyof the infant’s gaze shifts towards the caregiver:

(1)

The gaze following index GFI is the frequency of gazeshifts that lead from the location of the caregiver towhere the caregiver is looking:

(2)

An example run of the system with the default para-meters is shown in Figure 2. The model first learns toalternate gaze between the caregiver and other locations.In terms of the model, the when-agent discovers that itis best not to continue staring at a single location for toolong. At the same time, the where-agent discovers that ifthe infant is not looking at the caregiver it tends to berewarding to make a gaze shift back to the caregiver.After this has been achieved, gaze following behaviorslowly emerges. Here, the where-agent discovers thatunexpectedly high rewards tend to follow gaze shiftsto certain locations, depending on the caregiver’s head

pose. It learns to correctly map the caregiver’s head poseto gaze shifts to the locations that the caregiver looks at.The increasing average reward the model obtains pertime step during this phase confirms that gaze followingis in fact beneficial for the model under these para-meters. Note that for a model without habituating rewardsit would be optimal to continually stare at the caregiver.

A microscopic view of the behavior of the infantmodel is shown in Figure 3 (top). It shows the fixationbehavior of the infant during various stages of the learn-ing process. Fixations on the caregiver are indicated bywhite pixels, target fixations by black pixels, and fixa-tions on other regions of space by grey pixels. The quick

Table 1 Overview of model parameters, their allowed ranges and default values

Symbol Explanation Range Default

N number of spatial regions 1, 2, . . . 10∆t duration of one simulation step arbitrary ∼250 msα learning rate [0,1] 0.0025β habituation rate [0,∞] 1γ discount factor for future rewards [0,1] 0.8τ temperature (randomness of action selection) [0,∞] 0.095Rfrontal reward for looking at frontal view of caregiver [−∞,∞] 1Rprofile reward for looking at profile view of caregiver [−∞,∞] 1Rtarget reward for looking at target [−∞,∞] 1Rnothing reward for looking at other region [−∞,∞] 0Tmin minimum target stationary time (steps) [0,∞] 4pshift probability of target shift per time step [0,1] 0.5pvalid predictiveness of caregiver gaze [0,1] 0.75

CGI gaze shifts to caregiver

# gaze shifts

#.=

GFI

gaze shifts from caregiver to correct location# gaze shifts

#

.=

Figure 2 Emergence of gaze following in simple environment with just one interesting target present at any time. The solid curve plots the caregiver index (CGI), the solid curve with circles plots the gaze following index (GFI) and the dotted curve plots average reward per time step, as functions of the number of learning iterations. Error bars indicate standard deviations across 15 simulations.



development of a preference for looking at the caregiveris visible as the increase in the amount of white pixels(caregiver fixations) during the first few rows. Thesubsequent increase in target fixations (black pixels) isthe effect of the emergence of gaze following. Gazefollowing episodes are shown by black pixels to the rightof white pixels.4 The increase in the number of such epi-sodes during learning directly reflects the increasing GFI(compare Figure 2).

Figure 4 shows that gaze following will still be learnedin more complex environments, where multiple interestingevents occur simultaneously. In this case, the learning issomewhat slower because the infant may temporarilylearn incorrect associations between a particular caregiverhead pose and a gaze shift to a location not looked atby the caregiver but that nevertheless contains aninteresting event.

4 Note that there can be instances of black pixels to the right of whitepixels that do not correspond to gaze following. This occurs when theinfant looks away from the caregiver to a location not looked at by thecaregiver that happens by chance to hold the interesting object. Theseinstances are comparatively rare, however. More precisely, the prob-ability of the infant finding the target this way is only (1 − pvalid)/(N − 1),where N is the number of locations in the environment.

Figure 4 Gaze following in the presence of multiple targets for various values of pvaild. The gaze following performance averaged over 100 000 steps (y-axis) is plotted as a function of the number of targets that are present simultaneously (x-axis). Error bars indicate standard error across 15 simulations. Gaze following is diminished if significant ambiguities due to multiple targets exist. Also, a reduced predictiveness of the caregiver pvaild has a negative impact on gaze following performance. The dashed horizontal line marks the ‘chance level’ of gaze following expected for an infant who first looks to the caregiver and then shifts gaze randomly to any of the N locations.

Figure 3 Microscopic analysis of model behavior for normally developing (top), autism-like (center) and Williams-like (bottom) model. Each row of pixels shows the target of the infant’s gaze as a function of time (for 50 steps). The gaze target is color coded, with white corresponding to the caregiver, black corresponding to the target, and grey corresponding to other regions of space. In particular, an instance of gaze following is represented by a black pixel lying to the right of a white pixel. Different rows show the behavior at different times during the learning process (every 4000 steps).



We have also experimented with making Rprofile smallerthan Rfrontal to capture infants’ preference for frontalfaces (Sai & Bushnell, 1998). We found that gaze followingperformance is largely determined by Rprofile, with higherRprofile values leading to faster learning. The value of Rfrontalplays a comparatively small role, because the currentcaregiver model only looks at the infant infrequently. Asystematic analysis of learning speed as a function of care-giver reward is given below in the context of modelingdevelopmental disorders.

Analysis of model parameters

Predictiveness of caregiver’s gaze

An important parameter of the model is pvalid (see Figure 4).Unless pvalid is high enough, gaze following will notemerge. For pvalid = 0.25, the GFI remains very poor,even when there is only one interesting target in theenvironment. There are two interpretations of this result,corresponding to the two interpretations of pvalid (seeabove). First, a highly informative caregiver, i.e. one whofrequently looks at the interesting target, facilitates theacquisition of gaze following. This confirms the import-ance of one component of the Basic Set: a structuredsocial environment. Second, limitations of the infant’sability to discriminate head poses will delay the infant’sacquisition of gaze following. Currently, little is knownabout how real infants’ ability to discriminate headposes develops, but such data would be most useful inconstraining the model (see also Lau & Triesch, 2004).

Speed of learning: learning rate and habituation

We hypothesized that the learning rate α and the habitu-ation rate β might both influence the speed with whichgaze following can be acquired. In the trivial case ofα = 0 no learning takes place at all, and gaze followingobviously cannot emerge. However, too high a learning ratecan also cause problems. This is illustrated in Figure 5,top. In general, an intermediate value for the learningrate seems to be optimal, which is common for reinforce-ment learning models.

Figure 5, bottom, shows the effect of the habituationrate β on the learning process. It shows that an infantthat habituates faster (high β) learns to follow gaze morequickly. By contrast, slow habituation (low β) will resultin less frequent gaze shifts between objects and thereforeto fewer opportunities for the necessary learning experi-ences. Interestingly, however, even without any habitu-ation (β = 0) gaze following is still learned, but very slowly.In this case, gaze shifts away from the most rewardingobject occur only through the random selection of

exploratory actions. The infant will spend most timelooking at the caregiver, which is the optimal thing to do.Due to the random softmax action selection mechanism,however, which sometimes explores the consequences ofseemingly suboptimal actions, the infant will look awayfrom the caregiver, which creates an opportunity to dis-cover the benefit of following gaze. We conclude thatalthough habituation is not strictly necessary if there are

Figure 5 Top: Effect of learning rate on emergence of gaze following. A higher learning rate α leads to accelerated initial learning as measured by the gaze following index (GFI). However, a high learning rate can lead to problems in the long run. The infant may never acquire a high level of gaze following. Error bars indicate standard errors across 15 runs. Bottom: Effect of habituation rate on learning of gaze following. Faster habituation leads to accelerated learning as measured by the gaze following index (GFI). Even without any habituation gaze following is still learned – albeit very slowly. Error bars indicate standard error across 15 simulations.



other mechanisms for exploratory gaze shifting, learningmay be very slow without it. The model thus predictsthat infants who habituate quickly (in the sense of themodel) may learn gaze following faster than their peers.This prediction is consistent with some evidence thatinfants who are ‘fast habituators’ at 5 months havebetter social and communicative skills at 13 months(Tamis-LeMonda & Bornstein, 1989), although care hasto be taken because our notion of habituation as adecaying reward for a visual stimulus is not identical tothe common behavioral measures of habituation.

In summary, both learning rate and habituation rateinfluence the speed of learning and may be related toindividual differences in the emergence of gaze followingin real infants. However, they act on the learning processin different ways. The learning rate α determines howmuch an individual learning experience changes theinfant’s future behavior. The habituation rate β deter-mines how many relevant learning experiences the infantencounters during a fixed amount of time.

Modeling failures of the emergence of gaze following in autism and Williams syndrome

Any account of gaze following should answer why gazefollowing emerges, and why gaze following may notemerge under certain circumstances. An important lineof research concerns differences in shared attention skillsin developmental disorders such as autism and Williamssyndrome. Autism is a Pervasive Developmental Disordercharacterized by impairment in social interactions andcommunication (e.g. Dawson, Meltzoff, Osterling, Rinaldi& Brown, 2004), as well as atypical cognitive processing.Shared attention deficits are the most consistent earlypredictors of the social and language deficits of autism(Osterling & Dawson, 1994). Thus, a critical test of ourmodel is its capacity to simulate autistic failure of gazefollowing.

A more subtle challenge is to test the model’s capacityto simulate a disorder that is associated with less strikingand more idiosyncratic differences in joint attention.Williams syndrome is a rare genetic disorder that ischaracterized by (among other things) hypersocial behavior,differences in face processing and deficits in learning andattention. Most importantly for us, there is also someevidence for deficits in triadic shared attention skills(Bertrand, Mervis, Rice & Adamson, 1993; Laing, Butter-worth, Ansari, Gsödl, Longhi, Panagiotaki, Paterson& Karmiloff-Smith, 2002; Mervis, Morris, Klein-Tasman,Bertrand, Kwitny, Appelbaum & Rice, 2003), althoughmore research is needed in this area.

While traditional nativist/modularist accounts typic-ally propose broken or missing modules as the origin of

developmental disorders (Baron-Cohen, 1995), our accountprompts us to look for potential differences in thecomponents of the Basic Set that may lead to differentdevelopmental trajectories. The goal here is not to pro-vide a comprehensive model of these developmentaldisorders, but to show how specific aspects of thesedisorders may contribute to deficits in gaze following.

Changes in the reward structure

In the last section we have already seen how differencesin learning rate or habituation rate can slow down or evenprevent the emergence of gaze following. For autismspectrum disorders and Williams syndrome, however, aparticularly interesting candidate is the reward structureof the model, because in both kinds of disorders theaffective value of faces may be altered.

An intriguing attribute of autism is disinterest in faces.In general, the interest in or appeal of social stimuli isdiminished in autism (Adrien, Lenoir, Martineau, Perrot,Hameury, Larmande & Sauvage, 1993; Chawarska,Klin & Volkmar, 2003; Maestro, Muratori, Cavallaro, Pei,Stern, Golse & Palacio-Espasa, 2002; Tantam, Holmes& Cordess, 1993; Klin, Jones, Schultz & Volkmar, 2003;Dawson, Meltzoff, Osterling, Rinaldi & Brown, 1998).For some (but not all) individuals with autism, directeye contact even seems to be aversive, a phenomenonknown as gaze avoidance (Hutt & Ounsted, 1966; Richer& Coss, 1976; Langdell, 1978). It has been proposedmany times that a disruption in face processing maybe an underlying cause for social deficits in autism(e.g. Trepagnier, 1996; Howard, Cowell, Boucher, Broks,Mayes, Farrant & Roberts, 2000; Klin et al., 2003). Whyfaces are in some ways less salient or rewarding toindividuals with autism is not clear. It may be that facesare too unpredictable for autistics, an idea consistentwith the hypothesis that autistics prefer highly predictablestimuli (Gergely & Watson, 1999); it may also be thatanatomical differences in the amygdala (which particip-ates in processing facial affect displays) play a role(e.g. Howard et al., 2000; Baumann & Kemper, 2005).Regardless of the cause, this symptom, and its long-termeffect on social learning, bears more precise (ideallyquantitative) specification.

In contrast to the disinterest in faces in autism, childrenwith Williams syndrome show a high preference forlooking at faces over looking at other objects (Bertrandet al., 1993; Bellugi, Lichtenberger, Jones, Lai & StGeorge, 2000; Mervis et al., 2003). In addition, alteredas well as delayed emergence of face processing skillshas been reported (Karmiloff-Smith, Thomas, Annaz,Humphreys, Ewing, Brace, Van Duuren, Pike, Grice &Campbell, 2004).



What would happen in the model if looking at thecaregiver was made aversive, as for an atypical baby whofinds faces unpredictable and overstimulating, or madehighly positive, as for a hypersocial infant with anextreme preference for human faces over other sights?

To test the effect of different reward structures on thelearning process, we systematically varied the rewardparameters Rfrontal, Rprofile and Rtarget over a range of values.For simplicity we restricted ourselves to the case whereRprofile = Rfrontal. Figure 6 summarizes the results. For eachcombination of reward values we ran the simulation for105 time steps and measured the GFI at the end of thistime. Figure 6 plots the GFI averaged over 10 experi-ments as a function of Rfrontal and Rtarget.

For Rtarget ≤ 0 no gaze following behavior emerges.This makes intuitive sense because if the targets thatthe caregiver tends to look at are not rewarding for theinfant, there is no benefit in gaze following behavior.That is, no additional reward can be obtained by follow-ing the caregiver’s gaze. If Rfrontal and Rprofile are small oreven negative, modeling reduced interest in or aversionto faces as seen in autism, gaze following behavior doesnot develop normally. Depending on the caregiver andtarget reward, the infant model will spend little timelooking at the caregiver. For example, while the ‘normal’model with a base reward of 1 for the caregiver (frontaland profile) and for the target spends 49% of its timelooking at the caregiver and 14% of the time looking atthe target (averaged over the entire learning period), the‘autistic-like’ model with caregiver reward of −1 willspend only 1% of its time looking at the caregiver and11% looking at the target (which it occasionally finds by

chance without utilizing the caregiver’s gaze). As a con-sequence, the learning process is slowed down or evenprevented, and the GFI stays close to zero. The micro-scopic behavior of such a model is shown in Figure 3(middle). Thus, a reduced reward for looking at the care-giver’s face or aversiveness of the caregiver is sufficientto explain delays or complete failure in the emergence ofgaze following.

It is interesting to note that an analysis of the modelshows that even for negative caregiver rewards, themodel will nevertheless slowly learn how to follow gaze,even if it does not exhibit the behavior on a regularbasis. By analyzing the infant’s action selection probabil-ities we found that the probability for following thecaregiver’s gaze once the infant is looking at the caregiverslowly but clearly rises above those for other actions.However, the model rarely executes a complete gazefollowing sequence because it is unrewarding to do so,due to first having to look at the aversive caregiver. Thisbehavior of the model might explain a puzzling findingby Leekam, Baron-Cohen, Perret, Milders and Brown(1997) that autistic children can follow gaze if explicitlytold to do so, though they may rarely do it spontaneously.This finding is very problematic for previous accounts ofthe emergence of gaze following. We know of no theorythat offers a satisfactory explanation for it. Subsequentstudies by Leekam and colleagues (Leekam et al., 1998;Leekam, López & Moore, 2000) suggest that autisticchildren can be trained to follow gaze through contin-gent presentation of rewarding visual stimuli (Whalen& Schreibman, 2003), but that a lack of motivation toengage with the experimenter may impede learning.These findings are also consistent with our account. Theassociation from caregiver head pose to regions in spaceis learned (although slowly) due to the constant low levelof random exploration, but gaze following is simply notrewarding enough to be produced on a regular basis. If,however, an additional incentive for following gaze ispresent (e.g. being asked to look where another personis looking, or being trained via operant conditioning),the behavior can be elicited. Also, it is in line with thefinding that gaze following in response to static picturesmay be ‘easier’, if we make the additional assumption thatstatic pictures of faces are not as aversive as dynamicdisplays (Klin et al., 2003).

It should be noted that an infant who looks less at facesdue to a diminished reward for faces can be expected todevelop deficits in face processing skills such as fine dis-crimination of head poses or estimation of the directionof gaze. This will likely corroborate delays in the emerg-ence of gaze following. The model could capture this bymaking the parameter pvalid a function of the total amountof time the infant has been looking at the caregiver.

Figure 6 Learning performance as a function of caregiver and target reward. For the caregiver reward we use Rfrontal = Rprofile ≡ Rcaregiver. The z-axis corresponds to the GFI after 105 time steps of learning, averaged over 10 repetitions of the experiment.



We also tested what would happen if the reward forlooking at the caregiver is much higher than the rewardfor looking at the target. This manipulation may bethought of as an attempt to model differences in Williamssyndrome, where children exhibit an abnormally highpreference for faces. Our experiments with the modelshow that in this case, somewhat surprisingly, the learningof gaze following can be substantially delayed (Figure 6).To give an example, a ‘Williams-like’ model with a basereward of 5 for looking at the caregiver and a basereward of 0.5 for looking at the target will spend 51% ofits time looking at the caregiver but only 5% looking atthe target. Thus, little gaze following will be observed, asillustrated in Figure 3 (bottom). The reason is thatbecause the caregiver is relatively so rewarding to lookat, it makes little difference to the infant where it looksin between fixations on the caregiver: the probabilityof looking at the target is only slightly higher than theprobability of looking at any other region of space underthe model’s probabilistic action selection rule.

Deficits in attention-shifting

Another important aspect of autism spectrum disordersare deficits in shifting attention. For example, manystudies have shown that people with autism are slowerto shift attention between targets (e.g. Casey, Gordon,Manheim & Rumsey, 1993; Wainwright-Sharp & Bryson,1993; Goldberg, Lasker, Zee, Garth, Tien & Landa, 2002;Landry & Bryson, 2004). This deficit might be relatedto cerebellar abnormalities (Harris, Courchesne, Townsend,Carper & Lord, 1999). Slow attention shifting can beincorporated into the model in the following way.Instead of gaze shifts taking effect immediately, weintroduce a latency Tlat of 1 to 3 time steps. After theinfant makes a decision to shift gaze, it has to waitTlat time steps before the gaze shift takes effect. Figure 7shows how this affects the emergence of gaze following.In these experiments all other parameters were set totheir default values. The error bars indicate standarderrors of 15 independent simulations per condition. Ascan be seen in the figure, the additional latency can slowdown or even prevent the emergence of gaze followingbehavior, because there is a growing probability that bythe time the infant shifts gaze, the rewarding sight hasmoved to a different location. This effect is clearly visiblein infants with a normal, positive caregiver reward(Figure 7, top). However, it is more pronounced for acaregiver reward of zero, i.e. infants who find their care-givers uninteresting but not aversive (Figure 7, bottom),and it is even more pronounced for a model with negat-ive caregiver reward (not shown). These results and theprevious ones show that either different reward structures,

or poor attention-shifting, or both, can explain gazefollowing deficits in autism within the proposed model.

Regarding Williams syndrome, a noteworthy recentreport on the perception of faces in adults with Williamssyndrome finds less accuracy in determining the direc-tion of gaze, and significantly longer response latenciesduring face perception (Mobbs, Garrett, Menon, Rose,Bellugi & Reiss, 2004). Given our results above, we canconclude that both of these symptoms, if present ininfants, would corroborate problems in the emergence ofgaze following. Less accuracy in determining the direc-tion of gaze will lower the predictiveness of the caregiver(smaller pvalid), while longer response latencies can bethought of as increasing Tlat. In a similar vein, recentlyobserved inaccuracies of saccade targeting and a higher

Figure 7 Learning performance for infant models with attention shifting deficits of varying degree. Top: for normal, positive caregiver reward. Bottom: for zero caregiver reward. Note the different scales on the axes. Error bars indicate standard error across 15 simulations.



number of corrective saccades in Williams syndrome(van der Geest, van Haselen, van Hagen, Govaerts, deCoo, de Zeeuw & Frens, 2004) may also contribute tolonger latencies before the target of a gaze shift is reached,corroborating difficulties in learning to follow gaze.

Summary

To summarize, simple manipulations to the rewardstructure and attention shifting behavior of the modelmotivated by findings on two very different developmen-tal disorders lead to deficits in the emergence of sharedattention. What is needed for further constraining themodel is more experimental data on how, for example,the accuracy of infants’ head pose discrimination, or thepreference for viewing frontal vs. profile faces developsfor normally and atypically developing infants.

Summary of model predictions

Although our model is simple and incorporates onlywell-known and accepted infant skills, it makes a numberof novel predictions, summarized below. The list iscertainly not exhaustive, since there are many ways ofmanipulating the model (we invite readers to downloadthe software from http://mesa.ucsd.edu and derive newpredictions). Of course, not all predictions of the modelwill lend themselves to experimental investigation, andsome manipulations would be unethical to do with realinfants. Leaving these concerns aside, the model makesthe following predictions.

1. Fast habituation leads to quicker acquisition of gazefollowing. The systematic variation of the habitu-ation parameter β showed an advantage in learningspeed for faster habituation. Fast habituation in themodel leads to more gaze shifts per time interval onaverage, which produces more opportunities to learnthe predictive value of the caregiver’s direction ofgaze, all else being equal.

2. Face perception skills should correlate with gazefollowing ability. One interpretation of the parameterpvalid was that it reflected accuracy of head pose esti-mation in infants. The model showed that without asufficiently high pvalid, gaze following will not emerge.

3. Infants with general learning deficits should also havean impairment in the acquisition of gaze following.Choosing too small a learning rate in the modelleads to delays in the emergence of gaze following.Not surprisingly, though, too high a learning ratewas also found to be maladaptive.

4. Infants whose visual preferences do not match theircaregivers’ should have deficits in gaze following. The

model shows that if the reward values associatedwith the objects/events that caregivers tend to lookat are not higher than those for random locations,gaze following will not emerge. By the same token,infants whose caregivers produce few predictive gazecues (e.g. due to visual deficits) should also learngaze following more slowly.

5. Infants who find faces too attractive should have defi-cits in gaze following. Using a caregiver reward muchhigher than the target reward leads to deficits ingaze following in the model.

6. Infants who find faces uninteresting or aversive shouldhave deficits in gaze following. Using small positiveor negative rewards for looking at the caregiver leadsto gradual deficits in the emergence of gaze follow-ing. This problem may be corroborated by a poordevelopment of face processing skills caused byaversiveness (or even neutrality) of faces.

7. Infants with deficits in attention-shifting should exhibitdelays in learning gaze following. The model showsthat slow attention-shifting (Tlat > 0) leads to a slug-gish emergence of gaze following behavior.

8. Amount of caregiver contact should influence emerg-ence of gaze following. An infant who experiencesfew face-to-face interactions with caregivers may beslower to acquire gaze following because of a shortageof relevant learning experiences.

9. Differences in caregiver behavior can aid or hinderthe emergence of gaze following. Varying the modelparameters related to the caregiver behavior (pshift,Tmin) while keeping the parameters of the infantidentical, leads to differences in learning speed. It islikely that ‘optimal’ caregiver behavior depends onparticular infant parameters. Thus, the optimal care-giver behavior will generally be different for eachinfant – especially in the case of abnormally devel-oping infants. More work is needed to understandthese issues and their potential ramificationsfor therapeutic interventions (Teuscher & Triesch,2004).

10. Lesioning certain neural pathways should impair gazefollowing behavior. We assume that informationabout the caregiver’s direction of gaze is extractedfrom face processing areas including (but not neces-sarily limited to) the Fusiform Face Area (Kanwisher,McDermott & Chun, 1997). Control of gaze shiftsis assumed to be mediated through areas such asthe Frontal Eye Fields (Tehovnik, Sommer, Chou,Slocum & Schiller, 2000). Our temporal differencelearning model assumes that pathways betweenthese sites (direct or indirect) are modified duringlearning and lesioning these pathways may impairgaze following.



Discussion

We have proposed a model of the emergence of gazefollowing in situated infant–caregiver interactions. Ouraccount is an elaboration of ideas that explain the emerg-ence of gaze following as a learning process driven byhedonic principles (Moore & Corkum, 1994). Infants areviewed as pleasure-driven agents, who learn to exploitinformation about their caregiver’s head movement andhead pose (and, later, eye direction) to find interestingsights in their environment. More specifically, we haveproposed a Basic Set of structures and mechanisms thatallow the infant to succeed in learning in an appropri-ately structured environment where the caregiver tendsto look at things that the infant will find interesting. Theproposed Basic Set has a small number of elements but,as our computer simulations demonstrate, it is sufficientfor gaze following to emerge. In particular, no additionalspecialized cognitive modules are necessary to explainthe emergence of gaze following in infant–caregiverinteractions. Note that all elements of our proposedBasic Set are established within days of birth (or, forattention-shifting, at around 3 months) in typicallydeveloping infants. This does not mean that we think allother mechanisms are unimportant for a comprehensiveaccount of the emergence of gaze following. It merelymeans that other mechanisms are not required forexplaining the basic gaze following phenomenon.

We have used the model to demonstrate how the BasicSet mechanisms are sufficient to allow an infant to learnto associate a particular head pose of the caregiver witha gaze shift to a location outside of the infant’s field ofview. This specific ability emerges rather late in normaldevelopment. Earlier signs of gaze following may belearned in a very similar way, however. The presence ofthe Basic Set mechanisms in even very young infantsmakes a learning account of any earlier gaze followingcompetence plausible. For example, in the context of thepresent model it is easy to see that, say, gaze followingto targets inside the infant’s field of view may be learnedwith the same mechanisms – only more easily and faster/earlier. The only Basic Set element for which there is noevidence of its presence within days of birth is the abilityto shift gaze away from a central stimulus. Indeed,all demonstrations of very early ‘gaze following’ have toremove the face stimulus after the gaze shift to facilitatea gaze shift to the periphery. Overall, we find it hard toenvision an account of the progressive expansion of gazefollowing competence in infancy that is not based on agradual learning process. Again, as stated in the intro-duction, this view does not at all preclude the presenceof evolved rudimentary propensities that contribute togaze following in specific situations, but it places a clear

emphasis on learning, especially for the emergence ofmore advanced gaze following skills.

It has been noted that infants will follow not only theline of regard of humans, but also that of non-humanobjects with face-like features, or objects that behavecontingently to them (Johnson, Slaughter & Carey, 1998).This suggests that infants’ capacity for joint attention isa generalizable skill that is not tightly tied to specificsituations with specific caregivers. Rather, it is a robustskill that extends flexibly to various social interactions.Our model readily accounts for these findings, if theadditional assumption is made that such non-humanobjects may be able to activate some of the same headpose and gaze direction sensitive neurons in the infant’sface processing areas that are utilized for following thegaze of humans.

Related work

A few related models have recently been proposed in theliterature. The idea of using temporal difference learningto model the acquisition of gaze following was firstmentioned by Matsuda and Omori (2001). They modela learning situation as used by Corkum and Moore (1998),where an experimenter monitors the infant’s behaviorand gives visual rewards to the infant when it follows thecaregiver’s gaze. Their paper lacks details, however, anddoes not explicitly model how the caregiver’s directionof gaze becomes associated with certain gaze shifts. Weconsider explaining this process to be the centralproblem of learning gaze following.

A recent model by Nagai, Hosoda, Morita and Asada(2003) has been implemented in a robot. Their model,which was developed concurrently with ours, shares anumber of aspects of our model (Fasel et al., 2002; Carlson& Triesch, 2003). In Nagai et al.’s model the infant alsolearns to associate head poses of the caregiver withappropriate gaze shifts based on the success or failure offinding a visually appealing stimulus. To this end, aneural network is trained to map the robot’s current gazedirection and an image of the caregiver’s face onto thedesired gaze shift. Their model, however, does not utilizetemporal difference learning, but rather an ad hoc learn-ing mechanism. Also, no attempts are made to explainfailures of the emergence of gaze following in eitherdevelopmental disorders or in other species. On thepositive side, the authors do not make the simplifyingassumption that caregiver head poses have a one-to-onecorrespondence with regions in space, which we have usedhere. Nagai et al. also attempt to explain the progressivedevelopment of gaze following skills as described byButterworth and Jarrett (1991). However, a closer lookat their model reveals that the most sophisticated



so-called representational stage cannot be achieved. Incontrast, new models of our group correctly capture thesequential emergence of all skill levels described byButterworth and Jarrett (Lau & Triesch, 2004; Jasso, Triesch& Teuscher, 2005). Interestingly, these models predictthat limitations in head pose discrimination ability and/or depth perception ability may be the factors preventingyounger infants from learning advanced gaze followingskills (Butterworth’s geometric and representational stages).Taken together, the current study and our more recentones point to the possibility that simple perceptuallimitations may limit the emergence of advanced gazefollowing skills. We think it is crucial for the field tocarefully study how perceptual skills (head pose discrimi-nation, gaze direction estimation, depth perception) andgaze following skills co-develop in the same individual,in order to test the predicted causal relation betweenthese factors.

Developmental disorders

Our account of the emergence of gaze following offersnew perspectives on failures of its emergence in develop-mental disorders. If a small Basic Set of ‘ingredients’ isdemonstrably sufficient for the emergence of gaze fol-lowing, in situations where the learning process doesnot succeed, one or several elements of the Basic Set, ortheir interaction, has been compromised. Elaborating onthis idea, we showed how changes to the model motiv-ated by two different developmental disorders (autismand Williams syndrome) can lead to delays or deficits inlearning gaze following. In particular, our model is con-sistent with the idea that in autism an initial reductionin preference for faces might be at the root of a cascadeof problems leading to deficits in gaze following andattention-sharing. Our account is also consistent withevidence of the success of therapeutic interventions whereinfants are explicitly rewarded for a desired behavior suchas following gaze (Whalen & Schreibman, 2003). Finally,the model points to the possibility that various combi-nations of a few small alterations in the developinginfant, none of which may be critical by itself, couldconspire to produce severe deficits. This is consistent withthe characterization of autism as a spectrum disorder.

While our accounts of deficits in gaze following indifferent developmental disorders may seem simplistic, itnevertheless offers important lessons. Most prominently,the model shows that very different causes can lead todeficits in the emergence of gaze following. These causesinclude (but are not limited to) parameters related toface perception, learning, habituation and value/rewardsystems. Given that several completely independentcauses can all lead to deficits in gaze following, it

appears ill-advised to use deficits in gaze following todefine a disorder. This is still the case in autism, wheredeficits in social interaction skills such as gaze followingare used to define the syndrome. Our hope is that com-putational modeling efforts like ours will help in under-standing complex developmental disorders by helping tobetter differentiate symptoms and narrow down theirprimary causes. This, in turn, will suggest promisingavenues for treatment and early diagnosis.

Cross-species differences

A good account of the emergence of gaze followingshould also explain differences in the emergence of gazefollowing behavior, or the complete absence of it, inother species. Since a simple Basic Set of structures andmechanisms is sufficient for gaze following to emerge,any species with the Basic Set should be able to acquiregaze following to some degree. Deficits or differences inthe Basic Set may limit the emergence of gaze following,as seen in our discussion of developmental disorders.

Across vertebrate species some Basic Set elementssuch as habituation and reward-driven learning areessentially ubiquitous, suggesting that these are likelynot the missing factors. This inference demands somecaution, however, because the presence of, say, reward-driven learning does not mean that just any contingenciescan be learned. Nevertheless, we feel that differences inother Basic Set elements are more relevant.

Regarding perceptual skills and preferences, the basicquestions are how infants of other species might preferto look at conspecifics, and how well they might distin-guish different head or eye orientations. The first ques-tion can be studied with controlled preferential lookingparadigms to evaluate visual preferences for looking atconspecifics (or humans) (e.g. Bard, Platzman, Lester &Suomi, 1992). Our model predicts that a (not too big)preference for looking at conspecifics’ faces is beneficial(although not strictly necessary) for gaze following toemerge.

In terms of the ability to distinguish different heador eye poses of conspecifics, there is evidence that, forexample, many primate species can do so to some extent(Itakura, 2004). Interestingly, eye direction may be par-ticularly easy to discern for humans because of the whitesclera (Kobayashi & Kohshima, 1997; Emery, 2000).We assume that gaze direction (orientation of the eyes)is more informative than just head pose, but it is alsoharder to perceptually discriminate, because the eyes aresmall. A first attempt to relate such differences to ourmodel is as follows. If an animal with a weaker percep-tual system can only inaccurately estimate a conspecific’shead position, then this cue will be less predictive of



interesting sights compared to accurate knowledge ofthe conspecific’s direction of gaze. Thus, as explainedabove, we can attempt to model limited perceptual skillsby reducing the predictiveness of the caregiver’s gaze pvaild.As our experiments showed, reducing pvaild slows theemergence of gaze following or prevents it altogether. Thus,some species may not learn to follow gaze at all or may onlylearn primitive forms of gaze following because their per-ceptual apparatus does not allow them to gather sufficientlyaccurate information about conspecifics’ gaze directionto make gaze following worthwhile. A more detailedanalysis of the perceptual requirements for higher gazefollowing skills specifically implicates depth perceptionabilities and accuracy of gaze direction estimation aspossible culprits (Lau & Triesch, 2004). Generally speak-ing, we can expect advanced gaze following skills only inthose species that have adequate perceptual abilities.

A related issue is foveation. The more foveation thereis in an animal’s visual system, the more important itis to look directly at the most relevant regions of theenvironment. Gaze following can help to identify suchregions. At the same time, a more foveated vision systemwill be better at making fine discriminations, say, of aconspecific’s direction of gaze, which benefits gazefollowing. Thus, we suspect that there may be a correla-tion between the degree of foveation of a species’ visualsystem and its propensity to follow gaze.

Regarding a structured social environment, a first con-dition for the emergence of gaze following is that speciesmust live in social groups. Further, the gaze of con-specifics must be predictive of informative events. Note thatgaze can have a number of other meanings in socialspecies that could potentially impact gaze following. Forinstance, gaze aversion is found in several monkeyspecies (Argyle & Cook, 1976). In such species, direct eyecontact is a gesture of aggression and it is particularlyimportant for members of such species to be sensitive todirect versus averted gaze, as indicated by head and eyedirection (Coss, 1978; Emery, 2000).

Point following

Although we have focused on gaze following in thispaper, note that point following may be learned basedon the same principles. Pointing with an outstretchedand aligned arm, hand and finger is the most naturalway to intentionally direct another’s attention to a newtarget, and caregivers and (older) infants do producepointing gestures to direct each other’s attention (Bates,Camaioni & Volterra, 1975; Lempers, 1979; Leung &Rheingold, 1981). To model the emergence of pointfollowing, we could simply choose to identify differentcaregiver head poses in the current model with different

pointing gestures performed by the caregiver. However,there are certain differences to consider. First, while thecaregiver frequently shifts gaze, pointing gestures duringnaturalistic exchanges are rare by comparison (Deáket al., 2004). Second, pointing gestures are likely to bemore salient for infants because of the large amount ofmovement involved. Third, infants may be better atdiscriminating pointing direction than head directionbecause the extended arm provides a better directionalcue (Deák et al., 2000). Fourth, pointing gestures arelikely to be more predictive of interesting events, becausecaregivers will tend to engage in this ‘effort’ only whena particularly relevant environmental stimulus is present.All but the first of these four points suggest that it mightbe easier for infants to learn point following. In fact,human infants by 9 months follow gaze much morereliably when it is accompanied by a point (Flom, Deák,Phill & Pick, 2003), and a quasi-naturalistic observa-tional study shows that infants from 5 to 10 months arefar more likely to follow a parent’s point than a parent’sgaze shift (Deák et al., 2004).

Future work

Of course, our model and the ones discussed above mustbe seen as only first steps towards a full computationalaccount of the emergence of gaze following. In manyrespects, these models are still overly simplistic. Examplesof simplifications in our model are the restriction to a smallset of discrete spatial regions, the absence of peripheralvision and the stereotypic, non-interactive behavior ofthe caregiver model, just to name a few. Recent work hasstarted to address some of these issues (Lau & Triesch,2004; Teuscher & Triesch, 2004; Jasso & Triesch, 2004).Another limitation is that the model currently does notaddress how higher attention sharing skills may emerge.Future work needs to demonstrate that models such asthe present one can be scaled up to explain the emerg-ence of more advanced attention sharing skills. Despitethese shortcomings and limitations, we think our modelis a useful step in theorizing about the emergence ofgaze following and share

1,2 Christof Teuscher, 3 Gedeon O. Deák and Eric Carlson 1 · 2020. 3. 4. · Jochen Triesch,1,2 Christof Teuscher,3 Gedeon O. Deák1 and Eric Carlson1 1. Department of Cognitive

Documents