Adapting the Turing Test for Embodied Neurocognitive Evaluation of Biologically-Inspired cognitive agents

Adapting the Turing Test for Embodied Neurocognitive Evaluation ofBiologically-Inspired cognitive agents.

Shane T. Mueller, Ph.D.∗Klein Associates Division, ARA Inc.

Fairborn, OH [email protected]

Brandon S. Minnery, Ph.D.†The MITRE Corporation

1750 Colshire Drive, McLean, [email protected]

Abstract

The field of artificial intelligence has long surpassedthe notion of verbal intelligence envisioned by Tur-ing (1950). Consequently, the Turing Test is primarilyviewed as a philosopher’s debate or a publicity stunt,and has little relevance to AI researchers. This paper de-scribes the motivation and design of a set of behavioraltests called the Cognitive Decathlon, which were devel-oped to be a useable version of an embodied Turing Testthat is relevant and achievable by state-of-the-art AI al-gorithms in the next five years. We describe some ofthe background motivation for developing this test, andthen provide a detailed account of the tasks that makeup the Decathlon, and the types of results that should beexpected.

Can the Turing Test be Useful and Relevant?Alan Turing (1950) famously suggested that a reasonabletest for artificial machine intelligence is to compare the ma-chine to a human (who we agree is intelligent), and if theirverbal behaviors and interactions are indistinguishable fromone another, the machine might be considered intelligent.Turing proposed that the test should be limited to verbal in-teractions alone, and this is how the test is typically inter-preted in common usage. For example, the $100,000 Loeb-ner prize is essentially a competition for designing the bestchatbot. However, although linguistics remains an impor-tant branch of modern AI, the field has expanded into manynon-verbal domains related to embodied intelligent behav-ior. These include specialized fields of robotics, image un-derstanding, motor control, and active vision. Consequently,it is reasonable to ask whether the Turing Test, and espe-cially the traditional Verbal Turing Test (VTT) is still rele-vant today.

∗Part of the research reported here was conducted as part of theU.S. DARPA program, contract FA8650-05-C-7257,BiologicallyInspired Cognitive Architectures, and presented at the 2007 BRIMSconference and the 2008 MAICS conference. Approved for PublicRelease, Distribution Unlimited.

†Part of the research reported here was conducted as part ofthe U.S. DARPA program Biologically Inspired Cognitive Archi-tectures. Approved for public release, distribution unlimited. No.07-0258.Copyright c© 2008, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Indeed, it is fair to say that almost no cutting-edge re-search in cognitive science or AI has a goal of passing theVTT. Some observers have suggested the VTT is a stunt ora joke (e.g., Sundman, 2004), or an impossible goal that isnot useful for current research (Shieber, 1994). Yet somehave argued that the test is indeed relevant for the types ofresearch that is being produced today. For example, Harnad(1989, 1990, 2000, 2004) argued that an embodied versionof the Turing test is consistent with Turing’s original thoughtexperiment, which matches the domains of today’s research.This argument (expanded in the next section) suggests thatwe can still look to the Turing Test as a way to measure in-telligence, but it presents a challenge as well. Given thateven the VTT seems to be an impossible goal, embodiedversions of the Turing test (which are supersets of the VTT)would seem an even greater challenge. Yet, perhaps by re-laxing some of the properties of the Turing Test, a versionthat is both relevant and useful to today’s researchers can beframed.

Adapting the Turing Test for Modern ArtificialIntelligenceA general statement of the Turing test has three importantaspects, each of which are somewhat ambiguous:

A machine can be considered intelligent if its behaviorin (1) a specified domain is (2) indistinguishable from(3) human behavior.

The Domain of the Turing Test. The first aspect de-scribes the domain of the test. Harnad (2004) argued thatTuring’s writings are consistent with the domain being asliding scale, and he described five levels of Turing Testdomains: 1. For limited tasks; 2. For verbal context; 3.For sensori-motor context; 4. For internal structure; 5. Forphysical structure. Harnad argued that although Turing didnot mean the first level (Turing-1), the Turing-2 test (whichis the most common interpretation) is susceptible to gam-ing. A more powerful and relevant version consistent withTuring’s argument is Turing-3: a sensori-motor Turing Test.This argument is useful because it means it is possible todevelop versions of the Turing Test that are relevant to to-day’s researchers. However, because Turing-3 is a supersetof Turing-2, it means that it would be a greater challenge and

perhaps even less useful than Turing-2, because it would bean even more difficult test to pass. Yet, the other two as-pects of the test may suggest ways to design and implementa useful version of the test.

The Meaning of Indistinguishable. A second aspect ofthe Turing Test is that it looks for “indistinguishable” be-havior. On any task, the range of human behavior acrossthe spectrum of abilities can span orders of magnitude, andthere are artificial systems that today outperform humans onquite complex but limited tasks. So, we might also specifya number of levels of “indistinguishable”: at the minimum,consider the criterion of competence: the artificial systemproduces behavior that it at least as good as (and possiblybetter than) a typical human. This is a useful criterion inmany cases (and is one that has placed humans and machinesin conflict at least since John Henry faced off against thesteam hammer.) A more stringent criteria might be calledresemblance, requiring that typical inadequacies exhibitedby humans also be made, such as appropriate time profilesand error rates. Here, the reproduction of robust qualitativetrends may be sufficient to pass the test. A test with a higherfidelity than resemblance might be called verisimilitude. Forexample, suppose a test required the agent produce behaviorsuch that, if its responses were given along with correspond-ing responses from a set of humans on the same tasks, itsdata would not be able to be picked out as anomalous.

The criterion of verisimilitude might be viewed assomewhat contentious, because an artificial agent that issmarter/stronger/better than its human counterpart might beconsidered to exhibit embodied intelligence. There are anumber of contexts in which one might prefer verisimilitudeover competence. For example, if one’s goal is to developan artificial agent that can replace a human as a teammate oradversary (e.g., for training, design, or planning), it can beuseful for the agent to fail in the same ways a human fails.In other cases, if the agent is being used to make predictiveassessments of how a human would behave in a specific sit-uation, verisimilitude would be a benefit as well. Finally,this criterion can provides some tests for how an agent pro-cesses information and reasons: for example, if one’s goal isto create a system that processes information like the humanbrain, verisimilitude can improve the chances of developingthe right algorithms without having to understand exactlyhow the brain achieves the processing.

A criterion more stringent than verisimilitude might becalled distributional: predicting distributions of human be-havior. Given multiple repeated tests, the agent’s behaviorwould reproduce the same distribution of results as a sampleof humans produces.

The Target of Intelligent Behavior. A third important as-pect of a general Turing Test stated above is that an in-telligent target which produces behavior must be specified.There is a wide range of abilities possessed by humans, andif we observe behavior that we consider intelligent in a non-human animal or system, it could equally-well serve as atarget for the Turing Test. So, at one end of the spectrum,

there are behaviors of top experts in narrow domains (e.g.,chess grandmasters or baseball power hitters); on the otherend of the spectrum, there are physically disabled individ-uals, toddlers, and perhaps even other animals who exhibitintelligent behavior. So, one way to frame a useable Turing-3 test is to choose a target that might be easier than an adultable-bodied human expert. The different version of thesethree concepts are shown in Table 1.

This framework suggests that the Turing Test is indeed areasonable criterion for assessing artificial intelligence, andis relevant for embodied AI. By considering a generalizedform, there are a number of ways the test can be imple-mented with present technology that allow for an embodiedTuring-3 test to be constructed, tested, and possibly passed,even though the state of AI research is nowhere close topassing the traditional VTT.

In the remainder of this report, we describe such a planfor testing embodied intelligence of artificial agents. It wasan attempt to go beyond the VTT by incorporating a widerange of embodied cognitive tasks. In order to meet thisgoal, we chose a target that was at the lower end of the ca-pability spectrum: performance that might be expected of atypical 2-year-old human toddler. In addition, we relaxedthe fidelity requirement to initially require competence, andlater to require the reproduction of robust qualitative trends.

The Cognitive DecathlonThis research effort was funded as part of the first phaseof DARPA’s BICA program (Biologically-Inspired Cogni-tive Architectures). Phase I of the BICA program was thedesign phase, during which the set of tests described herewere selected. Later phases of the program were not funded,and so these tests have not been used as a comprehensiveevaluation suite for embodied intelligence. The simulatedBICA agents were planned to be embodied either in a pho-torealistic virtual environment or on robotic platform withcontrollable graspers, locomotion, and orientation effectorswith on the order of 20-40 degrees of freedom. The EURobotCub project (Sandini, Metta, & Vernon, 2004) is per-haps the most similar effort, although that effort is focusedon building child-like robots rather than designing end-to-end cognitive-biological architectures.

GoalsThe primary goals of the BICA program were to developcomprehensive biological embodied cognitive agents thatcould learn and be taught like a human. The test specifica-tion was designed to promote these goals, encouraging theconstruction of models that were capable of a wide range oftasks, but approached them as a coherent system rather thana loose collection of subsystems designed to solve each in-dividual task. Thus, we designed the test specification to:(1) Encourage the development of coherent, consistent, sys-tematic, cognitive system that can achieve complex tasks;(2) Promote procedural and semantic knowledge acquisitionthrough learning, rather than programming or endowmentby modelers; (3) Involve tasks that go beyond the capabil-ities of traditional cognitive architectures toward a level of

Table 1: Variations on three aspects of the Turing Test.Target Fidelity Domain (Harnad, 2000)1. Lower animals 1. Competence: can accomplish task target

achieves1. Local indistinguishability for specific task

2. Mammals 2. Domination: Behavior better than target 2. Global Verbal performance3. Children 3. Resemblance: reproduces robust qualitative

trends3. Global Sensorimotor performance

4. Typical Adult 4. Verisimilitude: Cannot distinguish measured be-havior from target behavior

4. External & Internal structure/function

5. Human expert 5. Distributional: Produces range of behavior fortarget population.

5. Physical structure/function

embodiment inspired by human biology; and (4) Promoteand assess the use of processing and control algorithms in-spired by neuro-biological processes.

To achieve these goals, we designed three types of tests:the Cognitive Decathlon (which is the focus of this report);integrative “challenge scenarios”, and a set of BiovalidityAssessments. The Challenge Scenarios were designed to re-quire interaction between different subsystems in order toachieve a high-level task. The biovalidity assessments weredesigned to determine the extent to which the artificial sys-tems used computation systems inspired by neurobiology.The “Cognitive Decathlon” was intended to provide detailedtests of core cognitive functions, and provide stepping stonesalong the way to achieving the more complex challenge sce-nario tasks.

Design of the Cognitive DecathlonLike the Olympic Decathlon, which attempts to measure thecore capabilities of an athlete or warrior, the Cognitive De-cathlon attempts to measure the core capabilities of an em-bodied cognitive human or agent. To enable an achievablecapability level within the scope of the program, target be-havior of a two-year old human toddler was selected. Therewere many motivations for this target, but one central no-tion is that if one could design a system with the capabilitiesof a two-year-old, it might be possible to essentially growa three-year-old, given realistic experiences in a simulatedenvironment. The tasks we chose covered a broad spectrumof verbal, perceptual, and motor tasks, and attempt to covermany of the intelligent behaviors of a toddler, and the assess-ment criteria were planned to require competence in earlyyears, and reproduction of robust qualitative trends in lateryears.1

Research on human development has shown that by 24months, children are capable of a large number of cogni-tive, linguistic and motor skills. For example, according tothe Hawaii Early Learning Profile development assessment,the linguistic skills of a typical 24-month-old child includethe ability to name pictures, use jargon, use 2–3 word sen-tences, produce 50 or more words, answer questions, andcoordinate language and gestures. Their motor skills in-

1In the scope of the BICA program, the ability of agents toachieve specific performance criteria was a requirement for con-tinued funding in subsequent phases.

clude walking, throwing, kicking, and catching balls, build-ing towers, carrying objects, folding paper, simple drawing,climbing, walking down stairs, and imitating manual andbilateral movements. Their cognitive skills include match-ing (names to pictures, sounds to animals, identical objects,etc.), finding and retrieving hidden objects, understandingmost nouns, pointing to distant objects, and solving simpleproblems using tools (Parks, 2006).

To develop the decathlon, we first began by examininghundreds of empirical tasks studied by psychologists in re-search laboratories. From these, we selected a set of spe-cific tests for which (1) human performance on the taskswere fairly well understood; (2) there typically existed com-putational or mathematical models accounting for these be-haviors; (3) were related to the core abilities of a two-year-old child; and (4) were components that are frequently in-tegrated to accomplish more complex tasks. Basic descrip-tions of these tasks are provided below, along with someinformation regarding human performance on the tasks.

Our basic taxonomy of tasks is shown in Table 2. Weidentified six taxons that describe basic skill types, andwhich are tied back to distinct neural or biological sys-tems. A number of other taxonomies of cognitive skillhave been used in other contexts. For example, the HawaiiEarly Learning Profile (Parks, 2006) describes six taxons:cognitive, language, gross motor, fine motor, social, andself-help. Our taxons focused on the first four of these,and view social interactions as a ubiquitous manner of in-teracting outside the scope of the taxonomy. As anotherexample, the Army’s IMPRINT tool recognizes nine tax-ons: Visual, numerical, cognitive, fine motor discrete, finemotor continuous, gross motor heavy, gross motor light,communication–reading and writing, and communication–oral (Allender, Sali, & Promisel, 1997). Our taxonomy cov-ers more than half of these domain, avoiding reading andwriting and numerical skills. Our selection of tasks wasguided by the desire to have fairly comprehensive coverageof low-lever cognitive core skills, while highlighting tasksin which standard AI approaches would accomplish in waysfundamentally different from human performers.

Visual IdentificationThe ability to identify visual aspects of the environment isa critical skill used for many tasks faced by humans. In thedecathlon, this skill is captured in a graded series tests that

Table 2: Component tasks of the cognitive decathlon.Task Level1. Vision Invariant Object Identification

Object ID: Size discriminationObject ID with rotationObject ID: relationsVisual Action/Event Recognition

2. Search NavigationVisual SearchSimple NavigationTraveling Salesman ProblemEmbodied SearchReinforcement Learning

3. Manual Motor MimicryControl and Simple (1-hand) ManipulationLearning Two-hand manipulation

Device MimicryIntention Mimicry

4. Knowledge Episodic Recognition MemoryLearning Semantic Memory/Categorization5. Language and Object-Noun MappingConcept Learning Property-Adjective

Relation-PrepositionAction-VerbRelational Verb-Coordinated Action

6. Simple Motor Eye MovementsControl Aimed manual Movements

determine if an agent can tell whether two objects or eventsare identical.

The notion of sameness is an ill-defined and perhaps so-cially constructed concept (cf. French, 1995), and this am-biguity helped structure a series of graded tests related tovisual identification. Typically, objects used for identifica-tion should be comprised of two or more connected compo-nents, have one or more axes of symmetry, and have colorand weight properties. Objects can differ in color, weight,size, component structure, relations between components,time of perception, movement trajectory, location, or ori-entation. In these tasks, color, mass, size, component rela-tions are defined as integral features to an object, and differ-ences along these dimensions should be deemed sufficient toconsider two objects different. Neuropsychological findings(e.g., Wallis & Rolls, 1997) show that sameness detection isinvariant to differences in translation, visual size, and view,and differences along these dimensions should not be con-sidered sufficient to be indicate difference.

The object recognition tasks are important tests of biolog-ical intelligence because they are a fairly important meansby which we interact with the world, and the machine visioncommunity has developed many successful algorithms thatare not inspired by biological structures.

In the basic task, the agent should be shown two objects,and be required to determine whether the objects are thesame or different. For each variation, both “same” and “dif-ferent” trials should be presented. The different variationsinclude:

Invariant Object Recognition. The goal of this trial typeis to provide a simple task that rudimentary visual systemscan accomplish. On “same” trials, the objects should be ori-ented in the same direction. On “different” trials, objectsshould differ along color, visual texture, or shape properties.

Size Differences. An object is perceived as maintaininga constant size even when its distance to the observer dis-tance (and thus the size of its proximal stimulus) changes.In fact, neural mechanisms have developed that are sensitiveto shape similarities regardless of the size (Wallis & Rolls,1997). This type of trial should test the ability to discrimi-nate size differences in two identically-shaped objects. Suc-cess in the task is likely to require incorporating at least oneother type of information, such as body position, binocularvision, or other depth cues.

Identification requiring rotation. Complex objects oftenneed to be aligned and oriented in some way to detect same-ness. This skill can often be accomplished by adult humansthrough “mental rotation” (Shepard & Metzler, 1971), al-though other strategies (physical rotation or even moving todifferent viewing positions) can also succeed. On these tri-als, identical objects should be rotated along two orthogonalaxes, so that physical or mental rotation is required to cor-rectly identify whether they are the same or different. Typ-ical human performance response times for both same anddifferent trials increase as the angle of rotation is increased,a result that may be diagnostic of the computational repre-sentations used by the agent.

Relation Identification. As described earlier, the objectsused in these tasks should have multiple components, whichrequires an understanding of the relations between thesecomponents. As a greater challenge, simple spatial relationsamong sets objects should also be tested. These should maponto the prepositions tested in the language skills tasks.

Event Recognition. Perceptual identification is not juststatic in time; it also includes events that occur as a sequenceof movements along a trajectory in time. This trial typeexamines the agent’s ability to represent and discriminatesuch events. The two objects should repeat through a shortequally-timed event loop (e.g., rotating, moving, bouncing,etc.) and the agent would be required to determine whetherthe two events are the same or different.

Search and Navigation.A critical skill for embodied agents is the ability to navi-gate through and learn about its environment. Search andnavigation tasks form a fundamental cognitive skillset usedby lower animals and adult humans alike. Furthermoremany automated search and navigation systems employ op-timization techniques or require GPS navigation or terraindatabases to be succeed. A fundamental property of humannavigation is that we don’t require these external navigationpoints, and indeed we learn the terrain by experiencing it.

Thus, search and navigation tasks can be useful in discrimi-nating biological from non-biological spatial reasoning sys-tems. A graded series of decathlon events tests these abili-ties.

Visual Search. A core skill required for many navigationtasks is the spatial localization of a target. In the visualsearch task, the agent should view a visual field contain-ing a number of objects, including a well-learned target.The agent should determine whether the target is or is notpresent. Behavior similar to human performance for sim-ple task manipulations should be expected (e.g., both color-based pop-out and deliberate search strategies should be ob-served; cf. Treisman & Gelade, 1980).

Simple Navigation. In this task, the agent should have thegoal of finding and moving to a target (e.g., a red light) ina room containing obstacles. Obstacles of different shapesand sizes should be present in the room (to allow landmark-based navigation), and should change from trial to trial (toprevent learning specific configurations. For simple versionsof the task, the target should be visible to the agent from itsstarting point, but difficulty can be increased by allowingobstacles to occlude the target either at the beginning of thetrial or at intermediate points. Agents should be assessed ontheir competency in the task as well as performance profilesin comparison to human solution paths.

Traveling Salesman Problem. A skill required for manyspatial reasoning tasks is the ability to navigate to multiplelocations in an efficient search path through multiple pointsof interest. This skill has been studied in humans in the con-text of the Traveling Salesman Problem (TSP).

The TSP belongs to a class of problems that are “NP-Complete”, which means that algorithmic solutions poten-tially require exhaustive search through all possible paths tofind the best solution. This is computationally intractablefor large problems, and so presents an interesting challengefor problem solving approaches that rely on search througha problem space. Such approaches could produce solutiontimes that scale as a power of the number of cities, andwould never succeed at finding efficient solutions to largeproblems. Yet human solutions to the problem are typicallyclose to optimal (5% longer than the minimum path) andefficient (solution times that are linear with the number ofcities) suggesting human solutions to the task are fundamen-tally different from traditional approaches in computer sci-ence. Recent research (e.g., Pizlo, et al., 2006) has suggestedthat the multi-layered pyramid structure of the visual systemenables efficient solutions of the task, and that such skillsmay form the basis of many human navigation abilities.

For this task, the agent should have the goal of visitinga set of target locations in a room. Once visited, each tar-get light can disappear, to enable task performance withoutneeding to remember all past visited locations. The agents’performance should primarily be based on competence (abil-ity to visit all objects), and secondarily on comparison to ro-bust behavioral findings regarding this task (solution paths

are close to optimal with solution times that are roughly lin-ear with the number of targets.)

Embodied Search. Search ability requires some amountof metaknowledge, such as the ability remember locationsthat have already been searched. In this task, the agentmust find a single target light occluded in such a way that itcan only be seen when approached. Multiple occluders notcontaining the target should be present in the search area.Performance should be expected to be efficient, with searchtime profiles and perseveration errors (repeated examinationof individual boxes) resembling human data.

Reinforcement Learning. The earlier search tasks havefairly simple goals, yet our ability to search and navigate of-ten supports higher-order goals such as hunting, foraging,path discovery. Reinforcement learning plays an importantrole in these more complex search tasks, guiding explorationto produce procedural skill, and tying learning to motiva-tional and emotional systems. To better test the ways re-inforcement learning contributes to search and navigation,this task requires the agents to perform a modified searchtask that closely resembles tasks such as the N-armed ban-dit (e.g., Sutton & Barto, 1998) or the Iowa Gambling Task(e.g., Bechara et al., 1994).

The task is similar to the Embodied Search Task, but thetarget light should be hidden probabilistically in different lo-cations on each trial. Different locations should be more orless likely to contain the hidden object, which the agent isexpected to learn and exploit accordingly. The probabilisticstructure of the environment may change mid-task, as hap-pens in the Wisconsin Card Sort (Berg, 1954), and behaviorshould be sensitive to such changes, moving away from ex-ploitation toward exploration in response to repeated searchfailures.

Reinforcement learning goes beyond just spatial reason-ing, and indeed is an important skill in its own right. Al-though machine learning has long been tied closely withpsychological and biological learning theory (cf. Bush &Mosteller, 1951; Rescorla & Wagner, 1972), advances inmachine learning have provided systems that can outlearnhumans in limited domains. Thus, tests of learning can pro-vide a good test of biological inspiration and discriminatebetween biological and non-biological mechanisms.

Simple Motor ControlA critical aspect of embodied intelligence is the ability tocontrol motor systems. These tests are designed to comparesome aspects of low-level motor control to human counter-parts; later tests (in the section “Manual Control and Learn-ing”) require more complex motor skills. The motivationfor these tasks is that low-level task performance constraintsimposed by these control mechanisms can have cascadingeffects that impact performance on many higher-level tasks.These biological factors place strong constraints on task per-formance that are not necessarily faced by robotic or engi-neered control mechanisms, and so they offer discriminativetests of biological inspiration.

Saccadic and Smooth Pursuit EyeMovements. Humansuse two basic forms of voluntary eye movement (cf. Krau-zlis, 2005): saccades, which are ballistic movements to aspecific location or targets occurring with low latency andbrief duration; and pursuit movements, which are smoothcontinuous movements following specific targets. Saccadicmovements should be tested by presenting target objects inthe visual periphery, to which the agent should shift its eyesin discrete movements, with time and accuracy profiles sim-ilar to humans. Pursuit movements should be tested by re-quiring the agent to track objects with its eyes moving intrajectories and velocities similar to those humans are capa-ble of tracking.

Aimed Manual Movement. Fitts’s (1954) law states thatthe time required to make an aimed movement is propor-tional to the log of the ratio between the distance moved andthe size of the target. Agents should be tested in their abil-ity to make aimed movements to targets of varying sizes anddistances, and are expected to produce Fitts’s law at a quali-tative level.

Manual Control & LearningBuilding on these simple motor skills, embodied agentsshould have ability to control arms and graspers to manip-ulate the environment. The following tasks evaluate theseskills in a series of more and more complex tests.

Motor Mimicry. One pathway to procedural skill is theability to mimic the actions of others. This task tests thisskill by evaluating the agents ability to copy manual move-ments. For this task, the agent should replicate hand move-ments of an instructor (with identical embodiment), includ-ing moving fingers, rotating hands, moving arms, touching alocations, etc. This test should not include the manipulationof artifacts or the requirement to move two hands/arms ina coordinated manner. Mimicry should be ego-centric andnot driven by shared attention to absolute locations in space,but errors related to left-right symmetries can be relaxed.Agents should be assessed on their ability to mimic thesenovel actions, and the complexity of the actions that can bemimicked.

Simple (One-hand) Manipulation. A more complexmimicry involves interacting with objects in a dexterousfashion. The agent should expected to grasp, pick up, ro-tate, move, put down, push, or otherwise manipulate ob-jects, copying the actions of an instructor. Given the pos-sibility of substantial skill required to coordinate two hands,all manipulations in this version of the task should involve asingle arm/grasper. The agent should be expected to copythe instructor’s action with its own facsimile of the ob-ject. Mimicry is expected to be egocentric and not based onshared attention, although produced actions can be mirror-image of the instructors. Agents should be assessed on theirability to mimic these novel manipulations, and the com-plexity of the actions they are able to produce.

Two-hand Manipulation. With enough skill, an agentshould be able to mimic 2-hand coordinated movement andconstruction. Actions could include picking up objects thatrequiring two hands, assembling or breaking two-piece ob-jects; etc. Evaluation should be similar to the Simple Manip-ulation task, but for these more complex objects and actions.

Device Mimicry. Although the ability to mimic the ac-tions of a similar instructor is a critical sign of intelli-gence, human observational learning allows for more ab-stract mimicry. For example, a well-engineered mirror neu-ron system might be able to map observed actions onto themotor commands used to produce them, but might fail if theobserved actions are produced by a system that physicallydiffers from the agent, or if substantial motor noise exists, orif the objects the teacher is manipulating differs from the onethe learner is using. This task goes beyond direct mimicryof action to tasks that require the mimicry of complex toolsand devices, and (in a subsequent task) the teacher’s intent.

The task involves learning how a novel motor action mapsonto a physical effect in the environment. The agent shouldcontrol a novel mechanized device (e.g., an articulated armor a remote control vehicle) by pressing several action but-tons with the goal of accomplishing some task. The agentshould be given opportunity to explore how the actions con-trol the device. When it has sufficiently explored the con-trol of the device, the agent should be tested by an instruc-tor who controls the device to achieve a specific goal (e.g.,moving to a specific location). The instructor’s control op-erations should be visible to the agent, so that it can repeatthe operations exactly if it chooses. The instructor shoulddemonstrate the action, and should repeat the sequence ifrequested.

Intention Mimicry. This task is based on the devicemimicry task, but tests more abstract observational learn-ing, in order to promote understanding of intent and goals ofthe teacher. The agent should observe a controlled simulateddevice (robot arm/remote control vehicle) accomplish a taskthat requires solving a number of sub-goals. The instructor’soperator sequence should not be visible to the agent, but theagent should be expected to (1) achieve the same goal in away (2) similar to how the instructor did. Performance suc-cess and deviation from standard should be assessed.

Knowledge LearningHumans learn incidentally about their environment, with-out needing to explicitly decide that objects and events needto be committed to memory. The tests described next in-clude several memory assessments that determine the extentto which the knowledge memory system produces results re-sembling robust human behavioral findings.

Episodic Recognition Memory. A key type of informa-tion required for episodic memory is the ability to remembera specific occurrence of known objects or events in a specificcontext. For this test, an agent should be allowed to explorea room containing a series of configurations of objects. After

a short break, the agent should be shown a new set of objectconfigurations and be required to determine which of themhad been seen during the learning period. Agents should dis-play robust qualitative trends exhibited by humans in suchtasks. For example, they should be better at identifying ob-jects that were given more study time; and increase falsealarms for new configurations of previously-seen objects.

Semantic Gist/Category Learning. An important aspectof human semantic memory is the ability to extract the basicgist or meaning from complex and isolated episodes. Thisskill is useful in determining where to look for objects insearch tasks, and the ability to form concept ontologies andfuzzy categories.

The agent should view a series of objects formed from asmall set of primitive components. Each object should belabeled verbally by the instructor, and the objects should fallinto a small number of categories (e.g., 3–5). No two objectsshould be identical, and the distinguishing factors should beboth qualitative (e.g., the type of component or the relationbetween two components) and relative (e.g., the size of com-ponents). Following study, the agent should be shown novelobjects and be asked whether it belongs to a specific cate-gory (Is this a DAX?). Category membership should not beexclusive, should be hierarchically structured, and could de-pend upon probabilistically on the presence of features andthe co-occurrence and relationship between features. Agentshould be expected to categorize novel objects in ways sim-ilar to human categorization performance.

Language/Concept LearningLanguage understanding plays a central role for instructionand tasking, and language ability opens up the domain oftasks that can be performed by the agents. Furthermore,traditional versions of the Turing Test were solely linguis-tic, which makes it an important skill for intelligent agents.Language grounding is a critical aspect of language acqui-sition (cf. Landau et al., 1998), and the following series oftests evaluates an agents ability to learn mappings betweenphysical objects or events and the words used to describethem. For each test type, the agent should be shown exam-ples with verbal descriptions, and later be tested on yes-notransfer trials. Brief descriptions of each test type are givenbelow.

Object-Noun Mapping. One early language skill devel-oped by children is the ability to name objects (Smith &Gasser, 1998), and even small children can learn objectnames quickly with few examples. This test examines theability to learn the names of objects.

Property-Adjective Mapping. A greater challenge islearning how adjectives refer to properties of objects, andcan apply to a number of objects. Such skill follows objectnaming (e.g., Smith & Gasser) and typically requires morerepetitions to master. This test examines the ability of anagent to learn adjectives, and recognize their correspondingproperties in novel objects.

Spatial Relation-Preposition Mapping. Research hassuggested that many relational notions are tied closely to thelanguage used to describe them.Spatial relations involve re-lations of objects, and so rely not just on presence of com-ponents but their relative positions. This test examines theability of an agent to infer the meaning of a relation, andrecognize that relation in new episodes.

Action-Verb Mapping. Recognition is not static in time,but also involves events occurring in time. Furthermore,verbs describing these events are abstracted from the actorobjects performing the event, and represent a second type ofrelation that must be learned about objects (Gentner, 1978).This test examines the ability of the agent to represent suchevents and the verb labels given to them, and recognize theaction taking place with new actors in new situations.

Multi-object Action to Relational Verb Mapping. Themost complex linguistic structure tested should involve rela-tional verbs, which can describe multi-object actions whoserelationship is critical to the correct interpretation For ex-ample, in the statement, “The cat chased the dog.”, the mereco-presence of dog and cat do not unambiguously define therelationship. This test examines the ability of the agents tounderstand these types of complex linguistic structures andhow the relate to events in the visual world.

Connections between tasks

The previous section provided a very elementary descrip-tion of a set of empirical tasks that we proposed to used formeasuring comprehensive embodied intelligence of cogni-tive agents. Within each group, there are obvious relationsbetween tasks, and many sub-tasks are simply elaborationsor variations on other sub-tasks. However, an importantaspect of human intelligence is how we use multiple sys-tems together. For example, research on “active vision” hasshown the importance of understanding how visual process-ing and motor control together provide simple accounts ofphenomena that appear complex when approached from tra-ditional visual processing perspectives.

Figure 1 depicts some of the strong connections betweentasks in different domains. For example, there is a strongcorrespondence between the visual identification of objects,relations, and events, and the use of linguistic forms such asnouns, adjectives, and verbs. As a result, a strong emphasiswas placed on language tasks that were grounded in the en-vironment, or could be used as a means to instruct the agentto perform specific tasks.

The connections between tasks are best illustrated by de-scribing some of the integrated ’challenge scenarios’ thatwere also part of the BICA evaluation but not described here.For example, one scenario was called “The Egg Hunt”, andthe agent was expected to be able to search for an objectin a set of rooms with obstacles. For advanced variationsof the task, the agent would be given a verbal instructiondescribing the object (“Bring me the red basket”). A sur-prising number of core decathlon tasks would be required

Figure 1: Graphical depiction of the Cognitive decathlon. Grey rounded boxes indicate individual tasks that require the samebasic procedural skills. Black rectangles indicate individual trial types or task variations. Lines indicate areas where there arestrong relationships between tasks.

accomplish this fairly simple task. For example, in the lan-guage tasks, the agent would have learned the color propertyred, the name basket, and perhaps the meaning of the word“find”; in the knowledge tasks the agent may have learnedthe basic shape category of a basket; searching the roomsrequires skills tested in the visual search task, embodiedsearch, simple navigation, and the TSP task. To identify thebasket, it would draw on skills required for invariant objectrecognition as well as object identification requiring rotationor size differences. Along with eye movements required toperform visual search, the agent would require at least theskill of simple manipulation, and possibly aspects of motormimicry and device mimicry if it needed to be taught how tocarry a basket.

Biovalidity AssessmentAs a complement to the Challenge Scenarios and Cogni-tive Decathlon, a parallel evaluation plan was developedfor the BICA program to assess the degree to which anagent’s cognitive architecture reflects brain-based designprinciples, computations, and mechanisms. These Biova-lidity Assessments are not intended so much as a “NeuralTuring Test”, but rather as a means to 1)compel teams toexplore neurobiologically-inspired design solutions, and 2)enable comparisons between an agent’s cognitive architec-ture and that of a mammalian brain. The Biovalidity As-sessments are structured to occur in three consecutive stagesover the course of a five-year program, with the idea beingthat teams will continually refine their architectures basedon insights from biological comparisons. During this time-

frame, emphasis gradually shifts from evaluations that per-mit each team to define and test its own claims to biolog-ical validity towards evaluations that require all teams totest their architectures against common neural data sets, in-cluding functional neuroimaging data recorded from humansubjects as they perform Challenge Scenario and Decathlontasks. The use of common neural data sets is intended to fa-cilitate comparison across teams and to better focus discus-sion as to which approaches are most successful on certaintasks and why.

Stage 1: Overall Neurosimilitude (Year 1)

Neurosimilitude refers to the degree to which a model in-corporates the design principles, mechanisms, and compu-tations characteristic of neurobiological systems. To effec-tively demonstrate neurosimilitude, teams should describe indetail the mapping of model components to brain structuresand comment on the connective topology of their model withrespect to that of the brain. Assertions should be backedby references to the neuroscience literature, including bothhuman and animal studies. Teams should not be requiredto capture neurobiological details at very fine scales (e.g.,multi-compartment Hodgkin-Huxley type models); how-ever, to the extent that teams can demonstrate that modelingmicro-level details of neural systems contributes to behav-ioral success beyond what can be accomplished with morecoarse-grained models, inclusion of such details should beencouraged.

Stage 2: Task-Specific Assessments (Years 2-3)

Stage 2, Year 2, affords each team the opportunity to com-pare the activity of their model to data from the existingneuroscience literature in a task-specific context. First, eachteam should select several cognitive functions, or skills, thatfeature prominently within one of the Challenge Scenariosor Decathlon events. It would be expected that teams wouldselect those skills/tasks that highlight the biologically in-spired capabilities of their own architecture. For instance,a team whose architecture includes a detailed model of thehippocampus might choose a task involving spatial naviga-tion and might choose to show that path integration in theirmodel occurs via the same mechanisms as in the rat hip-pocampus. Similarly, a team whose architecture employsa temporal differences reinforcement learning algorithm toperform a task might compare prediction error signalingin their model to that reported in neuroscience studies in-volving similar tasks. It should not be required for teamsto perform parametric fits to published data sets; rather,teams should be assessed according to how well their mod-els capture important qualitative features of the neural pro-cesses known to support key behaviors. Since, in the firstyear of Stage 2, teams would select for themselves the sub-tasks against which their models will be assessed, each teamwould in effect have considerable influence over how its ar-chitecture is evaluated.

In Stage 2, Year 3, teams would again compare modelperformance to existing neuroscience data in the context ofthe Challenge Scenarios and/or Decathlon tasks. This time,however, all teams should be required to focus on the sameset of tasks, which would be selected by the evaluation team.The emphasis on a common set of tasks is meant to facilitatecomparison across models and to compel each team to beginthinking about biological inspiration in domains other thanthose at which their models already excel.

Stage 3: Human Data Comparisons (Years 4-5)

In Stage 3, teams should compare model activity to hu-man functional neuroimaging (e.g., fMRI) data recordedfrom subjects performing actual Challenge Scenarios andDecathlon tasks. Whereas Stage 2 involves comparisons topreviously published neuroscience data, Stage 3 would al-low for a more direct comparison between model and neuraldata, since models and humans would be performing verysimilar, if not identical, tasks.

To allow for comparisons with fMRI data, teams shouldgenerate a simulated BOLD signal using methods of theirown choosing and should compare the performance profileof their model to that of the human brain during discrete taskelements, with a focus on identifying which model compo-nents are most strongly correlated with which brain areasduring which tasks, and on how variations in the patternsof correspondence between model and brain activity predictperformance across a range of tasks. (For examples of simu-lated brain imaging studies, see Arbib et al., 2000 and Sohnet al., 2005). Such comparisons would provide a solid em-pirical platform from which teams could demonstrate the in-corporation of neurobiological design principles. Moreover,

it is anticipated that Stage 3 comparisons would generatenew insights as to how teams might further incorporate bio-logically inspired ideas to enhance the functionality of theirmodels.

As in Stage 2, the first year of Stage 3 would require eachteam to identify several cognitive skills/tasks of their ownchoosing for which they would demonstrate a compellingrelationship between model activity and neural data. Like-wise, the second year of Stage 3 would involve a commonset of tasks so as to facilitate comparisons across teams.In order to take advantage of access to human brain data,selected tasks would be expected to differentially involvehigher-order cognitive faculties associated with human intel-ligence (e.g., language ,symbol manipulation). It is expectedthat there would be significant methodological challengesinvolved in parsing and interpreting data from tasks that areas open-ended as the Challenge Scenarios, in which a sub-ject may select from a near infinite repertoire of actions atany point within a continuum of events. However, the risksinvolved in this approach are outweighed by the potentialinsights that may be gained from the ability to compare thedynamics of model activity versus human brain activity inthe same task environment.

DiscussionThis report describes the motivation and design for the“Cognitive Decathlon”, an embodied version of the Turingtest designed to be useful and relevant for the current do-mains of study in Artificial Intelligence. The goal was to de-sign a comprehensive set of tests that could be accomplishedby a single intelligent agent using available technology inthe next five years. Although the program for which the testwas developed was not funded, it is hoped that this work(1) provides new approach that allows the Turing Test to beuseful and relevant for today’s researchers; (2) Suggests acomprehensive set of skills that cover a wide range of em-bodied cognitive skills; and (3) Identifies ways in which howthese core skills are interrelated, providing rationale for testsof embodied intelligence.

ReferencesAllender, L., Salvi, L., Promisel, D. (1997). Evaluation of

human performance under diverse conditions via model-ing technology. In Proceedings of workshop on emerg-ing technologies in human engineering testing and eval-uation, NATO Research Study Group 24, Brussels, Bel-gium, June 1997.

Arbib, M.A., Billard, A., Iacoboni, M. & Oztop E. (2000).Synthetic brain imaging: grasping, mirror neurons andimitation. Neural Networks, 13, 975-997.

Bechara A, Damasio AR, Damasio H, Anderson SW (1994).Insensitivity to future consequences following damage tohuman prefrontal cortex. Cognition, 50: 7-15.

Berg, E. A. (1948). A simple objective technique for mea-suring flexibility in thinking J. Gen. Psychol. 39: 15-22.

Busemeyer, J. & Wang, Y. (2000). Model Comparisonsand Model Selections Based on Generalization CriterionMethodology. Journal of Mathematical Psychology, 44,171-189.

Bush, R. R. & Mosteller, F. (1951). A mathematical modelof simple learning. Psychological Review, 58, 313–323.

Fitts, P. M. (1954). The information capacity of the humanmotor system in controlling the amplitude of movement.Journal of Experimental Psychology, 47, June 1954, pp.381-391. (Reprinted in Journal of Experimental Psychol-ogy: General, 121(3):262–269, 1992.

French, R. M. (1995). The Subtlety of Sameness. Cam-bridge, MA: The MIT Press, ISBN 0-262-06810-5.

Gasser, M. & Smith, L. B. (1998). Learning nouns and ad-jectives: A connectionist account. Language and cogni-tive processes,13, 269-306.

Gentner, D. (1978) On relational meaning: The acquisitonof verb meaning. Child Development, 48, 988-998.

Gluck, K. A. & Pew, R. W. (2005). Modeling human behav-ior with integrated cognitive architectures. Mahwah, NewJersey: Lawrence Erlbaum.

Harnad, S. (1990), The Symbol Grounding Problem, Phys-ica D 42, 335–346.

Harnad, S. (1991), Other Bodies, Other Minds: A MachineIncarnation of an Old Philosophical Problem, Minds andMachines 1, 43–54.

Harnad, S. (2001). Minds, Machines and Turing: The In-distinguishability of Indistinguishables. Journal of Logic,Language, and Information.

Harnad, S. (2004). The Annotation Game: On Turing (1950)on Computing, Machinery, and Intelligence. in Epstein,R. & Peters, G Eds.) The Turing Test Sourcebook: Philo-sophical and Methodological Issues in the Quest for theThinking Computer. Kluwer.

Krauzlis, R. J. (2005). The control of voluntary eye move-ments: New perspectives. Neuroscientist, 11, 124–137.

Landau, B., Smith, L., & Jones, S. (1998). Object shape,Object Function, and Object Name. Journal of Memoryand Language, 38, 1-27.

Myung, I. J.. (2000). The Importance of complexity inmodel selection. Journal of Mathematical Psychology,44,190-204.

Parks, S. Inside HELP, Administrative and Reference Man-ual. Palo Alto, CA: VORT Corp, ISBN 0-89718-097-6.

Pizlo, Saalweachter, & Stefanov. (2006) ”Visual solution tothe traveling salesman problem”. Journal of Vision (6).http://www.journalofvision.org/6/6/964/

Rescorla, R. A., & Wagner, A. R. (1972) A theory of Pavlo-vian conditioning: Variations in the effectiveness of re-inforcement and nonreinforcement, Classical Condition-ing II, A. H. Black and W. F. Prokasy, Eds., pp. 64-99.Appleton-Century-Crofts.

Sandini, G., Metta, G. & Vernon, D. (2004). RobotCub: Anopen framework for research in embodied cognition. In-ternational Journal of Humanoid Robotics, 8, 1-20.

Shieber, S (1994). Lessons form a Restricted Turing Test.Communications of the Association for Computing Ma-chinery, 37, 70–78.

Shepard, R & Metzler. J. (1971). Mental rotation of threedimensional objects, Science 1971. 171, 701–703.

Sohn, M.H., Goode, A., Stenger, V.A., Jung, K.J., Carter,C.S. & Anderson, J.R. (2005). An information-processingmodel of three cortical regions: evidence in episodicmemory retrieval. NeuroImage, 25, 21-33.

Sundman, J. (2003), Artificial Stupidity. Salon, Feb. 2003.

Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learn-ing: An Introduction. MIT Press, Cambridge, MA.

Turing, A. (1950). Computing machinery and intelligence.Mind, LIX, 433-460.

Treisman, A., & Gelade, G. (1980). A feature integrationtheory of attention. Cognitive Psychology, 12, 97-136.

Wallis, G. & Rolls, E. T. (1997). Invariant face and objectrecognition in the visual system. Progress in Neurobiol-ogy, 51, 167-194.

Adapting the Turing Test for Embodied Neurocognitive Evaluation of Biologically-Inspired cognitive agents

Documents