Teaching robots by moulding behavior and scaffolding the environment

Teaching Robots by Moulding Behavior and Scaffoldingthe Environment

Joe Saunders∗

[email protected] L. Nehaniv

[email protected] Dautenhahn

[email protected]

Adaptive Systems Research GroupUniversity of Hertfordshire

Hatfield, AL10 9AB, United Kingdom

ABSTRACTProgramming robots to carry out useful tasks is both acomplex and non-trivial exercise. A simple and intuitivemethod to allow humans to train and shape robot behaviouris clearly a key goal in making this task easier. This pa-per describes an approach to this problem based on stud-ies of social animals where two teaching strategies are ap-plied to allow a human teacher to train a robot by mould-ing its actions within a carefully scaffolded environment.Within these enviroments sets of competences can be builtby building state/action memory maps of the robot’s in-teraction within that environment. These memory mapsare then polled using a k-nearest neighbour based algorithmto provide a generalised competence. We take a novel ap-proach in building the memory models by allowing the hu-man teacher to construct them in a hierarchical manner.This mechanism allows a human trainer to build and ex-tend an action-selection mechanism into which new skillscan be added to the robot’s repertoire of existing competen-cies. These techniques are implemented on physical Kheperaminiature robots and validated on a variety of tasks.

Categories and Subject DescriptorsI.2.9 [Artificial Intelligence]: [robotics]; I.2.6 [ArtificialIntelligence]: [learning]; I.2.m [Artificial Intelligence]:[miscellaneous - imitation, programming by demonstration]

General TermsPerformance

KeywordsSocial Robotics, Imitation, Teaching, Memory-based learn-ing, Scaffolding, Zone of Proximal Development

∗corresponding author

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.HRI’06, March 2–4, 2006, Salt Lake City, Utah, USA.Copyright 2006 ACM 1-59593-294-1/06/0003 ...$5.00.

1. INTRODUCTIONImagine a scenario where your brand new domestic robot

has just been delivered. The factory have pre-programmed itto carry out some useful tasks around the home e.g. collect-ing cutlery, cups and plates and placing them in a dishwasheror tidying up by picking up clothes left on the floor and plac-ing them in a washing basket. You unpack the robot, pressthe “on” button, and the robot efficiently carries out thesetasks whilst being safe to both you and itself. Later howeveryou find that although the robot performs to the manufac-turer’s specifications there are some tasks which it does notcarry out. It fails to tidy up the children’s toys into the toycupboard or it fails to recognise that a particular and ex-pensive glass should not be placed in the dishwasher. Aftera call to the manufacturer you discover that there is anotherbutton on the robot marked “learn”. When this button ispressed the robot can be taught additional skills. This paperpresents research on how such a teaching mechanism mightbe implemented.

In section 2 we suggest that it is the social dimension of be-haviour that holds the key to making robots behave more in-telligently [6], an approach inspired from studies of social an-imals (e.g. apes) and the ‘social intelligence hypothesis’ [4],which proposes that intelligent behaviour in primates has itsorigins in dealing with complex social dynamics. We discusshow the social aspects of teaching, learning and imitationare used by some social animals to expand their repertoireof skills. From this work we extract the developmental con-cepts of “scaffolding” and “putting-through”/“moulding” asmechanisms which may prove useful for robot teaching. Sec-tion 3 discusses related work where observation, imitationand direct teaching are used. We conclude this review byoutlining the computational techniques that we will use increating a novel learning architecture which will allow newrobot skills to be taught whilst retaining or improving exist-ing skills. Section 4 details the realisation of this architec-ture on physical Khepera miniature robots. Section 5 givesthe experimental validation of the work by showing exam-ples of how behaviours can be created in a robot and howadditional skills can then be added to an existing robot skillrepertoire. Finally we discuss some of our findings and thepossible directions for further research in this area.

2. TEACHING ANDIMITATION IN ANIMALS

Moore [13] proposes a six-step hypotheses for the evolu-

https://www.researchgate.net/publication/222445562_Getting_to_know_each_other_artificial_social_intelligence_for_autonomous_robots_Robot_Auton_Syst_16_333-356?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/243746105_The_Thinking_Ape_Evolutionary_Origins_of_Intelligence?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

tion of imitation in nature The process starts with Thorndikianconditioning where existing motor actions are associated andreinforced based on particular environmental conditions. Thisstep is later enhanced by operant (or Skinner) conditioningwhere novel motor responses are formed based on combina-tions of existing actions. The next evolutionary step is animplicit reinforcement cycle leading to “skills” where the an-imal is able to perfect the novel act. The fourth stage intro-duces the teacher. The teacher essentially guides the pupilby physically “moulding” or “putting-through” the actionsof the pupil given particular environmental stimuli. This canbe considered as self-imitation by the animal as it repeatsthe actions that it has experienced. Visual imitation of oth-ers is the next evolutionary stage. In this case the animalnow only has to see an act to be able to repeat it. The finalprocess is called cross-modal imitation where an animal isable to match features of its body with corresponding fea-tures of another animal. For example, human babies touchparts of the faces of their parents and can then locate thesame features on their own face. In figure 1 we summariseand segment these stages into prime, taught and imitativesections.

The study presented in this paper bases its mechanisms forrobot teaching on the self-imitation stage. However each ofthe earlier stages are also used. For example, we simplify theactions available from the Thorndikian stage by consideringthem to be part of the robot’s existing repertoire of motorskills. This existing set of skills over and above basic mo-tor actions are called “primitives”. Explicit combinations ofprimitives can be specified by the teacher. We call these “se-quences” but they are equivalent to novel sets of responsesavailable at the operant conditioning stage. Skill learning isthe essential building block upon which the teacher’s direc-tions are built. The skill reinforcement stage will thereforeform the association between sensed stimuli and action. Thefourth self-imitation stage is based on moulding or putting-through. This is where the training example is provided bythe teacher by putting the robot through the range of actionsrequired. Our previous work [21] considered aspects of ob-servational/imitative learning at the imitation stage. Ourcurrent work develops these ideas in a further investigationof the spectrum of learner/pupil relationships.

Evidence for teaching in the animal kingdom comes mainlyfrom studies of primates [4]. However there is also evidencefrom carnivores including domestic cats, tigers, cheetahs,otters, dolphins, orca whales and some bird species [24].An example from cheetahs is where the mother rather thankilling prey will capture and release the live prey to thecheetah cubs when they are about 3 months old. The be-haviour is also selective, only prey species which the cubsare likely to catch are released. It appears that the cubs’experience results in faster learning and a more skilled per-formance. This was tested with domestic cats whose kittenswere brought live mice by their mothers at an early age.By 6 months old the kittens showed superior skills to a testgroup who had not been exposed to the mice [23].

Compelling evidence of intentional teaching comes fromstudies of primate behaviour. Fouts et al. report on thechimpanzees Washoe and Loulis, Loulis being the adoptedinfant chimp of the mother Washoe. Washoe had previ-ously been taught American Sign Language (ASL) howeverthe human carers made no attempt to teach Loulis ASL anddid not use ASL in Loulis’ presence. However Washoe suc-

Figure 1: Proposed evolutionary stages with tech-niques required to implement them. This paperdeals with the taught skill set.

ceeded in teaching Loulis ASL both by demonstration andby moulding of Loulis’ hands [18]. Moulding had also beenused by the human carers to originally teach Washoe.

Scaffolding is where a physical situation is artificially mod-ified, typically by the mother, to make it much easier forher child to complete the task when the child is at a de-velopmental stage where it could not perform the appro-priate acts or sequence its actions correctly. Scaffoldingof tasks together with observational learning and mould-ing have been observed in wild chimpanzees [4]. Crackingnuts with a hammerstone is an especially difficult task fora chimpanzee to learn, taking up to 14 years to perfect insome cases. A number of observations have been recordedwhere the mother will clean the anvil, reposition the the nutor re-orient the hammerstone to favourable orientations forthe infant. Scaffolding is also a familiar concept in humandevelopment and is emphasised in Vygotysky’s idea of the“zone of proximal development” in his theory of the childin society [27]. Teaching and social interaction allow highercompetence levels to be achieved through staged learningand building upon existing skills.

We take inspiration from these examples in social animalsto study how moulding/putting though and scaffolding canbe used to good effect in teaching robots new skills and allowexisting skills to be modified.

3. RELATED WORKEven with explicit programming robot control is hard due

sensor noise, the non-deterministic state of the environment,the inability to ensure that the robots actions are determin-istic and the need for real-time responses. There are gener-ally a number of problems which need to be solved:

i) how can the human teach the robot? - what mecha-nisms can be used to make the robot match the inten-tions of the teacher? How can the robot learn whenthe task is complete?

ii) what techniques can the robot use to learn? - how canthe machine generalise and execute the new task?

iii) how can the robot incorporate the new experiences intoits existing competencies? - what sort of structure isnecessary to ensure that new tasks can co-exist withexisting tasks?

https://www.researchgate.net/publication/233820685_Predatory_Behaviour_in_Domestic_Cat_Mothers?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/37712071_Vygotsky_and_the_Social_Formation_of_Mind?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/30384357_An_Examination_of_the_Static_to_Dynamic_Imitation_Spectrum?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/21519710_Is_There_Teaching_in_Nonhuman_Animals?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/244505344_The_Infant_Loulis_Learns_Signs_from_Cross-Fostered_Chimpanzees?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==



iv) how can it select the right action at the right time? -given a learned set of competencies which one is ap-propriate?

Approaches include topics such as programming by demon-stration, imitation learning, learning from observation androbot shaping. Typically the observational and imitative ap-proaches attempt to match the behaviour of the demonstra-tor and so construct an appropriate control policy. Schaal etal. [22] provide an overview where approaches to the prob-lem are classified as follows:

i) direct policy learning - where supervised learning isused to learn a control policy directly.

ii) learning policies from demonstrated trajectories - thisassumes that the task goal is known and uses sampletrajectories to learn a control policy

iii) model based policy learning - where a predictive modelof the control problem is constructed.

All of these approaches face two difficult problems. Firstly,that by observation alone the internal proprioceptive feed-back that the teacher experiences cannot be directly ex-perienced by the pupil [20] and secondly, there may be amismatch between the external and internal sensorimotorspaces of the teacher and pupil - the correspondence prob-lem [14].

In the direct policy approach these issues can be avoidedby having the pupil experience the same set of actions andsensory states as the teacher with the correspondence prob-lem solved by ensuring that both teacher and pupil havea similar embodiment. This approach has been used by anumber of groups including Billard & Dautenhahn [3] andHayes & Demiris [9]. In both cases a student robot followeda teacher robot and learned to associate imitated actionsagainst perceived environmental state. Saunders et al. [20]however have demonstrated that there can be limitationsin this approach due to reactive impersistence and teacherinterference when using a pure following approach.

In recent work by Nicolescu et al. [15] a mobile robottracks a teacher’s movements matching predicted postcon-ditions against the robot’s current proprioceptive state. Itthen builds a hierarchical behaviour-based network based on“Strips” [17] style production rules. This work attempts toprovide a natural interface between robot and teacher whilstautomatically constructing an appropriate action-selectionframework for the robot.

Another way to allow a robot to experience the appro-priate sensory state is by allowing the teacher to manip-ulate the robot directly via a form of tele-operation andrecord the sensory state of the robot. Although not us-ing a robot this method is closely related to Sammut’s [19]“learning-to-fly” application where recordings of control pa-rameters in a flight simulator flown by a number of humansubjects were analysed using Quinlan’s C4.5 induction algo-rithm [16]. The algorithm extracted a set of “if-then” controlrules. Van Lent [26] also used this approach but provideda user interface which could be marked with goal transitioninformation. This allowed an action-selection architectureto be constructed using “Strips” [17] style production rules.However, in both of these research areas the full “state” ofthe system (both internal and external) is available to thetrainer. This may not be the case when teaching robots.

A long line of research into teaching service robots by ob-serving humans has also been carried out by Dillman [7]where after observation production rules are generated toproduce grammatical formalisms held in a knowledge databaseof actions.

Dorigo and Colombetti [8] use decomposition of tasks bya trainer to “shape” robot behaviour. We take a similarapproach however we do not use either evolutionary or re-inforcement learning techniques to create or modify robotbehaviour.

Learning policies from demonstrated trajectories seems tobe appropriate when the goals of the task are known and thetask itself is self contained. For example when learning toduplicate human movements [11] or play tennis strokes [10]the goal of the task is already known or programmed intothe learning mechanism. It is made explicit by the program-mer for the specific (although mechanically complex) task tobe solved. It is difficult to see how a new task could be in-corporated into the existing learned policy without furtherexplicit programming.

Bentivenga et al. [2] use model based policy learning toconstruct a learning framework using a memory-based ap-proach. A humanoid robot learns to play games of “mar-ble maze” and “air hockey” by recording exteroceptive data(ball angle/velocity, board tilt angles) and primitive type(roll ball away from corner, roll ball off wall) from a humandemonstrator. The robot is able to select the appropriateprimitive by analysing a memory model to find the nearestexample to the current state. Parameters for the primitiveare constructed using locally weighted regression on pointsnearest the selected primitive. This technique is also relatedto loose-perceptual matching methods described in [1].

Memory based learning approaches have a number of tech-nical advantages. Firstly, complex functions can be learnedby focusing on sets of less complex local approximations.Secondly, the local approximation for the target query (basedon the current sensory state) is based on the training dataat the time of the query and not on a pre-built function ap-proximation. This means that additional training instancescan be added immediately without the need to rebuild atarget function (which would be the case for an inductive orneural network approach).

It is noticeable that many of the example applications de-scribed above have the ability to learn complex tasks basedon some form of observation (where observation can be bothdirect and from post-processing of sensory data). Howeverwith the exception of [26, 7, 15] there are few mechanismswhich allow another task to be both learned and includedinto the repertoire of previously learned functions. Our ap-proach is to provide an interface which will both learn aparticular task and have the ability to add this task to anexisting control mechanism. This requires a number of steps.

i) a policy needs to be learned based on the sensory stateof robot itself. The correspondence between the hu-man teacher and robot also needs to be solved - bothof these points we address by the simple process ofmoulding the robot by teleoperation.

ii) the robot needs to be aware of when tasks have a spe-cific goal - we make this an explicit part of the trainingsequence.

iii) learning must be carried out in real-time and be subse-quently modified or enhanced with additional learning

https://www.researchgate.net/publication/3955685_Movement_imitation_with_nonlinear_dynamical_systems_in_humanoid_robots?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/4201043_An_Approach_for_Programming_Robots_by_Demonstration_Generalization_Across_Different_Initial_Configurations_of_Manipulated_Objects?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/216722019_Robot_Shaping_An_Experiment_in_Behavior_Engineering?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/3412133_Learning_and_interacting_in_human_-_Robot_domains?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/3412133_Learning_and_interacting_in_human_-_Robot_domains?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/10832807_A_tennis_and_upswing_robot_based_on_bi-directional_theory?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/4113215_An_experimental_comparison_of_imitation_paradigms_used_in_social_robotics?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/4113215_An_experimental_comparison_of_imitation_paradigms_used_in_social_robotics?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/2483837_Experiments_in_Learning_by_Imitation_-_Grounding_and_Use_of_Communication_in_Robotic_Agents?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/222434357_Teaching_and_learning_of_robot_tasks_via_observation_of_human_performance_Robotics_and_Autonomous_Systems_47_109-116?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/222434357_Teaching_and_learning_of_robot_tasks_via_observation_of_human_performance_Robotics_and_Autonomous_Systems_47_109-116?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/2925010_Learning_Procedural_Knowledge_through_Observation?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/2925010_Learning_Procedural_Knowledge_through_Observation?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/2652979_A_Robot_Controller_Using_Learning_by_Imitation?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/13113327_Learning_to_fly?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/220797786_A_Framework_for_Learning_from_Observation_Using_Primitives?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/240265509_STRIPS_A_new_approach_to_the_application_of_theorem_proving_to_problem_solving?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/240265509_STRIPS_A_new_approach_to_the_application_of_theorem_proving_to_problem_solving?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/220688794_C45_Programs_For_Machine_Learning?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

https://www.researchgate.net/publication/2331806_The_Correspondence_Problem?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

Figure 2: A typical environment showing a Kheperawith vision sensor and gripper, objects with differ-ent electrical resistance and bar-coded containers.

experiences - this is made possible by using memorybased learning methods.

iv) the new learning experience should not corrupt otherpreviously learned experiences - we allow the construc-tion of a hierarchy of memory models to provide this.

One of the key points in addressing many of the issues de-scribed is that a teacher constructs an appropriate learningenvironment for the robot. We do this while exploiting andextending some of the techniques already used by the prac-titioners above in a new framework.

4. FRAMEWORKFor this study we have used physical Khepera miniature

robots (see figure 2) on a desk in a typical busy academicenvironment. Khepera’s are 5cm diameter non-holonomicrobots equipped with eight IR sensors placed at intervalsaround the base, an arm/gripper and a K213 linear visionsystem. The IR sensors are capable of detecting both ambi-ent light and short range (10cm) obstacles. The arm/gripperarrangement can detect when an obstacle is within the grip-per and also the electrical resistivity of the object grasped.The K213 vision system provides a one dimensional line of64 grey scale values subtending an angle of 36◦ from thefront of the robot. Commands to control the robot can besent from a remote PC either via a radio signal or from adirectly connected serial cable.

The learning environment we choose is based around thecapabilities of the Khepera. To provide a reasonably com-plex learning environment the Khepera is placed in a walled“room” with various objects of different conductivity andsome bar-coded containers.

4.1 Learning MechanismWe use a memory based “lazy” learning method [12] to

allow the robot to learn tasks. This is a simple k-nearestneighbour (kNN) approach where the value of each featurein the robot’s state vector (see Scaffolding below) is regardedas a point in n-dimensional space, where n is the number offeatures in the state vector (see table 1). For each chosentask we collect a set of training examples (as described inMoulding below) together with their target primitives, eachprimitive being chosen by the human trainer when mould-ing the robot’s actions. When the task is executed the robot

Table 1: State Vector Used in experimentsState DescriptionRepulsive Force Vector of IR sensorsRepulsive Angle Angle of IR VectorLight Distance Distance to lightLight Angle Angle to lightBars Seen Number of bars seen by K213Bar Size Average bar size seen by K213Bar.Std.Dev. Std. Deviation of bar sizeArm Up/Down Whether the arm is up or downGripper Open/Closed If gripper is open or closedObject in Gripper If object is in the gripperResistivity Resistivity of object

continually computes its current state vector. It then com-putes the distance from the current state to each of thetraining examples. The distance between the state vectorand the training example being the sum of the distancesbetween the features in each, as follows:

distance(X, S) =n

X

i=1

Wi |xi − si

maxi − mini

|

Where X is an instance of the training examples and S aninstance of the robot’s current sensory state. W is a non-negative vector of real numbers used to weight each of thedimensions. This weighting is discussed in the scaffoldingsection below. Setting k to 1 will result in the nearest pointin the training examples being used and yield a single prim-itive as its target function. Where k is greater than 1 thealgorithm will yield a set of primitives. We choose the mostcommon primitive from the set as the target function. Notethat this method will always result in a primitive being cho-sen. In environmental situations not previously experiencedby the robot, generalisation occurs as the primitive nearestto the current state is chosen. Thus performance is based onthe similarity of new situations to those already experienced.

In work to date the k value has been set experimentallyto approximately correlate to the number of state vectorentries in each memory table. For a small number of entriesk is set to 1. For larger tables k has been set to higher valuesbut not exceeding 5. We make use of the Tilburg UniversityMemory Based Learner [5] to provide the kNN functionality.This system has the advantage of providing a very efficienttree-based coding structure for the training examples so asto speed up performance.

4.2 MouldingThe concepts of scaffolding and moulding can play an im-

portant part in animal learning. They support a form ofself-imitation that may be the natural precursor to morecomplex forms of imitative learning. In our framework weuse the idea of moulding or putting-through directly. Thehuman has the ability to control the robot by remotely mov-ing it through a set of pre-defined basic primitives. This setof primitives are basic actions available to the robot (seetable 2). The human teacher has no access to the inter-nal state of the robot. By manipulating the robot in thismanner we also avoid both the problem of observation bythe robot of the human actions and of the correspondenceproblem between the robot and human. During the robotmoulding process a snapshot of the robots proprioceptive

https://www.researchgate.net/publication/2912472_Timbl_Tilburg_Memory-based_Learner_41?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

Table 2: Pre-defined Primitives.Primitive DescriptionMove Forwards Move Forward 1cm or continuouslyMove Backwards Move Backwards 1cm or continuouslyTurn Right Turn Right by 5◦ or continuouslyTurn Left Left Left by 5◦ or continuouslyRaise Arm Raise Arm, if not already raisedLower Arm Lower Arm, if not already loweredOpen Gripper Open gripper if not already openClose Gripper Close gripper if not already closed

and exterioceptive state (see table 1) is recorded togetherwith the directed primitive on each human command to therobot. For each human defined task we can therefore builda memory model of state/primitive combinations.

4.3 ScaffoldingAll of the states perceived by the robot are recorded in the

state vector however different attributes of this vector arerelevant to different tasks. For example, to avoid obstaclesthe attributes of the IR sensors are of more importance thanthe position of the gripper, whereas to track an object theperceived orientation of the object is more relevant than thevalues of the IR sensors. Here we capture a pre-defined setof states some of which are numeric summaries pertinentto the expected applications and realisable by the sensorarrangement of the robot (see table 1). For example, ratherthan storing 64 grey-scale values for the K213 linear visionsensor we pre-process the K213 data to specifically detectbar-coded items. In this case when no such items can bedetected the K213 values are set to zero. Repulsive andambient light sources from the IR sensors are formed intovectors. Apart from the process of vector creation no furtherpre-processing is carried out. Thus these sensors are always“on” and not specifically programmed to detect particularobjects or environmental items.

We use two mechanisms to ensure that the appropriateattributes are chosen. The first is a technical solution origi-nally used in Quinlan’s C4.5 Induction algorithm [16]. Thisis based on computing information gain to measure how wella given attribute separates the set of recorded state vectorsaccording to the target primitive. This is defined as follows:

Gain(S, A) = Entropy(S) −X

vεV alues(A)

| Sv |

| S |Entropy(Sv)

where S is the collection of training examples, Entropy(x) isa function returning the entropy of x in bits, Values(A) isthe set of all possible values for a particular state attributeA and Sv is the subset of S for which attribute A has valuev. Further explanations of this metric can be found in [16,12]. The information gain measurement allows particularattributes in the state vector to have greater relevance byusing it to weight the appropriate dimensional axes in thekNN algorithm (by setting Wi above). This has the effect ofeither lengthening or shortening the axes in Euclidean spacethus reducing the impact of irrelevant state attributes.

The second mechanism for attribute selection is the hu-man trainer. It is assumed that the trainer already under-stands the task (from an external viewpoint) that the robotmust carry out and therefore is able to construct the train-ing environment appropriately so as to ensure that irrelevant

features are removed. This idea allows the technical selec-tion of relevant state features to be enhanced as the otherfeatures will now tend to have constant values and thereforea low information gain.

As an example consider training the robot to perform a“wall following” behaviour. The teacher might remove ex-traneous objects from the training area such as the bar-coded containers. By moving the robot through a number ofwall following experiences the set of sensory states recordedwill then be primarily based on the IR sensors (which resolveto the repulsive vector/angle attributes). These attributeswill then be automatically selected by the extended kNNalgorithm based on their higher information gain. As dis-cussed in section 2 above this process of scaffolding or cre-ating favourable conditions for learning would seem a quitenatural phenomenon in social animals and is of course fun-damental to all forms of human teaching.

4.4 Learning New TasksWe are now in a position to define the mechanisms avail-

able to the human trainer. The trainer directs the robotusing a screen based interface which provides a number ofbuttons used to set operation modes such as “execute” and“start/stop learning” plus an edit field to label actions anda list from which to choose existing labelled actions andprimitive operations.

The robot can be in one of three modes. The first isexecution mode, which is its normal mode of operation whereits current behaviour is executed. Alternatively the robotcan be in training mode where the human trainer can mould,scaffold and create new activities for the robot to eventuallyuse in execution mode. An intermediate mode is where thetrainer can execute one of the set of available competencies.For example by selecting the primitive “Move Forward” inthis mode the robot will execute the move forward primitive.This is useful for placing the robot in an appropriate stateprior to training.

In “learning” mode the robot can learn new competencesat one of three training levels determined by the trainer:sequence, task and behaviour. All three training levels arestarted by pressing a “start learning” button and terminatedby pressing a “stop learning” button. For each new com-petence (either a behaviour, task or sequence) the trainerexplicitly provides an appropriate label, for example “Pick-UpCup”. When training is complete the label is added tothe set of actions available to the trainer and thus can beused immediately for further training sessions. Existing la-belled actions can also be modified with additional trainingepisodes as required. In training mode the trainer has theoption to execute the selected labelled competence so thatthe results of the robot’s actions can be assessed immedi-ately.

The first competence level is the sequence. This is wherethe robot can be directed through a given sequence of prim-itives which it records without reference to its state. An ex-ample of a sequence might be to lower the arm and close thegripper. This could, for example, be labelled as the ‘grab’sequence. The grab sequence would then become part of theavailable set of competences available for the trainer to use.These new sequences could also then be used in combinationwith other primitives and other sequences to create furthersequences. Note that sequences are entirely deterministic.When requested to perform a sequence the robot will sim-

https://www.researchgate.net/publication/220688794_C45_Programs_For_Machine_Learning?el=1_x_8&enrichId=rgreq-bab5a72b-3b43-4329-a955-ad5e4ef04d59&enrichSource=Y292ZXJQYWdlOzIyMTQ3MzU0MjtBUzoxMDE3NjUwNzY3NTAzNThAMTQwMTI3NDA5MzgzMg==

Figure 3: An example of a trained hierarchy of prim-itives, primitive sequences, learned goal-directedtasks and the final behaviour.

ply execute the recorded list of competences taught by thetrainer sequentially. It will make no reference to the envi-ronmental state. Each primitive when executed by the robotcan be run in two further modes - discrete or continuous. Indiscrete mode the primitive will execute followed immedi-ately by a “stop” instruction. The continuous mode doesnot issue the “stop” instruction. Discrete mode allows thetrainer put the robot though its range of actions step by step.Continuous mode is typically used after training is completeand enables the robot to execute the primitives without thejerkiness caused by the “stop” instructions above.

The second level for learning is called a goal-directed taskor simply a task. This differs from a sequence in that dur-ing training the actions taken by the robot will depend onthe environmental state at that time. The trainer now hasthe opportunity to select not only basic primitives, but se-quences and other goal-directed tasks. The tasks are goaldirected because the trainer also has the opportunity to in-form the robot when the task has completed. This goal stateis paired with the robot state and becomes a further train-ing record in the memory model for that particular task.In execution mode the task is iterated until the environ-mental state is close to a goal state and the task will thenterminate.As an example consider an obstacle avoidance be-haviour. The trainer would place the robot in an obstaclefacing situation, choose the “task” level, label it “Obsta-cle Avoidance” and press the “start learning” button. Therobot can then be moulded into a non-obstacle avoidance sit-uation. The trainer would then signal that the goal state wasreached. This training regime would be repeated for manyobstacle avoidance situations and thus many obstacle recog-nition states with appropriate avoidance actions and goalstates being recorded into the Obstacle Avoidance memorymodel.

The final mechanism for learning is a behaviour. This al-lows the trainer to construct the complete behaviour for therobot from the component set of tasks, sequences and prim-itives. The construction of a behaviour is the same as fora task except that no goal state is required. The behaviourwill run continually in execute mode and base its decision ofwhat task, sub-task, sequence or primitive to use based onthe current environmental state. With careful training thetrainer can now build a hierarchy of tasks, sequences andprimitives as required (see figure 3).

4.5 Action SelectionThe trainer by constructing a hierarchy of tasks, sequences

and primitives is now effectively building an action selectionarchitecture for the robot. At the top behavioural level adecision is made based on the robot’s current state as towhat to execute next (based on the kNN selection). If theselection is a primitive or sequence these will be executedand the next state cycle will begin. Alternatively the selec-tion could be a task. Within the task the robot state selectsthe next appropriate action, which again could be a prim-itive, sequence or task. Working down through the hierar-chy eventually results in the execution of a primitive. Notethat each task executed in the hierarchy will only termi-nate when its goal condition is selected based on the currentrobot state, thus within the lowest selected task the statewill be polled after each executed primitive. This method ofaction-selection is similar to the extended feed-forward free-flow hierarchy proposed by Tyrrell [25], who demonstrateshow hierarchical approaches to action-selection can often ex-hibit better performance than “Strips” style production rulemethods. Precedence of one action over another is entirelybased on current environmental state. The stored memorystate most similar to the current state is chosen at each levelwithin the framework.

5. VALIDATION OF FRAMEWORKWe illustrate the successful functioning of the implemented

social learning architecture from using the system on twoscaffolded behaviours. The first is simple and illustratesthe different ways that a trainer could proceed in trainingthe robot. The second is more complex and shows how anew skill can be added to an existing set of actions. Pleasenote that for reasons of clarity the diagrams only show eachunique sequence, task or primitive per memory model. Inreality each memory model may have a great many instancesof different states for the same sequence, task or primitive.

The first behaviour is called “Scared of Light” and wasto train the robot to move forward when a light was off,move backwards when a light was on and avoid bumpinginto obstacles in all cases (note that the robot had no pre-built competencies other than the basic set of primitivesat this stage). Figure 4 shows two different approaches tothe task, the first exploits the hierarchy by seperating thebehaviour with an “avoid obstacles” sub-task. The secondcombines both competencies into one behaviour. Both train-ing regimes are successful, however further training episodesmay become more difficult with the latter approach. Train-ing of the robot was carried out by two individuals who hadnot previously used the system. Observation of each per-son’s approach is informal but illuminating. The first userpre-constructed a possible solution using one behaviour andone task before implementing it on the robot. The seconduser took an entirely different approach. She first trainedthe robot to correctly respond to the light and then subse-quently added training episodes to cope with the obstacleavoidance behaviour. This resulted in a single behaviourwith no sub-tasks. However both users successfully trainedthe robot to complete the task. As part of our future re-search we intend to carry out further trials of the systemwith increasingly complex tasks to ascertain if there is anatural point where users start to automatically constructsub-tasks and scaffold each task appropriately.

The second behaviour is called “Tidy Up”. This behaviouris a proxy for the household robot described in the introduc-tion to this paper. We provide two containers. One we callthe “cupboard”, the other we call the “basket”. There are anumber of objects either plastic or with copper strips. Thetraining regime is much more complex in this instance (seefigure 5). This is not only because there is more to teachbut also that we need some negative examples. This is im-portant to ensure consistent behaviour. For example we hadto train the robot to do something sensible if it dropped theobject. This situation was scaffolded by initially runningthe “Tidy Up” behaviour (having already trained the robotto grasp the object) and then removing the object from thegripper. At this point we terminated execution and pressed

Figure 4: Different teaching styles. The upper partof the diagram shows the avoid task being taughtfirst, followed by scaffolding to recognise when tomove forward and backward. In the lower part of thediagram the trainer made no attempt to segment thebehaviour. All competencies are added to a singlebehaviour.

Figure 5: The hierarchy created after successfullytraining the robot to place plastic containers intothe basket (detailed states not shown).

Figure 6: The hierarchy has been extended to allowthe robot to succesfully place copper objects in thecupboard whilst still placing plastic objects in thebasket (detailed states not shown).

the learning button. We then selected the “GripperOpe-nArmUp” sequence and then terminated learning. For theinitial “Tidy Up” task seven steps, with up to three scaffold-ing experiences per task and up to fifteen moulding experi-ences per scaffold were needed. The robot however executedthe behaviour successfully.

Figure 6 shows the “Tidy Up” task extended by trainingthe robot to recognise the copper objects and placing themin the “cupboard”. The training sequence here involved cre-ating a new task “MoveToCupDropObject”, extending the“Tidy Up” task to execute the “MoveToCupDropObject”task if the robot could see the cupboard. Two negative ex-amples were also required. The robot is trained to ignore thecupboard if it has the plastic object. Similarly it is trainedto ignore the basket if has the copper object.

In some of the training episodes there were indicationsthat suggested that some tasks can be very difficult to demon-strate. For example, the alignment of the robot to success-fully pick up a film canister must be precise. If the robot istoo close the gripper cannot grasp it, if the robot is slightlymisaligned the canister can be knocked over. Demonstrat-ing this ability to the robot as a sub-task proved difficult asthe range of possible state attributes was very small in thisinstance. We think that it may be that certain useful com-ponent tasks such as these may be better defined pre-codedas basic primitives i.e. as factory presettings.

6. DISCUSSIONWe have described and implemented a robot social learn-

ing architecture, inspired from the study of social animals,that allows a human trainer to teach a physical robot with-out explicit programming. The teaching is based on buildingup hierarchical sets of reusable competences via interactivescaffolding. Each competence is based on the assumptionthat experiences captured by the robot as a result of di-rected human training can be re-applied when the robotexperiences a new situation which is similar to those in itsset of stored experiences. Thus it “self-imitates”, generalis-ing by reproducing its own behaviour in new contexts. The

training takes place in real-time and although relatively newthe architecture appears to scale from simple to moderatelycomplex tasks successfully. However further experimenta-tion to access performance on tasks of very high complexitywill be necessary.

To date we have also obtained limited feedback on theuse of the system by non-roboticists where informal testshave indicated that it may not be obvious to a non-technicaltrainer that a robot may need a developmental program tolearn to carry out complex tasks. Although this seems a nat-ural assumption which is made when training other adults,children or animals. This may be simply due to inexperi-ence with “intelligent” machines or that the robot itself doesnot “advertise” the fact that it lacks basic skills. Machinesup to now have been engineered mostly to work precisely asspecified, they are usually not expected to have to be taughtor developed in any way.

In our future research we intend to further study theseissues and also use the architecture to further investigatehow robots could learn from each other.

7. ACKNOWLEDGMENTSThe work described in this paper was partially conducted

within the EU Integrated Project COGNIRON (”The Cog-nitive Robot Companion”) and funded by the EuropeanCommission Division FP6-IST Future and Emerging Tech-nologies under Contract FP6-002020.

8. REFERENCES[1] A. Alissandrakis, C. L. Nehaniv, K. Dautenhahn, and

J. Saunders. An approach for programming robots bydemonstration: Generalization across different initialconfigurations of manipulated objects. In 6th IEEEInt. Symp. Computational Intelligence in Robotics andAutomation (CIRA’05), pages 61–66. IEEE, 2005.

[2] D. C. Bentivegna and C. G. Atkeson. A framework forlearning from observation using primitives. In Proc.RoboCup Int. Symp., Japan, 2002.

[3] A. Billard and K. Dautenhahn. Experiments inlearning by imitation - grounding and use ofcommunication in robotic agents. Adaptive BehaviourJournal, 7(3/4), 1999.

[4] R. W. Byrne. The Thinking Ape: EvolutionaryOrigins of Intelligence. Oxford University Press, 1995.

[5] W. Daelemans, J. Zavrel, K. van der Sloot, andA. van den Bosch. Timbl:tilburg memory-basedlearner. Technical Report ILK 04-02, TilburgUniversity, 2004. Available from http://ilk.uvt.nl/.

[6] K. Dautenhahn. Getting to know each other –artificial social intelligence for autonomous robots.Robotics and Autonomous Systems, 16:333–356, 1995.

[7] R. Dillmann. Teaching and learning of robot tasks viaobservation of human performance. Robotics andAutonomous Systems, 47:109–116, 2004.

[8] M. Dorigo and M. Colombetti. Robot Shaping: anexperiment in behavior engineering. MIT Press, 1998.

[9] G. Hayes and J. Demiris. A robot controller usinglearning by imitation. In Proc. Int. Symp. IntelligentRobotic Systems, Grenoble, pages 198–204, 1994.

[10] H.Miyamoto and M.Kawato. A tennis serve andupswing learning robot based on bi-directional theory.Neural Networks, 11:1331–1344, 1998.

[11] J.A.Ijspeert, J.Nakanishi, and S.Schaal. Movementimitation with nonlinear dynamical systems inhumanoid robots. In IEEE Int. Conf. Robotics andAutomation, 2002.

[12] T. M. Mitchell. Machine Learning. McGraw-HillInternational, 1997.

[13] B. R. Moore. Social Learning in Animals: The Rootsof Culture, chapter The Evolution of ImitativeLearning, pages 245–265. Academic Press Inc., 1996.

[14] C. L. Nehaniv and K. Dautenhahn. TheCorrespondence Problem. In K. Dautenhahn andC. L. Nehaniv, editors, Imitation in Animals andArtifacts, pages 41–61. MIT Press, 2002.

[15] M. N. Nicolescu and M. M. Mataric. Learning andinteracting in human-robot domains. IEEETransactions on Systems, Man, and Cybernetics, PartA, 31(5):419–430, 2001.

[16] J. R. Quinlan. C4.5: Programs for Machine Learning.Morgan Kaufmann, San Mateo, CA, 1993.

[17] R.E.Fikes and N.J.Nilsson. Strips: a new approach tothe application of theorem proving to problem solving.Artificial Intelligence, 2:189–208, 1971.

[18] R.S.Fouts, D.H.Fouts, and T. Cantfort. Teaching signlanguage to chimpanzees, chapter The infant Loulislearns signs from cross fostered chimpanzees, pages280–92. State University of New York Press, 1989.

[19] C. Sammut, S. Hurst, D. Kedzier, and D. Michie.Learning to fly. In Proc. Ninth Int. Conf. on MachineLearning, pages 385–393. Morgan Kaufmann, 1992.

[20] J. Saunders, C. L. Nehaniv, and K. Dautenhahn. Anexperimental comparison of imitation paradigms usedin social robotics. In Proc. IEEE Robot and HumanInteractive Communication (ROMAN ’04), pages691–696. IEEE Press, September 2004.

[21] J. Saunders, C. L. Nehaniv, and K. Dautenhahn. Anexamination of the static to dynamic imitationspectrum. In Proc. 3rd Int. Symp. on Animals andArtifacts at AISB 2005, pages 109–118, 2005.

[22] S. Schaal, A.Ijspeert, and A. Billard. TheNeuroscience of Social Interaction, chapterComputational approaches to motor learning byimitation, pages 199–218. 1431. Oxford UniversityPress, 2004.

[23] T.M.Caro. Predatory behaviour in domestic catmothers. Behaviour, 74:128–47, 1980.

[24] T.M.Caro and M.D.Hauser. Is there teaching innon-human animals? Quarterly Review of Biology,67:151–74, 1992.

[25] T. Tyrrell. Computational Mechanisms for ActionSelection, PhD Thesis. Technical report, Centre forCognitive Science, University of Edinburgh, 1993.

[26] M. van Lent and J. E. Laird. Learning proceduralknowledge through observation. In K-CAP 2001:Proc. Int. Conf. Knowledge Capture, 2001.

[27] J. V. Wertsch. Vygotsky and the Social Formation ofthe Mind. Harvard University Press, 1985.

Teaching robots by moulding behavior and scaffolding the environment

Documents