Visual approach skill for a mobile robot using learning and fusion of simple skills

Robotics and Autonomous Systems 38 (2002) 157–170

Visual approach skill for a mobile robot usinglearning and fusion of simple skills

M.J.L. Boada∗, R. Barber, M.A. SalichsSystem Engineering and Automation Division, Carlos III University, Madrid, Spain

Received 30 August 2001; received in revised form 30 September 2001; accepted 30 November 2001

Abstract

This paper presents a reinforcement learning algorithm which allows a robot, with a single camera mounted on a pan tiltplatform, to learn simple skills such aswatchandorientationand to obtain the complex skill calledapproachcombining thepreviously learned ones. The reinforcement signal the robot receives is a real continuous value so it is not necessary to estimatean expected reward. Skills are implemented with a generic structure which permits complex skill creation from sequencing,output addition and data flow of available simple skills. © 2002 Published by Elsevier Science B.V.

Keywords:Mobile robots; Reinforcement learning; Visual approach; Control architecture; Neural networks; Adaptive behavior

1. Introduction

An autonomous mobile robot must be able to mod-ify or adapt its skills in order to react adequatelyin complex, unknown and dynamic environments. Agood method for achieving this goal is reinforcementlearning [11] because a complete knowledge of the en-vironment is not necessary and also permits the robotto learn on-line.

The more complex a task performed by a robot,the slower the learning, because the number of statesincreases so that it makes it difficult to find the bestaction. The task decomposition in simpler sub-taskspermits an improvement of the learning because eachskill learns in a subset of possible states, so that thesearch space is reduced.

∗ Corresponding author.E-mail addresses:[email protected] (M.J.L. Boada),[email protected] (R. Barber), [email protected](M.A. Salichs).

The current tendency is to define basic robot be-haviors, which are combined to execute more com-plex tasks. Thus, Brooks [6] proposes a hierarchicaldecomposition based on behaviors. Gachet et al. [8]define a set of elemental or primitive behaviors. Therobot learns to merge these behaviors adequately inorder to execute more complex tasks called emergentbehaviors. Becker et al. [3] define a flexible library ofrobot skills that can be easily recombined to obtain avariety of useful behaviors.

Several authors have proposed different methods tolearn not only simple behaviors, but also how to com-bine them to obtain more complex ones. So, Michaudand Matarıc [12] provide a memory-based approachto dynamically adapt the behaviors selection accord-ing to the history of their use. Ryan and Pendrith[15] introduce the RL-TOPs architecture which al-lows the automatic construction of appropriate hierar-chies of learned behaviors and provides an approach tore-use learned behavior for solving another task. Thisimproves the learning time. Hasegawa and Fukuda[10] propose a hierarchical behavior controller and a

0921-8890/02/$ – see front matter © 2002 Published by Elsevier Science B.V.PII: S0921-8890(02)00165-3

https://www.researchgate.net/publication/2728570_Learning_from_History_for_Behavior-Based_Mobile_Robots_in_Non-Stationary_Conditions?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/224729804_A_Robust_Layered_Control_System_for_a_Mobile_Robot?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/215645020_GripSee_A_Gesture-Controlled_Robot_for_Object_Perception_and_Manipulation?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/235709806_Reinforcement_Learning_A_Survey?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/2802242_RL-TOPs_An_Architecture_for_Modularity_and_Re-Use_in_Reinforcement_Learning?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/3608764_Learning_emergent_tasks_for_an_autonomous_mobile_robot?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/3802716_Learning_method_for_hierarchical_behavior_controller?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

158 M.J.L. Boada et al. / Robotics and Autonomous Systems 38 (2002) 157–170

learning algorithm to obtain more complex behaviorscombining previously obtained behaviors.

Our research is based on the work done by Barber[2]. He defines an architecture consisting of a delib-erative and an automatic level. The deliberative levelis formed by the reasoning skills which require a highcomputing time. The automatic level is formed byskills which interact with the sensors and the actuators.One of the objectives of our research is to develop areinforcement learning algorithm which allows a mo-bile robot equipped with a vision system formed by asingle CCD camera mounted on a pan tilt platform tolearn simple skills such aswatchandorientation. Thevision system is the only mean by which the robot isable to obtain information from the environment. Inthe proposed learning algorithm, the robot receives areal continuos reinforcement every time it performsan action. Once the robot has learned the previoussimple skills, it performs the complex skill calledap-proach coordinating them. This paper also presentsthe generic skills’ structure in the AD architecture andthree different methods to obtain complex skills.

2. AD architecture

The control architecture used in this work is the ADarchitecture [2]. The motivation of the AD architec-ture is the human being reasoning capacity and the ac-tuation capacity. According to the theories of modernpsychology of Shiffrin and Schneider [16,17], thereare two mechanisms for processing information: re-flexive processes and the automatic ones. Therefore,two mental levels of activity can be differentiated inhuman beings: the deliberative level and the automaticone. These two levels are related to how the reason-ing capacity and the actuation capacity are distributed.The three layer architectures developed by Firby [7],Gat [9] and Bonasso and co-workers [4] are an ex-ample of the current trend to incorporate deliberationand reactivity in the same architecture. These archi-tectures can be considered as a precedent of the ADarchitecture.

According to the criteria exposed, only two levelscan be established, as shown in Fig. 1. The deliber-ative level is associated with the reflexive processes.The automatic level is the one associated to automaticprocesses. The deliberative level will be composed by

Fig. 1. AD architecture levels.

those processes which require a long time of calcu-lation as a consequence of reasoning. The path plan-ner, the environment Modelling and the task supervi-sor are skills which are found in this level. The auto-matic level is formed by the skills which interact withthe sensors and the actuators and which require mini-mum time to process the information they work with.Among them are modules which provide the sensorialinformation and the action modules upon the differentmechanical elements of the robot can be found. Bothlevels present the same characteristic: they are formedby skills. Skills consist of the different reasoning ca-pacities or the sensorial and motor capacities of thesystem. Those skills are activated by execution ordersproduced by others skills or by a sequencer. Theseskills return data and events to the skills or sequencerswhich have activated them. Those skills are the baseof the AD architecture.

2.1. Deliberative level

In this level we find modules that require reasoningor decision capacity. Those modules do not produceimmediate responses. They need to process the infor-mation they work with. Those modules will form thedeliberative skills, and they will be activated by a se-quencer, that will be in charge of managing the correctperformance of these skills. Fig. 2 shows this level.

This level is formed by a series of skills named de-liberative skills, a long-term memory where informa-tion is obtained as well as a sequencer that activatesand deactivates the deliberative skills:

• Deliberative skills. These are each of the capacitiesof reasoning and learning which the autonomoussystem has. Examples of these skills are the plannersand the relocalization systems. These skills require

https://www.researchgate.net/publication/232568886_Controlled_and_Automatic_Human_Information_Processing_I_and_II_Perceptual_Learning_Automatic_Attending_and_a_General_Theory?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/2335882_Task_Networks_for_Controlling_Continuous_Processes?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/221454949_Experiences_with_an_Architecture_for_Intelligent_Reactive_Agents?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

https://www.researchgate.net/publication/221605277_Integrating_Planning_and_Reacting_in_a_Heterogeneous_Asynchronous_Architecture_for_Controlling_Real-World_Mobile_Robots?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==

M.J.L. Boada et al. / Robotics and Autonomous Systems 38 (2002) 157–170 159

Fig. 2. Deliberative level.

Fig. 3. Automatic level.


the need for a long period of calculation. They ac-tivate and deactivate one by one, with concurrencynot being possible.

• Long-term memory. It contains information whichcan be considered more stable through time, i.e., itdoes not depend on the robot’s state. This type ofmemory is only accessible by the deliberative skills,which can perform reasoning upon that information,modifying the information when necessary. The apriori information, such as the maps, and the infor-mation originated from the reasoning or learning ofthe different deliberative skills, will be included inthis memory.

• Sequencer. The sequencer is in charge of managingthe deliberative skills, giving the execution order toeach one when necessary. This sequencer is givena priori and it is the one which defines the systembehavior. The sequencer decides which skills shouldbe active depending on their sequence and on theskills events of this level as well as those of theautomatic level.

2.2. Automatic level

In this level, there exists low level control moduleswhich act directly upon the actuators, as well as themodules that collect data from the different sensors ofthe system. Fig. 3 shows the different elements thatform the automatic level:

• Automatic skills. These are the sensorial and mo-tor capacities of the system. The automatic skills’response is faster than the deliberative skills’ re-sponse.

• Reflex actions. These are involuntary and priorityresponses to a stimulus. The intensity and durationof the response is governed by the intensity andduration of the stimulus.

3. Automatic skills

Automatic skills are in charge of processing sen-sorial information and/or executing actions upon therobot actuators. Bonasso et al. [5] define skills as therobot’s connection with the world. For Chatila andco-workers [1], skills are all built-in robot action andperception capacities.

Skills are classified asperceptiveandsensorimotor.Perceptive skills interpret the information perceivedfrom the sensors, perceptive skills or sensorimotorskills. Sensorimotor skills perceive information fromthe sensors, perceptive skills or sensorimotor skills andon the basis of that perform an action upon the actua-tors. Automatic skills can be combined to obtain morecomplex ones.

3.1. Skill’s structure

Skills are server/client modules. Each module con-tains an active object, an event manager object anddata objects. Objects are separate units of softwarewith an identity, interfaces and state. The active objecthas its own thread of control and it is in charge of pro-cessing. The processing results are stored in the dataobjects. These objects contain different data structuresdepending on type of data stored but the interfaces aresimilar. During the processing, the active object cangenerate events. Events are sent to the event managerobject which is in charge of notifying them of skillswhich have registered on it. In order to communicateamong objects of the same module or different mod-ulesCorba is used.

Skills can be activated or deactivated by skills situ-ated in the same level or in higher levels. During theactivation, some parameters can be sent to the acti-vated skill. When a skill is activated, it connects todata objects of other skills or to sensors’ servers as re-quired by the skill. Then, it processes the received in-put information, and finally, it stores the output resultsin its data objects. When the skill is sensorimotor, itcan connect to actuators’ servers in order to send themmovement commands.

A skill can send a report about its state while it isactive or when it is deactivated. For example, the skillcalledgotogoalcan inform on whether the robot hasachieved the goal or not. When this skill is deactivatedit might inform about the error between the currentrobot position and the goal.

Skills must define a interface independent from therobot’s physical characteristics to allow, on one hand,the communication among skills and, on the otherhand, the software portability from one robot to an-other robot.

Fig. 4 shows the skill’s structure. The black squaresrepresent the data objects where the output results are

https://www.researchgate.net/publication/29648870_An_Architecture_for_Autonomy?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==


Fig. 4. Skill’s structure.

stored. The ellipse represents the active object. If askill does not have the white circle this means it isalways active. The white square represents the eventsmanager object. Solid, dashed and dotted arrows rep-resent the data flow, the performance flow and theevents flow, respectively.

3.2. Complex skills

Skills can be combined to obtain complex skills andthese skills can also be recursively combined to ob-tain more complex skills. This presents the followingadvantages:

• Re-using of software. A skill can be used for dif-ferent complex skills.

• Reducing the programming complexity. The prob-lem is divided into smaller and simpler problems.

• Improving the learning rate. Each skill is learned ina subset of possible states, so that the search spaceis reduced.

3.3. Methods for generating complex skills

We propose three different methods to generatecomplex skills: sequencing, fusion and data flow.

• Sequencing. The sequencer is responsible for decid-ing what skills have to be activated in each momentand has to avoid activating at the same time skillswhich act upon the same actuator, see Fig. 5.

• Output addition. Fusion allows combining the out-puts of more than one sensorimotor skills that actupon the same actuator at the same time. The resul-tant movement commands are obtained combiningthe movement commands of each skill. In this case,the skills are activated at the same time, see Fig. 6.

In the sequencing method, simple skills connectdirectly to actuators’ servers. In the output additionmethod simple skills have to store the movementcommands in their data objects in order to be ableto be used by the complex skill. When simple skillsare activated, they do not know if they have to sendcommands to actuators or store them in their dataobjects. In order to solve this problem, simple skillscheck if there is another skill connected to actuators.In the negative case, they connect them and sendmovement orders. In the affirmative case, they storethe commands in their data objects. In the outputaddition method, first of all, complex skill connectsto actuators and then activates simple skills.

• Data Flow. A complex skill can also be made ofskills which send information one to the other, seeFig. 7. The difference with the above methods is thatthe complex skill does not have to be responsible foractivating all the skills. Simple skills can activateother simple skills.

The three methods are not exclusive; they can oc-cur in the same skill. A generic complex skill musthave a structure which allows its generation by one or


Fig. 5. Sequencing method for generating complex skill.

Fig. 6. Output addition method for generating complex skill.

Fig. 7. Data flow method for generating complex skill.


Fig. 8. Generic structure of a complex skill.

more of the methods described above. The complexskill connects to the actuators servers before activat-ing any simple skill. The movement commands sentby the simpler skills are passed through the complexskill. Fig. 8 show the generic structure with the threepossible methods of generating a complex skill.

3.4. Skill learning

An autonomous mobile robot must be able to mod-ify and improve its skills in order to carry out its taskswell in complex, dynamic and unknown environ-ments. In this case, the learning from the experienceplays a very important role. A robot should be able tolearn simple skills, which outputs it has to generatefrom the perceived inputs, and it should be able tolearn how to combine skills to obtain more complexskills. The work presented in this paper is focused onsimple skill learning.

4. Approaching skill description

Approaching a target means moving towards a sta-tionary object [14]. In the process the human performsto execute this skill using visual feedback is first of allto move his eyes and head to center the object in theimage and then to align the body with the head whilehe is moving towards the target. Humans are not beable to perform complex skills when they are born,they undergo a development process where they are

able to perform more complex skills through the merg-ing or coordination skills which have been learned.

According to these ideas, our robot learns indepen-dently to maintain the object in the image center andto turn towards the base to align the body with thevision system and finally to execute the approachingskill coordinating the learned skills. Fig. 9 shows thediagram for the approach skill. This skill is generatedby the data flow method. As the figure shows, the ob-ject center skill is not activated by the complex skill;it is activated by the skill calledwatch. There is a dataflux between one skill to another.

4.1. Watching skill

Watching a target means keeping the eyes on it. Thereceived inputs are the object center coordinates in theimage plane and the performed outputs are the pan tiltvelocities. The information is not obtained from thecamera sensor directly but it is obtained by the skillcalledobject center. This skill receives an image, findsthe searched for object in the image, and calculates theobject center. This last skill is perceptive because itdoes not produce any action upon actuators but it onlyinterprets the information obtained from the sensors.

4.2. Robot orientation skill

Orientating the robot means turning the robot’sbody to align it with the vision system. The turretis mounted on the robot so the angle formed by the

https://www.researchgate.net/publication/2505475_Task_and_Environment-Sensitive_Tracking?el=1_x_8&enrichId=rgreq-ed8c2464-aa09-4b4c-b207-c2ef4a7212c1&enrichSource=Y292ZXJQYWdlOzIyMDE0MjY5NTtBUzoxMDEwNTc4MTY0MzI2NDRAMTQwMTEwNTQ2OTUwNA==


Fig. 9. Approach skill.

robot body and the turret coincides with the turretangle. The input is the turret pan angle and the outputis the robot angular velocity. The information aboutthe angle is obtained from the encoder sensor placedon the pan tilt platform.

4.3. Object center skill

Object center means searching for an object onthe image previously defined. The input is the imagerecorded with the camera and the output is the objectcenter position on the image in pixels. If the object isnot found, the skill sends an report saying the objecthas not been found.

5. Reinforcement learning

Reinforcement learning is a learning techniquebased on trial and error. A good performance ac-tion provides a reward, increasing the probability ofrecurrence. A bad performance action provides pun-ishment, decreasing the probability. Reinforcementlearning is used when there is not detailed infor-mation about the desired output. The system learnsthe correct mapping from situations to actions maxi-

mizing a scalar, called areinforcement signal, with-out a priori knowledge of its environment. Anotheradvantage that the reinforcement learning presentsis that the system is able to learn on-line, so thatthe system can adapt to changes produced in theenvironment.

The external reinforcement signal normally is notenough. In worst cases, it says only if the system isstill working or has crashed. In these cases, the re-inforcement signal is a scalar value, typically{0,1},0 means bad performance and 1 means good perfor-mance, and/or is delayed in time. To cope with thisproblem, the reinforcement learning algorithms suchasTD(λ) [19], andQ’Learning [20], are based on es-timating an expected reward.

In our learning algorithm, when the robot performsan action, it receives a reinforcement, real value be-tween 0 and 1. This value shows how well the robothas performed the action. The robot can compare theaction result with the last action result performed inthe same state, so that it is not necessary to estimatean expected reward.

In our case, both the states space and outputs spaceare continuous. If the set of actions were discrete,some feasible solution would not have been taken intoaccount.


Fig. 10. Structure of the neural net. Shaded RBF nodes of the input layer represent the activated ones for a perceived situation. Only theactivated nodes will update its reinforcement value.

6. Network architecture

Fig. 10 shows the neural network architectureused. The input layer consists ofradial basis function(RBF) nodes. This layer is used to discretize the inputspace. The activation value for each node dependson the input vector proximity to the center of eachnode thus, if the activation level is 0 it means that theperceived situation is outside its receptive field. Butif it is 1, it means that the perceived situation corre-sponds to the node center. The output layer consistsof linear stochastic units allowing the search for betterresponses in the action space. Each output unit repre-sents an action. There exists a complete connectivitybetween the two layers.

6.1. Input layer

The input space is divided into discrete, overlappingregions using RBF nodes. The activation value for

each node is

ij = exp−‖�i−�cj ‖2/σ2RBF (1)

where�i is the input vector,�ci is the center of eachnode andσRBF the width of the activation function.The obtained activation values are normalized.

inj = ij∑nn

k=1ik(2)

wherenn is the number of created nodes. The nodes areallocated dynamically where they are necessary [13]maintaining the network structure as small as possible.Each time a situation is presented to the network, theactivation value for each node is calculated. If all thevalues are lower than a threshold,amin, a new node iscreated. The center of this new node coincides with theinput vector presented to the neural network,�ci = �i.The connection weights, between the new node and theoutput layer, are initialized to randomly small values.


6.2. Output layer

The output layer must find the best action for eachsituation. The recommended action is a weighted sumof the input layer given values

ork =

Nn∑j=1

wkj · inj , 1 � k � no, (3)

whereno is the number of output layer nodes. Duringthe learning process, it is necessary to explore for thesame situation all the possible actions to discover thebest one. The real final action is obtained from a nor-mal distribution centered in the recommended valueand with varianceσ

ofk = N(or

k, σ ). (4)

As the system learns a suitable action for each situ-ation, the value ofσ is decreased. We state that thesystem can perform the same action for the learnedsituation [18].

In order to improve the results, the weights of theoutput layer are adapted according to

wkj(t + 1) = wkj(t) + �wkj(t) (5)

�wkj(t) = β · (rkj(t) − rkj(t − 1)) · µkj(t)∑lµlj (t)

(6)

ekj(t) = ofk − or

k

σ· inj (7)

µkj(t + 1) = ν · µkj(t) + (1 − ν) · ekj(t) (8)

whereβ is the learning rate,µkj is the eligibility traceandekj is the eligibility of the weightwkj, andν is avalue in the [0, 1] range. The weight eligibility mea-sures how this weight influences in the action, andthe eligibility trace allows rewarding or punishing notonly the last action but the previous ones.rkj is ob-tained from the expression

rkj(t) ={

rext(t) if ink = 0

rkj(t − 1) otherwise(9)

whererext is the exterior reinforcement. Actions’ re-sults depend on the activated states, so that only thereinforcement values associated with these states willupdate.

6.3. Reinforcement signal

The robot receives an external reinforcement signal,rext, after the action is performed, so that the learningrate increases. Its value is included in the [0, 1] range,where 0 means a bad evaluation and 1 means a goodevaluation. Our reinforcement signal has the followingform:

rext = exp−k(error2/2) (10)

For the watching skill the error is

error=√

(xoc−xic)2+(yoc − yic)

2 in pixels (11)

wherexoc andyoc are the object center coordinates inthe image plane andxic andyic are the image centercoordinates.

For robot orientation skill the error is

error= angleturret robot in radians (12)

where angleturret robot is the angle formed by the robotbody and the turret.

Fig. 11. Robot RWI-B21.


7. Experimental results

The experimental results have been carried out onan RWI-B21 mobile robot shown in Fig. 11, equippedwith a vision system formed by a single monochromeCCD camera mounted on a pan tilt platform. The robothas two computers. One PC is dedicated to images ac-quisition and processing. Vision processing hardwareis formed by a Matrox card. The other PC is dedi-cated to running the movement modules of the robotand of the pan tilt platform and to execute the robotskills.

In the watching skill, the robot must learn the map-ping from the object center coordinates (x, y) to theturret motors (pan, ti lt). In our experiment, a cyclestarts with the target on the image plane at an initial po-sition (243, 82) pixels, and ends when the target comesout of the image or when the target reaches the imagecenter (0, 0) pixels and stays there. The turret pan tiltmovements are coupled so that ax-axis movement in-volves ay-axis movement and vice versa. This makesthe learning task difficult. Fig. 12 shows the robot per-

Fig. 12. Learning results in the watching skill.

formance while learning the watching skill. The plotsrepresent theX–Y object coordinates on the imageplane. As seen in the figure, the robot is improving itsperformance while it is learning. In the first cycles, thetarget comes out of the image in a few learning steps,while in cycle 6 the robot is able to center the target onthe image rapidly. The robot learns to center the objectfrom different positions on the image without forget-ting what it has already learned. The learning param-eters values areβ = 0.01, µ = 0.3, σRBF = 0.2 andamin = 0.2.

Once the robot has achieved a good level of per-formance in the watching skill, it learns the orien-tation skill. In this case, the robot must learn themapping from the turret angle (pan) to the robot an-gular velocity (θ). To align the robot’s body withthe turret, maintaining the target in the center im-age, the robot has to turn an angle. Because theturret is mounted on the robot’s body, the target isdisplaced on the image. The learned watching skillobliges the turret to turn to center the object so therobot’s body-turret angle decreases. The experimen-


Fig. 13. Learning results in the orientation skill.

tal results for this skill are shown in Fig. 13. Theplots represent the robot’s angle as a function ofthe number of learning steps. In this case, a cyclestarts with the robot’s angle at−0.61 radians, andit ends when the body is aligned with the pan tiltplatform. As Fig. 13 shows, the number of learningsteps is decreased. The learning parameters valuesare β = 1.0, µ = 0.3, σRBF = 0.1 and amin =0.2.

Once the robot has learned the above skills, therobot is able to perform the approaching skill bycoordinating them. Fig. 14 shows the results. Thisexperiment consist of the robot going towards a goalwhich is a visual target. First of all, the robot movesthe turret to center the target on the image and thenthe robot moves towards the target. The plot re-presents the robot’sX–Y positions where the circleis the robot and theX is the goal. The robot startswith the angle at 0 radians and with the position at(0, 0). Position information is obtained from therobot odometry.

Fig. 14. Approaching skill results.

8. Summary and conclusions

We have proposed a reinforcement learning algo-rithm which allows a robot to learn simple skills. Themain advantages of this algorithm are that it is notnecessary to estimate an expected reward because the


robot receives a real continuous reinforcement signaleach time it performs an action and that the learningis on-line so that the system can adapt to changes pro-duced in the environment.

This paper also present a generic structure definitionto implement perceptive and sensorimotor skills in arobot. This structure allows the generation of complexskills from three different methods. All skills have thesame characteristics; they can be activated by otherskills from the same level or from a higher level, theoutput data can be stored in data objects in order to beused by other skills, and skills notify events to otherskills which want to receive notification.

We have applied our algorithm to an autonomousmobile robot, equipped with a vision system, beingable to correctly perform the complex skill calledap-proachfrom other learned simple skills such aswatchandorientation.

Acknowledgements

The authors gratefully acknowledge the funds pro-vided by the Spanish Government through the CICYTproject TAP1999-214.

References

[1] R. Alami, R. Chatila, S. Fleury, M. Ghallab, F. Ingrand,An architecture for autonomy, The International Journal ofRobotics Research 17 (4) (1998) 315–337.

[2] R. Barber, M.A. Salichs, A new human based architecturefor intelligent autonomous robots, in: The Fourth IFACSymposium on Intelligent Autonomous Vehicles, IAV 01,2001, pp. 85–90.

[3] M. Becker, E. Kefalea, E. Maël, C. Von Der Malsburg, M.Pagel, J. Triesch, J.C. Vorbrüggen, R.P. Würtz, S. Zadel,Gripsee: A geture-controlled robot for object perception andmanipulation, Autonomous Robots 6 (1999) 203–221.

[4] J. Firby, E. Gat, D. Kortenkamp, D.P. Miller, M.G. Slack,R.P. Bonasso, Experiences with an architecture for intelligentreactive agents, Journal of Experimental Theory of ArtificialIntelligence 9 (1997) 237–256.

[5] R.P. Bonasso, J. Firby, E. Gat, D. Kortenkamp, D.P. Miller,M.G. Slack, Experiences of robotics and automation, RA-2,Journal of Experimental Theory of Artificial Intelligence 9(1997) 237–256.

[6] R.A. Brooks, A robust layered control system for a mobilerobot, IEEE Journal of Robotics and Automation 2 (1) (1986)14–23.

[7] R.J. Firby, Task networks for controlling continuousprocesses, in: K. Hammond (Ed.), Proceedings of the Second

International Conference on AI Planning Systems, AAAIPress, 1994, pp. 49–54.

[8] D. Gachet, M.A. Salichs, L. Moreno, J.R. Pimentel,Learning emergent tasks for an autonomous mobile robot, in:Proceedings of the IEEE/RSJ/GI, International Conference onIntelligent Robots and Systems, Advanced Robotic Systemand the Real World, 1994, pp. 290–297.

[9] E. Gat, Integrating planning and reacting in a heterogeneousasynchronous architecture for controlling real-world mobilerobots, in: Proceedings of the 10th National Conference onArtificial Intelligence, San Jose, CA, July 1992.

[10] Y. Hasegawa, T. Fukuda, Learning method for hierarchicalbehavior controller, in: Proceedings of the 1999 IEEE,International Conference on Robotics and Automation,Detroit, MI, 1999, pp. 2799–2804.

[11] M.L. Littman, L.P. Kaelbling, A.W. Moore, Reinforcementlearning: A survey, Journal of Artificial Intelligent Research4 (1996) 237–285.

[12] F. Michaud, M. Matarıc, Learning from history forbehavior-based mobile robots in non-stationary conditions,Autonomous Robots 4 (1998) 335–354.

[13] D. Obradovic, On-line training of recurrent neural networkswith continuous topology adaptation, IEEE Transactions onNeural Networks 7 (1) (1996) 222–228.

[14] M.J. Swain, P.N. Prokopowicz, R.E. Kahn, Task andenvironment-sensitive tracking, in: Proceedings of theWorkshop on Visual Behaviours, 1994, pp. 73–78.

[15] M.R.K. Ryan, M.D. Pendrith, RL-TOPs: An architecture formodularity and re-use in reinforcement learning, in: MachineLearning, Proceedings of the 15th International Conference,ICML’98, Madison, WI, 1998, pp. 481–487.

[16] R.M. Shiffrin, Attention, in: R.C. Atkinson, R.H. Herrnstein,G. Lindzey, R.D. Luce (Eds.), Stevens’ Handbook ofExperimental Psychology, 2nd Edition, Wiley, New York,1988, pp. 739–811.

[17] W. Schneider, R.M. Shiffrin, Controlled and automatic humaninformation processing, II: Perceptual learning, automaticattending and a general theory, Psychological Review 84(1977) 127–190.

[18] A.S. Soembagijo, H. Van Brussel, Robot visual trackingcontrol using neural networks, in: Intelligent AutonomousSystems, 1995, pp. 562–568.

[19] R.S. Sutton, Learning to predict by the method of temporaldifferences, Machine Learning 3 (1) (1988) 9–44.

[20] C.J.C.H. Watkins, P. Dayan, Q’learning, Machine Learning8 (3) (1988) 279–292.

Maria Jesus Lopez Boada received theIndustrial Engineering degree from CarlosIII University of Madrid in 1996. Since1997 she is a research assistant in SystemsEngineering and Automation at the Uni-versity Carlos III of Madrid. Currently sheis a PhD candidate working on the devel-opment of skills in an autonomous mobilerobot and skills’ learning.


Ramon Barber is a research assistantat the System Engineering and Automa-tion Unit, at the University Carlos III ofMadrid, Spain. He received BSc degreein Industrial Engineering from Polytech-nic University of Madrid in 1994, andthe PhD degree in Industrial Technologiesfrom the University Carlos III. In 2000,he has developed a new control architec-ture for mobile robots based on topologi-

cal navigation. His current research area is automatic generation oftopological maps. He is a member of the International Federationof Automatic Control (IFAC).

Miguel A. Salichs received the ElectricalEngineering and PhD degrees from Po-litechnical University of Madrid in 1978and 1982, respectively. He is currently aFull Professor at Systems Engineering andAutomation Unit at the University Car-los III of Madrid. He is Chairman ofthe Technical Committee on IntelligentAutonomous Vehicles of the InternationalFederation on Automatic Control (IFAC).

He has published more than 80 papers on robotics and automa-tion. His primary research interests are mobile robotics, intelligentautonomous systems and service and personal robots.

Visual approach skill for a mobile robot using learning and fusion of simple skills

Documents