Top Banner
Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst and Claude Sammut Abstract— Controllers for autonomous mobile robots that operate in rough terrain must consider the shape of the surrounding terrain and its impact on the robot’s movements. For complex terrain, these interactions are extremely difficult to model in a way that allows traditional controllers to be built. We have used Behavioural Cloning, a type of learning by imitation that produces rules that clone the skills of an expert human operator. We have also developed an autonomous instructor in simulation and used it to generate training data from which we have cloned controllers. The resulting controllers perform at a level comparable to that of a human expert. The controllers behave similarly both in simulation, where they were developed, and on the physical robot without the need for further modification or training. I. I NTRODUCTION We use Machine Learning to build a controller for a mo- bile robot traversing rough terrain that it is unable to traverse using a conventional controller. Examples of such robots include those used for fire-fighting, urban search and rescue and explosive ordinance disposal. Complex interactions be- tween mobile robots and these environments cannot be easily modelled, thus making it difficult to construct autonomous controllers that may be used to replace or assist human operators. Instead of attempting to model these difficult interactions, most approaches make simplifying assumptions, such as all wheels staying on the ground or the body not making contact with the terrain. They may then attempt to detect and avoid terrain that is likely to result in these assumptions being violated. Such an approach excludes the robot from terrain that it may be physically able to traverse using a more capable controller. We describe a controller that uses Behavioural Cloning to learn its control strategy, instead of mathematical modelling. It is given high-level goals and makes decisions on low level control actions in order to achieve those goals. Its source of training data is usually a behavioural trace provided by a human expert. We have also investigated the use of a simulated agent as the source of training data. Novel contributions of this work include the use of Be- havioural Cloning to interpret and make decisions based on a high dimensional set of features extracted from range images where the detailed mapping between features and actions is poorly understood, the application to both simulated and real robots without further modification and the use of a simulated demonstrator that can provide the training data for cloning. R. Sheh, B. Hengst and C. Sammut are with the ARC Centre of Excellence for Autonomous Systems, School of Computer Science and Engineering, The University of New South Wales, UNSW SYDNEY NSW 2052, AUSTRALIA [rsheh|bernhardh|claude]@cse.unsw.edu.au A. Behavioural Cloning In Behavioural Cloning (Sammut [1], Michie [2], Michie and Camacho [3], Bain and Sammut [4], Bratko et. al. [5]), a human instructor first demonstrates the control task by observing a visualisation of the same sensor data that is available to an autonomous controller and selecting the appropriate action that the controller should perform. A machine learning algorithm takes these examples of the performance of the task and builds a model of the instructor’s decision making process. The model generalises these exam- ples. At runtime, the controller uses this model to evaluate novel situations that it encounters with the aim of choosing the actions that the instructor would have chosen. The models used in Behavioural Cloning are intended to be meaningful to humans, providing insight into the instructor’s decision making processes. By the time a human is expert at a task, their decision making is often unavailable to introspection, i.e. humans can perform many tasks without thinking. By creating models that are meaningful to humans, Behavioural Cloning allows us to reconstruct this decision making. The use of readable models can also make it easier to check the control strategy and fine tune its training. In some cases it may be necessary for reasons of safety to be able to inspect the control strategy to predict what it will do or to justify an observed action. This contrasts with other kinds of learning by imitation (Pomerleau [6], Abbeel and Ng [7], Hamner et. al. [8], ˇ Suc and Bratko [9], Bentivegna et. al. [10]) where the ability to obtain an insight into the decision making process is not a primary objective. B. Robot and Application We use “Emu”, a four-wheel-drive robot shown in Figure 1 that was built for autonomous operations in the RoboCup Rescue Robot League competition. It is capable of traversing a variety of terrain, from flat flooring to low rubble. It is equipped with a MESA SR-3100 range imager for terrain sensing, an XSens MTx heading-attitude sensor and a SLAM based position tracker that uses an on-board Hokuyo URG- LX04 range scanner on a self-levelling mount. The range imager suffers from motion blur. To obtain sufficient resolution, the robot must be stationary when it observes the terrain 1 . For our experiments, the robot can per- form one of eight possible atomic actions. Each action drives the wheels on each side of the robot at a pre-set speed for one second before the robot stops to take another observation. 1 Recent experiments with newer sensors such as the Kinect exhibit less motion blur. Newer imaging technology may eliminate the need for stop- and-go operation.
6

Behavioural Cloning for Driving Robots over Rough Terrain › ~bernhardh › MobileMe › ... · Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst

Jun 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Behavioural Cloning for Driving Robots over Rough Terrain › ~bernhardh › MobileMe › ... · Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst

Behavioural Cloning for Driving Robots over Rough Terrain

Raymond Sheh, Bernhard Hengst and Claude Sammut

Abstract— Controllers for autonomous mobile robots thatoperate in rough terrain must consider the shape of thesurrounding terrain and its impact on the robot’s movements.For complex terrain, these interactions are extremely difficultto model in a way that allows traditional controllers to be built.

We have used Behavioural Cloning, a type of learningby imitation that produces rules that clone the skills of anexpert human operator. We have also developed an autonomousinstructor in simulation and used it to generate training datafrom which we have cloned controllers. The resulting controllersperform at a level comparable to that of a human expert. Thecontrollers behave similarly both in simulation, where theywere developed, and on the physical robot without the needfor further modification or training.

I. INTRODUCTION

We use Machine Learning to build a controller for a mo-bile robot traversing rough terrain that it is unable to traverseusing a conventional controller. Examples of such robotsinclude those used for fire-fighting, urban search and rescueand explosive ordinance disposal. Complex interactions be-tween mobile robots and these environments cannot be easilymodelled, thus making it difficult to construct autonomouscontrollers that may be used to replace or assist humanoperators. Instead of attempting to model these difficultinteractions, most approaches make simplifying assumptions,such as all wheels staying on the ground or the body notmaking contact with the terrain. They may then attemptto detect and avoid terrain that is likely to result in theseassumptions being violated. Such an approach excludes therobot from terrain that it may be physically able to traverseusing a more capable controller.

We describe a controller that uses Behavioural Cloning tolearn its control strategy, instead of mathematical modelling.It is given high-level goals and makes decisions on low levelcontrol actions in order to achieve those goals. Its sourceof training data is usually a behavioural trace provided bya human expert. We have also investigated the use of asimulated agent as the source of training data.

Novel contributions of this work include the use of Be-havioural Cloning to interpret and make decisions based on ahigh dimensional set of features extracted from range imageswhere the detailed mapping between features and actions ispoorly understood, the application to both simulated and realrobots without further modification and the use of a simulateddemonstrator that can provide the training data for cloning.

R. Sheh, B. Hengst and C. Sammut are with the ARCCentre of Excellence for Autonomous Systems, School ofComputer Science and Engineering, The University of NewSouth Wales, UNSW SYDNEY NSW 2052, AUSTRALIA[rsheh|bernhardh|claude]@cse.unsw.edu.au

A. Behavioural Cloning

In Behavioural Cloning (Sammut [1], Michie [2], Michieand Camacho [3], Bain and Sammut [4], Bratko et. al.[5]), a human instructor first demonstrates the control taskby observing a visualisation of the same sensor data thatis available to an autonomous controller and selecting theappropriate action that the controller should perform. Amachine learning algorithm takes these examples of theperformance of the task and builds a model of the instructor’sdecision making process. The model generalises these exam-ples. At runtime, the controller uses this model to evaluatenovel situations that it encounters with the aim of choosingthe actions that the instructor would have chosen.

The models used in Behavioural Cloning are intendedto be meaningful to humans, providing insight into theinstructor’s decision making processes. By the time a humanis expert at a task, their decision making is often unavailableto introspection, i.e. humans can perform many tasks withoutthinking. By creating models that are meaningful to humans,Behavioural Cloning allows us to reconstruct this decisionmaking. The use of readable models can also make it easierto check the control strategy and fine tune its training. Insome cases it may be necessary for reasons of safety to beable to inspect the control strategy to predict what it will door to justify an observed action. This contrasts with otherkinds of learning by imitation (Pomerleau [6], Abbeel andNg [7], Hamner et. al. [8], Suc and Bratko [9], Bentivegnaet. al. [10]) where the ability to obtain an insight into thedecision making process is not a primary objective.

B. Robot and Application

We use “Emu”, a four-wheel-drive robot shown in Figure 1that was built for autonomous operations in the RoboCupRescue Robot League competition. It is capable of traversinga variety of terrain, from flat flooring to low rubble. It isequipped with a MESA SR-3100 range imager for terrainsensing, an XSens MTx heading-attitude sensor and a SLAMbased position tracker that uses an on-board Hokuyo URG-LX04 range scanner on a self-levelling mount.

The range imager suffers from motion blur. To obtainsufficient resolution, the robot must be stationary when itobserves the terrain1. For our experiments, the robot can per-form one of eight possible atomic actions. Each action drivesthe wheels on each side of the robot at a pre-set speed forone second before the robot stops to take another observation.

1Recent experiments with newer sensors such as the Kinect exhibit lessmotion blur. Newer imaging technology may eliminate the need for stop-and-go operation.

Page 2: Behavioural Cloning for Driving Robots over Rough Terrain › ~bernhardh › MobileMe › ... · Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst

Fig. 1. The real (left) and simulated (right) Emu robot. The range imageris on top of the sensor pole. The simulated robot is shown with an exampleof an Irregular stepfield.

The eight actions, when performed on flat ground, result in aforward-left-turn, straight-forward-drive, forward-right-turn,spin-left, spin-right, backward-left-turn, straight-backward-drive and backward-right-turn respectively. On uneven andrough terrain, the result of each action becomes very difficultto predict using analytical methods.

The aim of this work is not to build the most capablesystem. Rather, it is to demonstrate that behavioural cloningmay be used to close the “performance gap”, between theperformance of a robot under the control of conventionalcontrollers and performance under the control of a humanoperator, in a task that requires interpretation of complexterrain. Although Emu is mechanically simple and must beoperated in a stop-start fashion, the setup is sufficient for thepurposes of the demonstration.

We trained and evaluated the controllers using a sequenceof three stepfields as described by Jacoff et. al. [11] with atotal size of 1.1×3.3m. Stepfields are designed to be exper-imentally reproducible versions of rubble and unstructuredterrain. They consist of wooden blocks that are reconfiguredbetween experiments to produce a wide variety of novelterrains. An example of a stepfield sequence is shown inFigure 1. We developed a stepfield specification that consistsof a fixed primary pattern surrounded by random blocks.The robot’s task is to drive in a specified direction through aprescribed navigation corridor, which takes the robot over thelength of the stepfield sequence and the areas of flat groundat either end. The controller is deemed to have failed if therobot flips over, becomes stuck or leaves the corridor. Thistask is very difficult, even for robot operators with eyes-oncontrol and who have experience driving “Emu” in easierterrain. We used a human expert who has won awards fordriving robots over rough terrain in the RoboCup RescueRobot League competition.

Both the robot and stepfields were reproduced in a simu-lator, based on the JMonkeyEngine Physics 2 developmentenvironment, which incorporates an ODE based physicsengine [12]. We used the procedure outlined by Pepper et.al. [13] to validate the physics model of the robot. Theperformance of the resulting simulated robot matches thatof the real robot over a variety of test terrains, includingterrains on and beyond the limit of the robot’s capabilities.

We used the simulator to test and fine-tune our training andlearning techniques but we also evaluated the entire systemby training and testing on the real robot. Policies, trained insimulation, were evaluated on the real robot without modi-fications. We also used the simulator to implement anothercontroller based on a forward search. This controller is usedas an automated instructor from which our cloned controllerslearn their behaviour. We compare the performance of theresulting controllers with those trained on demonstrationsfrom the human instructor.

II. APPROACH

The controller is limited to observing sensor data anddeciding on one of the eight actions to move the robot inthe desired direction within a navigation corridor providedby the higher levels of the control stack. Path re-planning orbacking out and re-trying due to failure on impassable terrainis the responsibility of the higher levels of the control stack.The state consists of the robot’s pitch and roll, determinedfrom the heading-attitude unit, terrain data from the rangeimager and the robot’s heading to the desired direction andoffset within the navigation corridor from a combination ofthe higher level goal and the position tracker. Due to sensinglimitations, only the 3D shape of the terrain is considered andwe disregard variations in surface deformation and friction.

A. Terrain Processing and Feature Extractor

The controller must consider the terrain around the robotin order to make its decisions. This requires a memory ofpast sensor data so that it can reason about what is underand around the robot. To allow situation-action rules towork, we fuse the past and present data from the rangeimager, heading-attitude unit and on-board position trackerto generate a world model consisting of a robot-relative gridbased height map of the terrain around the robot. This mapis very high dimensional, with a 5mm-10mm cell size inorder to capture the high spatial frequency in the terrain. Wehave developed two different feature extractors that reducethe dimensionality of these data to make learning tractable.The first makes use of domain knowledge. The second doesnot and was developed so that it could be compared againstthe knowledge-based method.

The first feature extractor, “Planes and Deviations”, is anevolution of that discussed by Kadous et. al. [14], [15]. Itwas inspired by interviews and eye tracking experimentswith expert operators. They showed that overall strategy isgoverned by the general layout of the terrain, such as ahill, flat or valley. Specific deviations, such as protrusionsand holes, vary this overall strategy. The terrain map isdivided into regions of interest shown in Figure 2. For eachregion, three numeric features represent the average robot-relative height, steepness of slope and direction of slope ofa plane of best fit. Another three numeric features representthe co-ordinates of the greatest deviation. Features are set tounknown for regions with <10% sensed coverage.

The second feature extractor, “Block Patterns”, divides theterrain around the robot into a regular grid. The numeric

Page 3: Behavioural Cloning for Driving Robots over Rough Terrain › ~bernhardh › MobileMe › ... · Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst

A B C

0.3m0.3m

0.25m

0.25m

0.25m

0.25m

0.25m

0.2m0.1m

0.5m

0.5m

0.25m

0.9m

Fig. 2. Overlapping Regions of Interest for the Planes and Deviationsfeature extractor, separated into the smaller (A) and larger (B). Figure (C)shows the robot to scale (forward is up).

features are the average robot-relative height of each cell.We use a 16x30 grid of 5x5cm cells centred on the robot.Any cell with <10% observed area is set to unknown.

B. Human Instructor

In each control cycle, the human instructor observesvisualisations, shown in Figure 3(A), of the same statedata that the controller uses to form its world model. Itis important that the human operator only has access tothe same information as the autonomous controller. If theinstructor uses other information, it may not be possible fora learning algorithm to infer that information and thereforethe task may not be learnable. So the world model must besufficient for learning. In response to the sensor data, theinstructor chooses the action most appropriate to move therobot in the desired direction within the navigation corridor.This set of records of state and action form the training data.

The instructor demonstrates the control task separately onreal and simulated robots over randomly generated stepfieldsequences. The simulator enables the automatic generationof many random stepfields and enables the gathering of largequantities of training data. We discard runs where the humaninstructor failed as our learning system is unable to learnfrom failures2. However, we deliberately retain runs wherethe human instructor makes a non-fatal error and recovers. Asdiscussed in Sammut et. al. [17], these examples are valuableas they provide examples of how to recover from mistakes inareas of the state space that would otherwise be unexplored.The initial mistake is also incorporated in the training data,however we observe that in most cases there are other, correctexamples from other demonstrations in similar surroundings.As observed by Michie et. al [18], under such circumstancesbehavioural cloning exhibits a “clean-up” effect and is stillable to learn the correct decisions.

C. Autonomous Instructor

In addition to the human instructor, the simulator allowsus to implement an automated instructor based on a forward

2In this paper, we focus on Behavioural Cloning. However, we havealso investigated an alternative method that incorporates principles fromReinforcement Learning and allows the agent to learn from failures. It isdescribed in detail by Sheh [16], including comparative results.

(A) (B)

Fig. 3. (A) The user interface showing the position of the corridor relativeto the robot, false colour images representing the range and reflectanceoutputs of the range imager and artificial horizon indicators. (B) The robotand its environment. Note that the operator cannot see the robot directly.

search, like that used by Green [19]. A random sequenceof stepfields is generated in the simulator and an A* searchis performed for a sequence of actions that will drive therobot over the terrain to a goal line one metre from therobot’s starting position, while keeping within the navigationcorridor. The short goal distance was chosen to match thenarrow field of view of the range imager. The nodes in theA* search graph represent locations on the stepfield whilethe edges represent actions that move the robot from onelocation to another. We cannot represent the problem as adiscrete grid of nodes in space. Variations in position of aslittle as 1cm can make a big difference in the outcome of anaction on uneven terrain, so the required resolution wouldresult in a prohibitive number of nodes in the search graph.

Instead, the search graph is dynamically grown as thesearch proceeds. To expand a node, the simulated robot is“teleported” to the position and orientation represented bythat node and each action taken in turn. The position of therobot at the end of the atomic action is represented by anew node and the search proceeds. The A* search is quiteslow, with each node expansion requiring around a secondand with a potential branching factor of 8.

Once a path has been found through a terrain in simulation,the path is replayed. Simulated sensor data, representing therobot state, are gathered at each step and stored along withthe best action found in the course of the forward search.

An A* search requires a deterministic process but thesimulator is stochastic. To make this process deterministic,each time a node, representing a particular position andorientation (and, thus, identical surrounding terrain) is re-visited, the result of each action is taken from the graphrather than by re-simulating. From the perspective of the A*search, the system is deterministic. The graph is re-built foreach 1m run so while the simulator is stochastic betweenruns, it is deterministic within the search for a given run.

This approach is similar to that used by Ng and Jordan[20] where they freeze the stochasticity of a system in orderto enable a search that requires determinism. Note that wecannot cache the result of a single simulation from a singlelocation and use it throughout the graph as each node in the

Page 4: Behavioural Cloning for Driving Robots over Rough Terrain › ~bernhardh › MobileMe › ... · Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst

graph represents a different robot location and, thus, differentsurroundings that may affect the robot’s motions differently.

The forward search does not make use of sensor data at all.It simply tries different sequences of actions in the simulatedenvironment until it succeeds. The final path only considersterrain that the search was able to range over. The A* searchhas an effective “sensing footprint” that is its search area. Incontrast, the real sensors are limited in field of view to 0.5m-1m depending on the robot’s pitch. Therefore, we only taketraining examples from nodes along the found path where theA* search from that point has a subsequent search range, or“sensing footprint”, of between 0.5m and 1m.

The search has “perfect” knowledge of the robot’s fu-ture behaviour, which is clearly unrealistic. This causesan aliasing problem where different situations, where theautonomous instructor takes different actions, may appearthe same to the sensors. With the few training examplesprovided by the human expert this was undesirable. However,the autonomous demonstrator may be run in parallel overlong periods and so it generates significantly more data.Therefore, the controller has the opportunity to learn the mostlikely action to take given a particular state, across all theunderlying aliased states that it may represent.

D. Machine Learning

For the purpose of machine learning, the features extractedfrom the state form the attributes and the action selected bythe instructor forms the class, thus there are 8 mutually ex-clusive classes. Decision tree learning was selected becauseof its speed and stability and because decision trees are easyto read. Decision trees are also able to gracefully handleunknown attributes, both during training and classification.We use J48, based on C4.5 by Quinlan [21], implementedin the Weka machine learning toolkit by Hall et. al. [22].

Cross-validation is often used to evaluate and select thevarious parameters in machine learning algorithms. However,our preliminary experiments showed that cross-validationaccuracy does not directly correspond to performance onthis task. Instead, we evaluated and tuned parameters onthe performance of the complete system in simulation. Ourmetric was the success rate of a controller, using a particularset of parameters, in completing the terrain traversal task,over many examples of randomly generated terrain.

The J48 decision tree was evaluated with a variety oftraining and pruning parameters. We found that changingthese parameters yielded no significant improvement overtheir default values. We also generated decision trees byrandomly sampling differing numbers of training examplesfrom the available training data, evaluated them on the task insimulation and used the best performing tree. In this manner,we can determine the best number of training examples touse. We can avoid overfitting as the candidate decision treesare evaluated on the task, rather than on a test set.

We used the same evaluation method for controllers thatwere generated by different machine learning algorithms,including Support Vector Machines, Locally Weighted Re-gression and meta-learning techniques including AdaBoost.

We found there to be no significant difference betweenthe performance of J48 and the best alternative learningtechniques. J48 was one of the fastest to learn and run.Sheh [16] presents detailed results of comparisons betweendifferent learning algorithms.

E. Controller

At runtime, a world model is constructed from sensor dataand features extracted, just as during demonstration. Thefeatures are input as attributes into the learned decision treeand the resulting classification corresponds to the action thatthe robot should perform. This controller has no memory soif it makes an incorrect choice of action that results in nomovement, it will keep taking that same action and becomestuck in an infinite loop. This problem is not unique to thiscontroller – a human expert occasionally selects actions thatdo not move the robot. However, the expert remembers pastactions and notices that the robot doesn’t move. Althoughthe state is the same, they will choose “next best” action.

Replanning over multiple steps is the responsibility ofthe higher levels. However, our controller has the abilityto recover from a single step “Robot Stuck” mistake byapplying a heuristic that penalises actions that have been triedand failed. The decision tree is used to obtain the probabilityP (a|s) that each action a is the action that the instructorwould have chosen in the current state, s. This probabilityis estimated based on the training data that reaches the sameleaf as the features extracted from the current state. Thecontroller also records the positions that the robot has visitedusing information from the position tracker, and the actionsthat were performed in each position. It computes a penaltyfor each action: Penalty(a) = pfna where na is the numberof times action a has been attempted in this position and pf isthe penalty factor, which we set to 0.8. If an action has neverbeen performed in the current position, Penalty(a) = 1.

For each action, we multiply the penalty factor by theprobability that it is the action that the instructor would havechosen and take the action with the greatest product. As aresult, if an action is very likely to be correct, the controllerwill persist for several control cycles as it is possible foran action to fail randomly. In contrast, if several actionsare equally desirable, on detecting a lack of motion, thecontroller will tend to alternate between them. Naturally, ifthis condition persists, eventually P (a|s)×Penalty(a) willapproach constant for all a and the controller will tend toalternate between all possible actions. While this is similarto a human operator having no idea what to do and tryingrandom actions, we regard this as a failure condition.

III. EVALUATION

We tested several variants of the controller: it may betrained by a human or autonomous instructor, use the Planesand Deviations or the Block Patterns feature extractors, andmay use Robot Stuck detection, yielding eight permutationswhich were evaluated in Sheh [16]. All results are fromevaluations on the actual task, in simulation or reality. Thetest terrains were randomly generated and were not used to

Page 5: Behavioural Cloning for Driving Robots over Rough Terrain › ~bernhardh › MobileMe › ... · Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst

generate the training data. The learned controllers were com-pared against three baseline controllers: a naıve controllerthat always drives forward (“Always Forward”), a “blind”controller that uses a short range Markov Decision Process(MDP) planner to plan a path assuming a flat floor (“DefaultMDP”), and the human expert operator (“Human Expert”).We also performed some experiments against an MDP basedplanner that used a learned cost function based on the terrainmap. As documented in Sheh [16], this cost function isextremely complex to capture the response of the robot to theterrain and had impractically long graph generation times.We also performed preliminary experiments using the A*search in simulation, over terrain reconstructed from theworld model. However, as we discussed in Section II-C, thisis very time consuming. Each node expansion in the searchgraph takes a second and the graph is re-generated at eachstep. In preliminary experiments, an A* based controller tookover an hour to evaluate the best action for a single step.

For this comparison, the human expert was instructed tonot reverse long distances (greater than 2 control cycles) norperform random actions with the intent of freeing the robot.If such actions were necessary, the robot was deemed to bestuck. This was to prevent the human operator from inad-vertently performing the role of the higher level behaviours.These are outside the scope of this controller, which wasprovided with a fixed higher level command.

We evaluated each controller over many different typesof stepfields. In this paper we present the most interesting“Irregular” stepfields, based on the NIST “Orange” randomstepfield specification and consisting of specific raised pat-terns, such as rails or pillars3. The areas around these patternsare filled in with blocks of random heights. These randomblocks significantly affect the robot’s behaviour and must betaken into account for successful traversal. An example ofan irregular stepfield is shown in Figure 1.

In Figure 4 we show the best performing representativeof each of the eight permutations of controller, as comparedwith the performance of the three baseline controllers. Theevaluations were run in simulation over 500 runs of thestepfield dash, on randomly generated novel terrain. To avoidfatigue from contaminating results, the human expert wasonly evaluated over 70 runs over the course of 2 weeks.

The learned controllers performed significantly better thanthe baseline controllers. While they did not perform quite aswell as the human expert, the difference in performance wasnot significant, within the limitations of a practical number oftrials of the human expert. This is encouraging as it suggeststhat Behavioural Cloning is able to extract meaningful rulesfrom the human expert and apply them to novel states inthis application. We discuss interpretation of these rules inSheh [16]. This also suggests that the autonomous instructoris a viable alternative to the human expert where it is possibleto construct a simulation of the desired task. The eightvariants of the controller have similar levels of performance

3It was found that truly random patterns were easier as the resulting holesin the terrain tended to be smaller than the length of the wheels.

0

20

40

60

80

100

IrregularStepfield 1

IrregularStepfield 2

IrregularStepfield 3

IrregularStepfield 4

% S

ucce

ss

Performance of the Situation-Action control agentover the four Irregular Stepfields

Simulation Reality

Fig. 5. Results comparing Real and Simulated runs over four differentIrregular stepfields using a representative learned controller.

even though some have more background information (in theform of more advanced feature detection and the addition of“Robot Stuck” detection) than others. This suggests that themachine learning is able to compensate for the relative lackof background information in the simpler controllers.

We also evaluated a representative controller on the realrobot. Because of time requirements and wear on the robot,it was not possible to evaluate every variant of the controlleron the real robot with a statistically significant number oftrials. The aim of the experiments on the real robot wasto verify that the simulation results are consistent withwhat we would expect from experiments on the real robot.We chose a controller that was trained on data from theautonomous instructor and used the Planes and Deviationsfeature extractor. Due to the large number of experimentalruns required and the fragile state of Emu’s motors, we didnot use “Robot Stuck” detection.

The controller was evaluated in simulation over severalrandomly generated irregular stepfields. Four stepfields wereselected based on the simulated rate of success; one wherethe controller usually succeeded, one where it usually failedand two in between. We then evaluated the controller overthese four terrains over 300 runs each in simulation. Wereproduced the four terrains in real life and used the samecontroller to drive the real robot over the terrain. The aimwas to determine if the difference in performance betweenthe four terrains observed in simulation would be reflectedin reality. The results are presented in Figure 5. In total, thecontroller drove the real robot over ten runs of each of thefour real terrains. After running as many experiments as werepractical, the performance of the simulated robot appears toagree well with the performance of the real robot.

IV. CONCLUSION AND FUTURE WORK

Behavioural Cloning has been applied to many controlapplications ranging from industrial process control (Michieand Camacho [3]) to flying aircraft (Sammut et. al. [23])and controlling container cranes (Urbancic and Bratko [24]).In those applications, state could be described by a set ofwell-defined variables that are relatively easy to obtain. In

Page 6: Behavioural Cloning for Driving Robots over Rough Terrain › ~bernhardh › MobileMe › ... · Behavioural Cloning for Driving Robots over Rough Terrain Raymond Sheh, Bernhard Hengst

0

20

40

60

80

100

Alw

ays

For

war

d

Def

ault

MD

P

Hum

an E

xper

t

Tra

ined

by

Hum

anB

lock

feat

ures

No

Stu

ck D

etec

tion

Tra

ined

by

Hum

anB

lock

feat

ures

With

Stu

ck D

etec

tion

Tra

ined

by

Hum

anP

lane

feat

ures

No

Stu

ck D

etec

tion

Tra

ined

by

Hum

anP

lane

feat

ures

With

Stu

ck D

etec

tion

Tra

ined

Aut

onom

ousl

yB

lock

feat

ures

No

Stu

ck D

etec

tion

Tra

ined

Aut

onom

ousl

yB

lock

feat

ures

With

Stu

ck D

etec

tion

Tra

ined

Aut

onom

ousl

yP

lane

feat

ures

No

Stu

ck D

etec

tion

Tra

ined

Aut

onom

ousl

yP

lane

feat

ures

With

Stu

ck D

etec

tion

% S

ucce

ss R

ate

Controller

Behavioural Cloning Success RateIrregular Stepfields

Baseline Controllers Machine Learned Control Agents

Fig. 4. Performance of the eight permutations of controller, compared to three baseline controllers, evaluated on the task of driving the simulated robot.Human expert was evaluated over 70 trials, all other controllers were evaluated over 500 trials.

contrast, the state information in our application must beobtained by interpreting complex data from noisy sensors.The experiments described above show that behaviouralcloning is a promising method for building control policiesfor an autonomous robot. However, there are still manyalternatives to investigate for the state representation and theimplementation of actions. We intend to look at continuousactions and actions that reconfigure the robot. For example,a multi-tracked vehicle has movable ”flippers” that can beused to overcome obstacles or free itself from obstructions.The additional degrees of freedom increase the search spaceof the learning algorithm, requiring more training examplesunless a smarter algorithm can be devised. On a morepractical side, newer range imagers are less prone to motionblur than the one in these experiments. These will allow usto drive the robot continuously, rather than using the stop-and-go method the was forced upon us by the limitations ofthe older equipment.

REFERENCES

[1] C. Sammut, “Automatic construction of reactive control systems usingsymbolic machine learning,” Knowledge Engineering Review, vol. 11,pp. 27–42, 1996.

[2] D. Michie, “Learning concepts from data,” Expert Systems withApplications, vol. 15, no. 3-4, pp. 193 – 204, 1998.

[3] D. Michie and R. Camacho, “Building symbolic representations ofintuitive real-time skills from performance data,” Machine intelligence13: machine intelligence and inductive learning, pp. 385–418, 1995.

[4] M. Bain and C. Sammut, “A framework for behavioural cloning,” inMachine Intelligence 15. Oxford University Press, 1996, pp. 103–129.

[5] I. Bratko, T. Urbancic, and C. Sammut, “Behavioural cloning ofcontrol skill,” Machine Learning and Data Mining: Methods andApplications, pp. 335–351, 1998.

[6] D. A. Pomerleau, “Efficient training of artificial neural networks forautonomous navigation,” Neural Computation, vol. 3, no. 1, pp. 88–97,1991.

[7] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-forcement learning,” in Proceedings of the Twenty-first InternationalConference on Machine Learning, 2004.

[8] B. Hamner, S. Scherer, and S. Singh, “Learning obstacle avoidanceparameters from operator behaviour,” in Proceedings of the NIPS2005 Workshop on Machine Learning Based Robotics in UnstructuredEnvironments, 2005.

[9] Dorian and I. Bratko, “Modelling of control skill by qualitativeconstraints,” in Thirteenth International Workshop on QualitativeReasoning, C. Price, Ed., 1999, pp. 212–220.

[10] D. C. Bentivegna, C. G. Atkeson, and G. Cheng, “Learning tasks fromobservation and practice,” in Robotics and Autonomous Systems 47,2004, pp. 163–169.

[11] A. Jacoff, A. Downs, A. Virts, and E. Messina, “Stepfield pallets:Repeatable terrain for evaluating robot mobility,” in PerformanceMetrics for Intelligent Systems Workshop, 2008.

[12] JMonkeyEngine community, “Jmonkeyengine,” http://www.jmonkeyengine.com/, 2008.

[13] C. Pepper, S. Balakirsky, and C. Scrapper, “Robot simulation physicsvalidation,” in Performance Metrics for Intelligent Systems Workshop,2007.

[14] M. W. Kadous, C. Sammut, and R. Sheh, “Behavioural cloningfor robots in unstructured environments,” in Advances in NeuralInformation Processing Systems Workshop, 2005.

[15] ——, “Autonomous traversal of rough terrain using behaviouralcloning,” in Proceedings of the 3rd International Conference onAutonomous Robots and Agents, 2006.

[16] R. K. Sheh, “Learning robot behaviours by observing and envisaging,”Ph.D. dissertation, School of Computer Science and Engineering, TheUniversity of New South Wales, Sydney, NSW, Australia, December2010.

[17] C. Sammut, S. Hurst, D. Kedzier, and D. Michie, “Learning to fly,”in Proceedings of the Ninth International Conference on MachineLearning. Aberdeen: Morgan Kaufmann, 1992.

[18] D. Michie, M. Bain, and J. Hayes-Michie, “Cognitive models fromsubcognitive skills,” in Knowledge-based Systems in Industrial Con-trol, M. Grimble, S. McGhee, and P. Mowforth, Eds. Peter Peregrinus,1990.

[19] A. R. Green, “Path planning for vehicles in difficult unstructured en-vironments,” Ph.D. dissertation, Australian Centre for Field Robotics,The University of Sydney, 2007.

[20] A. Y. Ng and M. Jordan, “Pegasus: A policy search method forlarge mdps and pomdps,” in Uncertainty in Artificial Intelligence,Proceedings of the Sixteenth Conference, 2000.

[21] J. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:Morgan Kaufmann, 1993.

[22] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten, “The weka data mining software: An update,” SIGKDDExplorations, vol. 11, no. 2, 2009.

[23] C. Sammut, S. Hurst, D. Kedzier, and D. Michie, “Learning to Fly,”in Proc 9th Intl Conf Machine Learning, D. Sleeman and P. Edwards,Eds. Aberdeen: Morgan Kaufmann, 1992.

[24] T. Urbancic and I. Bratko, “Reconstructing human skill with machinelearning,” in Proc. 11th Euro Conf Artificial Intelligence, A. Cohn,Ed. John Wiley and Sons, 1994.