Using Q-learning to Control Robot Goal Keeper Behaviour · goal keeper is the only player allowed to stay within the penalty area of its own team and furthermore touch the ball with

Using Q-learning to Control Robot Goal Keeper Behaviour

Joscha-David Fossel

September 15, 2010

Abstract

In this paper we tackle the problem of robot goalkeeping on the Aldebaran Robotics' Nao. There-fore we de�ne input and output available to theNao. We then present two approaches to creategoal keeper behaviour control. The �rst one is usingreinforcement learning (in particular Q-learning),the second one is using hard-coded rules. For theQ-learning based approach, we test two action poli-cies, ε-Greedy and Softmax. The experiments re-veal that Softmax is more suited for robot soc-cer goal keeping than ε-Greedy. When testing theQ-learning based approach against the hard-codedone we were not able to discover an advantage ofQ-learning in terms of performance.

1 Introduction

In August 2007, Aldebaran Robotics' Nao [6] wasintroduced as the robot used in the Robot SoccerWorld Cup (Robocup) Standard Platform League,an international robotics competition. In the Stan-dard Platform League, a designated player of everyteam (the goal keeper) has the duty of actively pre-venting the opponent from scoring a goal. Further-more, as in many real life sports, special rules applyto the goal keeper position in case of Robocup: Thegoal keeper is the only player allowed to stay withinthe penalty area of its own team and furthermoretouch the ball with its arms and hands while beeingin it's penalty area [ref].

This paper presents two approaches, learning andnolearning, to create robot behaviour control foroperating the goal keeper. The learning approachdiscussed in this paper is using Q-learning [11] toevaluate which movement is appropriate dependingon the position the Nao is in. The nolearning ap-

proach is manually setting the behavioural rules forthe robot (hard-coded).

However, in order to employ either of those ap-proaches, certain ways of perceiving (input) andacting (output) in the enviroment the agent is inare requiered. Input and ouput will be discussedin Section 2. Afterwards, Section 3 presents thelearning, and Section 4 the nolearning approach.Subsequently, experiments and results are given inSection 5. Finally, this paper closes with drawingconclusions on the presented approaches of control-ling goal keeper behaviour in Section 6.

2 Interaction between Nao and

Enviroment

In this section the Naos ways to interact with it'senviroment are speci�ed, namely:

• Input: Perception, which enables the Nao togather information about the enviroment.

• Output: Motions, which enable the Nao tochange the enviroment.

2.1 Input

To determine the position of the ball, the imageprovided by the Naos camera is scanned for a blob,whose RGB components match the ones of the ball.If such a blob is found, the distance and angle be-tween Nao and ball can be estimated depending onwhich pixels are occupied by the blob.

Using distance and angle, the relative position ofthe ball in dependency to the goal keeper's positioncan be derived:

Xpos = ballDist ∗ cos(ballAngle) (1)

1

Y pos = ballDist ∗ sin(ballAngle) (2)

From taking taking the (X,Y)-position of the balltwo times at di�erent points in time (P1(X1, Y1);P2(X2, Y2), see Figure 1) the direction and also thespeed of the ball can be derived. Additionally thedirection of the ball can be used to calculate the(impact) point of intersection between the ball andthe ground line. Therefore the distance of the pointof impact to the goal keeper can be calculated, de-termining if the ball will be in reach to save it.

ballSpeed =√4X2 +4Y 2 (3)

a = 4Y/4X (4)

impactPointY = a ∗X2 + Y2 (5)

Figure 1: Impact point calculation

2.2 Output

To enable the goal keeper to interact with the en-viroment it is provided with the following set ofmotions:

Strafe: Move sidewards without rotating.

Stand-still: Hold position.

Forward: Walk forwards.

Backward: Walk backwards.

Figure 2: Dive motion

Figure 3: Roll motion clearing the ball

Stand-up: Return to an erect position afterdiving.

Dive: Jump either to the left or to the right,resulting in the robot being horizontally on theground. See Figure 2.

Roll: Roll sidewards to remove the ball fromthe penalty box. Requires the goal keeper tolie on the ground. See Figure 3.

While this set of motions is quite basic, it allows thegoal keeper to pursue its task of preventing the op-ponent from scoring a goal. Adding extra motions,e.g. omni-directional walking [8], or a sophisticatedshooting motion would most likely increase the per-formance, but are so complex that they need to betreated seperately.

3 Machine learning

This section introduces a way of enabling the goalkeeper to learn what movements are appropriate

2

depending on the position it is in. Options how toimplement this are [7]:

• Supervised learning: Training data includingboth inputs and desired outputs is required.From that training data the learner learns afunction that should allow it to generalize fromthe training data to unseen examples. An ex-ample for supervised learning methods are Ar-ti�cal Neural Networks [5].

• Unsupervised learning: The learner is onlyprovided with examples from the input space.From that it seeks to determine how the datais organized. Examples are data mining [2] orKohonen maps [4].

• Reinforcement learning [1]: Seeking to maxi-mize the reward, the learner explores an en-viroment and receives rewards or penalties.However it can be unknown to the learner forwhat action in particular it receives the rewardor penalty. An example for this principle ischess, where the player knows the outcome ofthe game, but not which moves in particularwere good or not.

We opt for reinforcement learning, because it doesnot require correct input/output pairs and has afocus on on-line performance.

To implement reinforcement learning, a MarkovDecision Process representing the domain of robotsoccer goalkeeping is presented in Subsection 3.1.In Subsection 3.2, Q-learning, a reinforcementlearning technique we apply to the Markov Deci-sion Process, is presented. Two action policies (ε-Greedy and Softmax) are given in Subsection 3.3.

3.1 Markov Decision Process

The task of robot soccer goalkeeping can be trans-lated into a Markov decision process, i.e. it satis�esthe Markov property [9]: The e�ects of an actiontaken in a state depend only on that state and noton the prior history.

The basic reinforcement learning model applied toMarkov Decision processes [1] consists of the fol-lowing items:

• A set of possible world states S.

Figure 4: Playing �eld distributed into buckets

Figure 5: Robot soccer MDP

• A set of possible actions A.

• A real valued reward function R(s).

If S, A, R(s) are de�ned as follows, the Markovproperty is ful�lled in robot soccer:

State s ∈ S of the world is de�ned by the positionthe agent is in (e.g. lying on the �oor coveringthe left side of the goal), and the position, speedand impact point of the ball (see Subsection 2.1).These continous valued properties are distributedinto buckets (see Figure 4) in order to gain a �nitenumber of discrete states.

The sets of possible actionsA the agent can performare de�ned as the available motions discussed inSubsection 2.2. See also Figure 5.

The reward function R(s) is de�ned depending onthe outcome of an attack. The rewards are set asfollowing:

• No goal, Nao lying on the ground: Reward 75.

• No goal, Nao in erect position: Reward 100.

3

• Goal: Penalty -10.

Those reward/penalty values contain the informa-tion that preventing a goal by diving is less favor-able (due to the limited mobility afterwards) thanpreventing a goal without diving. Obviously notpreventing a goal at all is the least favorable courseof action, and therefore results in a penalty.

3.2 Q-learning

In 1989 Watkins[11] introduced a reinforcementlearning technique called Q-learning. It allows anagent to optimize its performance through a systemof rewards and punishments without being super-vised. By trial and error search the agent exploreswhich actions are suitable depending on the posi-tion it is in. Q-learning enables the agent to opti-mize not only immediate, but also delayed rewards.To do so, Q-learning learns an action-value functionthat approximates the utility of taking an actionin a certain position. This action-value functionis learned by exploring the state-space following acertain action-policy (de�ned in Subsection 3.3).

We apply Q-learning to the Markov decision pro-cess modeling robot soccer de�ned in Subsection3.1.

To implement Q-learning we introduce the follow-ing two matrices R and Q:

• The environment reward matrix R stores statedependent rewards/penalties. If for examplethe ball enters the goal a penalty of -10 is ap-plied. See R(s) at Subsection 3.1.

• Matrix Q, initialized as a zero matrix, repre-sents what the agent has learned so far. It con-sist of the state the agent is in on the one axis,and the actions it can perform on the other.

In the beginning the agent assumes that all actionsare equally good (Q matrix initialized as zero ma-trix - the agent has no experience about what ac-tions will yield reward). Therefore the agent selectsrandom actions and updates the Q matrix, depend-ing on the reward (or punishment) that result fromits actions (see Formula 6). This procedure is re-peated until the agent encounters a state where theQ matrix is not zero. In that case it might chosenot to perform a random action, but act according

Algorithm 1 Q-learning Algorithm

1: Set parameter gamma , alpha , and R2: Initialize matrix Q as zero matrix3: For each episode:4: Determine initial state5: Do while not reach goal state6: Select a according to action policy7: Take action a, observe r, s'8: Compute the new Q(s,a) with formula (6)]9: s <- s'

10: End Do11: End For

to its action policy (see Subsection 3.3). Contin-uing this, the agent learns which actions will leadto a maximization of rewards, so that it can actappropriate in every position.

The associated algorithm is shown at Algorithm1, where y is the so called discount factor with0 ≤ y < 1. y near 0 will result in the agent con-sidering only immediate rewards and y towards 1in considering future rewards with greater weight.Parameter α is the learning rate that determinesthe importance of newly acquiered information. Avalue of 0 for α prevents the agent from learninganything, while a value of 1 makes the agent con-sider only the newest information. The TransitionRule [9] that approximates the Q-Values is shownat Formula 6.

E(s, a) = ymaxa′ [Q(s′, a′)]−Q(s, a)

Q(s, a) = Q(s, a) + α [R(s, a) + E(s, a)] (6)

3.3 Action Policies

The process of selecting one action among all possi-ble actions needs to be elucidated. So called actionpolicies de�ne what action is to be chosen in anyposition encountered. These action policies de�newhen to explore and when to exploit, i.e. when notto select an action that is so far known to give thehighest reward, and instead explore if there mightbe a strategy that gives even more reward.

Two common action policies are:

• ε-Greedy[9]: Most of the times the action withthe highest so far known reward will be se-lected (greedy), but with a certain probabil-ity ε, a uniformly distributed random action

4

is chosen instead in order to provide the pos-sibility of exploration. Therefore if there arein�nite runs the optimal policy will be discov-ered since all possible actions are explored.

• Softmax[9]: The drawback of ε-Greedy is thatit is explored evenly among all actions, there-fore the worst action has the same likelihoodof being selected as the second best. If dealingwith a problem where the worst possible actionis very bad this might be unwise. When explor-ing, the Softmax policy di�ers from ε-Greedyto that extent that it considers the actions es-timated values and weights them accordingly,linking probability of an action being chosenfor exploration to their value estimates:

W (s, a) =eQ(s,a)/r∑nb=1 e

Q(s,b)/r(7)

The positive parameter r is the temperaturethat determines the sensitivity of the relationbetween weights and estimated values. With ahigh temperature larger di�erences in the es-timated values also cause high weight di�er-ences, while a temperature of zero results insame weights for every action independent oftheir value estimates (ε-Greedy).

4 Hard-coded Behaviour

As an alternative to the in the previous section in-troduced Q-learning based approach, this sectionintroduces a rather simple way of controlling thegoal keeper. Viz. manually de�ning how to behavein certain situations, deduced from what seems rea-sonable from a human point of view.

This approach is implemented by de�ning rulesthat cover all relevant situations.

1. In the beginning it is determined if the goalkeeper needs to act. Therefore the estimatedimpact point is taken into account. If that im-pact point is outside of the goal plus a certainsafety threshold (to absorb inaccuracy fromperception) the goal keeper obviously does nothave to engage, and 1 is repeated. Else it willexecute 2.

2. If, on the other hand, the impact point is in-side the goal area, the distance and speed ofthe ball are used to decide whether the ballwill enter the penalty box, or not. In order todetermine if the ball is fast enough to hit thegoal instead of stopping in front of it, a thresh-old can be de�ned - i.e. if the ball exceeds acertain velocity v at a certain distance d thegoal keeper is required to act, if not the ballwill stop before reaching the goal. Identifyingv and d can be accomplished by trial and er-ror. An alternative is using the second deriva-tive to determine if the ball is fast enough -however we were unable to obtain good resultsusing that approach. If the ball is predicted toenter the penalty box, the goal keeper eitherdives to the left or right according to the im-pact point prediction, or holds its position incase the ball is estimated to hit the centre ofthe goal. Afterwards 3 will be executed. If theball is predicted to not enter the penalty box,1 is repeated.

3. The goal keeper will hold its current position(i.e. either lying on the ground or standing inthe centre of the goal) until the ball has en-tered the penalty box. If that happens within5 seconds 4 is executed. If 5 seconds pass andthe ball does not enter the penalty box 5 isexecuted.

4. The goal keeper attempts to clear the ball byeither walking towards the ball and kicking it(if in erect position), or by rolling towards theball, both resulting in the ball being removedfrom the penalty area. Then it proceeds to 5.

5. If necessary the goal keeper will start the standup routine, move to the initial position and goto 1 again.

Figure 6 shows the �owchart of this procedure.

5 Experiments

In this section we try to answer the following twoquestions:

1. Which of the previously introduced actionpolicies, namely ε-Greedy and Softmax (see

5

Figure 6: Flowchart of hard-coded behaviour

Subsection 3.3), is better suited to solve theproblem of robot goalkeeping?

2. How does the Q-learning based approach per-form compared to the hard-coded approach?

In order to answer these question we conduct a se-ries of experiments described in Subsection 5.1. InSubsection 5.2 the results of these experiments areshown and discussed.

5.1 Set-up

All experiments are conducted on the CyberboticsWebots simulator [12] platform for several reasons:It prevents hardware damage, it allows to by-passcamera noise by using both the real position val-ues provided by the simulator and also the noisydata provided by the simulated camera, and mostimportantly because it enables to do a high num-ber of unsupervised test runs that would requireextensive time if done in reality.

In order to compare ε-Greedy and Softmax, theagent is initialized with an empty Q-Matrix forboth action policies. Then 100 random shots to-wards, but not necessarily on the goal are per-formed (Each attack is one episode). For both ac-tion policies the cumulated Q-Values and the num-ber of errors (goals) are observed. The agent istrained with the following parameters that were setintuitively:

• γ = 0.5, the agent consideres both immediateand future rewards.

• α = 0.5, the agent weights old and newly ac-quiered information evenly.

• ε = 0.1, the agent eplores instread of exploitingwith a probability of 10%.

• r = 0.5, under Softmax, actions with a highestimated value are more likely to be chosen.

To compare the Q-learning based with the hard-coded approach, the agent is initially trained us-ing the Softmax policy (same parameters as in theprevious experiment, 1000 episodes). Afterwards400 evaluation episodes are conducted. The hard-coded approach is also tested by carrying out 400evaluation episodes, and afterwards the number of

6

Figure 7: Number of committed errors (goals),ε-Greedy vs. Softmax

errors are compared. These two experiments areconducted using the NAO camera and using thereal positions provided by the Webots simulator,both for training the agent and when running thetests.

In order to ensure comparableness Common Ran-dom Numbers[3] are used so that the same randomshots are performed.

5.2 Results & Discussion

5.2.1 ε-Greedy vs. Softmax

Figure 7 shows the number of errors the agent com-mitted on the one axis and the number of episodeson the other. When following the ε-Greedy policythe agent commits more errors than when usingSoftmax.

This is a result of the possibility of making 'fatalerror' in the domain of robot soccer. If, for ex-ample, the ball is aimed at the left corner of thegoal and still at a su�cient distance, both waitingfor the ball to come closer and then acting as wellas diving to the left corner is suitable. However,if the agent dives right there is nothing it can doanymore to gain a reward in that episode. As itis more likely to commit such a 'fatal erros' usingtheε-Greedy action policy, Softmax performs bet-ter.

Figure 8 shows the cumulated Q-values for bothε-Greedy and Softmax. The graph shows that un-der the Softmax action policy higher cumulated Q-Values are achieved. Therefore we may conclude

Figure 8: Cumulated Q-values, ε-Greedy vs. Soft-max

that when using Softmax, ways to prevent a goalare discovered more quickly.

5.2.2 Q-learning based vs. Hard-coded Ap-

proach

As can be seen in Figure 9, the Q-learning basedapproach does not perform better than the hard-coded approach. When trained with and using thereal values as input (Q-real) both perform equallygood.

There are several reasons for that. For once, whileQ-learning has been proven to converge[10] to theoptimal solution, the discretisation of the originallycontinuous values that de�ne the state the agent isin leads to loss of precision. In order to compensatea large number of states may be introduced, thoughthat would increase the required training time in re-turn because of the additional number of states. Inthis implementation we used a rather low numberof buckets (12 buckets for the playing �eld, Fig-ure 4). This might be a reason for the bad perfor-mance of the Q-learning based approach, as it mayprevent it from handling camera noise su�ciently.The experiment shows that the Q-learning basedapproach is more a�ceted by camera noise thanthe hard-coded approach. In contrast the hard-coded behaviour was manually adjusted to reducethe amount of �awed actions resulting from impre-cise sensor readings. It is for example programmedto dive even when the shot is (only due to impre-cise sensor readings) calculated to slightly miss thegoal, while Q-learning has no possibility to deter-mine if a shot is predicted to slightly or clearly miss

7

Figure 9: Q-learning based vs. Hard-coded Ap-proach

the goal (due to the small number of buckets).

6 Conclusions & Future Re-

search

In this paper we proposed a Q-learning based and ahard-coded approach to solve the problem of robotsoccer goal keeping. We furthermore tested twoaction policies for the Q-learning based approach.The experiments show that softmax is an adequateaction policy in this domain. The experiments fur-thermore show that our Q-learning based approachdoes not outperform out hard-coded rule based ap-proach. However, the reason for that might be theimplementation that uses only a small number ofbuckets the continous states are divided in. Also, alearning based approach has the advantage that itis able to adapt to changes, while the hard-codedapproach does not. If for example the basic imagerecognition is improved the hard-coded approachwould need to be changed. In contrast, we expectthe Q-learning based approach to be able to copewith new image recognition easily.

For future reseach we suggest improving the imagerecognition, and also integrating more information(about for example friendly and/or enemy agents)into the input. By that the complexity of the prob-lem increases, and hard-coded behaviour might be-come unfeasible, necessitating another form of con-trolling the agent, like Q-learning. Likewise thenumber of buckets should be increased.

We presented a basic framework with decent perfor-

mance that o�ers possibilities for future improve-ments. Therefore we may conclude that Q-learningis a suitable approach to robot soccer goal keeping.

References

[1] Kaelbling, Leslie Pack, Littman, Michael,and Moore, Andrew (1996). Reinforcementlearning: A survey. Journal of Arti�cial In-telligence Research, Vol. 4, pp. 237�285.

[2] Kantardzic, Mehmed (2002). Data Min-

ing: Concepts, Models, Methods and Algo-

rithms. John Wiley & Sons, Inc., New York,NY, USA.

[3] Kleijnen, Jack P. C. (1975). Antitheticvariates, common random numbers and op-timal computer time allocation in simula-tion. MANAGEMENT SCIENCE, Vol. 21,No. 10, pp. 1176�1185.

[4] Kohonen, T., Schroeder, M. R., and Huang,T. S. (eds.) (2001). Self-Organizing Maps.Springer-Verlag New York, Inc., Secaucus,NJ, USA.

[5] Lawrence, Jeannette (1993). Introduction toneural networks. California Scienti�c Soft-ware, Nevada City, CA, USA.

[6] Robotics, AldebaranNao academicsdatasheet.

[7] Sewell, Martin (2007). 1 introduction ma-chine learning.

[8] Strom, Johannes, Slavov, George, andChown, EricOmnidirectional walking usingzmp and preview control for the nao hu-manoid robot.

[9] Sutton, Richard S. and Barto, Andrew G.(1998). Reinforcement Learning, An Intro-

duction. MIT Press.

[10] Watkins, Christopher J.C.H. and Dayan,Peter (1992). Technical note: Q-learning,machine learning.

8

[11] Watkins, Christopher J.C.H. (1989). Learn-ing from delayed rewards. Ph.D. thesis, Uni-versity of Cambridge, Psychology Depart-ment.

[12] www.cyberbotics.com (2010). Webots refer-ence manual.

9

Using Q-learning to Control Robot Goal Keeper Behaviour · goal keeper is the only player allowed to stay within the penalty area of its own team and furthermore touch the ball with

Documents