Heuristic Q-Learning Soccer Players: A New Reinforcement Learning Approach to RoboCup Simulation

Heuristic Q-Learning Soccer Players: a new

Reinforcement Learning approach to RoboCup

Simulation

Luiz A. Celiberto Jr.1,2, Jackson Matsuura2, and Reinaldo A. C. Bianchi1

1 Centro Universitario da FEIAv. Humberto de Alencar Castelo Branco, 3972.

09850-901 – Sao Bernardo do Campo – SP, Brazil.

2 Instituto Tecnologico de AeronauticaPraca Mal. Eduardo Gomes, 50.

12228-900 – Sao Jose dos Campos – SP, Brazil.

[email protected], [email protected], [email protected]

Abstract. This paper describes the design and implementation of a 4player RoboCup Simulation 2D team, which was build by adding Heuris-tic Accelerated Reinforcement Learning capabilities to basic players ofthe well-known UvA Trilearn team. The implemented agents learn byusing a recently proposed Heuristic Reinforcement Learning algorithm,the Heuristically Accelerated Q–Learning (HAQL), which allows the useof heuristics to speed up the well-known Reinforcement Learning algo-rithm Q–Learning. A set of empirical evaluations was conducted in theRoboCup 2D Simulator, and experimental results obtained while playingwith other teams shows that the approach adopted here is very promis-ing.

Keywords: Reinforcement Learning, Cognitive Robotics, RoboCup Sim-ulation 2D.

1 Introduction

Modern Artificial Intelligence textbooks such as AIMA [11] introduced a unifiedpicture of the field, proposing that the typical problems of the AI should beapproached by multiple techniques, and where different methods are appropriate,depending on the nature of the task. This is the result of the belief that AI mustnot be seen as a segmented domain. According to this tendency, the applicationsdomains probed by this field also changed: chess player programs that are betterthan a human champion, a traditional AI domain is a reality; new domainsbecame a necessity.

The RoboCup Robotic Soccer Cup domain was proposed by several re-searchers [6] in order to provide a new long-term challenge for Artificial In-telligence research. The development of soccer teams involves a wide range oftechnologies, including: design of autonomous agents, multiagent collaboration,

https://www.researchgate.net/publication/200044349_Artificial_Intelligence_A_Modern_Approach?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

strategy definition and acquisition, real-time reasoning, robotics, and sensor-fusion.

Soccer games between robots constitute real experimentation and testing ac-tivities for the development of intelligent, autonomous robots, which cooperateamong each one to achieve a goal. This domain has become of great relevancein Artificial Intelligence since it possesses several characteristics found in othercomplex real problems; examples of such problems are: robotic automation sys-tems, that can be seen as a group of robots in an assembly task, and spacemissions with multiple robots, to mention but a few.

The RoboCup Simulation League has the goal to provide an environmentwhere teams can be created in order to compete against each other in a simulatedsoccer championship. Since the simulator provides the entire environment of theplayers, teams have only one task: to develop the strategies of their players.

However, the task is not trivial. To solve this problem, several researchershave been applying Reinforcement Learning (RL) techniques that been attract-ing a great deal of attention in the context of multiagent robotic systems. Thereasons frequently cited for such attractiveness are: the existence of strong the-oretical guarantees on convergence [14], they are easy to use, and they providemodel-free learning of adequate control strategies. Besides that, they also havebeen successfully applied to solve a wide variety of control and planning prob-lems.

One of the main problems with RL algorithms is that they typically sufferfrom very slow learning rates, requiring a huge number of iterations to convergeto a good solution. This problem becomes worse in tasks with high dimensionalor continuous state spaces and when the system is given sparse rewards. One ofthe reasons for the slow learning rates is that most RL algorithms assumes thatneither an analytical model nor a sampling model of the problem is available a

priori. However, in some cases, there is domain knowledge that could be usedto speed up the learning process: “Without an environment model or additionalguidance from the programmer, the agent may literally have to keep falling offthe edge of a cliff in order to learn that this is bad behavior” [4].

As a way to add domain knowledge to help in the solution of the RL problem,a recently proposed Heuristic Reinforcement Learning algorithm – the Heuris-tically Accelerated Q–Learning (HAQL) [1] – uses a heuristic function that in-fluences the choice of the action to speed up the well-known RL algorithm Q–Learning. This paper investigates the use of HAQL to speed up the learningprocess of teams of mobile autonomous robotic agents acting in a concurrentmultiagent environment, the RoboCup 2D Simulator. It is organized as follows:section 2 describes the Q–learning algorithm. Section 3 describes the HAQL andits formalization using a heuristic function. Section 4 describes the robotic soc-cer domain used in the experiments, presents the experiments performed, andshows the results obtained. Finally, Section 5 summarizes some important pointslearned from this research and outlines future work.

https://www.researchgate.net/publication/220974775_Heuristically_Accelerated_Q-Learning_A_New_Approach_to_Speed_Up_Reinforcement_Learning?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

https://www.researchgate.net/publication/2924506_Reinforcement_Learning_for_Problems_with_Hidden_State?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

https://www.researchgate.net/publication/245451971_Generalized_Markov_Decision_Processes_Dynamic_programming_and_Reinforcement-le_arning_Algorithms?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

2 Reinforcement Learning and the Q–learning algorithm

Reinforcement Learning is the area of Machine Learning [9] that is concernedwith an autonomous agent interacting with its environment via perception andaction. On each interaction step the agent senses the current state s of theenvironment, and chooses an action a to perform. The action a alters the states of the environment, and a scalar reinforcement signal r (a reward or penalty)is provided to the agent to indicate the desirability of the resulting state. In thisway, “The RL problem is meant to be a straightforward framing of the problemof learning from interaction to achieve a goal” [13].

Formally, the RL problem can be formulated as a discrete time, finite state,finite action Markov Decision Process (MDP) [9]. Given:

– A finite set of possible actions a ∈ A that the agent can perform;– finite set of states s ∈ S that the agent can achieve;– A state transition function T : S ×A → Π(S), where Π(S) is a probability

distribution over S;– A finite set of bounded reinforcements (payoffs) R : S ×A → ℜ,

the task of a RL agent is to find out a stationary policy of actions π∗ : S → Athat maps the current state s into an optimal action(s) a to be performed in s,maximizing the expected long term sum of values of the reinforcement signal,from any starting state.

In this way, the policy π is some function that tells the agent which actionsshould be chosen, under which circumstances [8]. In RL, the policy π should belearned through trial-and-error interactions of the agent with its environment,that is, the RL learner must explicitly explore its environment.

The Q–learning algorithm was proposed by Watkins [15] as a strategy tolearn an optimal policy π∗ when the model (T and R) is not known in advance.Let Q∗(s, a) be the reward received upon performing action a in state s, plusthe discounted value of following the optimal policy thereafter:

Q∗(s, a) ≡ R(s, a) + γ∑

s′∈S

T (s, a, s′)V ∗(s′). (1)

The optimal policy π∗ is π∗ ≡ argmaxa Q∗(s, a). Rewriting Q∗(s, a) in a recur-sive form:

Q∗(s, a) ≡ R(s, a) + γ∑

s′∈S

T (s, a, s′)maxa′

Q∗(s′, a′). (2)

Let Q be the learner’s estimate of Q∗(s, a). The Q–learning algorithm itera-tively approximates Q, i.e., the Q values will converge with probability 1 to Q∗,provided the system can be modeled as a MDP, the reward function is bounded(∃c ∈ R; (∀s, a), |R(s, a)| < c), and actions are chosen so that every state-actionpair is visited an infinite number of times. The Q learning update rule is:

Q(s, a)← Q(s, a) + α[

r + γ maxa′

Q(s′, a′)− Q(s, a)]

, (3)

https://www.researchgate.net/publication/221345681_A_Generalized_Reinforcement-Learning_Model_Convergence_and_Applications?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

https://www.researchgate.net/publication/5596000_Reinforcement_Learning_An_Introduction_Bradford_Book?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

https://www.researchgate.net/publication/33784417_Learning_From_Delayed_Rewards?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

where s is the current state; a is the action performed in s; r is the rewardreceived; s′ is the new state; γ is the discount factor (0 ≤ γ < 1); α = 1/(1 +visits(s, a)), where visits(s, a) is the total number of times this state-action pairhas been visited up to and including the current iteration.

An interesting property of Q–learning is that, although the exploration-exploitation tradeoff must be addressed, the Q values will converge to Q∗, inde-pendently of the exploration strategy employed (provided all state-action pairsare visited often enough) [9].

3 The Heuristically Accelerated Q–Learning Algorithm

The Heuristically Accelerated Q–Learning algorithm [1] was proposed as a way ofsolving the RL problem which makes explicit use of a heuristic function H : S ×A → ℜ to influence the choice of actions during the learning process. Ht(st, at)defines the heuristic, which indicates the importance of performing the action at

when in state st.The heuristic function is strongly associated with the policy: every heuristic

indicates that an action must be taken regardless of others. This way, it can besaid that the heuristic function defines a “Heuristic Policy”, that is, a tentativepolicy used to accelerate the learning process. It appears in the context of thispaper as a way to use the knowledge about the policy of an agent to acceleratethe learning process. This knowledge can be derived directly from the domain(prior knowledge) or from existing clues in the learning process itself.

The heuristic function is used only in the action choice rule, which defineswhich action at must be executed when the agent is in state st. The action choicerule used in the HAQL is a modification of the standard ǫ − Greedy rule usedin Q–learning, but with the heuristic function included:

π(st) =

{

arg maxat

[

Q(st, at) + ξHt(st, at)]

if q ≤ p,

arandom otherwise,(4)

where:

– H : S ×A → ℜ: is the heuristic function, which influences the action choice.The subscript t indicates that it can be non-stationary.

– ξ: is a real variable used to weight the influence of the heuristic function.– q is a random value with uniform probability in [0,1] and p (0 ≤ p ≤ 1) is the

parameter which defines the exploration/exploitation trade-off: the greaterthe value of p, the smaller is the probability of a random choice.

– arandom is a random action selected among the possible actions in state st.

As a general rule, the value of the heuristic Ht(st, at) used in the HAQLmust be higher than the variation among the Q(st, at) for a similar st ∈ S, so itcan influence the choice of actions, and it must be as low as possible in order tominimize the error. It can be defined as:

H(st, at) =

{

maxa Q(st, a)− Q(st, at) + η if at = πH(st),

0 otherwise.(5)



where η is a small real value and πH(st) is the action suggested by the heuristic.For instance, if the agent can execute 4 different actions when in state st,

the values of Q(st, a) for the actions are [1.0 1.1 1.2 0.9], the action thatthe heuristic suggests is the first one. If η = 0.01, the values to be used areH(st, 1) = 0.21, and zero for the other actions.

As the heuristic is used only in the choice of the action to be taken, theproposed algorithm is different from the original Q–learning only in the wayexploration is carried out. The RL algorithm operation is not modified (i.e.,updates of the function Q are as in Q–learning), this proposal allows that manyof the conclusions obtained for Q–learning to remain valid for HAQL.

The use of a heuristic function made by HAQL explores an important char-acteristic of some RL algorithms: the free choice of training actions. The conse-quence of this is that a suitable heuristic speeds up the learning process, and ifthe heuristic is not suitable, the result is a delay which does not stop the systemfrom converging to a optimal value.

The idea of using heuristics with a learning algorithm has already been con-sidered by other authors, as in the Ant Colony Optimization presented in [2]or the use of initial Q-Values [7]. However, the possibilities of this use were notproperly explored yet. The complete HAQL algorithm is presented on table 1. Itcan be noticed that the only difference to the Q–learning algorithm is the actionchoice rule and the existence of a step for updating the function Ht(st, at).

Table 1. The HAQL algorithm.

Initialize Q(s, a).Repeat:Visit the s state.Select an action a using the action choice rule (equation 4).Receive the reinforcement r(s, a) and observe the next state s′.Update the values of Ht(s, a).Update the values of Q(s, a) according to:

Q(s, a)← Q(s, a) + α[r(s, a) + γ maxa′ Q(s′, a′)−Q(s, a)].Update the s← s′ state.Until some stop criteria is reached.

where: s = st, s′ = st+1, a = at e a′ = at+1.

4 Experiment in the RoboCup 2D Simulation domain

One experiment was carried out using the RoboCup 2D Soccer Server [10]: theimplementation of a four player team, with a goalkeeper, a first defender (full-back) and two forward players (strikers) that have to learn how to maximizethe number of goals they score, minimizing the number of goals scored by the

https://www.researchgate.net/publication/12423757_Inspiration_for_Optimization_from_Social_Insect_Behavior?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

https://www.researchgate.net/publication/2786147_Soccer_Server_a_simulator_of_RoboCup?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

https://www.researchgate.net/publication/2599373_The_Effect_of_Representation_and_Knowledge_on_Goal-Directed_Exploration_with_Reinforcement-Learning_Algorithms?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

opponent. In this experiment, the implemented team have to learn while playingagainst a team composed of one goalkeeper, one defender and two striker agentsfrom the UvA Trilearn 2001 Team [3].

The Soccer Server is a system that enables autonomous agents programsto play a match of soccer against each other. A match is carried out in aclient/server style: a server provides a virtual field and calculates movementsof players and a ball, according to several rules. Each client is an autonomousagent that controls movements of one player. Communication between the serverand each client is done via TCP/IP sockets. The Soccer Server system works inreal-time with discrete time intervals, called cycles. Usually, each cycle takes 100milliseconds and in the end of cycle, the server executes the actions requestedby the clients and update the state of world.

The space state of the defending agents is composed by its position in adiscrete grid with N x M positions the agent can occupy, the position of theball in the same grid and the direction the agent is facing. This grid is differentfor the goalkeeper and the defender: each agent has a different area where itcan move, which they cannot leave. These grids, shown in figure 1, are partiallyoverlapping, allowing both agents to work together in some situations. The spacestate of the attacking agents is also composed by its position in a discrete grid(shown in figure 2) the direction the agent is facing, the distance between theagents and the distance to the ball. The direction that the agent can be facingis also discrete, and reduced to four: north, south, east or west.

The defender can execute six actions: turnBodyToObject, that keeps theagent at the same position, but always facing the ball; interceptBall, that movesthe agent in the direction of the ball; driveBallFoward, that allows the agentto move with the ball; directPass, that execute a pass to the goalkeeper; kick-Ball, that kick the ball away from the goal and; markOpponent, that moves thedefender close to one of the opponents.

The goalkeeper can also perform six actions: turnBodyToObject, intercept-Ball, driveBallForward, kickBall, which are the same actions that the defendercan execute, and two specific actions: holdBall, that holds the ball and move-ToDefensePosition, that moves the agent to a position between the ball and thegoal. Finally, the strikers can execute six actions: turnBodyToObject, intercept-Ball and markOpponent, which are the same as described for the defender, andkickBall, that kick the ball in the direction goal; dribble, that allows the agentto dribble with the ball, and; holdBall, that holds the ball.

All these actions are implemented using pre-defined C++ methods definedin the BasicPlayer class of the UvA Trilearn 2001 Team. “The BasicPlayer classcontains all the necessary information for performing the agents individual skillssuch as intercepting the ball or kicking the ball to a desired position on the field”[3, p. 50].

The reinforcement given to the agents were inspired on the definitions ofrewards presented in [5], and are different for the agents. For the goalkeeper, therewards consists of: ball caught, kicked or driven by goalie = 100; ball with anyopponent player = -50; goal scored by the opponent = -100. For the defender, the

https://www.researchgate.net/publication/220797195_Half_Field_Offense_in_RoboCup_Soccer_A_Multiagent_Reinforcement_Learning_Case_Study?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

https://www.researchgate.net/publication/2588470_The_Incremental_Development_of_a_Synthetic_Multi-Agent_System_The_UvA_Trilearn_2001_Robotic_Soccer_Simulation_Team?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

Fig. 1. Discrete grids that compose the space state of the goalkeeper (left) and thedefender (right).

Fig. 2. Discrete grid that compose the space state of the strikers.

rewards are: ball with any opponent player = -100; goal scored by the opponent= -100. The rewards of the strikers consists of: ball with any opponent player =-50; goal scored by the agent = 100; goal scored by the agent’s teammate = 50.

The heuristic policy used for the all the agents is described by two rules:if the agent is not near the ball, run in the direction of the ball, and; if theagent is close to the ball, do something with it. Note that the heuristic policyis very simple, leaving the task of learning what to do with the ball and how todeviate from the opponent to the learning process. The values associated withthe heuristic function are defined using equation 5, with the value of η set to500. This value is computed only once, at the beginning of the game. In all thefollowing episodes, the value of the heuristic is maintained fixed, allowing thelearning process to overcome bad indications.

In order to evaluate the performance of the HAQL algorithm, this experimentwas performed with teams of agents that learns using the Q–learning algorithm,the HAQL algorithm. The results presented are based on the average of 10training sessions for each algorithm. Each session is composed of 100 episodesconsisting of matches taking 3000 cycles each. During the simulation, when a

Fig. 3. Position of all the agents at the beginning of an episode.

teams scores a goal all agents are transferred back to a pre-defined start position,presented in figure 3.

The parameters used in the experiments were the same for the two algo-rithms, Q–learning and HAQL: the learning rate is α = 1.25, the exploration/exploitation rate p = 0.05 and the discount factor γ = 0.9. Values in the Qtable were randomly initiated, with 0 ≤ Q(s, a, o) ≤ 1. The experiments wereprogrammed in C++ and executed in a Pentium IV 2.8GHz, with 1GB of RAMon a Linux platform.

Figure 4 shows the learning curves for both algorithms when the agents learnhow to play against a team composed of one goalkeeper, one defender and twostrikers from the UvA Trilearn Team 2001 [3]. It presents the average goalsscored by the learning team in each episode (left) and the average goals scoredby the opponent team (left), using the Q–Learning and the HAQL algorithm. Itis possible to verify that Q–learning has worse performance than HAQL at theinitial learning phase, and that as the matches proceed, the performance of bothalgorithms become more similar, as expected.

Student’s t–test [12] was used to verify the hypothesis that the use of heuris-tics speeds up the learning process. The t-test is a statistical test used to computewhether the means values of two sets of results are significant different from eachother. Given two data sets, a T value is computed using both sets mean values,standard deviations and number of data points. If the T value is above a pre-defined threshold (usually the 95% confidence level), then it can be stated thatthe two algorithms differ.

For the experiments described in this section, the value of the module ofT was computed for each episode using the same data presented in figure 4.The result, presented in figure 5, shows that HAQL performs clearly better thanQ–learning until the 20th episode, with a level of confidence greater than 95%.Also, after the 60th episode, the results became closer. But it can be seen thatHAQL still performs better than Q–learning in some cases.

https://www.researchgate.net/publication/2588470_The_Incremental_Development_of_a_Synthetic_Multi-Agent_System_The_UvA_Trilearn_2001_Robotic_Soccer_Simulation_Team?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

0

2

4

6

8

10

0 20 40 60 80 100

Goa

ls

Episodes

HAQLQ−Learning

0

5

10

15

20

0 20 40 60 80 100

Goa

ls

Episodes

HAQLQ−Learning

Fig. 4. Average goals scored by the learning team (left) and scored against it (right),using the Q–Learning and the HAQL algorithms, for training sessions against UvATrilearn agents.

0

5

10

15

20

25

0 20 40 60 80 100

T V

alue

Episodes

T Value95% Confidence Level

0

5

10

15

20

25

0 20 40 60 80 100

T V

alue

Episodes

T Value95% Confidence Level

Fig. 5. Results from Student’s t test between Q–learning and HAQL algorithms, forthe number of goals scored (left) and conceded (right).

5 Conclusion and Future Works

This paper presented the use of the Heuristically Accelerated Q–Learning (HAQL)algorithm to speed up the learning process of teams of mobile autonomousrobotic agents acting in the RoboCup 2D Simulator.

The experimental results obtained in this domain showed that agents usingthe HAQL algorithm learned faster than ones using the Q–learning, when theywere trained against the same opponent. These results are strong indications thatthe performance of the learning algorithm can be improved using very simpleheuristic functions.

Due to the fact that the reinforcement learning requires a large amount oftraining episodes, the HAQL algorithm has been evaluated, so far, only in simu-lated domains. Among the actions that need to be taken for a better evaluationof this algorithm, the more important ones include:

– The development of teams composed of agents with more complex spacestate representation and with a larger number of players.

– Working on obtaining results in more complex domains, such as RoboCup3D Simulation and Small Size League robots [6].

– Comparing the use of more convenient heuristics in these domains.– Validate the HAQL by applying it to other the domains, such as the “car on

the hill” and the “cart-pole”.

Future works also include the incorporation of heuristics into other wellknown RL algorithms, such as SARSA, Q(λ), Minimax-Q and Minimax-QS,and conceiving ways of obtaining the heuristic function automatically.

References

[1] R. A. C. Bianchi, C. H. C. Ribeiro, and A. H. R. Costa. Heuristically AcceleratedQ-Learning: a new approach to speed up reinforcement learning. Lecture Notesin Artificial Intelligence, 3171:245–254, 2004.

[2] E. Bonabeau, M. Dorigo, and G. Theraulaz. Inspiration for optimization fromsocial insect behaviour. Nature 406 [6791], 2000.

[3] R. de Boer and J. Kok. The Incremental Development of a Synthetic Multi-AgentSystem: The UvA Trilearn 2001 Robotic Soccer Simulation Team. Master’s Thesis,University of Amsterdam, 2002.

[4] S. W. Hasinoff. Reinforcement learning for problems with hidden state. Technicalreport, University of Toronto, 2003.

[5] S. Kalyanakrishnan, Y. Liu, and P. Stone. Half field offense in RoboCup soc-cer: A multiagent reinforcement learning case study. In G. Lakemeyer, E. Sklar,D. Sorenti, and T. Takahashi, editors, RoboCup-2006: Robot Soccer World CupX. Springer Verlag, Berlin, 2007.

[6] H. Kitano, A. Minoro, Y. Kuniyoshi, I. Noda, and E. Osawa. Robocup: A challengeproblem for ai. AI Magazine, 18(1):73–85, 1997.

[7] S. Koenig and R. G. Simmons. The effect of representation and knowledge on goal–directed exploration with reinforcement–learning algorithms. Machine Learning,22:227–250, 1996.

[8] M. L. Littman and C. Szepesvari. A generalized reinforcement learning model:Convergence and applications. In Procs. of the Thirteenth International Conf. onMachine Learning (ICML’96), pages 310–318, 1996.

[9] T. Mitchell. Machine Learning. McGraw Hill, New York, 1997.[10] I. Noda. Soccer server : a simulator of robocup. In Proceedings of AI symposium

of the Japanese Society for Artificial Intelligence, pages 29–34, 1995.[11] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice

Hall, Upper Saddle River, NJ, 1995.[12] M. R. Spiegel. Statistics. McGraw-Hill, 1998.[13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT

Press, Cambridge, MA, 1998.[14] C. Szepesvari and M. L. Littman. Generalized markov decision processes:

Dynamic-programming and reinforcement-learning algorithms. Technical report,Brown University, Department of Computer Science, Brown University, Provi-dence, Rhode Island 02912, 1996. CS-96-11.

[15] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, University ofCambridge, 1989.

https://www.researchgate.net/publication/220605355_RoboCup_A_challenge_problem_for_AI?el=1_x_8&enrichId=rgreq-1ff461b7-9477-4713-9ba5-75f6e40395db&enrichSource=Y292ZXJQYWdlOzIyMDc3MzU2NDtBUzoxMDQyMTAzMzUwMTA4MjFAMTQwMTg1NzA4OTM3Nw==

Heuristic Q-Learning Soccer Players: A New Reinforcement Learning Approach to RoboCup Simulation

Documents