A Graph-Based Evolutionary Algorithm

7/31/2019 A Graph-Based Evolutionary Algorithm

1/30

A Graph-Based Evolutionary Algorithm:Genetic Network Programming (GNP) and Its

Extension Using Reinforcement Learning

Shingo Mabu [email protected] School of Information, Production and Systems, Waseda University,Hibikino 2-7 Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0135, Japan

Kotaro Hirasawa [email protected] School of Information, Production and Systems, Waseda University,Hibikino 2-7, Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0135, Japan

Jinglu Hu [email protected] School of Information, Production and Systems, Waseda University, Hibikino2-7, Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0135, Japan

AbstractThis paper proposes a graph-based evolutionary algorithm called Genetic NetworkProgramming (GNP). Our goal is to develop GNP, which can deal with dynamicenvironments efficiently and effectively, based on the distinguished expression abil-ity of the graph (network) structure. The characteristics of GNP are as follows. 1)GNP programs are composed of a number of nodes which execute simple judg-ment/processing, and these nodes are connected by directed links to each other. 2)The graph structure enables GNP to re-use nodes, thus the structure can be very com-pact. 3) The node transition of GNP is executed according to its node connections with-out any terminal nodes, thus the past history of the node transition affects the currentnode to be used and this characteristic works as an implicit memory function. Thesestructural characteristics are useful for dealing with dynamic environments. Further-more, we propose an extended algorithm, GNP with Reinforcement Learning (GNP-

RL) which combines evolution and reinforcement learning in order to create effectivegraph structures and obtain better results in dynamic environments. In this paper,we applied GNP to the problem of determining agents behavior to evaluate its effec-tiveness. Tileworld was used as the simulation environment. The results show someadvantages for GNP over conventional methods.

KeywordsEvolutionary computation, graph structure, reinforcement learning, agent, tileworld.

1 Introduction

A large number of studies have been conducted on evolutionary optimization tech-niques. Genetic Algorithm (GA) (Holland, 1975), Genetic Programming (GP) (Koza,1992, 1994) and Evolutionary Programming (EP) (Fogel et al., 1966; Fogel, 1994) are

typical evolutionary algorithms. GA evolves strings and is mainly applied to optimiza-tion problems. GP was devised later in order to expand the expression ability of GA

by using tree structures. EP is a graph structural system creating finite state machinesby evolution. In this paper, a new graph-based evolutionary algorithm named Genetic

c2007 by the Massachusetts Institute of Technology Evolutionary Computation 15(3): 369-398


2/30

S. Mabu, K. Hirasawa and J. Hu

Network Programming (GNP) (Katagiri et al., 2000, 2001; Hirasawa et al., 2001; Mabuet al., 2002, 2004) is described. Our aim in developing GNP is to deal with dynamicenvironments efficiently and effectively by using the higher expression ability of graphstructure, and the inherently equipped functions in it.

The distinguishing functions of the GNP structure are directed graph expression,reusability of nodes, and implicit memory function. The directed graph expressioncan realize some repetitive processes, and it can be effective because it works like the

Automatically Defined Functions (ADFs) in GP. The node transition of GNP starts froma start node and continues based on the node connections, thus it can be said that anagents1 actions in the past are implicitly memorized in the network flow.

In addition, we propose an extended algorithm, GNP with Reinforcement Learn-ing (GNP-RL) which combines evolution and reinforcement learning in order to createeffective graph structures and obtain better results in dynamic environments. The dis-tinguishing functions of GNP-RL are the combinations of offline and online learning,and diversified and intensified search. Although we have already proposed a method,online learning of GNP (Mabu et al., 2003), which uses only reinforcement learning toselect the node connections, this method has a problem in that the Q table becomes verylarge, and the calculation time and the occupation of the memory also become large.Thus, we propose a method using both evolution and reinforcement learning in this pa-per. Evolutionary algorithms are superior in terms of wide space search ability because

they continue to evolve various individuals and select better ones (offline learning),while RL can learn incrementally, based on rewards obtained during task execution(online learning). Therefore, the combination of evolution and RL can cooperativelymake good graph structures. In fact, the proposed evolutionary algorithm (diversifiedsearch) makes rough graph structures by selecting some important functions amongmany kinds of functions and connects them based on fitness values after task execu-tion, thus the Q table becomes quite compact. And then RL (intensified search) selectsthe best function during task execution, i.e., determines an appropriate node transition.

This paper is organized as follows. Section 2 provides the related work and thecomparisons with GNP. In Section 3, the details of GNP and GNP-RL are described.Section 4 explains the Tileworld problem, available function sets, and shows the sim-ulation results. Section 5 discusses future work and remaining problems. Section 6 isdevoted to conclusions.

2 Related Work and Comparisons

A GP tree can be used as a decision tree when all function nodes are if-then typefunctions and all terminal nodes are concrete action functions. In this case, a tree isexecuted from a root node to a certain terminal node in each step, so the behaviors ofagents are determined mainly by the current information. On the other hand, sinceGNP has an implicit memory function, GNP can determine an action by not only thecurrent, but also the past information. The most important problem for GP is the bloatof the tree. The increase in depth causes an exponential enlargement of the searchspace, the occupation of large amounts of memory, and an increase in calculation time.Constraining the depth of the tree is one of the ways to overcome the bloat problem.Since the graph structure of GNP has an implicit memory function and the ability to

re-use nodes, GNP is expected to use necessary nodes repeatedly and create compact1An agent is a computer system that is situated in some environments, and it is capable of

autonomous action in these environments in order to meet its design objectives (Weiss, 1999). In this paper,the autonomous action is determined by GNP.

370 Evolutionary Computation Volume 15, Number 3


3/30

Genetic Network Programming and Its Extension

structures. We will have a discussion on the program size in the simulation section.Evolutionary Programming (EP) is a graph structural system used for the au-

tomatic synthesis of finite state machines (FSMs). For example, FSM programs areevolved in the iterated prisoners dilemma game (Fogel, 1994; Angeline, 1994) and theant problem (Angeline and Pollack, 1993). However, there are some essential differ-ences between EP and GNP. Generally FSM must define their transition rules for allcombinations of states and possible inputs, thus the FSM program will become large

and complex when the number of states and inputs is large. In GNP, the nodes areconnected by necessity, so it is possible that only the essential inputs obtained in thecurrent situation are used in the network flow. As a result, the graph structure of GNPcan be quite compact.

PADO (Teller and Veloso, 1995, 1996; Teller, 1996) is also a graph-based algorithm,but its fundamental concept is different from that of GNP. Each node in PADO pro-grams has two functional parts: an action part and a branch decision part, and PADOalso has both a start node and an end node. The state transition of PADO is based onstack and indexed memory. Since PADO has been successfully applied to image andsound classification problems, it can be said that PADO has a splendid ability for staticproblems. GNP is designed mainly to deal with problems in dynamic environments.First, the main concept of GNP is to make use of the implicit memory function. There-fore, GNP does not presuppose that it uses explicit memories such as stack and indexed

memories. Second, GNP has judgment nodes and processing nodes which correspondto branch decision parts and action parts in PADO, respectively. Note that GNP sepa-rates judgment and processing functions, while both functions of PADO are in a node.Therefore GNP can create more complex combinations/rules of judgment and process-ing. Finally, the nodes of GNP have a unique node number and the number of nodesis the same in all the individuals. This characteristic contributes to executing RL ef-fectively in GNP-RL (see section 3.6). Some information on the techniques of graph ornetwork based GP is given in Luke and Spector (1996).

Finally, we explain the methods that combine evolution and learning. In Downing(2001), the special terminal nodes for learning are introduced to GP and the contentsof the terminal nodes, i.e., the actions of agents are determined by Q learning. In Iba(1998), a Q table is produced by GP to make an efficient state space for Q learning, e.g.,if the GP program is (T AB(xy)(+z5)), it represents a 2-dimensional Q table having 2

axes, x y and z + 5. In Kamio and Iba (2005), the terminal nodes of GP select appro-priate Q tables and the agent action is determined by the selected Q table. The mostimportant difference between these methods and GNP-RL is how to create state-actionspaces (Q tables). GNP creates Q tables using its graph structures. Concretely speak-ing, an activated node corresponds to the current state, and selection of a function inthe activated node corresponds to an action. In the other methods, the current stateis determined by the combination of the inputs, and actions are actual actions such asmove forward, turn right and so on.

3 Genetic Network Programming

In this section, Genetic Network Programming is explained in detail. GNP is an exten-sion of GP in terms of gene structures. The original motivation for developing GNP is

based on the more general representation ability of graphs as opposed to that of treesin dynamic environments.

Evolutionary Computation Volume 15, Number 3 371


4/30


node i Ki IDi di diCiA

LIBRARY

J1:

J2:

Jk:

.

.

.

.

.

P1:

P2:

Pl:

.

.

.

.

.

Environment

Agent

Agent

fitness value

Node gene Connection gene

Directed graph structure Gene structure

0

1 2

3 4

start node

d1

d1

P1

P2J1

J20

2

1

1

2

2 1 3 0

1 5

5

1

2

1

0 0 1 0

4 0

3 0

BA

A B

Anode 0

node 1

node 2

node 3

node 4

: Processing node: Judgement node : Time delay

di.....

CiB

A B

A

4 0

1 0

A B

A

2 0

J: Judgement function

P: Processing function

Figure 1: Basic structure of GNP.

3.1 Basic Structure of GNP

3.1.1 Components

Fig. 1 shows a basic structure of GNP. GNP program is composed of one start node,plural judgment nodes and plural processing nodes. In Fig. 1, there are one start node,two judgment nodes and two processing nodes, and they are connected to each other.Start node has no function and no conditional branch. The only role of the start nodeis to determine the first node to be executed. Judgment nodes have conditional branchdecision functions. Each judgment node returns a judgment result and determines thenext node to be executed. Processing nodes work as action/processing functions. For

example, processing nodes determine the agents actions such as go forward, turn rightand turn left. In contrast to judgment nodes, processing nodes have no conditionalbranch. By separating processing and judgment functions, GNP can handle variouscombinations of judgment and processing. That is, how many judgments and whichkinds of judgment should be used can be determined by evolution. Suppose thereare eight kinds of judgment nodes (J1, . . . , J 8), and four kinds of processing nodes(P1, . . . , P 4). Then, GNP can make a node transition by selecting necessary nodes, e.g.,J1-> J5-> J3-> P1. Here, it says that judgment nodes J1, J5 and J3 are needed forprocessing node P1. By selecting the necessary nodes, GNP program can be quite com-pact and evolved efficiently.

In this paper, as described above, each processing node determines an agents ac-tion such as go forward, turn right and so on. And each judgment node determinesthe next node after judging what is in the forward?, what is in the right? and so

on. However, in other applications, they could, for example, be applied to other func-tions such as judge sensor values (judgment), determine wheel speed (processing)of Khepera robot (by K-Team Corp.), judge whether stocks rise or drop (judgment)and determine buy or sell strategy (processing) in stock markets.



5/30


GNP evolves the graph structure with the predefined number of nodes, so it nevercauses bloat2. In addition, GNP has an ability to use certain judgment/processingnodes repeatedly to achieve a task. Therefore, even if the number of nodes is prede-fined and small, GNP can perform well by making effective node connections based onre-using nodes. As a result, we do not have to prepare an excessive number of nodes.The compact structure of GNP is a quite important and distinguishing characteristic,

because it contributes to saving memory consumption and calculation time.

3.1.2 Memory Function

The node transition begins from a start node, but there are no terminal nodes. After thestart node, the current node is transferred according to the node connections and judg-ment results, in other words, the selection of the current node is influenced by the nodetransitions of the past. Therefore, the graph structure itself has an implicit memoryfunction of the past agent actions. Although a judgment node is a conditional branchdecision function, the GNP program is not merely the aggregate of if-then rules, be-cause it includes information of past judgment and processing. For example, in Fig.1, after node 1 (processing node P1) is executed, the next node becomes node 2 (judg-ment node J2). Therefore, when the current node is node 2, we can know the previousprocessing was P1.

The node transition of GNP ends when the end condition is satisfied, e.g., whenthe time step reaches the preassigned one or the GNP program completes the giventask.

3.1.3 Time Delays

GNP has two kinds of time delays: the time delay GNP spends on judgment or process-ing, and the one it spends on node transitions. In real world problems, when agents

judge environments, prepare for actions and take actions, they need time. For example,when a man is walking and sees a puddle before him, he will avoid it. At that moment,it takes some time to judge the puddle (time delay of judgment), to put judgment intoaction (time delay of transition from judgment to processing) and to avoid the pud-dle (time delay of processing). Since time delays are listed in each node gene and are

unique attributes of each node, GNP can evolve flexible programs considering time de-lays. In this paper, time delay of each node transition is set at zero time unit, that ofeach judgment node is one time unit, that of each processing node is five time units,and that of a start node is zero time unit. In addition, the one step of an agents behav-ior is defined in such a way that one step ends when an agent uses five or more timeunits. Thus an agent should do fewer than five judgments and one processing, or five

judgments in one step. Suppose there are three agents (agent 0, agent 1, agent 2) in anenvironment. During one step, first agent 0 takes an action, next agent 1, finally agent2. In this way, agents repeatedly take actions until reaching the maximum preassignedsteps. Another important role of time delays and steps is to prevent the program fromfalling into deadlocks. For example, if an agent cannot execute processing because ofthe judgment loop, then one step ends after five judgments. Such a program is removedfrom the population in the evolutionary process, or the node transition is changed by

the learning process of GNP-RL, as described later.

2A phenomenon that a program size, i.e., the number of nodes, becomes too large as generation goes on.



6/30


3.2 Gene Structure

The graph structure of GNP is determined by the combination of the following nodegenes. A genetic code of node i (0 i n31) is also shown in Fig. 1.

Ki represents the node type, Ki = 0 means start node, Ki = 1 means judgmentnode and Ki = 2 means processing node. IDi represents the identification numberof the node function, e.g., Ki = 1 and IDi = 2 mean the node is J2. di is the timedelay spent on judgment or processing. CAi , C

Bi . . . show the node number connected

from node i. dAi , dBi , . . . mean time delays spent on the transition from node i to nodeCAi , C

Bi . . ., respectively. Judgment nodes determine the upper suffix of the connection

genes to refer to depending on their judgment results. For example, if the judgmentresult is B, GNP refers to CBi and d

Bi . However, a start node and processing nodes

use only CAi and dAi , because they have no conditional branch.

3.3 Initialization of a GNP Population

Fig. 2 shows the whole flowchart of GNP. An initial population is produced accordingto the following rules. First, we determine the number of each kind of node4, there-fore all programs in a population have the same number of nodes and the nodes withthe same node number have the same function. However, the extended algorithm,GNP-RL described later, determines the node functions automatically, so we only needto determine the number of judgment nodes and processing nodes, e.g., 40 judgmentnodes and 20 processing nodes. The connection genes CAi , CBi , . . . are set at the valuesselected randomly from 1, . . . , n 1 (except i in order to avoid self-loop).

3.4 A Run of a GNP Program

The node transition of GNP is based on Ci. If the current node i is a judgment node,GNP executes the judgment function IDi and determines the next node using its result.For example, when the judgment result is B, the next node becomes CBi . When thecurrent node is a processing node, after executing the processing function IDi, the nextnode becomes CAi .

3.5 Genetic Operators

In each generation, the elite individuals are preserved and the rest of the individuals

are replaced with the new ones generated by crossover and mutation. In SimulationI (Section 4.3), first, 179 individuals are selected from the population by tournamentselection5 and their genes are changed by mutation. Then, 120 individuals are alsoselected from the population and their genes are exchanged by crossover. Finally, the299 individuals generated by mutation and crossover and one elite individual form thenext population.

3.5.1 Mutation

Mutation is executed in one individual and a new one is generated [Fig. 3]. The proce-dure of mutation is as follows.

3Each node in a program has a unique node number from 0 to n 1 (n: total number of nodes), respec-tively.

4Five of each kind in this paper. It could be determined experimentally, however in this paper, previous

experience indicates five nodes per each kind (J1, J2, . . . , P 1, P2, . . .) could keep a reasonable balance ofexpression ability and search speed.5The calculation cost of tournament selection is relatively small, because it simply compares the fitness

values of some individuals, and we can easily determine selection pressure by tournament size. Thus, weuse tournament selection in this paper. Tournament size is set at six.



7/30


start

generate an initial population

Judgement/Processing

reproduction

mutation

crossover

last generation ?

stop

Yes

No

trial ends ?No

ind=1

ind=the number

of individuals?

Yes

Noind=ind+1

Yes

ind : individual number

Figure 2: Flowchart of GNP system.

1. Select one individual using tournament selection and reproduce it as a parent.

2. Each connection of each node (Ci) is selected with the probability of Pm. The se-lected Ci is changed to other value (node number) randomly.

3. Generated new individual becomes the new one of the next generation.

3.5.2 Crossover

Crossover is executed between two parents and generates two offspring [Fig. 4]. Theprocedure of crossover is as follows.

1. Select two individuals using tournament selection twice and reproduce them asparents.

2. Each node i is selected as a crossover node with the probability ofPc.

3. Two parents exchange the genes of the corresponding crossover nodes, i.e., thenodes with the same node number.

4. Generated new individuals become the new ones of the next generation.

Fig. 4 shows a crossover example of the graph structure with three processingnodes for simplicity. If GNP exchanges the genes of judgment nodes, it must exchangeall the genes with suffix A , B , C , . . . simultaneously.

3.6 Extended Algorithm GNP with Reinforcement Learning (GNP-RL)

In this subsection, we propose an extended algorithm, GNP with Reinforcement Learn-ing (GNP-RL). Standard GNP (SGNP) described in the previous section is based on the



8/30


1

2 3

mutation

Each branch is selectedwith the probability of Pm

1

2 3

The selected branch becomesconnected to another noderandomly.

C3=2

C3=1A

A

Figure 3: Mutation.

general evolutionary framework such as selection, crossover and mutation. GNP-RLis based on evolution and reinforcement learning (Sutton and Barto, 1998). The aimof combining RL with evolution is to produce programs using the current information(state and reward) during task execution. Evolution-based methods change their pro-grams mainly after task execution or enough trials, i.e., offline learning. On the otherhand, GNP-RL can change its programs incrementally based on rewards obtained dur-ing task execution, i.e., online learning. For example, when an agent takes a good actionwith a positive reward at a certain state, the action is reinforced and the action will beadopted with higher probability when visiting the state again. Online learning is oneof the advantages of GNP-RL.

The other advantage is a combination of a diversified search of evolution and anintensified search of RL. The role of evolution is to make rough structures, i.e., pluralpaths of node transition, through selection, crossover and mutation. The role of RL isto determine one appropriate path in a structure made by evolution. Because RL isexecuted based on immediate rewards obtained after taking actions, intensified search,i.e., local search, can be executed efficiently. Evolution changes programs largely thanRL and the programs (solutions) could escape from local minima, so we call evolutionas a diversified search.

The nodes of GNP have a unique node number and the number of nodes (states) isthe same in all the individuals. In addition, the crossover operator exchanges the nodeswith the same node number. Therefore, the large changes of the Q tables do not occurand the obtained knowledge in the previous generation can be used effectively in thecurrent generation.

3.6.1 Basic Structure of GNP-RL

Fig. 5 shows a basic structure of GNP-RL. The difference between GNP-RL and SGNPis whether or not plural functions exist in a node. Each node of SGNP has one function,

but that of GNP-RL has several functions and one of them is selected based on a policy.Ki represents the node type, which is the same as in SGNP. IDip (1 p mi 6

) shows the identification number of the node function. In Fig. 5, mi of all nodes areset at 2, i.e., GNP can select the node function IDi1 or IDi2. Qip is a Q value which is

assigned to each state and action pair. In reinforcement learning, state and action6mi (1 mi M M: Maximum number of functions in a node, e.g., M = 4) shows the number of

node functions GNP can select at the current node i. mi is determined randomly at the beginning of the firstgeneration, but they can be changed by mutation.



9/30


parent1 parent2

1 1

2 2 33

crossover

offspring1 offspring2

1 1

2 2 33

C3=2

C3=1

*

C3=1

C3=2*

Each node is selectedwith the probability of Pc (crossover node)

d3

d3

d3

d3

*

*

d3 d3

d3 d3

*

*

A

A

A

A

A

A

A

A

ID3 ID3*

ID3*

ID3

Figure 4: Crossover.

must be defined, and generally, the current state is determined by the combination ofthe current information, e.g., sensor inputs, and action is an actual action an agent takes,e.g., go forward. However, in GNP-RL, the current state is defined as the current node,and a selection of a node function (IDip) is defined as an action. dip is the time delayspent on judgment or processing. CAip, C

Bip, . . . show the node number of the next node.

dAip, dBip, . . . mean time delays spent on the transition from node i to node C

Aip, C

Bip, . . .,

respectively.

3.6.2 A Run of GNP with Reinforcement Learning

The node transition of GNP-RL also starts from a start node and continues dependingon the node connections and judgment results.

If the current node i is a judgment node, first, one Q value is selected fromQi1, . . . , Qimi based on -greedy policy. That is, a maximum Q value amongQi1, . . . , Qimi is selected with the probability of1 , or a random one is selected withthe probability of, then the corresponding IDip is selected. GNP executes the selected

judgment function IDip and determines the next node depending on the judgment re-sult. For example, if the selected function is IDi2 and the judgment result is B, thenext node becomes node CBi2 .

If the current node is a processing node, GNP selects and executes a processingfunction in the same way as judgment nodes, and the next node becomes node CAi2when the selected function is IDi2.

Here, a concrete example of node transition is explained using Fig. 6. The firstnode is a judgment node 2 and there are functions JF and TD (see Table 1). Suppose JFis selected based on the -greedy policy and the judgment result is D(=floor). Then thenext node number becomes CD21 = 4. In node 4, the processing function MF is selected,



10/30


node i Ki IDi1 di1

Node gene

Directed graph structure Gene structure

0

1 2

3 4

start node

0

2

1

1

2

2 1 0 1 0

1 5

54

1 1

1 0

4 0

0

0 1 0

node 0

node 1

node 2

node 3

node 4

Processing node

: Time delay

IDimi dimiQi1

di1 .....Ci1A A

Connection gene

di1Ci1B B

dimiCimiA A

dimiCimiB B.....

Qimi.....

0.3

2.0

0.0

0.7

1 1

2 5

51

3 1

0.0

0.0

0.1

2.7

Node gene Connection gene

2

3 4

4 1

0

0

0

0

00

4

2 4

2

IDi1

IDi2

Qi1

Qi2

di1

di2

node i

node

node

Ci1A

Ci2A

di1A

di2A

in the case of mi=2

nodej

Judgement node

IDi1

IDi2

Qi1

Qi2

di1

di2

node i node Ci1A

di1A

di2A

node Ci1B

di1B

.

.

.

di2B

...

node Ci2A

node Ci2

B

One branch is selected according to the judgement result.

Figure 5: Basic structure of GNP with Reinforcement Learning.

so an agent moves forward, and the next node becomes node CA

42 = 9.

3.6.3 Genetic Operators

A crossover operator in GNP-RL is the same as in SGNP, i.e., all the genes of the se-lected nodes are exchanged. However, GNP-RL has its own mutation operators. Theprocedure is as follows.

1. Select one individual using tournament selection and reproduce it as a parent

2. Mutation operatorThere are three kinds of mutation operators [Fig. 7], and, uniformly selected, oneoperator is executed.

(a) connection of functions : Each node connection is re-connected to anothernode (Cip is changed to another node number) with the probability ofPm.

(b) content of functions : Each function is selected with the probability ofPm andchanged to another function, i.e., IDip and dip are each changed.



11/30


ID21=1

ID22=5

Q21=1.0

Q22=0.1

node 2

C21=4

node 4

ID41=1

ID42=3

Q41=0.02

Q42=0.5

DJF

TD

JF : judge forwardTD : direction of the nearest Tile from the agent

.....

tile

C21=1A

C21=5B

C21=8C

C21=10E

hole

obstacle

floor

agent

MF

TL

MF: move forwardTL : turn left

C42=9

A

C41=7A

Figure 6: An example of node transition using the nodes for Tileworld problem.

(c) number of functions : Each node i is selected with the probability of Pm, andthe number of functions mi is changed to 1, . . . , or M randomly. If the revisedmi becomes larger than the previous mi, then one or more new functions se-lected from LIBRARY are added to the node so that the number of functions

becomes the revised mi. If the revised mi becomes smaller, then one or morefunctions are deleted from the node.

3. The generated new individual becomes the new one of the next generation.

3.7 Learning Phase

Reinforcement learning is carried out when agents are carrying out their tasks andterminates when the time step reaches the predefined steps. The learning phase ofGNP is based on a basic Sarsa algorithm (Sutton and Barto, 1998). Sarsa calculates Qvalues which are functions of state s and action a. Q values estimate the sum of thediscounted rewards obtained in the future. Suppose that an agent selects an action atat state st at time t, a reward rt is obtained and an action at+1 is taken at the next statest+1. Then Q(st, at) is updated as follow.

Q(st, at) Q(st, at) + [rt + Q(st+1, at+1) Q(st, at)] (1)

is a step size parameter, and is a discount rate which determines the presentvalue of future rewards: a reward received k time steps later is worth only k1 timesof the reward supposed to receive at the current step.

As described before, a state means the current node and an action means the se-lection of a function. Here a procedure for updating Q value is explained using Fig. 8which shows states, actions and an example of node transition.

1. At time t, GNP refers to Qi1, Qi2, . . . , Qimi and select one of them based on -greedy. Suppose that GNP selects Qip and the corresponding function IDip.

2. GNP executes the function IDip, gets the reward rt and the next node j becomes

CAip.

3. At time t + 1, GNP selects one Q value in the same way as step 1. Suppose thatQjp is selected.



12/30


1

2 3

mutation(the connection)

The connection is changed randomly.

1

2 3C32=2C32=2

C31=2

C31=1

mutation(the number of functions)

The number of functions is selected from 1, ..., M.

mutation(the content of functions)

The content of the functions is changed randomly.

node i

node i

A

A

A

A

IDi di IDi di* *

mi=2 mi=3

Figure 7: Mutation of GNP with Reinforcement Learning.

4. Q value is updated as follows.

Qip Qip + [rt + Qjp

Qip]

5. t t + 1, i j, p p then return step 2.

In this example, node i is a processing node, but if it is a judgment node, the nextcurrent node is selected among CAip, C

Bip, . . . depending on the judgment result.

4 Simulations

To confirm the effectiveness of the proposed method, the simulations for determiningthe agents behavior using the Tileworld problem (Pollack and Ringuette, 1990) aredescribed in this section.

4.1 TileworldTileworld is well known as a testbed for the problem of agents. Fig. 9 shows an exampleof Tileworld, which is a 2D grid world including multi-agents, obstacles, tiles, holes andfloors. Agents can move to a contiguous cell in one step. Moreover, agents can push a



13/30


timet t+1

node i

IDi1

IDip

...

.

.

.

.

.

.

.

.

.

state st

IDimi

.

.

.

.

.

.

Qip

action at

nodej(=Cip )

IDj1

IDjp

...

.

.

.

.

.

.

.

.

.

state st+1

IDjmj

.

.

.

.

.

.

Qjp

at+1

reward rt reward rt+1

CipA

A

Qi1

Qimi

Qj1

Qjmj

Figure 8: An example of node transition.

tile to the next cell except when an obstacle or other agent exists in the cell. When a tileis dropped into a hole, the hole and the tile vanish, i.e., the hole is filled with the tile.

Agents have some sensors and action abilities, and their aim is to drop many tiles intoholes as fast as possible. Therefore, agents are required to use sensors and take actionsproperly according to their situations. Since the given sensors and simple actions arenot enough to achieve tasks, agents must make clever combinations of judgment andprocessing.

The nodes used by agents are shown in Table 1. The judgment nodes { JF, JB, JL,JR } return { tile, hole, obstacle, floor, agent }, and { TD, HD, THD, STD } return {forward, backward, left, right, nothing } as judgment results, like A , B , . . . in Fig. 1.Fig. 10 shows the four directions an agent can perceive when it faces north.

4.1.1 Fitness and Reward

A trial ends when the time step reaches the preassigned step, and then fitness is cal-culated. Fitness is used in the evolutional processes and Reward is used in the

learning phase of GNP-RL.Fitness = the number of dropped tilesReward =1 (when an agent drops a tile into a hole)

4.2 Simulation Conditions

The simulation conditions are shown in Table 2. For the comparison, the simulationsare carried out by SGNP, GNP-RL, standard GP, GP with ADFs, and EP evolving FSMs.

4.2.1 Conditions of GNP

As shown in Table 2, the number of nodes in a program is 61. In the case of SGNP, thenumber of each kind of judgment and processing nodes is fixed at five. In the case ofGNP-RL, 40 judgment nodes and 20 processing nodes are used, but the number of dif-

ferent kinds of nodes (IDi) are changed through the evolution. At the first generation,all nodes have functions randomly selected from LIBRARY. But, IDi are exchanged

by crossover and also changed by mutation, thus the appropriate kinds of nodes areselected as a result of evolution.



14/30


T

TileT

Hole

Obstacle

Agent

Floor

T

T

T

T

T

T

T

T

T

T

T

T

T T

T

T

T

T

T

T

T

T

T

TT T

T

T

T

Figure 9: Tileworld.

T

Forward

Backward

Left Right

Forward

Backward

TLeft

(North)

(South)

Figure 10: Four directions Agents perceive.

The number of elite individuals and offspring generated by crossover and muta-tion is predefined. Simulation I uses the environment shown in Fig. 9, where the po-sitions of tiles and holes are fixed, so the same environment is used every generation.On the other hand, in Simulation II, the positions of tiles and holes are determinedrandomly, so the problem becomes more difficult and complex. In Simulation I, the

best individual is preserved as an elite one, but in Simulation II, five elite individu-als are preserved, because the environment changes generation by generation. In fact,the best individual in the previous generation does not always show a good result inthe current generation. Therefore, in order to make the performance of each methodstable, we preserve five good individuals in Simulation II. In GNP-RL, when creatingoffspring in the evolution, offspring inherits the Q values of its parents and uses themas initial values. That is, the offspring generated by mutation has the same Q values asthe parents because mutation does not operates on Q values. Hence, the Q values of anelite individual carry over to the next generation. Furthermore, the offspring generated

by crossover has the exchanged Q values of the parents In addition, three agents in thetileworld share the Q values.Crossover rate Pc and Mutation rate Pm are determined appropriately through

our experiments, which maintains the variation of the population, but does not changethe programs too much. The settings of the parameters used in the learning phase ofGNP-RL are as follows. The step size parameter is set at 0.9 in order to find solutionsquickly and the discount rate is set at 0.9 in order to sufficiently consider the futurerewards. is set at 0.1 experimentally, which considers the balance between exploita-tion and exploration. In fact, the programs with lower epsilon fall into local minimawith higher probability, and those with higher epsilon take too much random actions.M (the maximum number of functions in a node) is set at the best value among M = 2,3 and 4.

4.2.2 Conditions of GPWe use GP as a decision maker, so the function nodes are used as if-then type branchdecision functions, and the terminal nodes are used as action selection functions. Theterminal nodes of standard GP are composed of the processing nodes of GNP: {MF,



15/30


Table 1: Function Set.Judgment node

J symbol contentJ1 JF judge FORWARDJ2 JB judge BACKWARDJ3 JL judge LEFT sideJ4 JR judge RIGHT sideJ5 TD direction of the nearest TILE from the agentJ6 HD direction of the nearest HOLE from the agentJ7 THD direction of the nearest HOLE from the nearest TILEJ8 STD direction of the second nearest TILE from the agent

Processing node

P symbol contentP1 MF move forwardP2 TR turn rightP3 TL turn leftP4 ST stay

TL, TR, ST}, and the function nodes are the judgment nodes of GNP: {JF, JB, JL, JR,TD, HD, THD, STD}. Terminal nodes have no arguments and function nodes have fivearguments corresponding to the judgment results. In the case of GP with ADFs, themain tree uses {ADF1, . . ., ADF10 7 } as terminal nodes in addition to the terminaland function nodes in standard GP. The ADF tree uses the same nodes as standardGP. The genetic operators of GP used in this paper are crossover (Poli and Langdon,1998, 1997), mutation and inversion (Koza, 1992). In the simulations, the maximumdepth of the trees is fixed in order to avoid bloat, but the setting of the maximum depthis very important, because the expression ability is improved as the depth becomeslarge, while the search space is increased. Therefore, we try to use various depths inthe range permitted by machine memory and calculation time, and use a full and aramped half-and-half initialization methods (Koza, 1992) in order to produce trees

with various sizes and shapes in the initial population.

4.2.3 Conditions of EP

EP uses the same sensor information as the judgment nodes of GNP use, and the out-puts are the same as the contents of the processing nodes. Generally EP must definetransitions and outputs for all combinations of states and inputs. Here, we would liketo discuss how the complexity of the EP and GNP programs differs depending on theproblem. Table 3 shows the number of outputs/connections for each individual in EPand GNP. Fig. 11 shows the number of outputs at each state/node. In case 1, thereis only one sensor which can distinguish two objects, and the number of states/nodesis 608. Then, the number of outputs of EP becomes 120, that of SGNP becomes 100,and that of GNP-RL becomes 100-400 (variable). However, as the number of sensors

7

The number of ADFs in each individual is 10 and each ADF is called by the terminal nodes of the maintree.8A start node of GNP is not counted because it has only one branch determining the first judgment or

processing node and does not have any functions. EP has a branch determining a first state, but it is notcounted as an output.



16/30


17/30


Table 3: The relation between the number of inputs and outputs in EP and GNP.case 1 case 2 case 3 case 4

X 1 4 8 8Y 2 2 2 5

EP 120 960 15,360 23,437,500Z SGNP 100 100 100 220

GNP-RL (M = 4) 100-400 (variable) 100-400 100-400 220-880

X: the number of inputs (sensors)Y: the number of objects each sensor can distinguishZ: the total number of outputs (connections)the number of states/nodes: 60 (judgment node: 40, processing node: 20 in the case ofGNP)

1 2

initial state

JF TD

tile/MFhole/TR

obstacle/TL

floor/MF

agent/TR

forward/MF

backward/TR

left/TL

right/TR

nothing/MF

in the case where one input is dealt with

Figure 12: An example of an EP program.

should take. If the number of inputs at each state is two instead of one, the number of

transitions and outputs becomes 25 each.At the beginning of the first generation, the predefined number of sensors are as-signed to each state randomly, but the types of sensors are changed by the mutationoperator as the generation goes on. As shown in Table 2, the number of sensors (in-puts) and the maximum number of states are set at some value in order to find goodsettings for EP. The mutation operators used in the simulations are {ADD STATE,REMOVE STATE, CHANGE TRANSITION, CHANGE OUTPUT, CHANGE INITIALSTATE, CHANGE INPUT}. The last operator is the additional one adopted in this pa-per and the others are the same as the ones used in Angeline (1994).

4.3 Simulation I

Simulation I uses the environment shown in Fig. 9 where there are 30 tiles and 30holes. Three agents have the same program made by GNP, GP or EP. In this environ-

ment, since each input for judgment nodes is not the complete information needed todistinguish various situations, each method is required to judge its situations and takeactions properly by combining various kinds of judgment and processing nodes. Themaximum number of time steps is set at 150.



18/30


Table 4: The data on the fitness of the best individuals at the last generation in Simula-tion I.

GNP-RL SGNP GP-ADFs GP EP

average 21.23 18.00 15.43 14.00 16.30

standard deviation 2.73 1.88 1.94 2.00 1.99

t-test GNP-RL - 1.04 106 3.03 1013 3.13 1017 5.31 1011

(p-value) SGNP - - 1.32 106 3.17 1011 5.95 104

Figs. 13, 14 and 15 show the fitness curves of the best individuals at each gen-eration averaged over 30 independent simulations. From the results, GNP-RL showsthe best fitness value at 5000 generation. In early generations, SGNP exhibited betterfitness, because Q values in GNP-RL are set at zero in the first generation and must

be updated gradually. However, GNP-RL can produce the better result in the latergenerations by appropriately learned Q values.

Table 4 shows the averaged fitness values of the best individuals over 30 simula-tions9 at the last generation, their standard deviation, and the results of a t-test (one-sided test). The results of the t-test show the p-values between GNP-RL and the othermethods, and between SGNP and the other methods. There are significant differences

between GNP-RL and the other methods, and between SGNP and GP, GP with ADFs,and EP.

Although it seems natural that the method using RL can obtain better solutionsthan the other methods without it, the aim of developing GNP-RL is to solve problemsfaster than the others within the same time limit of actions (same steps). In other words,GNP-RL aims to make full use of the information obtained during task execution inorder to make the appropriate node transitions.

From Fig. 14, we see that standard GP of depth four, initialized by the full method(GP-full4), shows better results than the other standard GP programs, and GP withADFs of depth three (main tree) and depth two (ADF tree)initialized by the full method(GP-ADF-full3-2) produces the best result of all the GP programs. However, in thisproblem, the arity of the function nodes of GP is relatively large (five), so the totalnumber of nodes of GP becomes quite large as the depth becomes large. For exam-

ple, GP-full4 has 781 nodes, GP-full5 has 3,906 nodes, and GP-full6 has 19,531 nodes.Although GP programs can have higher expression ability as the number of nodes in-creases, they take much time to explore the programs and much memory is needed.For example, GP-full6 takes too much time to execute the programs and GP (depth7)cannot be executed because of the lack of memory in our machine (Pentium4 2.5GHz,DDR-SDRAM PC2100 512MB). On the other hand, GNP can obtain good results usingrelatively small number of nodes.

As shown in Fig. 15, EP using three inputs and five states shows better results, sothis setting is suitable for the environment. EP uses a graph structure, so it can also ex-ecute state transition considering the past agents actions. Furthermore, as the numberof states increases, EP can implicitly memorize the past action sequences. However, ifthere are many inputs, it causes a large number of outputs and state transition rules andthe programs then become impractical for exploration and execution. The structure ofGNP does not become exponentially large even if the number of inputs increases as

9the results of the best settings, i.e., GP-full4, GP-ADF-full3-2 and EP-input3-state5. Table 5, 6 and 7 alsoshow the results of the best settings.



19/30


0

2

4

68

10

12

14

16

18

20

22

0 1000 2000 3000 4000 5000

generation

fitness

GNP-RL (21.23)

SGNP (18.00)

fitness at the last generation

Figure 13: Fitness curves of GNP in Simulation I.

0

2

4

6

8

10

12

1416

18

20

22

1000 2000 3000 4000 5000

fitness

generation

GP-ADF-full3-2 (15.43)GP-ADF-full4-3 (14.46)

GP-full4 (14.00)

GP-ADF-ramp3-2 (13.86) GP-full5 (13.76)

GP-ADF-ramp4-3 (13.5)GP-ramp4 (10.10)

GP-ramp5 (10.30)

Figure 14: Fitness curves of GP in Simulation I.

0

2

4

6

8

10

12

14

16

18

20

22

0 1000 2000 3000 4000 5000generation

fitness

EP-input3 state 5 (16.30)EP-input4 state 5 (14.93)

EP-input2 state 30 (13.70)

EP-input1 state 60 (13.30)

Figure 15: Fitness curves of EP in Simulation I.



20/30


Table 5: Calculation time for 5000 generations in Simulation I.

GNP-RL SGNP GP-ADFs GP EPCalculation time [s] 1,364 1,019 3,252 3,281 2802ratio of each method to SGNP 1.34 1 3.19 3.22 2.75

1.6

1.8

2

2.2

2.4

2.6

0 1000 2000 3000 4000 5000theaveragenumberoffunctionsineach

node

generation

Figure 16: Change of the average number of functions mi in each node in GNP-RL.

described in 4.2.3, therefore many more states (nodes) can be used in GNP than in EP.As a result, the implicit memory function of GNP becomes more effective in dynamicenvironments than that of EP.

Table 5 shows the calculation time for 5,000 generations. SGNP is the fastest, GNP-RL is second and EP is third. GNP-RL takes more time than SGNP, because it executesRL during tasks, however, it does not take so much time. The maximum number offunctions (M) in each node is 4 and one of them is selected as an action; this proceduredoes not take much time. In addition, more importantly, mi tends to decrease as thegeneration goes on as shown in Fig. 16 (1.75 at the last generation) because the appro-priate number and contents of functions are selected automatically in the evolutionalprocesses. Therefore, reinforcement learning just selects one function from 1.75 func-

tions on average. This tendency contributes to saving time. In addition, the relativelylarge epsilon (= 0.1) succeeded in achieving the tasks thanks to this tendency, becauseless than two functions (actions) are in a node on average at the last generation, whilethe relatively large epsilon is useful at the beginning of the generations, because theagents can try to do many kinds of actions and find good ones by RL. EP takes moretime than GNP and GNP-RL, but it can save its calculation time compared with ordi-nary EP, because the number of inputs is limited to three and the structure becomescompact. Actually, the calculation time of EP using four inputs is 4,184. GP and GPwith ADFs have many nodes, thus they take more time than the others in evolutionalprocesses.

Fig. 17 (a) shows the typical node transition of the upper left agent (in Fig. 9) oper-ated by SGNP. The x-axis shows the symbols of the nodes, and the y-axis distinguishesthe same kind of nodes, i.e., there are five nodes per each kind/symbol10 and they havethe number 0, 1, 2, 3 and 4. For example, (x, y) = (4, 1) shows the second JF node. Fromthe figure, we can see that the specific nodes are repeatedly used.

10 Five JF nodes, five JB nodes, . . . , five MF node, five TL nodes, . . . are used in SGNP.



21/30


0

1

2

3

4

MF TL TR ST JF JB JL JR TD HD THD STD

0 1 2 3 4 5 6 7 8 9 10 11

(a) whole node transition for 150 steps

JFMF TD MF JF

TD MF TR HD MF THD

reward (drop tile)

reward(x,y)=(0,1) (4,1) (8,0) (0,1) (4,1)

(8,0) (0,0) (2,0) (9,2) (0,2) (10,4)

floor forward

judgement result

obstacle

right forward backward

(b) partial node transition extracted from (a)

Figure 17: Node transition of standard GNP.

Fig. 17 (b) shows the partial node transition extracted from the whole node transi-tion [Fig. 17 (a)]. In Fig. 17 (b), the first node is MF (0,1), so the agent moves forward,and the next node becomes JF (4,1) according to the node connection from MF (0,1).Thus the points (0,1) and (4,1) in Fig. 17 (a) are connected with line. Next, the judg-ment JF (4,1) is executed and the judgment result is floor, thus the corresponding node

branch (connected to TD (8,0) ) is selected. Then the points (4,1) and (8,0) in Fig. 17 (a)are connected. After executing the judgment TD (8,0) (judgment result: forward), theagent goes forward, judges forward (judgment result: obstacle), judge the tile direction(judgment result: right), and so on.

Finally, the simulations using Environment A, B and C [Figs. 18, 19 and 20] arecarried out. The condition of each method is the same as that showing the best result inthe previous environment [Fig. 9]. From Figs. 21, 22 and 23, GNP-RL and SGNP show

better results than the other methods.



22/30


T T

T

T

T

T

T

T

T

T

T

T

T

T T

T

T

T

T

T

T

T

TT

TT T

T

T

T

Figure 18: Environment A.

T T

T

T

T

T

T

T

T

T

T

T

T T

T

T

T

T

T

T

T

T

TT

TT T

T

T

T

Figure 19: Environment B.

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

TT

TT T

TT

T

Figure 20: Environment C.

4.4 Simulation II

In Simulation II, we use an environment whose size (20x20) and distribution of theobstacles are the same as Simulation I. However, 20 tiles and 20 holes are set at randompositions at the beginning of each generation. In addition, when an agent drops a tileinto a hole, the tile and the hole disappear; however, a new tile and a new hole appearat random positions. Therefore, the individuals obtained in the previous generationare required to show good performance in the new, unexperienced environment. Thisproblem is more dynamic and suitable than Simulation I in terms of confirming thegeneralization ability of each method. The maximum number of steps is set at 300.

Figs. 24, 2511 and 26 show the averaged fitness curves of the best individuals over30 independent simulations at each generation. From the figures, we can see that GNP-RL can obtain the highest fitness value at the last generation because the informationobtained during task execution is used for making node transitions efficiently. FromTable 6, we can also see that there are significant differences between GNP-RL and theother methods. SGNP can obtain better fitness value than GP, GP-ADFs at the lastgeneration, but, from Table 6, it is found that there is no significant difference betweenSGNP and EP-input1-state60. In the case of EP, it is interesting to note that the programsin Simulation II show the opposite results to those in Simulation I, i.e., the program us-ing one input shows better results in Simulation II, while that using three inputs show

better results in Simulation I. Therefore, for EP in this environment, it is recommendedthat an action be determined by one input and that a relatively large number of states isused. In other words, this EP makes many simple rules and combines them consideringthe past state transition. This special structure of EP is similar to that of GNP. However,the advantage of GNP is automatically selecting the necessary number of inputs andactions depending on the situations, and moreover, it is found that GNP programs with61 nodes show good results in both Simulation I and II, therefore we are not worriedabout the settings of the number of nodes. In fact, there are significant differences be-tween SGNP and EP-input3-state5 (p-value= 8.67 106) which shows the best fitnessvalue in Simulation I, EP-input2-state30 (1.12103) and EP-input4-state5 (1.29109).

In the case of GP, it is difficult to find effective programs, because the environmentchanges randomly generation by generation. In addition, GP has relatively complexstructures and wide search space compared to GNP and EP, thus it is more difficult for

GP to explore solutions.11Fig. 25 shows the fitness curves of GP-full5 and GP-ADF-full3-2, and the fitness values of the other

settings at the last generation. Because the fitness curves of the GP overlap each other, the best two results(GP-full5 and GP-ADF-full3-2) are shown.



23/30


0

2

4

6

8

10

12

14

16

18

20

22

0 1000 2000 3000 4000 5000

generation

fitness

GNP-RL (20.80)

SGNP (16.87)EP (15.97)

GP (13.80)GP-ADF (14.20)

Figure 21: Fitness curves in Environment A.

0

2

4

6

8

10

12

14

16

18

20

22

24

0 1000 2000 3000 4000 5000generation

fitness

GNP-RL (24.37)

SGNP (18.83)

EP (17.00) GP (16.93) GP-ADF (16.00)

Figure 22: Fitness curves in Environment B.

0

2

4

6

8

10

12

14

16

18

20

22

24

0 1000 2000 3000 4000 5000

generation

fitness

GNP-RL (22.83)

SGNP (19.87)

EP (19.17)

GP (13.73)

GP-ADF (16.43)

Figure 23: Fitness curves in Environment C.



24/30


0

2

4

68

10

12

14

16

18

20

22

0 1000 2000 3000 4000 5000

fit

ness

generation

GNP-RL (19.93)

SGNP (15.30)

Figure 24: Fitness curves of GNP in Simulation II.

0

2

4

6

8

10

12

1416

18

20

22

1000 2000 3000 4000 5000

GP-ADF-full3-2 (6.67)

GP-full5 (6.10)

GP-full4 (5.90)

GP-ramp5 (5.20)

GP-ramp4 (5.13)

GP-ADF-ramp3-2 (5.73)

GP-ADF-full4-3 (5.56)

GP-ADF-ramp4-3 (5.60)

generation

fitness

Figure 25: Fitness curves of GP in Simulation II.

0

2

4

6

8

10

12

14

16

18

20

22

0 1000 2000 3000 4000 5000

fit

ness

generation

EP-input 1 state 60 (14.40)




Figure 26: Fitness curves of EP in Simulation II.



25/30


Table 6: The data on the fitness of the best individuals at the last generation in Simula-tion II.

GNP-RL SGNP GP-ADFs GP EP

average 19.93 15.30 6.67 6.10 14.40

standard deviation 2.43 3.88 3.19 1.75 2.54

t-test GNP-RL - 5.90 108 7.46 1026 1.53 1031 2.90 1012

(p-value) SGNP - - 1.36 1013 5.91 1015 1.46 101

Table 7: Calculation time for 5000 generations in Simulation II.

GNP-RL SGNP GP-ADFs GP EPCalculation time [s] 2,734 1,177 5,921 12,059 1,584ratio of each method to SGNP 2.32 1 5.03 10.25 1.35

Table 7 shows the calculation time for 5,000 generations. SGNP is the fastest as inSimulation I. However, EP shows better results than GNP-RL because EP deals withone input at each node, and the amount of calculation in one step decreases. In fact,the total calculation time of EP becomes smaller than in Simulation I. Although GP-full5 shows the best fitness value of all the settings of standard GP, the calculation time

becomes quite large because each individual of GP-full5 has 3,906 nodes.Fig. 27 shows the ratio of the nodes used by the best individual of GNP-RL in order

to see which nodes are used and which are most efficient for solving the problem. Thex-axis represents the unique node number. In GNP-RL, the total number of nodes is61, and each processing node has a node number (120) and each judgment node hasa node number (2160). In addition, the symbol (function) of each node is changedthrough the evolution, thus the x-axis also shows the node symbols in order to showwhich symbols are more selected as a result of evolution. For example, if the width ofTD is wider than STD, we can know that GNP-RL selects more TD than STD as a resultof evolution. The y-axis shows the ratio of the used nodes.

In the first generation, GNP uses various kinds of nodes randomly and thus cannot

obtain effective node transition rules. However, in the last generation, MF, JF, JB, TDand THD are used frequently. So the basic behavior obtained by evolution and RLturns out to be that each agent judges the direction of the nearest tile from the agentand the nearest hole from the nearest tile first, then turns to the tile, looks forward and

backward, and moves to the forward cell. The other kinds of nodes are used accordingto necessity. It is interesting that GNP rarely uses HD but THD. An agent can know thedirection of the nearest hole from the agent by using HD, but the aim of the agent isnot to reach the hole position or drop itself into holes. The agent must drop tiles intoholes, thus it is important to know the direction of the nearest hole from the nearest tilerather than from the agent.

Finally, the simulations using the other three environments (Environment A-II, B-II and C-II) are carried out. The positions of obstacles in these environments are thesame as Environment A, B and C [Figs. 18, 19 and 20], respectively. However, like the

previous environment, 20 tiles and 20 holes are set at random positions at the beginningof each generation and a tile and a hole appear at random positions when an agentdrops a tile into a hole.

Figs. 28, 29 and 30 show the fitness curves for each environment. SGNP and GNP-



26/30


0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 10 20 30 40 50 60MF TL TR ST JF JB JL JR TD HD THD STD

r

atio

ofused

node

node number and symbol

Generation 1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 10 20 30 40 50 60MF TL TR ST JBJL JR TD HDTHD

STD

ratio

ofused

node


JF

Generation 100

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 10 20 30 40 50 60MF TL TR ST JF JBJL JR TD HDTHD

STD

ratio

ofused

node


Generation 5000

Figure 27: Ratio of nodes used by GNP with Reinforcement Learning.



27/30


0

2

4

6

8

10

12

14

16

18

20

22

24

0 1000 2000 3000 4000 5000

fit

ness

generation

GNP-RL (22.63)

EP-iuput 1 state 60 (16.87)

SGNP (16.17)




GP-ADF (6.20)

GP (6.20)

Figure 28: Fitness curves in Environment A-II.

0

2

4

6

8

10

12

14

16

18

0 1000 2000 3000 4000 5000

fit

ness

generation

GNP-RL (16.07)


SGNP (9.50)




GP-ADF (5.30)

GP (4.80)

Figure 29: Fitness curves in Environment B-II.

0

2

4

6

8

10

12

14

0 1000 2000 3000 4000 5000

fit

ness

generation

GNP-RL (11.87)


SGNP (8.17)

EP-iuput 2 state 30 (5.10) EP-iuput 3 state 5 (5.13)


GP-ADF (4.63)

GP (4.46)

Figure 30: Fitness curves in Environment C-II.



28/30


RL use the same settings as in the previous simulations. GP and GP-ADFs use thesettings showing the best results in the previous simulations. EP uses four settings inorder to see the differences between GNP and EP. From the figures, GNP-RL showsthe best fitness values in all the environments. However, SGNP does not show betterfitness values than EP-input1-state60, although it shows better fitness values than GP,GP-ADFs and EPs with other settings. We used the special EP in this paper becauseordinary EP, i.e., EP using eight inputs, is impractical to use, and it is found that if we

determine the appropriate number of inputs and states, it shows good results. How-ever, GNP can automatically determine the necessary number of inputs and actions,and this characteristic is an advantage of GNP.

5 Discussion

This paper describes a basic analysis of GNP in order to apply GNP to more complexdynamic problems, such as Elevator group supervisory control systems (Crites andBarto, 1998), stock prediction (Potvin et al., 2004), and RoboCup soccer (Luke, 1998) inthe future. GP and EP, for example, must prepare a large number of nodes if a problemis complex, and therefore the search space and calculation time become very large.GNP results in compact structures comparing to GP and EP, and can consider its past

judgment and processing when determining the next judgment or processing.GNP has some problems which remain to be solved. We set out the judgment

nodes and processing nodes using raw information obtained from environments. Inthe Tileworld problem, for example, since GNP can obtain eight kinds of informationand execute four kinds of processing, it uses eight kinds of judgment nodes and fourkinds of processing nodes. However, when we apply GNP to more complex problemssuch as Elevator Group Supervisory Control Systems and stock prediction we are nowstudying, there is too much information to consider, and therefore the number of nodesincreases. Although GNP can select the necessary nodes as a result of evolution, thecalculation time becomes large. In order to solve this problem and apply GNP to realworld problems, we will combine some important judgments and processing usingforesight information and set out the new judgment nodes and processing nodes likemacro function nodes. Secondly, the current GNP judges discrete information, e.g., thedirection of the nearest tile is right. But we will make the GNP system deal witha continuous input, such as an angle of 23 degrees to the tile, using reinforcementlearning architecture, e.g., actor-critic. Finally, we must compare GNP with other evo-lutionary and RL methods in more realistic problems to confirm the applicability of theproposed method to a great variety of applications.

6 Conclusions

In this paper, a graph-based evolutionary algorithm called Genetic Network Program-ming (GNP) and its extended algorithm, GNP with Reinforcement Learning (GNP-RL),are proposed. From the simulation results of the Tileworld problem, GNP shows goodresults in terms of fitness values and calculation time. It is well known that the expres-sion ability of GP is improved as the number of nodes increases, but the larger trees usegreater amounts of memory and take more time to explore and execute the programs.On the other hand, GNP shows good results using a comparatively small number of

nodes. It is also clarified that GNP selects important kinds of nodes and uses themrepeatedly, while the unnecessary nodes gradually become unused, as the generationgoes on. GNP-RL shows much better fitness values than SGNP because GNP-RL canobtain and utilize more information than SGNP during task execution.



29/30


References

Angeline, P. J. (1994). An alternate interpretation of the iterated prisoners dilemma and theevolution of non-mutual cooperation. In Brooks, R. and Maes, P., editors, Proceedings of 4thArtificial Life Conference, pages 353358. MIT Press.

Angeline, P. J. and Pollack, J. B. (1993). Evolutionary module acquisition. In Fogel, D. and Atmar,W., editors, Proceedings of the Second Annual Conference on Evolutionary Programming, pages 154163, La Jolla, CA.

Crites, R. H. and Barto, A. G. (1998). Elevator group control using multiple reinforcement learn-ing agents. Machine Learning, 33(2-3):235262.

Downing, K. L. (2001). Adaptive genetic programming via reinforcement learning. In Spector, L.,Goodman, E. D., Wu, A., Langdon, W. B., Voigt, H.-M., Gen, M., Sen, S., Dorigo, M., Pezeshk,S., Garzon, M. H., and Burke, E., editors, Proceedings of the 3rd Genetic and Evolutionary Compu-tation Conference, pages 1926. Morgan Kaufmann.

Fogel, D. B. (1994). An introduction to simulated evolutionary optimization. IEEE Transactionson Neural Networks, 5(1):314.

Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artificial Intelligence through Simulated Evolution.John Wiley & Sons.

Hirasawa, K., Okubo, M., Katagiri, H., Hu, J., and Murata, J. (2001). Comparison between genetic

network programming (GNP) and genetic programming (GP). In Proceedings of Congress onEvolutionary Computation, pages 12761282. IEEE Press.

Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press,Ann Arbor.

Iba, H. (1998). Multi-agent reinforcement learning with genetic programming. In Proceedings ofthe Third Annual Conference of Genetic Programming, pages 167172.

Kamio, S. and Iba, H. (2005). Adaptation technique for integrating genetic programming andreinforcement learning for real robots. IEEE Transactions on Evolutionary Computation, 9(3):318333.

Katagiri, H., Hirasawa, K., and Hu, J. (2000). Genetic network programming-application to intel-ligent agents -. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics,pages 38293834. IEEE Press.

Katagiri, H., Hirasawa, K., Hu, J., and Murata, J. (2001). Network structure oriented evolution-ary model-genetic network programming-and its comparison with genetic programming. InSpector, L., Goodman, E. D., Wu, A., Langdon, W. B., Voigt, H.-M., Gen, M., Sen, S., Dorigo, M.,Pezeshk, S., Garzon, M. H., , and Burke, E., editors, 2001 Genetic and Evolutionary ComputationConference Late Breaking Papers, pages 219226. Morgan Kaufmann.

Koza, J. R. (1992). Genetic Programming, on the Programming of Computers by Means of NaturalSelection. MIT Press, Cambridge, MA.

Koza, J. R. (1994). Genetic Programming II, Automatic Discovery of Reusable Programs. MIT Press,Cambridge, MA.

Luke, S. (1998). Genetic programming produced competitive soccer softbot teams for robocup97.In Koza, J. R., Banzhaf, W., Chellapilla, K., Deb, K., Dorigo, M., Fogel, D. B., Garzon, M. H.,Goldberg, D. E., Iba, H., and Riolo, R., editors, Proceedings of the Third Anuual Genetic Program-

ming Conference. Morgan Kaufmann.Luke, S. and Spector, L. (1996). Evolving graphs and networks with edge encoding: Preliminary

report. In Koza, J. R., editor, Late Breaking Paper at the Genetic Programming 1996 Conference,Stanford University, July 28-31, 1996, pages 117124, Stanford University, CA.



30/30


Mabu, S., Hirasawa, K., and Hu, J. (2004). Genetic network programming with reinforcementlearning and its performance evaluation. In Proceedings Part II of 2004 Genetic and EvolutionaryComputation, pages 710711, Seattle, WA.

Mabu, S., Hirasawa, K., Hu, J., and Murata, J. (2002). Online learning of genetic network pro-gramming. In Proceedings of Congress on Evolutionary Computation, pages 321326. IEEE Press.

Mabu, S., Hirasawa, K., Hu, J., and Murata, J. (2003). Online learning of genetic network pro-gramming and its application to prisoners dilemma game. Transactions of IEE Japan, 123-

C(3):535543.

Poli, R. and Langdon, W. B. (1997). Genetic programming with one-point crossover. InChawdhry, P. K., Roy, R., and Pant, R. K., editors, Second On-line World Conference on Soft Com-puting in Engineering Design and Manufacturing, pages 180189. Springer-Verlag, London.

Poli, R. and Langdon, W. B. (1998). Schema theory for genetic programming with one-pointcrossover and point mutation. Evolutionary Computation, 6(3):231252.

Pollack, M. E. and Ringuette, M. (1990). Introducing the tile-world: Experimentally evaluatingagent architectures. In Dietterich, T. and Swartout, W., editors, Proceedings of the conference ofthe American Association for Artificial Intelligence, pages 183189. AAAI Press.

Potvin, J.-Y., Soriano, P., and Vallee, M. (2004). Generating trading rules on the stock marketswith genetic programming. Computers & Operations Research, 31:10331047.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning - An Introduction. MIT Press, Cam-bridge, MA.

Teller, A. (1996). Evolving programmers: The co-evolution of intelligent recombination opera-tors. In Angeline, P. J. and K. E. Kinnear, J., editors, Advances in Genetic Programming, chapter 3,pages 4568. MIT Press, Cambridge, MA.

Teller, A. and Veloso, M. (1995). PADO, learning tree-structured algorithm for orchestration intoan object recognition system. Technical report, Carnegie Mellon University.

Teller, A. and Veloso, M. (1996). PADO: A new learning architecture for object recognition. InIkeuchi, K. and Veloso, M., editors, Symbolic Visual Learning. Oxford University Press, NewYork.

Weiss, G., editor (1999). Multiagent Systems, A Modern Approach to Distributed Artificial Intelligence.MIT Press, Cambridge, MA.


A Graph-Based Evolutionary Algorithm

Documents