Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.

Study on Genetic Network Study on Genetic Network Programming (GNP) with Programming (GNP) with Learning and EvolutionLearning and Evolution

Hirasawa laboratory,Artificial Intelligence section

Information architecture fieldGraduate School of Information, Production and Systems

Waseda University

I Research BackgroundI Research Background

Intelligent systems(evolutionary and learning algorithms)

can solve problems automatically

　 Systems are becoming large and complex robot control elevator Group Control System Stock trading system

It is very difficult to make efficient control rulesconsidering many kinds of real world phenomena

II Objective of the researchII Objective of the research

• propose an algorithm which combines evolution and learning

– In the natural world ・・・• evolution 　―　 Many individuals (living things) adapt to the world (environment) through long time of generations

• learning 　―　 the knowledge the living things acquire in their life time through trial-and-error

give inherent functions and characteristics to the living things

the knowledge acquired in the course of their life

III evolutionIII evolution

　 selection　 crossover　 mutation

EvolutionEvolution

Characteristics of living things are determined by genes

Evolution is realized the following components

Evolution gives inherent characteristics and Functions

Selection

Those who fit into an environment survive,otherwise die out.

Crossover

Genes are exchanged between two individuals

MutationSome of the genes of the selected individuals are changed to other ones

New individuals are produced

New individuals are produced

IV learningIV learning

Important factors in Important factors in reinforcement learningreinforcement learning

• State transition (definition of states and actions)

• Trial and error learning

• Future prediction

Framework of Framework of Reinforcement LearningReinforcement Learning

• Learn action rules through the interaction between an agent and an environment.

agentagent

environmentenvironment

Action

State signal（ sensor input ）

Reward（ evaluation 　　 on the action ）

The aim of RL is to maximize the total rewards obtained from the environment

rt+n

State transitionState transition

• State transition

stst+1 st+2 st+n

……at

An action taken at time t

State at time t

G

Example: maze problem

start

G G

Goal!!

st st+1 st+2

Reward 100

st+n

at: move right at+1: move upwardat+2: move left at+n: do nothing (end)

……

Reward rt

at+1 at+2

rt+1 rt+2

Trial-and-error learningTrial-and-error learning

concept of reinforcement learning

trial and error learning method

Decide an action

take the action

Success (get reward)

Failure (get negative reward)

Take this action again!

Don’t take this action

again

Reward (scalar value): indicate whether good action or not

Acquired knowledge

agent

Future predictionFuture prediction• Reinforcement learning estimates the future rewards and

take actions

st+1

st+2

st+3

at

Reward rt

at+1

at+2

rt+1

rt+2

st

current time

future

Future predictionFuture prediction• Reinforcement learning considers the rewards not only

current but also the future rewards

st+1 st+2 st+3

at

Reward rt=1

at+1 at+2

rt+1=1 rt+2=1

st

st+1 st+2 st+3

at

Reward rt=0

at+1 at+2

rt+1=0 rt+2=100

Case 1

Case 2

V GNP with evolution and learning V GNP with evolution and learning

Genetic Network Programming (GNP)Genetic Network Programming (GNP)

GNP is an Evolutionary Computation.

What’s Evolutionary Computation ？

solution gene＝

•　 Solutions (programs) are represented by genes•　 The programs are evolved (changed) by selection, crossover and mutation

Structure of GNPStructure of GNP

Graph structure

0 0 3 4

0 1 1 6

0 2 5 7

1 0 8 0

1 0 0 4

1 5 1 2

… … … …

gene structure

• GNP represents its programs using directed graph structures.• The graph structures can be represented as gene structures.• The graph structure is composed of processing nodes and judgment nodes.

Khepera robotKhepera robot

• Khepera robot is used for the performance evaluation of GNP

obstacle

sensorFar from obstacles

Close to obstaclesClose to zero

Close to 1023

Sensor value

wheel

Speed of the right wheel VR

Speed of the left wheel VL

-10 (back) ～ 10 (forward)

-10 (back) ～ 10 (forward)

Node functionsNode functionsProcessing node

Judgment node

Each node determines an agent action

Each node selects a branch based on the judgment result

Set the speed of the right wheel at 10

Ex) khepera robot behavior

Judge the value of sensor 1

500 or more

Less than 500

An example of node transitionAn example of node transition

Judge sensor 1

Judge sensor 5


The value is 700 or more

The value is less than 700

80 or more

Less than 80

Generate an initial population (initial programs)

Task executionReinforcement Learning

EvolutionSelection / Crossover / Mutation

Last generation

one generation

Flowchart of GNPFlowchart of GNP

stop

start

Evolution of GNPEvolution of GNPselection

Select good individuals (programs) from the population based on their fitness

Fitness indicates how much each individual achieves a given task

used for crossover and mutation

・・・

GNP population

Evolution of GNPEvolution of GNPcrossover

Some nodes and their connections are exchanged.

Individual 1 Individual 2

mutation

Change connections

Change node function

Speed of Right wheel: 5

Speed of Left wheel: 10

The role of LearningThe role of LearningExample)


Collision!

Judge sensor 0

1000 or more

Less than 1000

1000 is changed to 500 in order to judge obstacle sensitively

Judgment node

10 is changed to 5 not to collide with the obstacle

Processing nodeNode parameters are changed by reinforcement learning

The aim of combining The aim of combining evolution and learningevolution and learning

•　 create efficient programs•　 search for solutions faster

Evolution uses many individuals and better ones are selected after task execution

Learning uses one individuals and better action rules can be determined during task execution

VI SimulationVI Simulation• Wall-following behavior

1. All the sensor values must not be more than 1000

2. At least one sensor value is more than 100

3. Move straight 4. Move fast

Simulation environment

Ctvtvtvtv

(t) LRLR

20

)()(1

20

)()(Reward

1000/Rewardfitness1000

1

t

(t)

: If the condition 1 and 2 is satisfied

0

1C

: otherwise

Node functionsNode functions

Processing node (2 kinds) Judgment node (8 kinds)

Determine the speed of right wheelDetermine the speed of left wheel



.....

0

0.2

0.4

0.6

0.8

0 200 400 600 800 1000

Simulation resultSimulation result

• conditions– The number of

individuals: 600

– The number of nodes: 34

• Judgement nodes: 24• Processing nodes: 10

fitn

ess

generation

GNP with learning and evolution

Standard GNP (GNP with evolution)

fitness curves of the best individuals averaged over 30 independent simulations

start

Track of the robot

startstart

Simulations in the Simulations in the inexperienced environmentsinexperienced environmentsSimulation on the generalization ability

The robot can show the wall-following behavior.

The best program obtained in the previous environment

Execute in the inexperienced environment

VII ConclusionVII Conclusion

• The algorithm of GNP using evolution and reinforcement learning is proposed.

– From the simulation results, the proposed method can learn wall-following behavior well.

• Future work

– Apply GNP with evolution and reinforcement learning to real world applications

• Elevator control system• Stock trading model

– Compare with other evolutionary algorithms

VIII other simulationsVIII other simulations

Example of tileworld

wall

floor

tile

agent

Agent can push a tile and drop it into a hole.

The aim of agent is to drop tiles into holes as many as possible.

tileworld

hole

Fitness = the number of dropped tilesReward rt = 1 (when dropping a tile into a hole)


Processing node Judgement node

go forward

turn right

turn left

stay

What is in the forward cell ? (floor, tile, hole, wall or agent) backward cell left cell right cell the direction of the nearest tile (forward, backward, left, right or nothing) the direction of the nearest hole the direction of the nearest hole from the nearest tile the direction of the second nearest tile

Example of node transitionExample of node transition

What is in the forward?

Direction of the nearest holebackward

right

left nothing

floorwall

agent

tile

hole

Go forward

forward

Simulation 1Simulation 1

– There are 30 tiles and 30 holes

– same environment every generation

– Time limit: 150 steps Environment I

Fitness curveFitness curve （（ simulation simulation 11 ））

fitn

es

s

generation

GNP with evolution

GNP with learning and Evolution

EP （ evolution of Finite State Machine ）

GP-ADFs (main tree ： max depth 3

GP 　 (max depth 5)

ADF: depth 2)

0

5

10

15

20

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Simulation 2Simulation 2

• Put 20 tiles and 20 holes at random positions

• One tile and one hole appear just after an agent push a tile into a hole

• Time limit: 300 stepsEnvironment II（ example of an initial state ）

Fitness curve (simulation 2)Fitness curve (simulation 2)

fitn

es

s

generation

GNP with evolution

GNP with learning and evolution

EP

GP 　 (max depth 5)

GP-ADFs (main tree ： max depth3ADF: depth 2)

0

5

10

15

20

25

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Ratio of used nodesRatio of used nodes

Ratio

of u

sed n

odes

Node function

Go fo

rward

Turn

left

Turn

right

Do n

oth

ing

Judge fo

rward

Judge b

ackw

ard

Judge le

ft side

Judge rig

ht sid

e

Dire

ction o

f tile

dire

ction o

f hole

Dire

ction

of h

ole

from

tile

Seco

nd n

eare

st tile

Node function

Ratio

of u

sed n

odes

Initial generation Last generation

Go fo

rward

Turn

left

Turn

right

Do n

oth

ing

Judge fo

rward

Judge b

ackw

ard

Judge le

ft side

Judge rig

ht sid

e

Dire

ction o

f tile

dire

ction o

f hole

Dire

ction

of h

ole

from

tile

Seco

nd n

eare

st tile

Summary of the simulationsSummary of the simulations

GNP-LE GNP-E GP GP-ADFs EP

Mean fitness 21.23 18.00 14.00 15.43 16.30

Standard deviation 2.73 1.88 4.00 1.94 1.99

T-test(p value)

GNP-LEGNP-E

1.04×10-6 3.13×10-17

3.17×10-11

3.03×10-13

1.32×10-6

5.31×10-11

5.95×10-4

Simulation I

GNP-LE GNP-E GP GP-ADFs EP

Mean fitness 19.93 15.30 6.10 6.67 14.40

Standard deviation 2.43 3.88 1.75 3.19 2.54

T-test(p value)

GNP-LEGNP-E

5.90×10-8 1.53×10-31

5.91×10-15

7.46×10-26

1.36×10-13

2.90×10-12

1.46×10-1

Simulation II

Data on the best individuals obtained at the last generation (30 samples)

Summary of the simulationsSummary of the simulations

GNP with LE

GNP with E GP GP-ADFs EP

Calculation time for 5000 generations [s]

1,717 1,019 3,281 3,252 2,802

Ratio of GNP with E (1) to each

1.68 1 3.22 3.19 2.75

Simulation I

GNP with LE

GNP with E GP GP-ADFs EP

Calculation time for 5000 generations [s]

2,734 1,177 12,059 5,921 1,584

Ratio of GNP with E (1) to each

2.32 1 10.25 5.03 1.35

Simulation II

Calculation time comparison

The program obtained by The program obtained by GNPGNP

0 　 step12345678910111213141516

K G wall

floor

door

agent

Maze problem

K

G

key

goal

fitness=

reward rt = 1 (when reaching the goal)

Remaining time （ when reaching the goal ）０（ when the agent cannot reach the goal ）

objective ： reach goal as early as possible

The key is necessary to open the door in front of the goal

Time limit: 300 step

Processing node Judgment node

go forward

turn right

turn left

random (take one of three actions randomly)

Judge forward cell　 Judge backward cell

Judge left cell Judge right cell


0

100

200

300

0 1000 2000 3000

Fitness curveFitness curve （（ maze maze problemproblem ））

Fitnes

s

generation

GP GNP with evolution (GNP-E)

GNP with learning and Evolution (GNP-LE)

Data on the best individuals obtained at the last generation (30 samples)

GNP-LE GNP-E GP

mean 253.0 246.2 227.0

Standard deviation 0.00 2.30 37.4

Ratio of reaching the goal 100% 100% 100%

Ratio of obtaining the optimal policy 100% 3.3% 63%

Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.

Documents

learning algorithms

error learning methoddecide

good action

natan action

onesnew individuals

individuals mutationsome

selected individuals

time tstate