-
Runtime Analysis of (1 + 1) EvolutionaryAlgorithm Controlled
with Q-learning using
Greedy Exploration Strategy onOneMax+ZeroMax Problem
Denis Antipov1, Maxim Buzdalov1, and Benjamin Doerr2
1 ITMO University, 49 Kronverkskiy av.,Saint-Petersburg, Russia,
197101
[email protected], [email protected] LIX, École
polytechnique, 91128 Palaiseau Cedex
[email protected]
Abstract. There exist optimization problems with the target
objective,which is to be optimized, and several extra objectives.
The extra objec-tives may or may not be helpful in optimization
process in terms of thenumber of objective evaluations necessary to
reach an optimum of thetarget objective.OneMax+ZeroMax is a
previously proposed benchmark optimizationproblem where the target
objective is OneMax and a single extra ob-jective is ZeroMax, which
is equal to the number of zero bits in the bitvector. This is an
example of a problem where extra objectives are notgood, and
objective selection methods should ignore the extra objectives.The
EA+RLmethod is a method which selects objectives to be optimizedby
evolutionary algorithms (EA) using reinforcement learning (RL).
Pre-viously it was shown that it runs in Θ(N logN) on
OneMax+ZeroMaxwhen configured to use the randomized local search
algorithm and theQ-learning algorithm with the greedy exploration
strategy.We present the runtime analysis for the case when the (1 +
1)-EA al-gorithm is used. It is shown that the expected running
time is at most3.12eN logN .
1 Introduction
Single-objective optimization can often benefit from multiple
objectives [6,7,9].Different approaches are known from the
literature. Some researchers introduceadditional objectives to
escape from the plateaus [1]. Decomposition of the pri-mary
objective into several objectives also helps in many problems [5,
7]. Addi-tional objectives may also arise from the problem
structure [8].
1.1 Approaches for Extra Objectives
Different approaches may be applied to a problem with the
“original” objec-tive, which can be called the target objective,
and some extra objectives. The
-
multi-objectivization approach is to optimize all extra
objectives at once usinga multi-objective optimization algorithm
[5,7]. The helper-objective approach isto optimize simultaneously
the target objective and some (not necessarily all,in some cases,
only one is preferrable) extra objectives, switching between
themfrom time to time [6].
The approaches above are designed in the assumption that the
extra objec-tives are crafted to help optimizing the target
objective. However, this is notalways true, especially when their
properties are unknown. In fact, the extra ob-jectives may support
or obstruct the process of optimizing the target objective.The
EA+RL method [3] was developed to cope with such situations. The
ideaof this method is to use a single-objective optimization
algorithm and switchbetween the objectives (which include the
target one and the extra ones). Tofind the most suitable objective
for the optimization, reinforcement learning al-gorithms are used
[10].
1.2 Reinforcement Learning and EA+RL
Reinforcement learning [10] uses the concepts of state, action
and reward. Areinforcement learning algorithm is often thought to
control an agent whichinteracts with a certain environment. The
agent receives the current state fromthe environment as input. It
should return an action to apply on the environment.For that
action, it receives a reward. The aim of the reinforcement
learningalgorithm is to maximize the total reward by choosing
appropriate actions indifferent states. The total reward can be
treated as a sum of all rewards receivedby the algorithm, or as a
discounted reward, when a reward for i-th step fromthe end is taken
with a weight of γi, where 0 < γ < 1.
In the EA+RL method, actions are objectives to choose, while
states andrewards are defined depending on the problem. A good
choice of the reward canbe the value of the target objective after
the selection of an objective minusthe value of the target
objective before it. The sum of all rewards during theoptimization
is equal to the difference between the final value of the
targetobjective and its initial value, so optimization of the
reward leads to optimizationof the target objective.
2 Analyzed Problem and Algorithm
OneMax is a well-known optimization problem widely used in
theoretical re-search on evolutionary computation. It can be
defined as “maximize the numberof one-bits in a bit vector of
length N ”. It is known that simple evolutionary algo-rithms, such
as randomized local search (RLS) or (1+1) evolutionary algorithm((1
+ 1)-EA), solve this problem in Θ(N logN) function evaluations
[11].
We define ZeroMax, a counterpart of OneMax, as follows: the
number ofzero-bits in a bit vector needs to be maximized. Clearly,
the maximum point ofZeroMax is the same as the minimum point of
OneMax, and vice versa. More-over, any change that increases the
OneMax fitness decreased the ZeroMaxfitness at the same time.
-
In paper [2], OneMax+ZeroMax was defined as an optimization
problemwith extra objectives where OneMax is the target objective
and ZeroMaxis the extra objective. Clearly, for this problem every
objective selection algo-rithm should eventually manage to ignore
the offensive extra objective. It wasshown that the EA+RL method
[3], the objective selection method based onreinforcement learning,
indeed learns to ignore the ZeroMax objective whenrandomized local
search (RLS) is used as an optimizer and finds the OneMaxoptimum in
Θ(N logN), more precisely, at most twice slower than RLS itselfdoes
when optimizing OneMax. The proof in [2] was done by analysing
theMarkov chain which modeled the optimization process. Since this
Markov chainpossessed a simple linear structure, the proof was
relatively easy. However, whenthe optimizer is changed to (1 +
1)-evolutionary algorithm, which is able to flipmore than one bit
at a time, using Markov chains becomes insanely complicated.
In this paper, we consider the (1 + 1) evolutionary algorithm
with the fixedprobability of flipping a bit p = 1/N as a
single-objective optimization algorithm.To select which objective
to optimize at each iteration, we use the EA+RLmethod [3], which
internally uses a reinforcement learning (RL) algorithm [10]to do
this. The actions of the RL algorithm are the possible objectives,
so the setof possible actions for the considered problem is
{OneMax,ZeroMax}. Thechoices for the EA+RL method are fixed in this
paper to the following values:
– RL algorithm: the Q-learning algorithm with greedy exploration
strategy [10];• the learning rate: an arbitraty α ∈ (0; 1);• the
discount factor: an arbitrary γ ∈ [0; 1];
– RL state: the value of the OneMax fitness;– RL reward for the
action: the difference of the OneMax fitness values after
the action and before the action.
The pseudocode for the (1 + 1)-EA controlled by the EA+RL method
withthe parameters listed above is shown on Fig. 1.
3 Learning Lemma
The Q-learning algorithm stores estimations of action rewards as
a Q(s, a) ma-trix, where Q(s, a) is the expected reward for
applying an action a in a RLstate s. When randomized local search
is used, it was shown in [2] that EA+RLlearns to ignore the
offensive ZeroMax objective in the following way: once itleaves
each RL state for the first time, it maintains Q(s, 1) > Q(s,
0), whichmakes it select OneMax each time it enters the same state
the next time. Thesame idea is true in the current situation, but
in a more complicated manner.
Lemma 1 (Learning lemma). If the algorithm has not reached the
terminalstate (where s = N) and γ < 1N−1 then for every
non-terminal state s it is truethat:
Q(s, 0) ≤ 0 ≤ Q(s, 1) < N − 1− s.Additionally, if the
algorithm has never left a state s, it is true that Q(s, 0) =Q(s,
1) = 0. Otherwise, Q(s, 0) < Q(s, 1).
-
1: X ← current individual, vector of N zeros2: Q← transition
quality matrix, N × 2, filled with zeros3: f1 ← OneMax, the target
objective4: f0 ← ZeroMax, the extra objective5: Mutate(X)← mutation
operator: inverts each bit with p = 1/N6: while f1(X) < N do7:
s← f1(X)8: Y ←Mutate(X)9: f, i: chosen fitness function and its
index10: if Q(s, 0) > Q(s, 1) then11: i← 012: else if Q(s, 0)
< Q(s, 1) then13: i← 114: else15: i← Random(0, 1)16: end if17: f
← fi18: if f(Y ) ≥ f(X) then19: X ← Y20: end if21: s′ ← f1(X)22: r
← s′ − s23: Q(s, i)← (1− α)Q(s, i) + α(r + γ ·maxj Q(s′, j))24: end
while
Fig. 1. (1 + 1)-EA controlled by EA+RL using the greedy
Q-learning algorithm
Proof. We use mathematical induction. The base is obvious: in
the very begin-ning, all Q(i, j) are zeros and the algorithm has
never left any state, so the lemmastatement is true. Assume that
the lemma statement was true for all previousalgorithm iterations.
The current iteration can have the following forms:
1. The algorithm has never left the current state and remains
there.2. The algorithm has left the current state for the first
time.3. The algorithm has left the current state before.
Below we denote the state before the current iteration as s, the
state afterthe current iteration as s′, the Q-values before the
current iteration as Q(i, j)and the Q-values after the current
iteration as Q′(i, j).
Case 1. By induction hypothesis, Q(s, 0) = Q(s, 1) = 0. If the
algorithm remainsat the state s, the reward is zero, so all the
components of the expression atline 23 are zeros. Whatever
objective i was selected, Q′(s, i) is set to zero, andQ′(s, 1 − i)
remains zero as well, so the induction hypothesis is proven for
thecurrent iteration.
Case 2. By induction hypothesis, Q(s, 0) = Q(s, 1) = 0. This
means that theobjective is OneMax with the probability of 0.5 and
ZeroMax with the prob-
-
ability of 0.5. As s′ 6= s, the change was accepted by the
selected objective, sofor OneMax (i = 1) s′ ≥ s+ 1, while for
ZeroMax (i = 0) s′ ≤ s− 1.
By induction hypothesis, Q(s′, 0) ≤ 0 ≤ Q(s′, 1) < N − s′ −
1. This meansthat for the chosen objective i the following will be
true:
Q′(s, i) = (1− α)Q(s, i) + α(s′ − s+ γmax
jQ(s′, j)
)= α(s′ − s+ γQ(s′, 1)).
The upper bound on s′ − s+ γQ(s′, 1), provided that s′ < N ,
is:
s′ − s+ γQ(s′, 1) < s′ − s+ N − s′ − 1
N − 1= s′
(1− 1
N − 1
)+ 1− s
= s′N − 2N − 1
+ 1− s ≤ (N − 1)(N − 2)N − 1
+ 1− s = N − s− 1.
It follows that Q′(s, i) < N−s−1 as well. The lower bound is
Q′(s, i) ≥ α(s′−s).For i = 1 these two bounds immediately yield
that 0 < Q′(s, 1) < N − s−1. Fori = 0 we should additionally
use the fact that s′ ≤ s− 1, which brings:
Q′(s, 0) < s′N − 2N − 1
+ 1− s ≤ s′ + 1− s ≤ 0.
To sum up, after an iteration which leaves a state s for the
first time it willbe that either 0 < Q′(s, 1) < N − s − 1 and
Q′(s, 0) = 0 or Q′(s, 0) < 0 andQ′(s, 1) = 0. This proves the
induction hypothesis for the current iteration inthe considered
case.
Case 3. By induction hypothesis, Q(s, 0) < Q(s, 1), so the
OneMax objective isselected. As a result, s′ ≥ s. Using the upper
bound on s′−s+γQ(s′, 1) proven inthe previous case (which still
holds under assumptions of the current case), thefact that it is
non-negative and the induction assumption thatQ(s′, 0) ≤ Q(s′,
1),we get the bounds on Q′(s, 1):
Q′(s, 1) = (1− α)Q(s, 1) + α (s′ − s+ γQ(s′, 1))< (1− α)(N −
s− 1) + α(N − s− 1) = N − s− 1.
Q′(s, 1) ≥ (1− α)Q(s, 1).
In any case, Q′(s, 1) < N−s−1. If Q(s, 1) > 0, then Q′(s,
1) > 0. If Q(s, 1) = 0,then Q′(s, 1) ≥ 0. This proves the
induction hypothesis for the current iterationin the considered
case.
In all three possible cases the induction hypothesis is proven,
which completesthe proof. ut
This lemma lets us describe each RL state in any moment of time
eitheras learned or unlearned. In the learned state the algorithm
always selects thecorrect objective, OneMax. In the unlearned
state, it selects either OneMax orZeroMax with equal probabilities.
Each unlearned state becomes learned whenthe algorithm leaves this
state and enters another one. All these considerationsare true when
γ < 1/(N − 1), so we consider it to be so in the rest of the
paperuntil explicitly noted.
-
4 Transition Probabilities
What is the exact probability that the independent bit-flip
mutation (with aprobability of flipping each bit equal to 1/N)
constructs a bit string with j one-bits from a bit string with i
one-bits? Consider the situation where i < j: thismeans that j−
i+k zeros and k ones are flipped. The exact expressions for
theseprobabilities are given below.
P i,j =
min(N−j,i)∑k=0
(N−i
j−i+k)(
ik
) (1N
)j−i+2k (1− 1N
)N−(j−i+2k) if i < j,min(N−i,j)∑
k=0
(i
i−j+k)(
N−ik
) (1N
)i−j+2k (1− 1N
)N−(i−j+2k) if i > j,min(N−i,i)∑
k=0
(N−ik
)(ik
) (1N
)2k (1− 1N
)N−2k if i = j.(1)
In an unlearned state, OneMax or ZeroMax is chosen with the
probabilityof 0.5. Together with the probabilities given above,
transition probability P i,jUfrom an unlearned state i to a state j
is 12P
i,j if i 6= j and 1 − 12∑
k 6=i Pi,k =
12
(1 + P i,i
)if i = j.
In a learned state, OneMax is always chosen, so the transition
probabilityP i,jL from a learned state i to a state j is P
i,j if i < j, 1 −∑j
k=i+1 Pi,j if i = j
and zero otherwise.
4.1 Lower and Upper Bound on P i,j
The expressions for P i,j are rather complex. The following
theorem gives a lowerand an upper bound on P i,j .
Theorem 1. Assume that i 6= j. Let Si,j be the following:
Si,j =
{(N−ij−i) (
1N
)j−i (1− 1N
)N−(j−i) if i < j,(i
i−j) (
1N
)i−j (1− 1N
)N−(i−j) if i > j. (2)Then Si,j ≤ P i,j ≤ 87S
i,j.
Proof. The lower bounds are proven easily, since Si,j are the
addends for k = 0in (1), and all these addends are positive.
We denote as Si,jk the k-th addend of the sum in (1)
corresponding to Pi,j .
Specifically, Si,j = Si,j0 . Consider the case of i < j. The
ratio of the k-th addendto the (k + 1)-th addend is:
Si,jkSi,jk+1
=
(N−i
j−i+k)(
ik
) (1N
)j−i+2k (1− 1N
)N−(j−i+2k)(N−i
j−i+k+1)(
ik+1
) (1N
)j−i+2k+2 (1− 1N
)N−(j−i+2k)−2 ==
(j − i+ k + 1)(k + 1)(N − j − k)(i− k)
N2(1− 1
N
)2=
(j − i+ k + 1)(k + 1)(N − j − k)(i− k)
(N − 1)2.
-
When i and j are fixed, this ratio grows as k grows, so
Si,jkSi,jk+1
≥ Si,j0
Si,j1=j − i+ 1(N − j)i
(N − 1)2.
When i is fixed, this ratio grows as j grows, so we replace j
with its minimumpossible value i+ 1 and then minimize the result
with i = N−12 :
Si,jkSi,jk+1
≥ 2(N − 1)2
(N − i− 1)i≥ 2(N − 1)
2
N−12
N−12
=8(N − 1)2
(N − 1)2= 8.
This means that P i,j can be bounded by a sum of geometric
progression:
P i,j =
min(N−j,i)∑k=0
Si,jk ≤min(N−j,i)∑
k=0
(1
8
)kSi,j0 ≤
∞∑k=0
(1
8
)kSi,j0 =
8
7Si,j0 =
8
7Si,j .
The case of i > j is proven in the similar way. ut
In the rest of the paper, we denote the 87 constant from this
theorem by R.
4.2 Lower and Upper Bounds on Partial Sums of P i,j
Theorem 2. If Vi =(N−1N
)N−i − (N−1N )N , then Vi ≤∑i−1j=0 P i,j ≤ RVi.Proof.
Considering the definition of Si,j from Theorem 1, we get that:
i−1∑j=0
Si,j =
i−1∑j=0
(i
i− j
)(1
N
)i−j (1− 1
N
)N−(i−j)
=
(1− 1
N
)N i−1∑j=0
(i
i− j
)(1
N − 1
)i−j
=
(1− 1
N
)N ((N
N − 1
)i− 1
)= Vi.
As P i,j ≥ Si,j , it follows that∑i−1
j=0 Pi,j ≥
∑i−1j=0 S
i,j = Vi. Similarly, asP i,j ≤ RSi,j , it follows that
∑i−1j=0 P
i,j ≤ R∑i−1
j=0 Si,j = RVi. ut
Theorem 3. If Wi =(N−1N
)i − (N−1N )N , then Wi ≤∑Nj=i+1 P i,j ≤ RWi.4.3 Lower and Upper
Bounds on Other Expressions
Theorem 4. If Yi = iN−1(N−1N
)N−i+1, then Yi ≤∑i−1j=0 P i,j(i− j) ≤ RYi.
-
Proof. Consider∑i−1
j=0 Si,j(i− j):
i−1∑j=0
Si,j(i− j) =i−1∑j=0
(i
i− j
)(1
N
)i−j (1− 1
N
)N−(i−j)(i− j)
=
(1− 1
N
)N i−1∑j=0
(i
j
)i− j
(N − 1)i−j
=
(1− 1
N
)N(1−N)
i−1∑j=0
(i
j
)(1
(N − 1)i−j
)′N
=
(1− 1
N
)N(1−N)
i−1∑j=0
(i
j
)1
(N − 1)i−j
′
N
=
(1− 1
N
)N(1−N)
((N
N − 1
)i− 1)
)′N
=
(1− 1
N
)Ni
N − 1
(N
N − 1
)i−1=
i
N − 1
(N − 1N
)N−i+1= Yi.
Similarly to Theorem 2, we prove the bounds on the required sum.
ut
Theorem 5. If Zi = N−iN−1(N−1N
)i+1, then Zi ≤∑Nj=i+1 P i,j(j − i) ≤ RZi.5 Drift analysis
We analyse the running time of the algorithm using the additive
drift theorem [4].To do that, we construct the following potential
function:
Φ(i, l) =
N−1∑t=i
N
N − t+ CN
N−1∑t=0
1− l(t)N − t
,
where i is the current state (equal to the number of one-bits),
l(t), the learnindicator, is equal to one if the state t is a
learned state and to zero otherwise,and C is a constant.
Such function rewards the algorithm not only for getting closer
to the op-timum, but for learning a state as well. Note that each
time Φ(i, l) = 0, thealgorithm is at the optimum, however, the
opposite is not true: the optimumcan be reached, but not all states
become learned. This does not hurt anything:the additive drift
theorem gives an upper bound on the number of iterationsuntil the
condition that the optimum is reached and all the states are
learned,which, in turn, is an upper bound on the actual running
time of the algorithm.
We can treat Φ(i, l) as a sum of two functions, Φ1(i) =∑N−1
t=iN
N−t andΦ2(l) = CN
∑N−1t=0
1−l(t)N−t . As Φ1 is upwards convex, from Jensen’s inequality
it
follows that Φ1(i)−E(Φ1(i′)) ≥ Φ1(i)−Φ1(E(i′)), if Φ1 is
extended to non-integerarguments by linear interpolation.
-
5.1 Drift from a learned state
In a learned state the situation resembles how (1+1)-EA works on
OneMax [11].In this case, the learn indicator l does not change, so
drift of Φ2 is zero. Thelower bound on E(i′) is (using Theorem
5):
E(i′) = i+
N∑j=0
(j − i)P i,jL = i+N∑
j=i+1
(j − i)P i,j ≥ i+ N − iN − 1
(N − 1N
)i+1.
The lower bound on E(i′)− i is at most one. The drift of Φ1 (and
thus Φ) is:
Φ1(i)− E(Φ1(i′)) ≥ Φ1(i)− Φ1(E(i′)) ≥N
N − iN − iN − 1
(N − 1N
)i+1≥ e−1.
5.2 Drift from an unlearned state
In an unlearned state, the expected drift of Φ2(l) can be
estimated as the rewardfor learning the current state i multiplied
by the probability the algorithm leavesthe state i (using Theorems
2 and 3). Φ2(l)− E(Φ2(l)) is at least:
CN
N − i∑j 6=i
P i,j
2≥ C
2
N
N − i
((N − 1N
)N−i+
(N − 1N
)i− 2
(N − 1N
)N).
The lower bound on E(i′) is (using Theorems 4 and 5):
E(i′) = i+
N∑j=0
P i,jU (j − i) = i+1
2
∑j 6=i
P i,j(j − i)
= i+1
2
N∑j=i+1
P i,j(j − i)−i−1∑j=0
P i,j(i− j)
≥ i+ 1
2
(N − iN − 1
(N − 1N
)i+1−R i
N − 1
(N − 1N
)N−i+1).
As R ≤ 87 , the lower bound on E(i′)− i is at most − 47 , so the
drift of Φ1 is:
Φ1(i)− E(Φ1(i′)) ≥ Φ1(i)− Φ1(E(i′))
≥ 12
N
N − i+ 1
(N − iN − 1
(N − 1N
)i+1− RiN − 1
(N − 1N
)N−i+1)
=1
2
N − iN − i+ 1
(N − 1N
)i− R
2
i
N − i+ 1
(N − 1N
)N−i.
The total value of Φ, namely, D = Φ(i, l)−E(Φ(i′, l′)), is
bounded from belowby sum of drifts for Φ1 and Φ2:
D ≥ C2
N
N − i
((N − 1N
)N−i+
(N − 1N
)i− 2
(N − 1N
)N)
-
+1
2
N
N − i+ 1
(N − iN
(N − 1N
)i− RiN
(N − 1N
)N−i)
≥ C2
N
N − i+ 1
((N − 1N
)N−i+
(N − 1N
)i− 2
(N − 1N
)N)
+1
2
N
N − i+ 1
(N − iN
(N − 1N
)i− RiN
(N − 1N
)N−i)
=
(N−1N
)N−i(CN −Ri) +
(N−1N
)i(CN +N − i)−
(N−1N
)N(2CN)
2(N − i+ 1).
If C ≥ R, then CN − Ri is positive. As(N−1N
)x is convex downwards, wecan use Jensen’s inequality to
simplify a part of the latter expression, which wecall G:
G =
(N − 1N
)N−i(CN −Ri) +
(N − 1N
)i(CN +N − i)
=
(N−1N
)N−i CN−Ri(2C+1)N−(R+1)i +
(N−1N
)i CN+N−i(2C+1)N−(R+1)i
((2C + 1)N − (R+ 1)i)−1
≥ ((2C + 1)N − (R+ 1)i)(N − 1N
) (N−i)(CN−Ri)+i(CN+N−i)(2C+1)N−(R+1)i
.
To find a lower bound on G one needs to find an upper bound on
the ex-ponent in the expression above, assuming that 0 ≤ i ≤ N .
Recall that R = 87 ,which makes the exponent be equal to i
2−iN+7CN214CN+7N−15i . The derivative by i of this
exponent have no roots in [0;N ] at least when C ≥ 1. The value
for i = 0 isN C2C+1 , and for i = N it is N
7C14C−8 , the second one is the biggest of two. We
continue with the lower bound on D:
D ≥((2C + 1)N − (R+ 1)i)
(N−1N
)N 7C14C−8 − (N−1N )N (2CN)2(N − i+ 1)
=
((1 + 2C
(1−
(N−1N
)N 7C−814C−8))N − (R+ 1)i)2(N − i+ 1)
(N − 1N
)N 7C14C−8.
If we choose C such that 1+2C(1−
(N−1N
)N 7C−814C−8) > R+1, then startingfrom some N the fraction
will be greater than one. For these needs we approx-imate
(N−1N
)N by e−1. This problem can be reduced to finding a minimum
Csuch that 1 − 47C > e
7C−88−14C . This can be done by a binary search which yields
C = 2.115188060 . . . ≈ 2.12. Consequently, when C is at least
this large, thedrift from an unlearned state is at least
(N−1N
)N 7C14C−8 ≈ (N−1N )0.6845N ≥ e−1.Together with all previous
analysis, this proves the following theorem:
-
Table 1. Experiment results. C ≈ 2.12 is a constant which was
proven.
N Average FF calls Average false 2eN logN (1 + C)eN logN Ratio
toγ = 1/N γ = 1 queries, γ = 1 γ = 1/N
1 · 101 9.892 · 101 9.648 · 101 3.000 · 10−2 1.252 · 102 1.950 ·
102 1.973 · 101 4.673 · 102 4.855 · 102 1.900 · 10−1 5.547 · 102
8.640 · 102 1.851 · 102 2.382 · 103 2.389 · 103 7.800 · 10−1 2.504
· 103 3.900 · 103 1.643 · 102 9.340 · 103 9.335 · 103 1.690 · 100
9.303 · 103 1.449 · 104 1.551 · 103 3.910 · 104 3.925 · 104 8.280 ·
100 3.755 · 104 5.849 · 104 1.503 · 103 1.389 · 105 1.441 · 105
2.414 · 101 1.306 · 105 2.034 · 105 1.461 · 104 5.585 · 105 5.461 ·
105 7.443 · 101 5.007 · 105 7.799 · 105 1.403 · 104 1.882 · 106
1.901 · 106 2.330 · 102 1.681 · 106 2.619 · 106 1.391 · 105 7.225 ·
106 7.108 · 106 7.780 · 102 6.259 · 106 9.749 · 106 1.353 · 105
2.376 · 107 2.376 · 107 2.325 · 103 2.057 · 107 3.204 · 107 1.351 ·
106 8.726 · 107 8.648 · 107 7.632 · 103 7.511 · 107 1.170 · 108
1.34
Theorem 6. The expected running time of the (1 + 1) evolutionary
algorithmcontrolled by greedy Q-learning on the OneMax+ZeroMax
problem with avalue of the discount factor γ < 1/(N − 1) is at
most:
(1 + C)eN logN ≈ 3.12eN logN.
6 Experimental evaluation
We conducted experimental evaluation on big problem sizes to see
how preciseour estimations are. Table 1 presents the results. For
each N , there were 100runs, and the numbers of fitness function
calls were averaged. We tested twovalues of the discount factor γ,
namely 1/N and 1. For the latter value, weadditionally track the
number of situations when Q values were tuned wrongand ZeroMax was
chosen intentionally (the “Average false queries”
column).Additionally, we track the value of 2eN logN , which was
the common belief forthe “right” expression on the algorithm’s
runtime.
First, it turned out that for large N the algorithm needs more
than 2eN logNiterations to find an optimum, which was surprising.
Second, the runtimes fordifferent values of γ seem to be the same.
The false queries column providesan insight for this phenomenon:
for γ = 1, the number of mismatches withthe learning lemma seems to
be Θ(N) with a constant approximately equal to7.6 · 10−3. We have
no proof for this yet, but it seems that the probability
of“learning the wrong way” is very small and in most cases the
algorithm can later“self-heal” from wrong decisions. Finally, our
estimation appears to be quite agood one, as our estimations are
only 35% pessimistic for large N .
7 Conclusion
We presented a proof that the (1+1) evolutionary algorithm, when
controlled bythe EA+RL method which uses Q-learning with a greedy
exploration strategy,
-
small values of the discount factor γ < 1/(N − 1), difference
in target fitnessfunction as a reward and reinforcement learning
states determined by the targetfitness function, solves the
OneMax+ZeroMax problem in O(N logN)—moreprecisely, in at most
3.12eN logN fitness function calls in expectation.
Experiments show that early thoughts on the actual expected
running time,2eN logN , are wrong for N ≥ 105. The current upper
bound seems to be only35% worse than the “real life”. What is more,
the influence of big values of γ isshown to be negligible. We hope
that these results will show the way for provingbounds on the
running time for the EA+RL method on more complex problems.
This work was partially financially supported by the Government
of RussianFederation, Grant 074-U01.
References
1. Brockhoff, D., Friedrich, T., Hebbinghaus, N., Klein, C.,
Neumann, F., Zitzler,E.: On the Effects of Adding Objectives to
Plateau Functions. Transactions onEvolutionary Computation 13(3),
591–603 (2009)
2. Buzdalov, M., Buzdalova, A., Shalyto, A.: A First Step
towards the Runtime Anal-ysis of Evolutionary Algorithm Adjusted
with Reinforcement Learning. In: Pro-ceedings of the International
Conference on Machine Learning and Applications.vol. 1, pp.
203–208. IEEE Computer Society (2013)
3. Buzdalova, A., Buzdalov, M.: Increasing Efficiency of
Evolutionary Algorithms byChoosing between Auxiliary Fitness
Functions with Reinforcement Learning. In:Proceedings of the
International Conference on Machine Learning and Applica-tions.
vol. 1, pp. 150–155 (2012)
4. Hajek, B.: Hitting-time and occupation-time bounds implied by
drift analysis withapplications. Advances in Applied Probability
14(3), 502–525 (1982)
5. Handl, J., Lovell, S.C., Knowles, J.D.: Multiobjectivization
by Decomposition ofScalar Cost Functions. In: Parallel Problem
Solving from Nature Parallel ProblemSolving from Nature X, pp.
31–40. No. 5199 in Lecture Notes in Computer Science,Springer
(2008)
6. Jensen, M.T.: Helper-Objectives: Using Multi-Objective
Evolutionary Algorithmsfor Single-Objective Optimisation:
Evolutionary Computation Combinatorial Opti-mization. Journal of
Mathematical Modelling and Algorithms 3(4), 323–347 (2004)
7. Knowles, J.D., Watson, R.A., Corne, D.: Reducing Local Optima
in Single-Objective Problems by Multi-objectivization. In:
Proceedings of the First Inter-national Conference on Evolutionary
Multi-Criterion Optimization. pp. 269–283.Springer-Verlag
(2001)
8. Lochtefeld, D.F., Ciarallo, F.W.: Helper-Objective
Optimization Strategies for theJob-Shop Scheduling Problem. Applied
Soft Computing 11(6), 4161–4174 (2011)
9. Neumann, F., Wegener, I.: Can Single-Objective Optimization
Profit from Mul-tiobjective Optimization? In: Multiobjective
Problem Solving from Nature, pp.115–130. Natural Computing Series,
Springer Berlin Heidelberg (2008)
10. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An
Introduction. MIT Press,Cambridge, MA, USA (1998)
11. Witt, C.: Optimizing Linear Functions with Randomized Search
Heuristics – theRobustness of Mutation. In: Proceedings of the 29th
Annual Symposium on The-oretical Aspects of Computer Science. pp.
420–431 (2012)