Run-Time Analysis of the (1+1) Evolutionary Algorithm Optimizing

Run-Time Analysis of the (1+1) Evolutionary Algorithm
Optimizing Linear Functions Over a Finite Alphabet
Benjamin Doerr
Max Planck Institute for Computer Science,
Campus E1 4,
66123 Saarbrücken,
Germany
Sebastian Pohl
Max Planck Institute for Computer Science,
Campus E1 4,
66123 Saarbrücken,
Germany
ABSTRACT
problems like a tight analysis of the run-time of evolutionary algorithms for the minimum spanning tree or the singlesource shortest path problem.
In this work, we make further progress along this line.
The work [5] showed that the classic drift methods used
to analyze evolutionary algorithms (EA) optimizing linear
functions f : {0, 1}n → R cannot be extended to the
natural generalization of the problem to linear functions
f : {0, . . . , r}n → R for r ≥ 43 already; in this work,
based on Witt’s recent break-through [15], we succeed in
giving a relatively tight run-time analysis for this problem.
While we do have results for all r and all mutation probabilities p, most interesting might be the case of r being
an arbitrary constant and p being the standard mutation
probability 1/n. For this setting, we show that the classic
(1+1) evolutionary algorithm optimizes any linear function
f : {0, . . . , r}n → R in time (1 + o(1))ern ln(n). We have
an almost matching lower bound, also valid for all linear
functions, of (1 + o(1))rn ln(n).
We analyze the run-time of the (1 + 1) Evolutionary
Algorithm optimizing an arbitrary linear function f :
{0, 1, . . . , r}n → R. If the mutation probability of the
algorithm is p = c/n, then (1 + o(1))(ec /c)rn log n +
O(r3 n log log n) is an upper bound for the expected time
needed to find the optimum. We also give a lower bound
of (1 + o(1))(1/c)rn log n. Hence for constant c and all r
slightly smaller than (log n)1/3 , our bounds deviate by only
a constant factor, which is e(1 + o(1)) for the standard mutation probability of 1/n. The proof of the upper bound
uses multiplicative adaptive drift analysis as developed in a
series of recent papers. We cannot close the gap for larger
values of r, but find indications that multiplicative drift is
not the optimal analysis tool for this case.
Categories and Subject Descriptors
F.2 [Theory of Computation]: Analysis of Algorithms
and Problem Complexity
1.1
General Terms
What led to the invention of the drift analysis method was
the innocent looking question how fast the (1+1) EA optimizes an arbitrary linear pseudo-Boolean function. These
are functions of the form
n
X
f : {0, 1}n → R, x 7→ w0 +
wi xi ,
Theory, algorithms
Keywords
Run-time analysis, theory, drift analysis, linear function
1.
Drift Analysis and Pseudo-Boolean Linear Functions
i=1
INTRODUCTION
where the weights wi are arbitrary real values. For all evolutionary algorithms regarded in this paper, we may (and
shall) without loss of generality assume that w0 = 0 and
wi ≥ 0 for all i ∈ [n] := {1, . . . , n} and that we want to find
a minimum. In this case, (0, . . . , 0) is an obvious optimum of
any such f . Also, it is easy to establish that randomized local search (RLS), in each iteration flipping a random bit and
accepting this offspring if not worse, with high probability
finds an optimum in (1 + o(1))n ln(n) iterations.
It then came with some surprise that the same question
for the simple (1+1) EA, flipping each bit with probability
1/n instead of flipping exactly one bit, was much harder
to answer. In their seminal paper [8], Droste, Jansen, and
Wegener proved that the (1+1) EA finds an optimum of any
pseudo-Boolean linear function in an expected number of at
most O(n log n) iterations. The proof of this result is long
and complicated.
He and Yao [11] overcame these difficulties by introducing
drift analysis to the field of evolutionary computation. The
Drift analysis, introduced to the field of evolutionary computation by He and Yao [10, 11], is one of the most powerful
methods to analyze the run-time of evolutionary algorithms.
The last two years saw several deep results showing unexpected limitations of the classical method (e.g., that it only
works for small mutation probabilities) and giving ways to
overcome these. All this work is directed towards a deeper
understanding of drift analysis and developing the method
further; hopefully to be able to solve some of the great open
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
GECCO’12, July 07–11, 2012, Philadelphia, USA.
Copyright 2012 ACM 978-1-4503-1177-9/12/07 ...$10.00.
1317
1.2
central idea of drift analysis is to observe a drift towards the
optimum, that is, an expected decrease of the distance to the
optimum, and then use deeper tools from probability theory
to transfer this information into a result on the (expected)
run-time of the algorithm.
What makes this method strong, but at the same time
non-trivial to use, is that we have a choice of how to measure the distance to the optimum. Often, as demonstrated
in [11], quite complicated distance measures are needed to
obtain the necessary drift. Usually, these distance measures
are derived from potential functions on the search space, a
concept implicitly already used in Droste, Jansen, and Wegener [8], though without drift arguments.
While the bounds for optimizing linear functions by a (1+
1) EA were known, the obtained leading constants were still
impractical and not sharp. Jägersküpper in [12] approached
this problem by analyzing the underlying Markov chain of
the optimization process of the (1 + 1) EA. Through his
approach he was able to give an explicit and practical upper
bound for the expected run-time of the (1+1) EA optimizing
linear functions over {0, 1}n , namely 2.016en ln n + O(n).
The next progress triggered by the linear functions problem was the introduction of Multiplicative Drift Analysis by
Doerr, Johannsen, and Winzen [6,7]. The main difference to
the previously used additive drift is that now the progress
is measured relatively to the current distance from the optimum. Such multiplicative progress seems quite natural. It is
also encountered naturally in the linear functions problem.
By better capturing the drift behavior, also sharper bounds
for the linear function problem were derived, proving that
the leading constant lies in the interval [1, 1.39] when the
standard for mutation rate of 1/n is used. However, Doerr,
Johannsen, and Winzen [6, 7] also observed that none of the
at that time known drift approaches can analyze the linear
functions problem when the mutation probability is a small
constant factor larger than 1/n.
Doerr and Goldberg overcame this problem by introducing Adaptive Drift Analysis in [4], which allowed them to
analyze the (1 + 1) EA optimizing linear functions with arbitrary mutation rates c/n. Adaptive Drift Analysis uses
not one universal potential function for every linear pseudoBoolean function, but for each linear function a specifically
tailored potential function is used. Their analysis obtained
Θ(n log n) for the optimization-time of this class of functions.
In [4], the same authors gave a different proof of the multiplicative drift theorem, that not only yields bounds for the
expected optimization time, but also bounds that hold with
high probability.
By giving a very clever definition of adaptive drift functions, Witt [15] gave a very precise run-time for general mutation probability. For the mutation probability
being p = c/n, c a constant, he proved a run-time of
(1 ± o(1))(ec /c)n log n.
Linear Functions Over Larger Domains
All previous results refer to linear functions defined over
the search space {0, 1}n . Doerr, Johanssen, and Schmidt
in [5] were the first to regard the problem for linear functions
of the form
f : {0, 1, . . . , r}n → R, x 7→
n
X
wi xi ,
i=1
where n and r are positive integers and the weight wi are
real-valued, however, their results were mainly negative.
For the case r = 2 they obtained the bound O(n log n) in
a highly technical way. For r ≥ 43, they even showed that
there do not exist universal potential functions leading to an
O(n log n) bound.
Besides the interest in gaining a deeper understanding on
the strengths and limitations of existing drift methods, also
several combinatorial optimization problems motivate the
study of such domains. For example, Gunia [9] modelled
the minimum makespan problem for scheduling n jobs over k
machines over the search space {1, . . . , k}n . From the theory
perspective highly interesting is the work of Scharnow, Tinnefeld, and Wegener [14], who analyzed the Single-Source
Shortest Path Problem (SSSP). The natural representation
of the shortest path tree immediately yields the search space
{1, . . . , n − 1}n for n-vertex graphs. Their work applies
a multi-objective fitness function, which cannot be translated in the context of a single linear function. In [2],
Baswana, Biswas, Doerr, Friedrich, Kurur, and Neumann
analyzed the SSSP problem with a single-objective fitness
function and the same representation as elements of the set
{1, . . . , n − 1}n . Their result established an efficient (1 + 1)
EA for the SSSP with a single-objective fitness function with
the expected run-time bound O(n3 (log n+log wmax )), where
wmax is the maximal edge weight of the graph. It is a great
open problem whether the wmax -term in this bound is necessary or not. Baswana et al. did not use drift analysis, so
there is some hope that suitable drift methods can remove
the wmax -term.
1.3
Our Results
In this work, we make a step forward to analyzing runtimes for function classes defined on domains {0, . . . , r}n for
r larger than one. In particular, we use multiplicative adaptive drift analysis together with an extension of Witt’s [15]
potential functions to general r to analyze the general linear
functions problem.
For all r and all mutation probabilities p = c/n, c not
necessarily a constant, we show that the (1+1) EA finds
the optimum of any linear function defined on {0, . . . , r}n in
(1 + o(1))(ec /c)rn log n + O(r3 n log log n) iterations.
To complement the upper bound, we also give a simple,
but effective argument to show that the (1+1) EA needs at
least (1 + o(1)) 1c rn log n time to find the optimum of any
linear function defined on {0, 1, . . . , r}n . Thus for the standard mutation probability 1/n, our bounds for all r and n
deviate by a factor at most (1 + o(1))e only.
These results show that the existing methods can be
adapted to deal with optimization over {0, . . . , r}n . They
give some hope that drift analysis can also be employed
in combinatorial optimization problems, for example, the
single-source shortest path problem.
1318
2.
THE (1+1) EA OPTIMIZING FUNCTIONS OVER A FINITE ALPHABET
Moreover, for any λ > 0, the probability that the optimization time exceeds
λ + ln(s0 /smin )
δ
The main focus in this paper is the (1 + 1) EA optimizing
functions defined on search spaces of type {0, 1, . . . , r}n for
some r ≥ 1.
This is a natural extension from the usual bit-string setting (r = 1).
To this aim, we have to define a suitable mutation operator. Since the standard bit mutation for bit strings flips each
bit independently with probability p (for some p > 0, typically p = 1/n), we define the following mutation operator
for general r ≥ 1.
is at most e−λ .
n
For the
Pnremainder of this paper let f : {0, 1, . . . , r} →
R, x 7→ i=1 wi xi be an arbitrary linear function over the
index set I = {1, . . . , n}. Without loss of generality, we assume wn ≥ . . . ≥ w1 ≥ 0. Our central goal is to give upper
bounds on the optimization time of Algorithm 1 minimizing f .
We use the following notation. We denote by a(t) ∈
{0, 1, . . . , r}n the search point after t iterations of Algorithm
1 starting with the initial search point a(0) . Obviously for
all t ≥ 0, a(t+1) is the search point after one iteration of
Algorithm 1 starting with a(t) .
To apply the drift theorem, we assign each search point
x ∈ {0, 1, . . . , r}n a non-negative potential g(x). This potential function g will be defined depending on f (adaptive
(t)
drift method). We then analyze the stochastic process Xt≥0
where X (t) := g(a(t) ) for all t and define ∆t := X (t) −X (t+1) .
The crucial part of the proof, besides defining the potential functions g suitably, is showing that the drift condition
in Theorem 2 is fulfilled.
The proof of the following lemma is moved to Section 4
due to its technicality and size to ease the reading.
Definition 1 (mutp (x)). Let r and n be positive integers. Let x ∈ {0, 1, . . . , r}n be a vector. Further let
0 ≤ p ≤ 1.
The mutation operator mutp computes the image y ∈
{0, 1, . . . , r}n of x ∈ {0, 1, . . . , r}n in the following way. Independently for any element xi of x (1 ≤ i ≤ n), the probability that xi mutates is p.
If xi mutates, it mutates to yi ∈ {0, 1, . . . , r}\{xi }, where
yi is chosen uniformly with probability 1/r.
We then use this mutation operator mutp in the iteration
steps of a standard (1 + 1) EA.
This yields the following (1 + 1) EA (Algorithm 1) minimizing a given function f : {0, 1, . . . , r}n → R.
It was studied before by Gunia [9] to analyze the minimum
make-span scheduling problem with r + 1 machines.
Lemma 3. In the notation from above and for arbitrary
α (dependent on n and r),
E ∆t X (t) = s ≥ p · (1 − p)n−1 · r1 − α1 · s.
Algorithm 1: The (1 + 1) EA minimizing a function
f : {0, . . . , r}n → R.
1 input f : {0, 1, . . . , r}n → R;
2 initialize Choose x ∈ {0, 1, . . . , r}n uniformly at
random;
3 repeat
4
y := mutp (x);
5
if f (y) ≤ f (x) then
6
x := y;
Theorem 2 now yields the following run-time bounds for the
(1 + 1) EA optimizing linear function.
Theorem 4. Let λ > 0. The optimization time T of
the (1+1) EA optimizing an arbitrary linear function f :
{0, 1, . . . , r}n → R is bounded from above by
rα
·
b(λ) := (1 − p)1−n
α−r
ln(1/p) + (n − 1) ln(1 − p) + λ
1−n
nαr(1 − p)
+
p
(1)
7 until forever ;
3.
UPPER BOUNDS: DRIFT ANALYSIS
Drift analysis has become a central method for the rigorous study of the run-time of evolutionary algorithms. The
following multiplicative drift theorem with tail bounds stems
from [4, 7].
with probability at least 1 − e−λ , where α > 1 can be chosen
arbitrarily depending on n and r. The expected optimization
time is at most b(1).
Proof of Theorem 4. We apply Theorem 2 together
with Lemma 3. Observe that X (0) is the initial potential
in Algorithm 1 and we can apply Lemma 7 from Section 4.
Because of the monotonicity of the natural logarithm and
since ln( α1 ) ≤ 0, we obtain
ln X (0) ≤ nαpr(1 − p)1−n + ln( α1 ) + ln( p1 ) + ln((1 − p)n−1 )
Theorem 2 (Drift Theorem). Let S ⊆ R be a finite
set of positive numbers with minimum smin . Let {X (t) }t∈N
be a sequence of random variables over S ∪ {0}. Let T be the
random variable that denotes the first point in time t ∈ N
for which X (t) = 0. Suppose that there exists a real number
δ > 0 such that
E X (t) − X (t+1) X (t) = s ≥ δs
≤ nαpr(1 − p)1−n + ln(1/p) + ln((1 − p)n−1 ).
holds for all s ∈ S with Pr[X (t) = s] > 0. Then for all
s0 ∈ S with Pr[X (0) = s0 ] > 0, we have
By Definition 6 the minimal weight in our potential function is g1 = 1, that is, the minimal potential value X min is
1 at search point (0, . . . , 0, 1). Applying Theorem 2, we see
that the optimization time T of Algorithm 1 with probability
1 + ln(s0 /smin )
.
E T X (0) = s0 ≤
δ
1319
1 − e−λ is at most
ln X (0) /X min + λ
T ≤
p(1 − p)n−1 (1/r − 1/α)
4.
PROOF OF LEMMA 3
The proof of Lemma 3 involves the analysis of the structure of the function f in a most general way. To enable
this, we introduce some needed notation to ease the understanding of the proof itself. As a first step we construct an
adaptive linear potential function g : {0, 1, . . . , r}n → R for
an arbitrary linear input function f : {0, 1, . . . , r}n → R.
rα nαpr(1 − p)1−n + ln(1/p) + ln((1 − p)n−1 ) + λ
(α − r)p(1 − p)n−1
rα
=(1 − p)1−n
α−r
ln(1/p) + (n − 1) ln(1 − p) + λ
1−n
· nαr(1 − p)
+
p
≤
Definition 6. We define the following sequence over i ∈
N
γi :=
=b(λ).
Setting λ = 1 we also obtain from Theorem 2 the
upper
b(1) for the expected optimization time
bound
E T X (0) .
1+
αrp
(1 − p)n−1
i−1
,
where α is some (not necessarily constant) factor. Given this
sequence, we establish a sequence of weights gi as follows
wi
gi := min γi , gi−1 ·
wi−1
For mutation probabilities of order Θ(1/n), these bounds
are as follows
for all i > 1 and g1 := γ1 . Through this we obtain the new
linear function
Corollary 5. Let c > 0 be constant and let p = c/n be
the mutation probability of Algorithm 1. Then the expected
optimization time of Algorithm 1 is
g : {0, 1, . . . , r}n → R, x 7→
n
X
gi xi .
i=1
(1 + o(1))
ec
rn log n + O(r3 n log log n).
c
For any i ∈ {1, . . . , n} we further define the following terms
• k(i) := max{j ≤ i gj = γj }, the first right capping
index according to the sequence γi ,
The same bound holds with probability 1 − o(1).
Proof. We bound the terms in (1) from above. We have
(1 − p)1−n ≤ ec ∈ O(1) which we obtain from the inequality
(1 + x) ≤ ex valid for all x ∈ R. Now choose α = r(1 +
ln ln(n + 1)). Then
• L(i) := {n, . . . , k(i)}, the non-capped indices of i situated left from k(i),
• R(i) := {k(i) − 1, . . . , 1}, the capped indices of i situated right from k(i).
r2 (1 + ln ln(n + 1))
rα
=
α−r
r ln ln(n + 1)
1
=r 1+
∈ (1 + o(1))r.
ln ln(n + 1)
Observe either R(i) or L(i) can be empty but never both.
We also obtain for any two indices i ≤ j, i, j ∈ {1, . . . , n},
that k(i) ≤ k(j), thus the sequence k(i) is decreasing. Accordingly, L(j) ⊆ L(i) and R(i) ⊆ R(j) follows directly.
Furthermore it is obvious that the construction of g depends primarily on the weights wi of the linear input function f , but is also dependent on the mutation probability p
and n and r. The main idea for g is based upon the work
of Witt [15] and mimics the growth ratios of the weights of
our initial fitness function f . At the same time we want to
prevent arbitrarily large growth of the weight sequence gi .
For the sake of completion, we state the lemma which we
already used in the proof of Theorem 4.
We also have
nαr = r2 n(1 + ln ln(n + 1)) ∈ O(r2 n ln ln n).
For the last term in (1) note that ln(1 − p) < 0. Hence we
obtain
ln(1/p) + (n − 1) ln(1 − p) + 1
1
1
1
≤ ln
+
p
p
p
p
n n
=
ln
+1
c
c
n
= (1 − o(1)) ln n.
c
Lemma 7. For the potential X (0) of the initial search
point in Algorithm 1, we state the following
Consequently, we bound (1) from above by
1−n
1−n
X (0) ≤
rα
α−r
(1 − p)
ln(1/p) + (n − 1) ln(1 − p) + 1
· nαpr(1 − p)1−n +
p
n
c
c
2
= e (1 + o(1))r(e O(r n ln ln n) + (1 − o(1)) ln n)
c
ec
3
= (1 + o(1)) rn ln n + O(r n ln ln n).
c
enαpr·(1−p)
.
αp · (1 − p)1−n
Proof. Because of the construction of γi and g we have
n
αpr
n
n
1 + (1−p)
−1
X
X
n−1
X (0) ≤ r ·
gi ≤ r ·
γi ≤ r ·
1−n
αpr
·
(1
−
p)
i=1
i=1
1−n
≤
Note that we could replace the ln ln(n + 1) term in the proof
(and the corollary) by any other term that is ω(1). We chose
ln ln n for mere convenience.
enαpr·(1−p)
,
αp · (1 − p)1−n
where the last inequality stems from the well-known fact
that (1 + x) ≤ ex is valid for all x ∈ R.
1320
The following Lemma shows an important property of the
defined k(i).
Clearly, we have ∆t = ∆L (i) + ∆R (i). By the law of total
probability and the linearity of expectation, we obtain
E ∆t X (t) = s = E ∆t A · Pr(A)
X
=
E ∆t Ai · Pr(Ai )
Lemma 8. For the index set I of the linear functions f
and g with respective weight (wi )i∈I and (gi )i∈I we have
• gj /gk(i) ≤ wj /wk(i) for n ≥ j > i (right side of i).
i∈I
(t)
ai >0
• gj /gk(i) = wj /wk(i) for i ≥ j ≥ k(i) (the noncapped
left side of i).
=
(2)
We first show that ∆L (i) Ai is nonnegative for all i ∈ I.
To this assume that Ai holds. Then
X
X
∆L (i) =
`(j)gj +
`(j)gj
The properties are obvious from the definition of gi and the
definition of k(i).
Definition 9. Given our two vectors a(t) and a(t+1) , we
define the following items
• `(i) :=
index i.
−
(t+1)
ai
j∈ID
≥
which is the change distance of
X
j∈IU ∩L(i)
`(j)gk(i) ·
j∈ID
X
gk(i)

≥
`(j)wj +
wk(i) j∈I
D
X
`(j)gk(i) ·
j∈IU ∩L(i)
wj
wk(i)

X
`(j)wj  ≥ 0.
j∈IU ∩L(i)
Now let Si be the event that |IU ∩ L(i)| = 0, that is, that
no index in L(i) is upflipping. Then
E ∆L (i) Ai · Pr(Ai ) = E ∆L (i) Ai ∩ Si · Pr(Ai ∩ Si )
+ E ∆L (i) Ai ∩ S̄i · Pr(Ai ∩ S̄i )
by the law of total expectation. Since ∆L (i) Ai is nonnegative, all these conditional expectations are non-negative
as well. From (2) we thus derive
X
E ∆t X (t) = s ≥
(E ∆L (i) Ai ∩ Si · Pr(Ai ∩ Si )
It is easy to observe that ID ∪ I0 ∪ IU = I as well as ID ∩
I0 ∩ IU = ∅, and that the indices in I0 contribute nothing to
our expected change of ∆t . With this notation, we can now
prove Lemma 3.
Proof of Lemma 3. Let A be the event that a(t) 6=
(t+1)
a
and X (t) = s. By the law of total expectation, we
have
E ∆t X (t) = s = E ∆t A · Pr(A)
+ E ∆t (X (t) = s) ∧ Ā · Pr((X (t) = s) ∧ Ā).
Clearly, E ∆t (X (t) = s) ∧ Ā = 0, because if Ā occurs,
then ∆t = 0. Thus
E ∆t X (t) = s = E ∆t A · Pr(A).
i∈I
(t)
ai >0
+ E ∆R (i) Ai · Pr(Ai )).
(3)
We will bound this central inequality furthermore to obtain
our result. First observe that for Ai to occur it is necessary
(t)
that the index i with ai > 0 is downflipping. Thus we
obtain Pr(Ai ) ≤ p. For Ai ∩ Si to occur it is necessary
and sufficient that i is downfllipping and that all j ∈ L(i)
are not upflipping. Consequently, Pr(Ai ∩ Si ) ≥ (p/r) ·
(t)
(t)
(t)
ai · (1 − p)n−k(i) ≥ (p/r) · ai · (1 − p)n−1 , where ai
(t)
is the i-th index in our start vector a . It is obvious that
(t)
index
downflip
i may
in ai different ways. We also estimate
E ∆L (i) Ai ∩ Si ≥ 1 · gi . By pessimistically assuming that
every index j ∈ R(i) that’s touched
by the
mutation
P flips up
with `(j) = −r, we obtain E ∆R (i) Ai ≥ −r·p· j∈R(i) gj .
These estimates and (3) yield


X
X
p
2
 (1 − p)n−1 a(t)
E ∆t ≥
gj 
i gi − rp
r
i∈I
For event A to occur two things are necessary. First, that
ID is not the empty set, second, that
X
X
`(j)wj +
`(j)wj ≥ 0
j∈ID
wj
+
wk(i)

• ID := {i ∈ I `(i) > 0} the downflipping (good) indices.
• IU := {i ∈ I `(i) < 0} the upflipping (bad) indices.
• I0 := {i ∈ I `(i) = 0} the nonmutating indices.
j∈IU ∩L(i)
for at least one i ∈ ID . Now we define the following events
Ai seperately for every i ∈ I, where I is the set of indices
of the linear function f . The event Ai occurs exactly if the
following three conditions simultaneously hold.
(i) i = max ID .
P
P
(ii)
j∈ID `(j)wj +
j∈IU ∩L(i) `(j)wj ≥ 0.
(iii) X (t) = s.
j∈R(i)
(t)
ai
Clearly, the event A is the disjoint union of the events Ai .
For all i ∈ I let
X
X
∆L (i) :=
`(j)gj +
`(j)gj ,
j∈ID
∆R (i) :=
E ∆L (i) Ai · Pr(Ai ) + E ∆R (i) Ai · Pr(Ai ) .
i∈I
(t)
ai >0
• gj /gk(i) ≥ wj /wk(i) for k(i) > j ≥ 1 (the capped left
side of i).
(t)
ai
X
X
>0

k(i)−1
(t)
X
a
g
p
i
n−1
2
i
 (1 − p)
γk(i) − rp
γj  .
r
gk(i)
j=1

≥
X
i∈I
(t)
ai >0
j∈IU ∩L(i)
Now we will look at each inner individually term and thus
(t)
fix i ∈ I with ai > 0. To ease reading let δ := p·(1−p)n−1 .
`(j)gj .
j∈IU ∩R(i)
1321
Proof of Lemma 10. For each i ∈ {1, . . . , n} define a
binary random variable Xi such that Xi = 1 if and only if
the index
Obviously, we
iis interesting after initialization.
r
≥ 12 . Observe that the
have E Xi = Pr(Xi = 1) = r+1
Xi are mutually
P independent by definition of Algorithm 1.
Now let X := n
i=1 Xi , the expected number of interesting
indices, and we have E X ≥ n2 . Applying Theorem (13)
with δ = 13 , we obtain
1
1
Pr X ≤ n3 ≤ Pr X ≤ 32 E X ≤ e− 18 E X ≤ e− 36 n ,
The i-th term can then be transformed as follows.
k(i)−1
(t)
X
δ ai gi
·
γk(i) − rp2
γj
r gk(i)
j=1
(t)
δ
δ a gi
γk(i) − (γk(i) − 1)
= · i
r gk(i)
α
(t)
a gi
1
1
≥
−
· γk(i)
·δ· i
r
α
gk(i)
1
1
(t)
−
· p · (1 − p)n−1 · ai · gi .
=
r
α
which proves the claim.
Proof of Lemma 11. Consider an index i that is inter-
c
esting in the initial individual. First observe that 1 − nr
is the probability for i to stay interesting during an itera− 1 iterations the probability for i to
tion. Thus after nr
c
nr −1
c
c
have stayed interesting is at most 1 − nr
. Observing
1 n−1
1
that (1 − n )
≥ e for all n ∈ N, we compute
! nr −1
c
1
c
≥ .
1−
nr
e
The inequality stems from lemma (8), which gives us
(t)
gi /gk(i) ≥ 1, as well as from the fact that ai ≥ 1. Computing the sum gives us a lower bound as follows
X 1
1
(t)
−
· p · (1 − p)n−1 · ai · gi
E ∆t X (t) = s ≥
r
α
i∈I
(t)
ai
=
=
=
>0
1
1
−
r
α
1
1
−
r
α
1
1
−
r
α
X
· p · (1 − p)n−1 ·
(t)
ai gi
i∈I
(t)
ai >0
n−1
· p · (1 − p)
Consequently, the probabilitythat index i remains interest− 1 ln n iterations, is
ing after T := (1 − ε) nr
c
−1 (1−ε) ln n
c T
c nr
c
1−
=
1−
nr
nr
(1−ε) ln n
1
≥
= n−1+ε .
e
· g a(t)
· p · (1 − p)n−1 · s,
which is exactly what we wanted to show.
5.
Let I be a set of n3 indices which are all interesting after
initialization. Then the probability that all are good after
T iterations, is
n
n
c T 3
1− 1−
≤ 1 − n−1+ε 3
nr
n
−1+ε
3
≤ e−n
LOWER BOUNDS
In this section, we give a simple proof showing that the
upper bound claimed in the previous section is asymptotically optimal for mutation probability p = nc where c > 0 is
constant. Thus for the rest of this section p = nc . Throughout this section, we consider a typical run of Algorithm 1.
We call an index i ∈ {1, . . . , n} good, if xi was zero in the
random initial individual, or if xi was mutated to 0 at some
previous time t. Otherwise we call the index interesting.
ε
≤ e−(1/3)n .
Lemma 10. After the initialization of the Algorithm 1,
1
with probability 1 − e− 36 n at least 31 n indices are interesting.
6.
Lemma 11. Let ε > 0. The probability that after T =
(1 − ε) nr
− 1 ln n iterations all indices from a fixed set of
c
ε
n
initially
interesting
indices are good, is at most e−(1/3)n .
3
Combining Lemma 10 and Lemma 11 we obtain immediately
that with high probability after T = (1−ε) 1c (nr − c) ln n iterations not all interesting indices have become good, hence
Algorithm 1 cannot have reached its optimum.
Corollary 12. With high probability the expected optimization run-time of Algorithm 1 for an arbitrary linear
function f : {0, 1, . . . , r} → R is in Ω(rn log n).
We conclude by proving the lemmata. Lemma (10) uses the
following theorem due to Chernoff [1, 3].
Theorem 13 (Chernoff bound). Let X1 , . . . , Xn be
independent
random variables taking values in [0, 1]. Let
P
X= n
i=1 Xi . Let δ ≥ 0. Then
Pr(X ≤ (1 − δ) E[X]) ≤ e− E[X]
δ2
2
EXPERIMENTS
We ran several simple experiments to see whether our upper or our lower bound is closer to the truth. In Fig. 1, we
depict the outcomes of the following experiment. In dimension n = 100, for r = 1 (the classic bit-string case) and each
r ∈ {5, 10, 15, . . . , 100} we measured the run-times (number of iterations) in 500 runs of the (1+1) EA with standard mutation probability p = 1/n P
optimizing the function
i−1
(r + 1)val: {0, . . . , r}n → R; x 7→ n
xi . This
i=1 (r + 1)
could be one of the more difficult linear functions, given
that it is as different from OneMax as possible and OneMax provably is the easiest linear function in the case r = 1.
Depicted in Fig. 1 are the average run-times together with
the variances.
Comparing the shape of the plots and the values, we feel
that the O(r3 n log log n) term in our upper bound most
likely is not needed, rather we see a linear dependence on
r. This is supported by the fact that the linear function
r 7→ 1568.97r − 1962.22 as the least square linear approximation has a coefficient of determination of over 0.998.
.
1322
the methods developed in the last ten years also allow to
obtain good results for search spaces larger than bit-strings.
We are optimistic that the constant factor gaps between
upper and lower bounds for r = o((log n)1/3 ) can be closed
with established methods: First we want to show that
OneMax is the easiest linear function also for r > 1, then
analyze OneMax exploiting the nice symmetric structure
here. See [6, 15] for the details.
More challenging, and also more important for a deeper
understanding of optimization on larger search spaces, is
determining the influence of r on the optimization time for
r = ω((log n)1/3 ). Here we foresee two difficulties. The first
is that the drift behavior is not linear anymore. Consider the
one-dimensional toy-case f : {0, . . . , r} → R; x 7→ x. Taking
f itself as potential function, we easily compute E[X (t) −
X (t+1) | X (t) = s] = ps(s + 1)/2, that is, we observe a
quadratic drift behavior. To exploit this optimally, one most
likely needs Johannsen’s variable drift theorem [13].
To make things worse, these quadratic behaviors
mix with
P
linear ones. Let f : {0, 1, . . . , n}n → N; x 7→ n
i=1 xi be the
analogue of the OneMax function defined for r = n. From
a search point like (n, 0, . . . , 0), we do expect a quadratic
behavior as above (the analysis is harder due to the possibility of 0-entries flipping in the wrong direction when
making profit in the first entry). On the other hand, for
x = (1, . . . , 1), we rather expect a linear drift behavior similar to the one we observe in this paper for smaller r. It
will be challenging to find potential functions properly representing both types of behavior.
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
y = 1249.3x
0
20
40
r
60
80
100
Figure 1: Run-time of the (1+1) EA optimizing (r +
1)-VAL for varying r. In all runs, we have n = 100
and a mutation probability of 1/n. Plotted are the
averages over 500 runs for each r value, while the
vertical bars represent the respective variance.
8.
[1] A. Auger and B. Doerr, editors. Theory of
Randomized Search Heuristics, volume 1 of Series on
Theoretical Computer Science. World Scientific, 2011.
[2] S. Baswana, S. Biswas, B. Doerr, T. Friedrich, P. P.
Kurur, and F. Neumann. Computing single source
shortest paths using single-objective fitness. In
FOGA ’09: Proceedings of the 10th ACM Workshop on
Foundations of Genetic Algorithms, pages 59–66.
ACM, 2009.
[3] H. Chernoff. A measure of asymptotic efficiency for
tests of a hypothesis based on the sum of observations.
Ann. Math. Statistics, 23(4):493–507, 1952.
[4] B. Doerr and L. A. Goldberg. Adaptive drift analysis.
In PPSN ’10: Proceedings of the 11th International
Conference on Parallel Problem Solving from Nature,
pages 32–41, 2010.
[5] B. Doerr, D. Johannsen, and M. Schmidt. Runtime
analysis of the (1+1) evolutionary algorithm on
strings over finite alphabets. In Proceedings of the 11th
Workshop Proceedings on Foundations of Genetic
Algorithms, FOGA ’11, pages 119–126. ACM, 2011.
[6] B. Doerr, D. Johannsen, and C. Winzen. Drift analysis
and linear functions revisited. In CEC ’10: Proceedings
of the 2010 IEEE Congress on Evolutionary
Computation, pages 1967–1974. IEEE, 2010.
[7] B. Doerr, D. Johannsen, and C. Winzen.
Multiplicative drift analysis. In GECCO ’10:
Proceedings of the 12th Annual Genetic and
Evolutionary Computation Conference, pages
1449–1456. ACM, 2010.
[8] S. Droste, T. Jansen, and I. Wegener. On the analysis
The straight line depicted represents the (ec /c)rn ln(n)
part of our theoretical upper bound, which for c = 1
is approximately the linear function r 7→ 1249.3r, since
1249.3 ≈ e · 100 ln 100. Hence our experimental results also
suggest that the O(r3 n log log(n)) part of our upper bound
hides a term linear in r.
7.
REFERENCES
CONCLUSION
In this work, we analyzed the run-time of the classic (1+1)
EA when optimizing arbitrary linear functions defined over
{0, 1, . . . , r}n . For r = 1, that is, pseudo-Boolean linear
functions, this is may-be the most notorious problem in the
theory of evolutionary computation, which after a series of
papers over the last ten years now is well understood.
For larger values of r, only the case r = 2 was previously
analyzed. Also, it was proven that previous approaches via
universal drift functions cannot work for r ≥ 43. In this
work, we show that the adaptive drift analysis method [4]
and in particular a natural extension of Witt’s adaptive potential functions [15] can be used to overcome these difficulties.
We derive run-time bounds for all r and all mutation probabilities p.
For the reasonable setting that
p = c/n for an arbitrary constant c > 0, our bound is
O(rn log n)+O(r3 n log log n). We give a simple lower bound
of Ω(rn log n), showing that our analysis is tight apart from
constant factors, if r = o((log n)1/3 ). These constant factors
depend on the mutation probability only. For the common
case of p = 1/n, this constant asymptotically is e ≈ 2.718
only. Hence the good news stemming from this work is that
1323
[9]
[10]
[11]
[12]
of the (1+1) evolutionary algorithm. Theoretical
Computer Science, 276(1–2):51–81, 2002.
C. Gunia. On the analysis of the approximation
capability of simple evolutionary algorithms for
scheduling problems. In GECCO ’05: Proceedings of
the 7th Annual Genetic and Evolutionary
Computation Conference, pages 571–578. Jahn, 2005.
J. He and X. Yao. Drift analysis and average time
complexity of evolutionary algorithms. Artificial
Intelligence, 127(1):51–81, 2001.
J. He and X. Yao. A study of drift analysis for
estimating computation time of evolutionary
algorithms. Natural Computing, 3(1):21–35, 2004.
J. Jägersküpper. A blend of Markov-chain and drift
analysis. In PPSN ’08: Proceedings of the 10th
International Conference on Parallel Problem Solving
from Nature, pages 41–51. Springer, 2008.
[13] D. Johannsen. Random Combinatorial Structures and
Randomized Search Heuristics. PhD thesis, Universität
des Saarlandes, 2010. Available online at
http://scidok.sulb.uni-saarland.de/volltexte/2011/
3529/pdf/Dissertation 3166 Joha Dani 2010.pdf.
[14] J. Scharnow, K. Tinnefeld, and I. Wegener. The
analysis of evolutionary algorithms on sorting and
shortest path problems. Journal of Mathematical
Modelling and Algorithms, 3(4):349–366, 2004.
[15] C. Witt. Optimizing linear functions with randomized
search heuristics - the robustness of mutation. In
STACS ’12: Proceedings of the 29th Annual
Symposium on Theoretical Aspects of Computer
Science, pages 420–431, 2012.
1324

Download Report

Run-Time Analysis of the (1+1) Evolutionary Algorithm Optimizing

Paperzz.com

Your Paperzz