A Method to Derive Fixed Budget Results From Expected

A Method to Derive Fixed Budget Results
From Expected Optimisation Times
Benjamin Doerr
Thomas Jansen
Max Planck Institute for
Informatics
Saarbrücken, Germany
Dep. of Computer Science
Aberystwyth University
Aberystwyth, SY23 3DB, UK
[email protected]
[email protected]
Carsten Witt
Christine Zarges
DTU Compute
Technical University of
Denmark
2800 Kgs. Lyngby, Denmark
School of Computer Science
University of Birmingham
Birmingham B15 2TT, UK
[email protected]
[email protected]
ABSTRACT
Keywords
At last year’s GECCO a novel perspective for theoretical
performance analysis of evolutionary algorithms and other
randomised search heuristics was introduced that concentrates on the expected function value after a pre-defined
number of steps, called budget. This is significantly different from the common perspective where the expected optimisation time is analysed. While there is a huge body of
work and a large collection of tools for the analysis of the expected optimisation time the new fixed budget perspective
introduces new analytical challenges. Here it is shown how
results on the expected optimisation time that are strengthened by deviation bounds can be systematically turned into
fixed budget results. We demonstrate our approach by considering the (1+1) EA on LeadingOnes and significantly
improving previous results. We prove thatdeviating
from
the expected time by an additive term of ω n3/2 happens
runtime analysis, fixed budget computation, tail bounds,
(1+1) EA, LeadingOnes
1. INTRODUCTION
Evolutionary algorithms (like all randomised search heuristics) are applied to find solutions for difficult problems in
situations where no good problem-specific algorithms are
available. They are incomplete optimisers in the sense that
it is not guaranteed that an optimal solution is found in
any given run of finite length. The general hope is that they
find sufficiently good solutions in reasonable time. Note that
even if an optimal solution is found the user has usually no
way of knowing this. In stark contrast to this way of applying evolutionary algorithms their theoretical analysis often
concentrates on the expected optimisation time [1]. For this
scenario many results and analytical methods are known,
for typical example problems [9] as well as for combinatorial
optimisation problems [12]. This analytical perspective has
contributed to the gap between between theory and practice
in the field.
Addressing this problem Jansen and Zarges [10] have introduced a novel perspective, called fixed budget computations. They suggest to consider the function value that has
been achieved after a fixed number b of function evaluations.
This perspective is similar to the consideration of best-sofar curves, a popular way of evaluating the performance of
evolutionary algorithms in empirical studies [16]. In [10],
two particularly simple randomised search heuristics, random local search and the (1+1) EA, are considered for two
well-known example functions, OneMax and LeadingOnes.
The analysis is rather long and complicated and does not
lead to tight bounds in most cases. One of the conclusions
and observed open problems is the need for the development
of novel analytical methods to facilitate analysis in the fixed
budget scenario. We address this need here.
We present a method that allows one to turn results on
the expected time needed to reach a specific function value
into a fixed budget result provided that the original result
is accompanied with tail bounds. Using commonly available
results on the expected time three major steps are involved.
only with probability o(1). This is turned into tight bounds
on the function value using the inverse function. We use
three, increasingly strong or general approaches to proving
the deviation bounds, namely via Chebyshev’s inequality,
via Chernoff bounds for geometric random variables, and
via variable drift analysis.
Categories and Subject Descriptors
F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems
General Terms
Algorithms, Design, Performance, Theory
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
GECCO’13, July 6–10, 2013, Amsterdam, The Netherlands.
Copyright 2013 ACM 978-1-4503-1963-8/13/07 ...$10.00.
1581
First, the results on the expected optimisation time need to
be generalised into results on the expected time needed to
reach a solution of a pre-specified quality. This can usually
easily be done. Second, the results need to be accompanied by tail bounds for the probability to deviate from the
expected times. We present three concrete examples for doing this for LeadingOnes, based on Chebyshev’s inequality,
based on a rarely used Chernoff bound for sums of geometric random variables, and based on variable drift analysis.
Using Chebyshev’s inequality we prove that the probability
to deviate from the expected time to reach a leading 1-bits
on LeadingOnes
by an additive term of d is bounded above
by O n3 /d2 where n is the number of bits. The exact result allows for concrete statements for rather small values of
n. Via Chernoff bounds for sums of independent geometric
random variables, we can reduce the above
error probability to a much stronger exp −Θ d2 /n3 , however giving
weaker bounds for concrete small values of d2 /n3 . Both the
Chebyshev and the Chernoff approach exploit the particular knowledge of the distribution of the random variables
describing the runtime of the algorithm. Since such knowledge may often not be present, we also develop a proof of
concept for a more universal approach to obtain tail bounds
from variable drift analysis.
This allows
us to bound the
error probability by exp −Ω d/n3/2+ǫ , which is still ex
ponentially small if d = Ω n3/2+2ǫ .
In the next section we formally define the notions, algorithm, and function. Section 3 introduces our approach via
the inverse function of the expected optimisation time. This
is followed by our results obtained using Chebyshev’s inequality (Section 5), Chernoff bounds (Section 6), and variable drift analysis (Section 7). We conclude with a comparison of our bounds with the previous results in Section 8.
2.
The example function we consider yields as function value
the number of leading one bits counted from left to right.
Its formal definition is given below.
Definition 2. The function LO : {0, 1}n → R is defined
n Q
i
P
by LO(x) =
x[j].
i=1 j=1
It has been long known that the expected
optimisation
time of the (1+1) EA on LO is Θ n2 and that the expectation is sharply concentrated [7]. Böttcher et al. [2] determined the exact expected optimisation time of the (1+1) EA
on LO (even for arbitrary mutation probabilities). We will
build on their results in Section 3.
We adopt the fixed budget perspective and analyse the
function value that can be achieved with b function evaluations. For the (1+1) EA, this means we are interested in
f (xb ). Since f (xb ) is a random variable we are interested in
its expectation E(f (xb )) as well as properties of its distribution, e. g., bounds on the probabilities to achieve at least or
at most a given function value.
3. FIXED-BUDGET RESULTS VS. OPTIMISATION TIMES
We want to obtain results on the function value after b
function evaluations, f (xb ) for the (1+1) EA. Theoretical
analysis usually concentrates on the time needed for optimisation. It has been pointed out [9] that it is not difficult to
generalise this to optimisation goals other than exact optimisation. The same methods used for analysing the expected
optimisation time can still be applied without any change.
Consider TA,f (a), the number of function evaluations some
specific algorithm A needs to reach function value a for some
specific fitness function f . Formally,
TA,f (a) = min{t | f (xt ) ≥ a}.
FRAMEWORK AND DEFINITIONS
Our main contribution is a general method to derive fixed
budget results from results about the expected optimisation time. We demonstrate the approach using a specific
simple evolutionary algorithm, the (1+1) EA, and a specific well-known example function, LO (often called LeadingOnes). This addresses an open problem posed by Jansen and
Zarges [10] when introducing the fixed budget perspective.
We concentrate on the simplest evolutionary algorithm,
the (1+1) EA. It is similar to random local search since it
also has a population of size only 1, creates one new search
point in each generation and deterministically replaces the
current population with the new search point if this is of at
least equal fitness. It differs from random local search since
the new search point is not selected from a neighbourhood
but generated by means of standard bit mutations. This
difference can make the two algorithms perform very differently [6]. Specifically in the fixed budget framework this
difference can complicate the analysis drastically [10].
If the algorithm A is deterministic then the inverse func−1
tion TA,f
(b) yields the function value reached after b function evaluations have been made. However, we are dealing with randomised search heuristics and TA,f (a) is a random variable. Note that if we consider the expected number of function evaluations needed to reach function value a,
E(TA,f (a)), its inverse function does not necessarily coincide
with the expected function value after b function evaluations
as the following tiny example shows.
Consider an algorithm A and a function f where f (x0 ) =
0 with probability 1. For t > 0, it is defined as follows.
If f (xt−1 ) = 0 then f (xt ) = 1 with probability 1/2 and
f (xt ) = 0 otherwise. If f (xt−1 ) = 1 then also f (xt ) = 1 with
probability 1. Clearly, E(TA,f (0)) = 0 and E(TA,f (1)) = 2
holds. For the expected function value we have E(f (xb )) =
Pr(f (xb ) = 1) = 1 − (1/2)b . We notice two things. First,
the inverse function E−1
TA,f (b) of E(TA,f (a)) is only defined
for b ∈ {0, 2}. Second, if we consider E−1
TA,f (2) = 1, we see
that this is different from E(f (x2 )) = 1 − (1/2)2 .
Things are different if we have bounds on the time needed.
Suppose we know an “upper tail” function tu (a) such
that
Pr(TA,f (a) ≥ tu (a)) ≤ pu . Then Pr f (xb ) ≤ t−1
u (b) ≤ pu
holds. Analogously, if we know a “lower tail” tl (a) s. t.
Pr(TA,f (a) ≤ tl (a)) ≤ pl then Pr f (xb ) ≥ t−1
l (b) ≤ pl .
Bounds for the probability of TA,f (a) being above or below
some values are typically derived as bounds that TA,f (a)
Algorithm 1. (1+1) Evolutionary Algo. ((1+1) EA)
1. t := 0; Select xt ∈ {0, 1}n . Evaluate f (xt ).
2. While t + 1 < b do
3.
t := t + 1
4.
y := xt−1 . For each i ∈ {1, 2, . . . , n}
with probability 1/n flip i-th bit in y.
5.
Evaluate f (y).
6.
If f (y) ≥ f (xt−1 ) then xt := y else xt := xt−1 .
1582
deviates from its expected value E(TA,f (a)). Since the fixed
budget perspective requires results to be as tight as possible
we are interested in determining the expectation E(TA,f (a))
as exactly as we can and derive bounds for the probability
to deviate from it that are as sharp as possible. For our
work here we start with analysing T(1+1) EA,LO (a), in the
following abbreviated as T (a), exactly.
4.
Since our focus in this work are results more precise than
merely computing expectations, we shall discuss the actual
random variables now. To this aim, we regard a modified optimisation process. It is identical to the one of the (1+1) EA
optimising LO with a random initial search point apart from
the fact that the mutation operator does not flip bits with
index higher than the current LO value plus one. Since in
both processes these bits are independent and uniformly distributed in {0, 1}, both processes are equivalent in the sense
that the random search point at any time t is equally distributed in both processes. In particular, all statements on
runtimes and fitnesses reached after certain times hold for
both processes.
For this reason, let us now analyse the runtime of the
modified process. Denote by x = (x[1], . . . , x[n]) the random initial search point. Since in the modified process bits
higher than the LO value plus one never change, we observe
that to gain a fitness level of at least a exactly those fitness
levels i have to be left that satisfy i < a and x[i] = 0. The
waiting time to leave the ith level again is a geometric random variable Ai with success probability qi := (1 − 1/n)i /n.
Consequently, when conditioning on x,Pthe time T (a) needed
to reach a fitness level of a or higher is i∈{j∈[a−1]|x[j]=0} Ai .
Since the x[j] are independent and uniformly distributed in
{0, 1}, we have the following.
EXACT OPTIMISATION TIME FOR LO
We use the precise analysis by Böttcher et al. [2] as starting point. Let Ai denote the random number of mutations
needed to increase the function value from i to something
larger. Böttcher et al. [2] prove that E(Ai ) = 1/((1 − pi )i pi )
holds (Lemma 2 and Theorem 1 in [2]) where pi denotes the
mutation probability. We consider the case of a constant
mutation probability of 1/n so this simplifies to E(Ai ) =
n/(1 − 1/n)i . They carry on determining the expected number of mutations until the function value is increased from
i to n (Lemma 3 and Theorem 2 in [2]). This is easily generalised to the expected number of mutations needed to increase the function value from i to at least j. If Bi,j denotes
the random number of mutations needed we have
E(Bi,j ) = E(Ai ) +
j−1
1 X
E(Ak ).
2
k=i+1
Lemma 5. Let a ∈ {0, . . . , n}. The random time T (a)
taken by the (1+1) EA started with a random search point
to find a solution with LO value at least a can be written as
From this, we now derive an expression for E(T (a)). If
LO(x0 ) ≥ a, the expected number of mutations is 0. Hence,
E(T (a)) =
a−1
X
i=0
=
a−1
X
i=0
1
2
Pr(LO(x0 ) = i) · E(Bi,a )
i+1
·
E(Ai ) +
1
2
a−1
X
j=i+1
E(Aj )
T (a) =
!
=
1
2
a−1
X
a−1
X
Xi Ai ,
i=1
E(Ai )
where the Xi are uniformly distributed in {0, 1}, the Ai
are geometric random variables with parameter qi = (1 −
1/n)i /n, and all Xi and Ai are mutually independent.
i=0
holds.
From E(Ai ) = n/ (1 − 1/n)i , we derive
a
n2 − n
1
E(T (a)) =
·
1+
−1 .
2
n−1
5. BOUNDS WITH THE CHEBYSHEV INEQUALITY
We have demonstrated in Section 3 that knowledge of
E(TA,f (a)) and its inverse function is not sufficient to compute E(f (xb )). Therefore, we want to derive bounds for the
event that T (a) deviates from its expectation. The simplest way of doing this would be Markov’s inequality [3].
However, since this only delivers bounds for deviations by
at least constant factors (Pr(T (v) ≥ cE(T (a))) ≤ 1/c) it is
too weak for our purposes. Chebyshev’s inequality [3] is of
similar generality like Markov’s inequality but delivers much
tighter bounds. However, it requires to have at least bounds
on the variance of the random variable.
Theorem 6. For any d = ω n3/2 and b ≤ E(T (n)) the
following holds: Pr LO(xb ) = n ln 1 + 2b/n2 ± Θ(d/n) =
1 − o(1).
Note that our approach via the inverse function of E(T (a))
makes only sense for budgets b ≤ E(T (n)). Larger budgets
are beyond the domain of E(T (a)) so that its inverse function does not yield useful values. Therefore, we work with
budgets b ≤ E(T (n)) in the following.
For the expected function values some results are known
[10]. We cite the results here and show how our methods
help to improve over them in the following.
Theorem 3 (Theorem 8 from [10]). Let the budget b
= (1 − β)n2 /α(n) for any β with (1/2) + β ′ < β < 1
where β ′ is a positive constant and α(n) = ω(1), α(n) ≥ 1.
E(LO(xb )) = 1 + (2b/n) − O(b/(nα(n))) = 1 + (2b/n) −
o(b/n).
Theorem 4 (Theorem 9 from [10]). Let the budget b
2
= cn
cwith 0 < c < 1/2.
forany constant
−c
−c
LO(xb )
−c
−ce−ce
Pr ce
1+e
≤
≤ c 1 + e−ce
=
n
1 − o(1).
Proof. We use E(T (a)) from Lemma 5. Since Ai is geometrically distributed with parameter qi = (1 − 1/n)i /n,
E(Ai ) = 1/qi and Var (Ai ) = 1/qi2 − 1/qi are well-known.
Consequently, we compute
E A2i = Var (Ai ) + E(Ai )2 = 2/qi2 − 1/qi
E (Xi Ai )2 = E Xi A2i = (1/2)E A2i = 1/qi2 − 1/(2qi )
Note that Theorem
3 is already tight but
only holds for
budgets b = o n2 . For budgets b = Θ n2 we have bounds
in Theorem 4 but these are not tight.
E(Xi Ai ) = (1/2)E(Ai ) = 1/(2qi )
1583
Var (Xi Ai ) = E (Xi Ai )2 − E(Xi Ai )2
=
1/qi2
− 1/(2qi ) −
d = Ω(n2 ) from the expectation. As pointed out in the previous section, meaningful fixed-budget results are only obtained if we bound the probability of deviations d = o(n2 ),
i. e., deviations by a lower-order term.
In this section, we solve this problem and prove that the
actual runtime deviates from the expected one by an additive term of d only with probability exp(−Θ(d2 /n3 )). A
main difficulty here is that the independent random variables Xi Ai have an unbounded range. This forbids a straightforward application of classic Chernoff bounds. To overcome
this, we use a seemingly lesser known Chernoff bound for
sums of independent random variables having a geometric
distribution, which is due to Scheideler [15] (see [3, Section 1.4.4] for an elementary proof of a similar result for
identically distributed geometric random variables). It takes
some more work to handle the fact that the Xi Ai are not
truly geometric random variables. To this aim, we regard
the conditional expectations E(T (a)|X1 , . . . , Xn ). All this
leads to the following result.
1/(4qi2 )
= (3/4)(1/qi2 ) − (1/2)(1/qi ).
Since the (Xi Ai ) are independent, we have Var (T (a)) =
a−1
P
Var (Xi Ai ), which after a somewhat lengthy and tedious
i=0
calculation gives Var (T (a)) = n3 ·(3/8) (1 + 1/(n − 1))2a−1
−1 − Θ n2 . By Chebyshev’s inequality, this yields
Pr(|T (a) − E(T (a))| ≥ da )
3 (1 + 1/(n − 1))2a−1 − 1 · n3
8d2a
so that for any da = ω n3/2 we have for the expected time
upper and lower bounds
a
1
n2 − n
− 1 ± da
·
1+
2
n−1
≤
Theorem 7. For all d ≤ 2n2 , the probability that T (a)
deviates from its expectation of (1/2)(n2 − n)((1 + 1/(n −
1))a − 1) by at least d, is at most 4 exp(−d2 /(20e2 n3 )).
that hold with probability converging to 1, i. e., 1 − o(1).
To obtain bounds on the expected function value we need
to compute the inverse function. If we use for da some d
independent of a we obtain
To prove this result, let a ∈ [n]. Let X1 , . . . , Xn and
A1 , . . . , An be as in Lemma 5. Write X = (X1 , . . . , Xn ).
Assume that all these random variables are
defined over a
Pa−1
common probability space Ω. Let T =
i=0 Xi Ai be a
random variable describing the distribution of T (a).
The conditional expectation E(T |X) is a random variable
= E(T |X = X(ω)) =
P defined over Ω ′by E(T |X)(ω)
−1
(ω)))T (ω ′). A simple appliω ′ ∈X −1 (X(ω)) (Pr(ω )/ Pr(X
cation of the triangle inequality gives
n2 − n / n2 − n + 2b − 2d
Pr
≤ LO(xb )
ln (1 − 1/n)
ln n2 − n / n2 − n + 2b + 2d
= 1 − o(1)
≤
ln (1 − 1/n)
for any d = ω n3/2 . If n is not very small we have ln(1 −
ln
1/n) ≈ −1/n (since for x ≈ 0 we have ex ≈ 1+x),
so that
we
obtain Pr LO(xb ) = n ln 1 + (2b) / n2 − n ± Θ d/n2
= 1 − o(1) for any d = ω n3/2 . We observe that 2b/(n2 −
n) = (2b/n2 )+O b/n3 holds and we can replace 2b/(n2 −n)
2
by 2b/n : The introduced error is O b/n3 = O(1/n) and
2
subsumed in Θ d/n . To simplify further to ln 1 + 2b/n2
±Θ d/n2 we first observe that we can restrict our inter
est to d = O n2 . For larger values of d the error term
O d/n2 dominates and the theorem’s statement becomes
trivial. With d = O n2 we have 1 + (2b/n2 ) ± Θ d/n2 =
Θ(1) and the claim follows by applying the Taylor expansion
of the natural logarithm.
6.
|T − E(T )| ≤ |T − E(T |X)| + |E(T |X) − E(T )|.
Consequently, we have
Pr(|T − E(T )| ≥ 2d) ≤ Pr(|T − E(T |X)| ≥ d)+
Pr(|E(T |X) − E(T )| ≥ d)
USING CHERNOFF BOUNDS FOR SUMS
OF GEOMETRIC RANDOM VARIABLES
The bounds on the optimisation time derived in the previous section hold with probability 1 − O(n3 /d2 ). Note that
the error term is a polynomial in d, which is due to Chebyshev’s inequality. However, it has already been proven using
Chernoff bounds that the actual optimisation time of the
(1+1) EA on LO is within the interval [c1 n2 , c2 n2 ] for two
constants c1 , c2 with probability 1 − 2−Ω(n) [7], i. e., with
probability exponentially close to 1. Hence, we could hope
for much stronger tail bounds than proved by Chebyshev’s
inequality.
Unfortunately, the result from [7] is too weak for our
purposes since it only bounds the probability of deviations
(1)
for all d. We show that both terms are small when d =
ω(n3/2 ).
Tail Bounds for |T − E(T |X)| via Chernoff Bounds
for Geometric Random Variables. By
P law of total
probability, we have Pr(|T −E(T |X)| ≥ d) = x∈{0,1}n Pr(X =
x) Pr(|T − E(T )| ≥ d | X = x). Let us condition onPX = x.
Let S = {i ∈ {0, . . . , a − 1} | xi = 1}. Then T = i∈S Ai
is a sumPof independent geometric random variables and
E(T ) = i∈S E(Ai ) is its expectation. We use the following
Chernoff bound for sums of such random variables due to
Scheideler [15, Theorem 3.38].
Theorem 8. Let Y1 , Y2 , . . . , Ys be independent, geometrically distributed random variables with parameters q1 , q2 ,
. . . , qP
s ∈ (0, 1). Let q = min{qi | i ∈ {1, . . . , s}}. Let
Y := si=1 Yi and µ = E(Y ).
(a) For all δ > 0,
Pr(Y ≥ (1 + δ)µ) ≤ (1 + δµq/s)s exp(−δµq).
(b) For all δ ∈ [0, 1],
Pr(Y ≤ (1 − δ)µ) ≤ exp(−δ 2 µq/2).
1584
We apply this theorem to Ai , i ∈ S, to q = min{qi | i ∈
S}, and d ≤ n2 . Note that thus d ≤ n/q. Using (i) that
(1 + c/s)s for all c ≥ 0 is increasing in s, (ii) the estimate
(1 + δ)/ exp(δ) ≤ (1 + δ)/(1 + δ + δ 2 /2) = 1 − (δ 2 /2)/(1 +
δ + δ 2 /2) ≤ 1 − δ 2 /5 ≤ exp(−δ 2 /5) valid for all δ ∈ [0, 1],
and (iii) finally q ≥ 1/en, we compute
(1+1) EA on LO, with deviations in lower-order terms only,
using drift analysis combined with the exponential-moment
method (which was only recently introduced to the runtime
analysis of evolutionary algorithms in [4]). Although this result is weaker than the one obtained in the previous section,
two advantages in other respects can be identified: (1) the
method does not need the discussed characterisation of runtime and (2) it can also be used if the (1+1) EA is initialised
with a fixed starting point (instead of a uniform one).
Pr(T ≥ E(T ) + d) ≤ (1 + dq/s)s exp(−dq)
≤ (1 + dq/n)n exp(−dq/n)n ≤ exp(−(dq/n)2 /5)n
≤ exp(−d2 /(5e2 n3 )).
Theorem 9. The time for the (1+1) EA to reach a LOvalue of at least a, where 0 < a ≤ n − log n, is bounded from
above by
a
n2 − n
1
− 1 1 + O( √1n ) + d
1+
2
n−1
For the lower tail, still conditioning on X = x and using
µ ≤ e−1
n2 , we compute
2
Pr(T ≤ E(T ) − d) ≤ exp(−d2 q/(2µ))
≤ exp(−d2 /(e(e − 1)n3 )).
−3/2−ǫ
with probability at least 1 − e−dn
and from below by
√ !
a
n
n2 − n
1
1
− 1+
1+
1−O( √1n ) −d
2
n−1
n−1
Consequently, conditional on X = x, we have Pr(|T −
E(T )| ≥ d) ≤ 2 exp(−d2 /(5e2 n3 )), which is independent of
x. Therefore,
Pr(|T − E(T |X)| ≥ d)
X
=
Pr(X = x) Pr(|T − E(T )| ≥ d | X = x)
−3/2−ǫ
+O(log n)
with probability at least 1−e−dn
−2−Ω(
an arbitrarily small constant ǫ > 0 and any d > 0.
√
n)
for
x∈{0,1}n
≤
X
Before we prove the theorem, we look into its consequences
for the function value at time b. As long as the following
bounds do not exceed n − log n, we obtain
ln n2 − n / n2 − n + 2b − 2d − o(d)
≤ LO(xb )
Pr
ln (1 − 1/n)
ǫ
ln n2 − n / n2 − n + 2b + 2d + o(d)
≤
= 1 − 2−Ω(n )
ln (1 − 1/n)
2−n 2 exp(−d2 /(5e2 n3 ))
x∈{0,1}n
= 2 exp(−d2 /(5e2 n3 )).
(2)
Tail Bounds for |E(T |X) − E(T )|. Note that the random variable E(T |X) by definition only
depends on X1 , . . . ,
Pa−1
Xn . We have E(T |X) − E(T ) =
i=0 (Xi − 1/2)E(Ai ).
Here, the (Xi − 1/2) are independent random variables uniformly distributed in {−1/2, +1/2} and the E(Ai ) simply
are number of order Θ(n). Hence E(T |X) − E(T ) is a sum
of independent random variables each with expectation zero
and range at most E(An−1 ) ≤ en. Hence a standard Chernoff bound (like Theorem 1.11 of [3]) for all d ≥ 0 yields
Pr(|E(T |X) − E(T )| ≥ d) ≤ 2 exp(−2d2 /(a(en)2 ))
≤ 2 exp(2d2 /(en3 )).
for any d = Ω(n3/2+ǫ ), where ǫ is an arbitrarily small constant. This is slighly weaker than the bound from the previous section since Theorem 7 only requires d = ω(n3/2 ).
To prove Theorem 9, we use Lehre’s [11] proof of the variable drift theorem and combine it with ideas implicit in Hajek’s [8] paper on drift analysis, including the exponentialmoment method. We start by defining some notation.
(3)
Definition 10. Let xt denote the current search point of
the (1+1) EA at time t. Its distance to the target a is defined
by d(xt ) := max{0, a − LO(xt )}.
Given a search point xt of distance d(xt ) > 0, we denote
by δ(xt ) := d(xt ) − d(xt+1 ) the decrease in distance in the
next step. Slightly abusing notation, we also write δ(i) := i−
d(xt+1 | d(xt ) = i) for the decrease in distance if d(xt ) = i.
From (1) to (3), Theorem 7 immediately follows.
Notes. We stated Theorem 7 only for d ≤ 2n2 , which is
enough to obtain an exponential failure probability. Larger
d seem therefore not too interesting. They could be handled,
though, easily by exploiting (1 + δ)/ exp(δ) = exp(−δ(1 −
o(1))) in the estimates for the upper tail bounds for the sums
of geometric random variables. We did not try to optimise
the constant in the exp(−Θ(d2 /n3 )) term.
7.
We now bound the drift E(δ(i)) (assuming a random xt
satisfying d(xt ) = i, but omitting this in the expectation for
the sake of readability). At a later point of time, a potential
function based on δ(i) will be defined, and drift analysis will
be carried out with respect to that potential function.
USING DRIFT ANALYSIS WITH
TAIL BOUNDS
The results from the previous two sections heavily rely on
knowing the precise distribution of the random variable describing the runtime (Lemma 5). Such an exact description
is usually not available. It is therefore worth investigating
whether sharp tail bounds can be obtained by other, potentially more general, methods. We show that drift analysis,
which has become the key method for the runtime analysis of randomised search heuristics, is suitable as well for
our purposes. As a proof of concept, we derive in Theorem 9 sharp tail bounds on the optimisation time of the
Lemma 11. For all i > 0, E(δ(i)) = (2 − 2−n+a−i+1 ) ·
(1 − 1/n)a−i · (1/n).
Proof. The leftmost zero-bit is at position a − i + 1. To
increase the LO-value (it cannot decrease), it is necessary
to flip this bit and not to flip the first a − i bits, which is
reflected by the last two terms in the lemma. The first term
is due to the expected number of free-rider bits. Note that
there can be between 0 and n − a + i − 1 such bits. By
1585
the usual argumentation using a geometric distribution, the
expected number of free riders in an improving step equals
min{n−a+i−1,k+1}
n−a+i−1
X
1
k·
= 1 − 2−n+a−i+1 ,
2
We now bound the value of the potential function depending on the distance. As a by-product from additive drift
analysis (using the result from Lemma 13), a bound on g(a)
is also a bound on the expected time to reach at least the
target value a.
k=0
hence the expected progress in an improving step is 2 −
2−n+a−i+1 .
Lemma 14. For all i ≥ 0 and a ≤ n − log n, it holds
g(i) ≤
We now turn to the potential function, which is obtained
from the drift δ(i) in a similar way as in the proof of the
variable drift theorem (cf. [14]).
and
Definition 12. Given a search point xt , its potential is
Pd(x )
defined by g(xt ) := k=1t 1/E(δ(k)). We also write g(i) :=
g(xt | d(xt ) = i) for the potential at distance i. The random
change of the potential is defined by ∆t := g(xt ) − g(xt+1 ).
We also write ∆t (i) := g(i) − g(xt+1 | d(xt ) = i).
g(i) ≥
Lemma 15. If λ = n−3/2−ǫ for an arbitrarily small constant ǫ > 0 and g(x
t ) > 0, then there is a constant c > 0
such that E e−λ∆t ≤ 1 − λ(1 − cn−1/2 ) and E eλ∆t ≤
1 + λ(1 + cn−1/2 ).
Proof. We start with the bound on E e−λ∆t . From
Lemma 11, we get E(δ(i)) ≥ 1/(en) for any i > 0, which
implies g(i + j) − g(i) ≤ ejn for any j > 0. Moreover,
similarly as argued before, the probability of decreasing the
distance (i. e., increasing the LO-value) by at least j is at
most 2−j+1 /n as j − 1 free-riders and a flip of the leftmost
zero-bit are necessary. Hence, Pr(∆t ≥ ejn) ≤ 2−j+1 /n.
We use this exponential decay to bound the higher-order
moments of ∆t . More precisely, similarly as in the proof of
the simplified drift theorem [13], we get
∞
X
λk E |∆t |k
E e−λ∆t ≤ 1 − λ +
,
k!
k=2
Lemma 13. If i > 0 then E(∆t (i)) √
≥ 1. If i > 0 and
a ≤ n − log n then E(∆t (i)) = 1 + O(1/ n).
Proof. By Lemma 11, E(δ(i)) is decreasing in its argument. Hence, if δ(i) = j, we get ∆t = g(i) − g(i − j) =
Pj−1
Pj−1 1
1
1
k=0 E(δ(i−k)) ≥
k=0 E(δ(i)) = j · E(δ(i)) . This implies
Pi
E(δ(i))
j
E(∆t ) ≥ j=1 E(δ(i)) · Pr(δ(i) = j) = E(δ(i)) , which proves
the lower bound.
For the upper bound, we study the distribution of δ(i).
Using ideas from the proof of Lemma 11, we get Pr(δ(i) =
j) ≤ (1 − 1/n)a−i (1/n)2−j . As a ≤ n − log n, we get
E(δ(i)) = (2 − O(1/n)) · (1 − 1/n)a−i · (1/n) from Lemma 11.
In the remainder of this proof, we omit the −O(1/n) term
in all calculations involving E(δ(·)). It is clear that this only
can introduce an additive error of O(1/n).
We bound the change of potential, given δ(i) = j. Then
where we have used that E(∆t ) ≥ 1 (Lemma 13). The sum
for k ≥ 2 is at most
g(i) − g(i − j)
j−1
X
i−a X
−k
j−1 1
1
1−
1−
n
n
k=0
j
−j
1
(n − 1) 1 + n−1
− (n − 1)
−1
1 − n1
1
=
·
=
E(δ(i)) 1 − 1 −1 − 1
E(δ(i))
n
2 j
j
(n − 1) 1 + n−1
+ n−1
− (n − 1)
≤
,
E(δ(i))
1
n
=
E(δ(i
−
k))
2
k=0
∞
n
X
λk X
(e(j + 1)n)k Pr(∆kt ≥ (ejn)k )
k!
j=0
k=2
as ∆t does not take on negative values and the maximum
change of distance is at most n.
Now, for any k, the inner sum is at most
n
X
(e(j + 1)n)k
j=0
where the last inequality used (1 + x)y ≤ exy ≤ 1 + (xy) +
(xy)2 for 0 ≤ xy ≤ 1.
Simplifying the last bound and putting everything together, we get
n
2−j+1
2 · (en)k X
=
(j + 1)k 2−j .
n
n
j=0
To bound the sum over j, we
several cases. If
Pdistinguish
n
k −j
= O(1). If
k is constant then obviously
j=0 (j + 1) 2
Pn
k −j
(1/2+2ǫ/3)k
k ≤ n1/2+ǫ/3 , then
(j
+
1)
2
=
O(n
) as
j=0
1/2+ǫ/3
2−j = o(j −n
) for j ≥ n1/2+2ǫ/3 . Otherwise, we use
P
n
k −j
the trivial bound
= O(nk ). This, along
j=0 (j + 1) 2
with our choice λ = n−3/2−ǫ , bounds the outer sum by
2
j
j + n−1
E(∆t ) ≤
Pr(δ(i) = j) ·
.
E(δ(i))
j=1
i
X
n2 − n
((1 − 1/n)−a − (1 − 1/n)−a+i ).
2
The proof of the lemma is elementary and has been omitted due to space restrictions.
Recall that our aim is to show tail bounds. To this end,
we consider the moment-generating function of the drift.
The following statement (where we again omit the condition “ | xt ” for readability) bounds the moment-generating
function of the negated and unnegated drift away from 1.
The drift E(∆t ) of the potential (more precisely, we mean
E(∆t | x0 , . . . , xt ), which is the same as E(∆t | xt ) since we
are dealing with a Markov chain) is crucial in the following.
Note that E(δ(i)) is non-decreasing in i. Hence, as in the
proof of the variable drift theorem, we can obtain E(∆t ) ≥ 1.
We make this now explicit and prove an upper bound as well.
=
n2 − n
((1 − 1/n)−a − (1 − 1/n)−a+i )(1 + O(1/n))
2
√
√
√
If j < n, then j + j 2 /(n − 1) = j(1 + O(1/√ n)). If j ≥ n,
2
−Ω( n)
then Pr(δ(i) = j)√· j /(E(δ(i))) = 2
. Altogether,
E(∆t ) = 1 + O(1/ n) follows.
1/2+ǫ/3
6/ǫ
n X
X
C k n−(1/2+ǫ)k
+
nk!
k=2
k=6/ǫ+1
1586
C k n−(1/2+ǫ)k (1/2+2ǫ/3)k
n
nk!
∞
X
+
k=n1/2+ǫ/3 +1
≤ O(n−2−ǫ ) +
∞
X
k=n1/2+ǫ/3 +1
C k n−(1/2+ǫ)k nk
nk!
large constant c. Then for all s ≤ t, Pr(g(xs ) = 0) ≤ e−λd .
Now the lower tail follows by the above union bound over
t ≤ en2 steps.
c
cλ
C k n(1/2−ǫ)k
≤ 2+ǫ = √
n
n(1/2+ǫ/3)k
n
8. CONCLUSIONS
We have presented a general method to derive results on
the function value after a pre-specified number of function
evaluations, called budget. Our method starts with results
on the expected runtime (to find a solution of a given quality) that come with tail bounds and turn these into fixedbudget results by means of the inverse function. The method
has the advantage of allowing for the re-use of a large body
of existing work.
We have tested our method on a simple test case, the
(1+1) EA on LO. The (1+1) EA was one of two algorithms
considered in the original paper introducing fixed budget
computations [10] and is more realistic than random local
search, the other of the two algorithms. We have been able
to improve over the bounds from [10] considerably. In Section 5 we use Chebyshev’s inequality to derive tail bounds
(a method that, as Doerr [3] points out, was rarely used in
the theory of randomised search heuristics; this may change
now). In Section 6 we provide asymptotically much stronger
results using Chernoff bounds. Both methods strongly rely
on properties of LO. In Section 7 we also prove strong tail
bounds now using an implicit variable drift theorem with
tail bounds. We are currently working on an explicit version of such a theorem, which is expected to be very useful to
obtain further fixed-budget results. The asymptotic nature
of the much more involved proofs from this section make it
difficult, however, to derive numerical statements for concrete values of n. Numerical bounds based on Theorem 7
are also not strong enough for the values of n we consider.
Therefore, when we compare our bounds with the ones from
[10] we do so only with the results from Section 5.
We depict the bounds from Theorem 4 from [10] with our
result from Theorem 6 for n ∈ {200, 1000, 5000, 10000} and
plot the bounds in Figure 1. Our bounds are derived numerically using an exact expression for the variance Var (T (a))
in Chebyshev’s inequality. We choose da in such a way that
the function value stays between the upper and lower bound
with at least 90% probability.
Remember that the bounds from [10] only make a statement for budgets b < n2 /2. Therefore, the corresponding
curves in Figure 1 stop at this point. Our bounds allow
for statements for the whole range of budgets up to the expected optimisation time. For large values of n our bounds
are clearly much better. For smaller values of n and rather
small budgets this is not necessarily the case (see, e. g., the
curves for n = 200 and b ≤ 11000 in Figure 1). Note, however, that the bounds from [10] are asymptotic and it is not
clear that they are actually valid for such small values of n.
We see that our bounds are not only much sharper asymptotically but also allow for concrete (numerical) statements
for concrete values of n and b.
The presentation of our method allows for a wealth of
new results in the fixed budget framework. This requires
revisiting existing results on expected optimisation times,
generalising them to results on the expected time needed
to find results of a pre-defined quality (i. e., results on the
expected approximation time), and strengthening them by
adding tail bounds to the statements on the expected times.
While general tools for this are available (see, e. g., the drift
for constants C, c > 0 since k! ≥ (k/e)k .
The bound on E(eλ∆t ) is obtained in a very similar way.
P
λk E((∆t )k )
We have E(eλ∆t ) = 1 + λE(∆t ) + ∞
. Using
k=2
k!
√
Lemma 13, we get E(∆t ) = 1 + O(1/ n). The higherorder moments
√ are bounded as before, such that E(∆t ) =
1 + λ(1 + c/ n) follows by choosing c large enough.
With the previous bound on the moment-generating function, the proof of the desired tail bounds will mostly be applications of Markov’s inequality.
Proof of Theorem 9. Let T denote the random time
until the distance from the target a is reduced to 0. We
start with the upper tail. For any t > 0, we have
Pr(T > t) = Pr(g(xt ) ≥ g(1)) = Pr(eλg(xt ) ≥ eλg(1) )
≤ E(eλg(xt ) | x0 )
where the inequality uses Markov’s inequality along with
eλg(1) ≥ 1. Now,
E(eλg(xt ) | x0 ) = E eλg(xt−1 ) · E(e−λ∆t−1 | xt−1 ) | x0
= eλg(x0 ) ·
t−1
Y
s=0
E(e−λ∆s | xs ),
where the last equality follows inductively (note that this
does not assume independence of the ∆s ). Using Lemma 15,
our bound is no larger than
−1/2
eλg(x0 ) · (1 − λ(1 − cn−1/2 ))t ≤ eλg(x0 ) e−tλ(1−cn
)
,
which, using Lemma 14 with i = a, is at most
2
eλ((n
−n)/2)((1−1/n)−a −1)(1+O(1/n))−tλ(1−cn−1/2 )
−1/2
.
−1/2
As (1 − cn
)(1 + 2cn
) ≥ 1, plugging in t := ((n2 −
−a
n)/2)((1 − 1/n) − 1)(1 + C/n)(1 + 2cn−1/2 ) + d for some
sufficiently large constant C finally yields
−3/2−ǫ
Pr(T > t) ≤ e−λd = e−dn
.
−1/2
Noting that (1 + O(1/n))(1 + 2cn
) = 1 + O(n−1/2 ) and
−1
(1−1/n) = 1+1/(n−1), the first statement of the theorem
follows.
The second statement, the lower tail, is proved similarly.
By a union bound,
Pr(T < t) ≤
t−1
X
Pr(g(xs ) = 0).
s=0
Since g(xs ) = 0 is equivalent to e−λg(xs ) ≥ e−λ0 = 1, we
get Pr(e−λg(xs ) ≥ 1) ≤ E(e−λg(xs ) ) from Markov’s inequalQs−1
ity. Now, E(e−λg(xs ) ) = e−λg(x0 ) · r=0
E(eλ∆r | xr ). By
−1/2
)
Lemma 15, this is no larger than e−λg(x0 ) ·esλ(1+cn
.√
Due
√
−Ω( n)
to random initialisation, Pr(d(x0 ) ≤ a − n) = 1 − 2
.
2
−a
From Lemma
14,
we
get
g(x
)
≥
((n
−
n)/2)((1
−
1/n)
−
0
√
case. Let t := ((n2 − n)/2)((1 −
(1 − 1/n)− n ) in this
√
−a
− n
1/n) −(1−1/n)
)(1−2cn−1/2 )−d for some sufficiently
1587
lower bound
upper bound
lower bound from [10]
upper bound from [10]
200
lower bound
upper bound
lower bound from [10]
upper bound from [10]
1000
800
function value
function value
150
100
600
400
50
200
0
0
0
5000
10000
15000
20000
25000
30000
35000
0
150000
300000
#function evaluations
lower bound
upper bound
lower bound from [10]
upper bound from [10]
5000
600000
750000
900000
lower bound
upper bound
lower bound from [10]
upper bound from [10]
10000
8000
function value
4000
function value
450000
#function evaluations
3000
6000
2000
4000
1000
2000
0
0
0
5e+06
1e+07
1.5e+07
2e+07
0
#function evaluations
1e+07
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
8e+07
#function evaluations
Figure 1: Bounds from Theorem 4 from [10] and Theorem 6 for n = 200 (top left), n = 1000 (top right),
n = 5000 (bottom left) and n = 10000 (bottom right). The function value LO(xt ) stays between the upper
and lower bound with at least 90% probability.
theorem with tail bounds by Doerr and Goldberg [5]) it has
not often been done in the past. We expect to see the need
for more tools to allow for the derivation of such tail bounds
and a renewed interest in the development of such analytical methods. We hope that this line of research proves to
be motivating and intriguing for theoreticians and delivers
results that will be of interest to practitioners.
9.
[8]
[9]
[10]
REFERENCES
[1] A. Auger and B. Doerr, editors. Theory of
Randomized Search Heuristics. World Scientific, 2011.
[2] S. Böttcher, B. Doerr, and F. Neumann. Optimal fixed
and adaptive mutation rates for the LeadingOnes
problem. In Proc. of the 11th Int’l Conf. on Parallel
Problem Solving From Nature (PPSN 2010), pages
1–10. Springer, 2010. LNCS 6238.
[3] B. Doerr. Analyzing randomized search heuristics:
Tools from probability theory. In Auger and Doerr [1],
pages 1–20.
[4] B. Doerr, M. Fouz, and C. Witt. Sharp bounds by
probability-generating functions and variable drift. In
Proc. of the Genetic and Evolutionary Comp. Conf.
(GECCO 2011), pages 2083–2090. ACM Press, 2011.
[5] B. Doerr and L. Goldberg. Drift analysis with tail
bounds. In Proc. of the 11th Int’l Conf. on Parallel
Problem Solving From Nature (PPSN 2010), pages
174–183. Springer, 2010. LNCS 6238.
[6] B. Doerr, T. Jansen, and C. Klein. Comparing global
and local mutations on bit strings. In Proc. of the
Genetic and Evolutionary Comp. Conf. (GECCO
2008), pages 929–936. ACM Press, 2008.
[7] S. Droste, T. Jansen, and I. Wegener. On the analysis
[11]
[12]
[13]
[14]
[15]
[16]
1588
of the (1+1) evolutionary algorithm. Theoretical
Comp. Science, 276(1–2):51–81, 2002.
B. Hajek. Hitting and occupation time bounds implied
by drift analysis with applications. Advances in
Applied Probability, 14:502–525, 1982.
T. Jansen. Analyzing Evolutionary Algorithms. The
Computer Science Perspective. Springer, 2013.
T. Jansen and C. Zarges. Fixed budget computations:
A different perspective on run time analysis. In Proc.
of the Genetic and Evolutionary Comp. Conf.
(GECCO 2012), pages 1325–1332. ACM Press, 2012.
P. K. Lehre. Drift analysis (tutorial). In Companion to
GECCO 2012, pages 1239–1258. ACM Press, 2012.
F. Neumann and C. Witt. Bioinspired Computation in
Combinatorial Optimization. Springer, 2010.
P. S. Oliveto and C. Witt. Simplified drift analysis for
proving lower bounds in evolutionary computation.
Algorithmica, 59(3):369–386, 2011. Erratum in
http://arxiv.org/abs/1211.7184.
J. E. Rowe and D. Sudholt. The choice of the offspring
population size in the (1, λ) EA. In Proc. of the
Genetic and Evolutionary Comp. Conf.
(GECCO 2012), pages 1349–1356. ACM Press, 2012.
C. Scheideler. Probabilistic Methods for Coordination
Problems, volume 78 of HNI-Verlagsschriftenreihe.
University of Paderborn, 2000. Habilitation thesis.
Available at:
www.cs.jhu.edu/~scheideler/papers/habil.ps.gz.
N. P. Troutman, B. E. Eskridge, and D. F. Hougen. Is
“best-so-far” a good algorithmic performance metric?
In Proc. of the Genetic and Evolutionary Comp. Conf.
(GECCO 2008), pages 1147–1148. ACM Press, 2008.