A Method to Derive Fixed Budget Results From Expected Optimisation Times Benjamin Doerr Thomas Jansen Max Planck Institute for Informatics Saarbrücken, Germany Dep. of Computer Science Aberystwyth University Aberystwyth, SY23 3DB, UK [email protected] [email protected] Carsten Witt Christine Zarges DTU Compute Technical University of Denmark 2800 Kgs. Lyngby, Denmark School of Computer Science University of Birmingham Birmingham B15 2TT, UK [email protected] [email protected] ABSTRACT Keywords At last year’s GECCO a novel perspective for theoretical performance analysis of evolutionary algorithms and other randomised search heuristics was introduced that concentrates on the expected function value after a pre-defined number of steps, called budget. This is significantly different from the common perspective where the expected optimisation time is analysed. While there is a huge body of work and a large collection of tools for the analysis of the expected optimisation time the new fixed budget perspective introduces new analytical challenges. Here it is shown how results on the expected optimisation time that are strengthened by deviation bounds can be systematically turned into fixed budget results. We demonstrate our approach by considering the (1+1) EA on LeadingOnes and significantly improving previous results. We prove thatdeviating from the expected time by an additive term of ω n3/2 happens runtime analysis, fixed budget computation, tail bounds, (1+1) EA, LeadingOnes 1. INTRODUCTION Evolutionary algorithms (like all randomised search heuristics) are applied to find solutions for difficult problems in situations where no good problem-specific algorithms are available. They are incomplete optimisers in the sense that it is not guaranteed that an optimal solution is found in any given run of finite length. The general hope is that they find sufficiently good solutions in reasonable time. Note that even if an optimal solution is found the user has usually no way of knowing this. In stark contrast to this way of applying evolutionary algorithms their theoretical analysis often concentrates on the expected optimisation time [1]. For this scenario many results and analytical methods are known, for typical example problems [9] as well as for combinatorial optimisation problems [12]. This analytical perspective has contributed to the gap between between theory and practice in the field. Addressing this problem Jansen and Zarges [10] have introduced a novel perspective, called fixed budget computations. They suggest to consider the function value that has been achieved after a fixed number b of function evaluations. This perspective is similar to the consideration of best-sofar curves, a popular way of evaluating the performance of evolutionary algorithms in empirical studies [16]. In [10], two particularly simple randomised search heuristics, random local search and the (1+1) EA, are considered for two well-known example functions, OneMax and LeadingOnes. The analysis is rather long and complicated and does not lead to tight bounds in most cases. One of the conclusions and observed open problems is the need for the development of novel analytical methods to facilitate analysis in the fixed budget scenario. We address this need here. We present a method that allows one to turn results on the expected time needed to reach a specific function value into a fixed budget result provided that the original result is accompanied with tail bounds. Using commonly available results on the expected time three major steps are involved. only with probability o(1). This is turned into tight bounds on the function value using the inverse function. We use three, increasingly strong or general approaches to proving the deviation bounds, namely via Chebyshev’s inequality, via Chernoff bounds for geometric random variables, and via variable drift analysis. Categories and Subject Descriptors F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems General Terms Algorithms, Design, Performance, Theory Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO’13, July 6–10, 2013, Amsterdam, The Netherlands. Copyright 2013 ACM 978-1-4503-1963-8/13/07 ...$10.00. 1581 First, the results on the expected optimisation time need to be generalised into results on the expected time needed to reach a solution of a pre-specified quality. This can usually easily be done. Second, the results need to be accompanied by tail bounds for the probability to deviate from the expected times. We present three concrete examples for doing this for LeadingOnes, based on Chebyshev’s inequality, based on a rarely used Chernoff bound for sums of geometric random variables, and based on variable drift analysis. Using Chebyshev’s inequality we prove that the probability to deviate from the expected time to reach a leading 1-bits on LeadingOnes by an additive term of d is bounded above by O n3 /d2 where n is the number of bits. The exact result allows for concrete statements for rather small values of n. Via Chernoff bounds for sums of independent geometric random variables, we can reduce the above error probability to a much stronger exp −Θ d2 /n3 , however giving weaker bounds for concrete small values of d2 /n3 . Both the Chebyshev and the Chernoff approach exploit the particular knowledge of the distribution of the random variables describing the runtime of the algorithm. Since such knowledge may often not be present, we also develop a proof of concept for a more universal approach to obtain tail bounds from variable drift analysis. This allows us to bound the error probability by exp −Ω d/n3/2+ǫ , which is still ex ponentially small if d = Ω n3/2+2ǫ . In the next section we formally define the notions, algorithm, and function. Section 3 introduces our approach via the inverse function of the expected optimisation time. This is followed by our results obtained using Chebyshev’s inequality (Section 5), Chernoff bounds (Section 6), and variable drift analysis (Section 7). We conclude with a comparison of our bounds with the previous results in Section 8. 2. The example function we consider yields as function value the number of leading one bits counted from left to right. Its formal definition is given below. Definition 2. The function LO : {0, 1}n → R is defined n Q i P by LO(x) = x[j]. i=1 j=1 It has been long known that the expected optimisation time of the (1+1) EA on LO is Θ n2 and that the expectation is sharply concentrated [7]. Böttcher et al. [2] determined the exact expected optimisation time of the (1+1) EA on LO (even for arbitrary mutation probabilities). We will build on their results in Section 3. We adopt the fixed budget perspective and analyse the function value that can be achieved with b function evaluations. For the (1+1) EA, this means we are interested in f (xb ). Since f (xb ) is a random variable we are interested in its expectation E(f (xb )) as well as properties of its distribution, e. g., bounds on the probabilities to achieve at least or at most a given function value. 3. FIXED-BUDGET RESULTS VS. OPTIMISATION TIMES We want to obtain results on the function value after b function evaluations, f (xb ) for the (1+1) EA. Theoretical analysis usually concentrates on the time needed for optimisation. It has been pointed out [9] that it is not difficult to generalise this to optimisation goals other than exact optimisation. The same methods used for analysing the expected optimisation time can still be applied without any change. Consider TA,f (a), the number of function evaluations some specific algorithm A needs to reach function value a for some specific fitness function f . Formally, TA,f (a) = min{t | f (xt ) ≥ a}. FRAMEWORK AND DEFINITIONS Our main contribution is a general method to derive fixed budget results from results about the expected optimisation time. We demonstrate the approach using a specific simple evolutionary algorithm, the (1+1) EA, and a specific well-known example function, LO (often called LeadingOnes). This addresses an open problem posed by Jansen and Zarges [10] when introducing the fixed budget perspective. We concentrate on the simplest evolutionary algorithm, the (1+1) EA. It is similar to random local search since it also has a population of size only 1, creates one new search point in each generation and deterministically replaces the current population with the new search point if this is of at least equal fitness. It differs from random local search since the new search point is not selected from a neighbourhood but generated by means of standard bit mutations. This difference can make the two algorithms perform very differently [6]. Specifically in the fixed budget framework this difference can complicate the analysis drastically [10]. If the algorithm A is deterministic then the inverse func−1 tion TA,f (b) yields the function value reached after b function evaluations have been made. However, we are dealing with randomised search heuristics and TA,f (a) is a random variable. Note that if we consider the expected number of function evaluations needed to reach function value a, E(TA,f (a)), its inverse function does not necessarily coincide with the expected function value after b function evaluations as the following tiny example shows. Consider an algorithm A and a function f where f (x0 ) = 0 with probability 1. For t > 0, it is defined as follows. If f (xt−1 ) = 0 then f (xt ) = 1 with probability 1/2 and f (xt ) = 0 otherwise. If f (xt−1 ) = 1 then also f (xt ) = 1 with probability 1. Clearly, E(TA,f (0)) = 0 and E(TA,f (1)) = 2 holds. For the expected function value we have E(f (xb )) = Pr(f (xb ) = 1) = 1 − (1/2)b . We notice two things. First, the inverse function E−1 TA,f (b) of E(TA,f (a)) is only defined for b ∈ {0, 2}. Second, if we consider E−1 TA,f (2) = 1, we see that this is different from E(f (x2 )) = 1 − (1/2)2 . Things are different if we have bounds on the time needed. Suppose we know an “upper tail” function tu (a) such that Pr(TA,f (a) ≥ tu (a)) ≤ pu . Then Pr f (xb ) ≤ t−1 u (b) ≤ pu holds. Analogously, if we know a “lower tail” tl (a) s. t. Pr(TA,f (a) ≤ tl (a)) ≤ pl then Pr f (xb ) ≥ t−1 l (b) ≤ pl . Bounds for the probability of TA,f (a) being above or below some values are typically derived as bounds that TA,f (a) Algorithm 1. (1+1) Evolutionary Algo. ((1+1) EA) 1. t := 0; Select xt ∈ {0, 1}n . Evaluate f (xt ). 2. While t + 1 < b do 3. t := t + 1 4. y := xt−1 . For each i ∈ {1, 2, . . . , n} with probability 1/n flip i-th bit in y. 5. Evaluate f (y). 6. If f (y) ≥ f (xt−1 ) then xt := y else xt := xt−1 . 1582 deviates from its expected value E(TA,f (a)). Since the fixed budget perspective requires results to be as tight as possible we are interested in determining the expectation E(TA,f (a)) as exactly as we can and derive bounds for the probability to deviate from it that are as sharp as possible. For our work here we start with analysing T(1+1) EA,LO (a), in the following abbreviated as T (a), exactly. 4. Since our focus in this work are results more precise than merely computing expectations, we shall discuss the actual random variables now. To this aim, we regard a modified optimisation process. It is identical to the one of the (1+1) EA optimising LO with a random initial search point apart from the fact that the mutation operator does not flip bits with index higher than the current LO value plus one. Since in both processes these bits are independent and uniformly distributed in {0, 1}, both processes are equivalent in the sense that the random search point at any time t is equally distributed in both processes. In particular, all statements on runtimes and fitnesses reached after certain times hold for both processes. For this reason, let us now analyse the runtime of the modified process. Denote by x = (x[1], . . . , x[n]) the random initial search point. Since in the modified process bits higher than the LO value plus one never change, we observe that to gain a fitness level of at least a exactly those fitness levels i have to be left that satisfy i < a and x[i] = 0. The waiting time to leave the ith level again is a geometric random variable Ai with success probability qi := (1 − 1/n)i /n. Consequently, when conditioning on x,Pthe time T (a) needed to reach a fitness level of a or higher is i∈{j∈[a−1]|x[j]=0} Ai . Since the x[j] are independent and uniformly distributed in {0, 1}, we have the following. EXACT OPTIMISATION TIME FOR LO We use the precise analysis by Böttcher et al. [2] as starting point. Let Ai denote the random number of mutations needed to increase the function value from i to something larger. Böttcher et al. [2] prove that E(Ai ) = 1/((1 − pi )i pi ) holds (Lemma 2 and Theorem 1 in [2]) where pi denotes the mutation probability. We consider the case of a constant mutation probability of 1/n so this simplifies to E(Ai ) = n/(1 − 1/n)i . They carry on determining the expected number of mutations until the function value is increased from i to n (Lemma 3 and Theorem 2 in [2]). This is easily generalised to the expected number of mutations needed to increase the function value from i to at least j. If Bi,j denotes the random number of mutations needed we have E(Bi,j ) = E(Ai ) + j−1 1 X E(Ak ). 2 k=i+1 Lemma 5. Let a ∈ {0, . . . , n}. The random time T (a) taken by the (1+1) EA started with a random search point to find a solution with LO value at least a can be written as From this, we now derive an expression for E(T (a)). If LO(x0 ) ≥ a, the expected number of mutations is 0. Hence, E(T (a)) = a−1 X i=0 = a−1 X i=0 1 2 Pr(LO(x0 ) = i) · E(Bi,a ) i+1 · E(Ai ) + 1 2 a−1 X j=i+1 E(Aj ) T (a) = ! = 1 2 a−1 X a−1 X Xi Ai , i=1 E(Ai ) where the Xi are uniformly distributed in {0, 1}, the Ai are geometric random variables with parameter qi = (1 − 1/n)i /n, and all Xi and Ai are mutually independent. i=0 holds. From E(Ai ) = n/ (1 − 1/n)i , we derive a n2 − n 1 E(T (a)) = · 1+ −1 . 2 n−1 5. BOUNDS WITH THE CHEBYSHEV INEQUALITY We have demonstrated in Section 3 that knowledge of E(TA,f (a)) and its inverse function is not sufficient to compute E(f (xb )). Therefore, we want to derive bounds for the event that T (a) deviates from its expectation. The simplest way of doing this would be Markov’s inequality [3]. However, since this only delivers bounds for deviations by at least constant factors (Pr(T (v) ≥ cE(T (a))) ≤ 1/c) it is too weak for our purposes. Chebyshev’s inequality [3] is of similar generality like Markov’s inequality but delivers much tighter bounds. However, it requires to have at least bounds on the variance of the random variable. Theorem 6. For any d = ω n3/2 and b ≤ E(T (n)) the following holds: Pr LO(xb ) = n ln 1 + 2b/n2 ± Θ(d/n) = 1 − o(1). Note that our approach via the inverse function of E(T (a)) makes only sense for budgets b ≤ E(T (n)). Larger budgets are beyond the domain of E(T (a)) so that its inverse function does not yield useful values. Therefore, we work with budgets b ≤ E(T (n)) in the following. For the expected function values some results are known [10]. We cite the results here and show how our methods help to improve over them in the following. Theorem 3 (Theorem 8 from [10]). Let the budget b = (1 − β)n2 /α(n) for any β with (1/2) + β ′ < β < 1 where β ′ is a positive constant and α(n) = ω(1), α(n) ≥ 1. E(LO(xb )) = 1 + (2b/n) − O(b/(nα(n))) = 1 + (2b/n) − o(b/n). Theorem 4 (Theorem 9 from [10]). Let the budget b 2 = cn cwith 0 < c < 1/2. forany constant −c −c LO(xb ) −c −ce−ce Pr ce 1+e ≤ ≤ c 1 + e−ce = n 1 − o(1). Proof. We use E(T (a)) from Lemma 5. Since Ai is geometrically distributed with parameter qi = (1 − 1/n)i /n, E(Ai ) = 1/qi and Var (Ai ) = 1/qi2 − 1/qi are well-known. Consequently, we compute E A2i = Var (Ai ) + E(Ai )2 = 2/qi2 − 1/qi E (Xi Ai )2 = E Xi A2i = (1/2)E A2i = 1/qi2 − 1/(2qi ) Note that Theorem 3 is already tight but only holds for budgets b = o n2 . For budgets b = Θ n2 we have bounds in Theorem 4 but these are not tight. E(Xi Ai ) = (1/2)E(Ai ) = 1/(2qi ) 1583 Var (Xi Ai ) = E (Xi Ai )2 − E(Xi Ai )2 = 1/qi2 − 1/(2qi ) − d = Ω(n2 ) from the expectation. As pointed out in the previous section, meaningful fixed-budget results are only obtained if we bound the probability of deviations d = o(n2 ), i. e., deviations by a lower-order term. In this section, we solve this problem and prove that the actual runtime deviates from the expected one by an additive term of d only with probability exp(−Θ(d2 /n3 )). A main difficulty here is that the independent random variables Xi Ai have an unbounded range. This forbids a straightforward application of classic Chernoff bounds. To overcome this, we use a seemingly lesser known Chernoff bound for sums of independent random variables having a geometric distribution, which is due to Scheideler [15] (see [3, Section 1.4.4] for an elementary proof of a similar result for identically distributed geometric random variables). It takes some more work to handle the fact that the Xi Ai are not truly geometric random variables. To this aim, we regard the conditional expectations E(T (a)|X1 , . . . , Xn ). All this leads to the following result. 1/(4qi2 ) = (3/4)(1/qi2 ) − (1/2)(1/qi ). Since the (Xi Ai ) are independent, we have Var (T (a)) = a−1 P Var (Xi Ai ), which after a somewhat lengthy and tedious i=0 calculation gives Var (T (a)) = n3 ·(3/8) (1 + 1/(n − 1))2a−1 −1 − Θ n2 . By Chebyshev’s inequality, this yields Pr(|T (a) − E(T (a))| ≥ da ) 3 (1 + 1/(n − 1))2a−1 − 1 · n3 8d2a so that for any da = ω n3/2 we have for the expected time upper and lower bounds a 1 n2 − n − 1 ± da · 1+ 2 n−1 ≤ Theorem 7. For all d ≤ 2n2 , the probability that T (a) deviates from its expectation of (1/2)(n2 − n)((1 + 1/(n − 1))a − 1) by at least d, is at most 4 exp(−d2 /(20e2 n3 )). that hold with probability converging to 1, i. e., 1 − o(1). To obtain bounds on the expected function value we need to compute the inverse function. If we use for da some d independent of a we obtain To prove this result, let a ∈ [n]. Let X1 , . . . , Xn and A1 , . . . , An be as in Lemma 5. Write X = (X1 , . . . , Xn ). Assume that all these random variables are defined over a Pa−1 common probability space Ω. Let T = i=0 Xi Ai be a random variable describing the distribution of T (a). The conditional expectation E(T |X) is a random variable = E(T |X = X(ω)) = P defined over Ω ′by E(T |X)(ω) −1 (ω)))T (ω ′). A simple appliω ′ ∈X −1 (X(ω)) (Pr(ω )/ Pr(X cation of the triangle inequality gives n2 − n / n2 − n + 2b − 2d Pr ≤ LO(xb ) ln (1 − 1/n) ln n2 − n / n2 − n + 2b + 2d = 1 − o(1) ≤ ln (1 − 1/n) for any d = ω n3/2 . If n is not very small we have ln(1 − ln 1/n) ≈ −1/n (since for x ≈ 0 we have ex ≈ 1+x), so that we obtain Pr LO(xb ) = n ln 1 + (2b) / n2 − n ± Θ d/n2 = 1 − o(1) for any d = ω n3/2 . We observe that 2b/(n2 − n) = (2b/n2 )+O b/n3 holds and we can replace 2b/(n2 −n) 2 by 2b/n : The introduced error is O b/n3 = O(1/n) and 2 subsumed in Θ d/n . To simplify further to ln 1 + 2b/n2 ±Θ d/n2 we first observe that we can restrict our inter est to d = O n2 . For larger values of d the error term O d/n2 dominates and the theorem’s statement becomes trivial. With d = O n2 we have 1 + (2b/n2 ) ± Θ d/n2 = Θ(1) and the claim follows by applying the Taylor expansion of the natural logarithm. 6. |T − E(T )| ≤ |T − E(T |X)| + |E(T |X) − E(T )|. Consequently, we have Pr(|T − E(T )| ≥ 2d) ≤ Pr(|T − E(T |X)| ≥ d)+ Pr(|E(T |X) − E(T )| ≥ d) USING CHERNOFF BOUNDS FOR SUMS OF GEOMETRIC RANDOM VARIABLES The bounds on the optimisation time derived in the previous section hold with probability 1 − O(n3 /d2 ). Note that the error term is a polynomial in d, which is due to Chebyshev’s inequality. However, it has already been proven using Chernoff bounds that the actual optimisation time of the (1+1) EA on LO is within the interval [c1 n2 , c2 n2 ] for two constants c1 , c2 with probability 1 − 2−Ω(n) [7], i. e., with probability exponentially close to 1. Hence, we could hope for much stronger tail bounds than proved by Chebyshev’s inequality. Unfortunately, the result from [7] is too weak for our purposes since it only bounds the probability of deviations (1) for all d. We show that both terms are small when d = ω(n3/2 ). Tail Bounds for |T − E(T |X)| via Chernoff Bounds for Geometric Random Variables. By P law of total probability, we have Pr(|T −E(T |X)| ≥ d) = x∈{0,1}n Pr(X = x) Pr(|T − E(T )| ≥ d | X = x). Let us condition onPX = x. Let S = {i ∈ {0, . . . , a − 1} | xi = 1}. Then T = i∈S Ai is a sumPof independent geometric random variables and E(T ) = i∈S E(Ai ) is its expectation. We use the following Chernoff bound for sums of such random variables due to Scheideler [15, Theorem 3.38]. Theorem 8. Let Y1 , Y2 , . . . , Ys be independent, geometrically distributed random variables with parameters q1 , q2 , . . . , qP s ∈ (0, 1). Let q = min{qi | i ∈ {1, . . . , s}}. Let Y := si=1 Yi and µ = E(Y ). (a) For all δ > 0, Pr(Y ≥ (1 + δ)µ) ≤ (1 + δµq/s)s exp(−δµq). (b) For all δ ∈ [0, 1], Pr(Y ≤ (1 − δ)µ) ≤ exp(−δ 2 µq/2). 1584 We apply this theorem to Ai , i ∈ S, to q = min{qi | i ∈ S}, and d ≤ n2 . Note that thus d ≤ n/q. Using (i) that (1 + c/s)s for all c ≥ 0 is increasing in s, (ii) the estimate (1 + δ)/ exp(δ) ≤ (1 + δ)/(1 + δ + δ 2 /2) = 1 − (δ 2 /2)/(1 + δ + δ 2 /2) ≤ 1 − δ 2 /5 ≤ exp(−δ 2 /5) valid for all δ ∈ [0, 1], and (iii) finally q ≥ 1/en, we compute (1+1) EA on LO, with deviations in lower-order terms only, using drift analysis combined with the exponential-moment method (which was only recently introduced to the runtime analysis of evolutionary algorithms in [4]). Although this result is weaker than the one obtained in the previous section, two advantages in other respects can be identified: (1) the method does not need the discussed characterisation of runtime and (2) it can also be used if the (1+1) EA is initialised with a fixed starting point (instead of a uniform one). Pr(T ≥ E(T ) + d) ≤ (1 + dq/s)s exp(−dq) ≤ (1 + dq/n)n exp(−dq/n)n ≤ exp(−(dq/n)2 /5)n ≤ exp(−d2 /(5e2 n3 )). Theorem 9. The time for the (1+1) EA to reach a LOvalue of at least a, where 0 < a ≤ n − log n, is bounded from above by a n2 − n 1 − 1 1 + O( √1n ) + d 1+ 2 n−1 For the lower tail, still conditioning on X = x and using µ ≤ e−1 n2 , we compute 2 Pr(T ≤ E(T ) − d) ≤ exp(−d2 q/(2µ)) ≤ exp(−d2 /(e(e − 1)n3 )). −3/2−ǫ with probability at least 1 − e−dn and from below by √ ! a n n2 − n 1 1 − 1+ 1+ 1−O( √1n ) −d 2 n−1 n−1 Consequently, conditional on X = x, we have Pr(|T − E(T )| ≥ d) ≤ 2 exp(−d2 /(5e2 n3 )), which is independent of x. Therefore, Pr(|T − E(T |X)| ≥ d) X = Pr(X = x) Pr(|T − E(T )| ≥ d | X = x) −3/2−ǫ +O(log n) with probability at least 1−e−dn −2−Ω( an arbitrarily small constant ǫ > 0 and any d > 0. √ n) for x∈{0,1}n ≤ X Before we prove the theorem, we look into its consequences for the function value at time b. As long as the following bounds do not exceed n − log n, we obtain ln n2 − n / n2 − n + 2b − 2d − o(d) ≤ LO(xb ) Pr ln (1 − 1/n) ǫ ln n2 − n / n2 − n + 2b + 2d + o(d) ≤ = 1 − 2−Ω(n ) ln (1 − 1/n) 2−n 2 exp(−d2 /(5e2 n3 )) x∈{0,1}n = 2 exp(−d2 /(5e2 n3 )). (2) Tail Bounds for |E(T |X) − E(T )|. Note that the random variable E(T |X) by definition only depends on X1 , . . . , Pa−1 Xn . We have E(T |X) − E(T ) = i=0 (Xi − 1/2)E(Ai ). Here, the (Xi − 1/2) are independent random variables uniformly distributed in {−1/2, +1/2} and the E(Ai ) simply are number of order Θ(n). Hence E(T |X) − E(T ) is a sum of independent random variables each with expectation zero and range at most E(An−1 ) ≤ en. Hence a standard Chernoff bound (like Theorem 1.11 of [3]) for all d ≥ 0 yields Pr(|E(T |X) − E(T )| ≥ d) ≤ 2 exp(−2d2 /(a(en)2 )) ≤ 2 exp(2d2 /(en3 )). for any d = Ω(n3/2+ǫ ), where ǫ is an arbitrarily small constant. This is slighly weaker than the bound from the previous section since Theorem 7 only requires d = ω(n3/2 ). To prove Theorem 9, we use Lehre’s [11] proof of the variable drift theorem and combine it with ideas implicit in Hajek’s [8] paper on drift analysis, including the exponentialmoment method. We start by defining some notation. (3) Definition 10. Let xt denote the current search point of the (1+1) EA at time t. Its distance to the target a is defined by d(xt ) := max{0, a − LO(xt )}. Given a search point xt of distance d(xt ) > 0, we denote by δ(xt ) := d(xt ) − d(xt+1 ) the decrease in distance in the next step. Slightly abusing notation, we also write δ(i) := i− d(xt+1 | d(xt ) = i) for the decrease in distance if d(xt ) = i. From (1) to (3), Theorem 7 immediately follows. Notes. We stated Theorem 7 only for d ≤ 2n2 , which is enough to obtain an exponential failure probability. Larger d seem therefore not too interesting. They could be handled, though, easily by exploiting (1 + δ)/ exp(δ) = exp(−δ(1 − o(1))) in the estimates for the upper tail bounds for the sums of geometric random variables. We did not try to optimise the constant in the exp(−Θ(d2 /n3 )) term. 7. We now bound the drift E(δ(i)) (assuming a random xt satisfying d(xt ) = i, but omitting this in the expectation for the sake of readability). At a later point of time, a potential function based on δ(i) will be defined, and drift analysis will be carried out with respect to that potential function. USING DRIFT ANALYSIS WITH TAIL BOUNDS The results from the previous two sections heavily rely on knowing the precise distribution of the random variable describing the runtime (Lemma 5). Such an exact description is usually not available. It is therefore worth investigating whether sharp tail bounds can be obtained by other, potentially more general, methods. We show that drift analysis, which has become the key method for the runtime analysis of randomised search heuristics, is suitable as well for our purposes. As a proof of concept, we derive in Theorem 9 sharp tail bounds on the optimisation time of the Lemma 11. For all i > 0, E(δ(i)) = (2 − 2−n+a−i+1 ) · (1 − 1/n)a−i · (1/n). Proof. The leftmost zero-bit is at position a − i + 1. To increase the LO-value (it cannot decrease), it is necessary to flip this bit and not to flip the first a − i bits, which is reflected by the last two terms in the lemma. The first term is due to the expected number of free-rider bits. Note that there can be between 0 and n − a + i − 1 such bits. By 1585 the usual argumentation using a geometric distribution, the expected number of free riders in an improving step equals min{n−a+i−1,k+1} n−a+i−1 X 1 k· = 1 − 2−n+a−i+1 , 2 We now bound the value of the potential function depending on the distance. As a by-product from additive drift analysis (using the result from Lemma 13), a bound on g(a) is also a bound on the expected time to reach at least the target value a. k=0 hence the expected progress in an improving step is 2 − 2−n+a−i+1 . Lemma 14. For all i ≥ 0 and a ≤ n − log n, it holds g(i) ≤ We now turn to the potential function, which is obtained from the drift δ(i) in a similar way as in the proof of the variable drift theorem (cf. [14]). and Definition 12. Given a search point xt , its potential is Pd(x ) defined by g(xt ) := k=1t 1/E(δ(k)). We also write g(i) := g(xt | d(xt ) = i) for the potential at distance i. The random change of the potential is defined by ∆t := g(xt ) − g(xt+1 ). We also write ∆t (i) := g(i) − g(xt+1 | d(xt ) = i). g(i) ≥ Lemma 15. If λ = n−3/2−ǫ for an arbitrarily small constant ǫ > 0 and g(x t ) > 0, then there is a constant c > 0 such that E e−λ∆t ≤ 1 − λ(1 − cn−1/2 ) and E eλ∆t ≤ 1 + λ(1 + cn−1/2 ). Proof. We start with the bound on E e−λ∆t . From Lemma 11, we get E(δ(i)) ≥ 1/(en) for any i > 0, which implies g(i + j) − g(i) ≤ ejn for any j > 0. Moreover, similarly as argued before, the probability of decreasing the distance (i. e., increasing the LO-value) by at least j is at most 2−j+1 /n as j − 1 free-riders and a flip of the leftmost zero-bit are necessary. Hence, Pr(∆t ≥ ejn) ≤ 2−j+1 /n. We use this exponential decay to bound the higher-order moments of ∆t . More precisely, similarly as in the proof of the simplified drift theorem [13], we get ∞ X λk E |∆t |k E e−λ∆t ≤ 1 − λ + , k! k=2 Lemma 13. If i > 0 then E(∆t (i)) √ ≥ 1. If i > 0 and a ≤ n − log n then E(∆t (i)) = 1 + O(1/ n). Proof. By Lemma 11, E(δ(i)) is decreasing in its argument. Hence, if δ(i) = j, we get ∆t = g(i) − g(i − j) = Pj−1 Pj−1 1 1 1 k=0 E(δ(i−k)) ≥ k=0 E(δ(i)) = j · E(δ(i)) . This implies Pi E(δ(i)) j E(∆t ) ≥ j=1 E(δ(i)) · Pr(δ(i) = j) = E(δ(i)) , which proves the lower bound. For the upper bound, we study the distribution of δ(i). Using ideas from the proof of Lemma 11, we get Pr(δ(i) = j) ≤ (1 − 1/n)a−i (1/n)2−j . As a ≤ n − log n, we get E(δ(i)) = (2 − O(1/n)) · (1 − 1/n)a−i · (1/n) from Lemma 11. In the remainder of this proof, we omit the −O(1/n) term in all calculations involving E(δ(·)). It is clear that this only can introduce an additive error of O(1/n). We bound the change of potential, given δ(i) = j. Then where we have used that E(∆t ) ≥ 1 (Lemma 13). The sum for k ≥ 2 is at most g(i) − g(i − j) j−1 X i−a X −k j−1 1 1 1− 1− n n k=0 j −j 1 (n − 1) 1 + n−1 − (n − 1) −1 1 − n1 1 = · = E(δ(i)) 1 − 1 −1 − 1 E(δ(i)) n 2 j j (n − 1) 1 + n−1 + n−1 − (n − 1) ≤ , E(δ(i)) 1 n = E(δ(i − k)) 2 k=0 ∞ n X λk X (e(j + 1)n)k Pr(∆kt ≥ (ejn)k ) k! j=0 k=2 as ∆t does not take on negative values and the maximum change of distance is at most n. Now, for any k, the inner sum is at most n X (e(j + 1)n)k j=0 where the last inequality used (1 + x)y ≤ exy ≤ 1 + (xy) + (xy)2 for 0 ≤ xy ≤ 1. Simplifying the last bound and putting everything together, we get n 2−j+1 2 · (en)k X = (j + 1)k 2−j . n n j=0 To bound the sum over j, we several cases. If Pdistinguish n k −j = O(1). If k is constant then obviously j=0 (j + 1) 2 Pn k −j (1/2+2ǫ/3)k k ≤ n1/2+ǫ/3 , then (j + 1) 2 = O(n ) as j=0 1/2+ǫ/3 2−j = o(j −n ) for j ≥ n1/2+2ǫ/3 . Otherwise, we use P n k −j the trivial bound = O(nk ). This, along j=0 (j + 1) 2 with our choice λ = n−3/2−ǫ , bounds the outer sum by 2 j j + n−1 E(∆t ) ≤ Pr(δ(i) = j) · . E(δ(i)) j=1 i X n2 − n ((1 − 1/n)−a − (1 − 1/n)−a+i ). 2 The proof of the lemma is elementary and has been omitted due to space restrictions. Recall that our aim is to show tail bounds. To this end, we consider the moment-generating function of the drift. The following statement (where we again omit the condition “ | xt ” for readability) bounds the moment-generating function of the negated and unnegated drift away from 1. The drift E(∆t ) of the potential (more precisely, we mean E(∆t | x0 , . . . , xt ), which is the same as E(∆t | xt ) since we are dealing with a Markov chain) is crucial in the following. Note that E(δ(i)) is non-decreasing in i. Hence, as in the proof of the variable drift theorem, we can obtain E(∆t ) ≥ 1. We make this now explicit and prove an upper bound as well. = n2 − n ((1 − 1/n)−a − (1 − 1/n)−a+i )(1 + O(1/n)) 2 √ √ √ If j < n, then j + j 2 /(n − 1) = j(1 + O(1/√ n)). If j ≥ n, 2 −Ω( n) then Pr(δ(i) = j)√· j /(E(δ(i))) = 2 . Altogether, E(∆t ) = 1 + O(1/ n) follows. 1/2+ǫ/3 6/ǫ n X X C k n−(1/2+ǫ)k + nk! k=2 k=6/ǫ+1 1586 C k n−(1/2+ǫ)k (1/2+2ǫ/3)k n nk! ∞ X + k=n1/2+ǫ/3 +1 ≤ O(n−2−ǫ ) + ∞ X k=n1/2+ǫ/3 +1 C k n−(1/2+ǫ)k nk nk! large constant c. Then for all s ≤ t, Pr(g(xs ) = 0) ≤ e−λd . Now the lower tail follows by the above union bound over t ≤ en2 steps. c cλ C k n(1/2−ǫ)k ≤ 2+ǫ = √ n n(1/2+ǫ/3)k n 8. CONCLUSIONS We have presented a general method to derive results on the function value after a pre-specified number of function evaluations, called budget. Our method starts with results on the expected runtime (to find a solution of a given quality) that come with tail bounds and turn these into fixedbudget results by means of the inverse function. The method has the advantage of allowing for the re-use of a large body of existing work. We have tested our method on a simple test case, the (1+1) EA on LO. The (1+1) EA was one of two algorithms considered in the original paper introducing fixed budget computations [10] and is more realistic than random local search, the other of the two algorithms. We have been able to improve over the bounds from [10] considerably. In Section 5 we use Chebyshev’s inequality to derive tail bounds (a method that, as Doerr [3] points out, was rarely used in the theory of randomised search heuristics; this may change now). In Section 6 we provide asymptotically much stronger results using Chernoff bounds. Both methods strongly rely on properties of LO. In Section 7 we also prove strong tail bounds now using an implicit variable drift theorem with tail bounds. We are currently working on an explicit version of such a theorem, which is expected to be very useful to obtain further fixed-budget results. The asymptotic nature of the much more involved proofs from this section make it difficult, however, to derive numerical statements for concrete values of n. Numerical bounds based on Theorem 7 are also not strong enough for the values of n we consider. Therefore, when we compare our bounds with the ones from [10] we do so only with the results from Section 5. We depict the bounds from Theorem 4 from [10] with our result from Theorem 6 for n ∈ {200, 1000, 5000, 10000} and plot the bounds in Figure 1. Our bounds are derived numerically using an exact expression for the variance Var (T (a)) in Chebyshev’s inequality. We choose da in such a way that the function value stays between the upper and lower bound with at least 90% probability. Remember that the bounds from [10] only make a statement for budgets b < n2 /2. Therefore, the corresponding curves in Figure 1 stop at this point. Our bounds allow for statements for the whole range of budgets up to the expected optimisation time. For large values of n our bounds are clearly much better. For smaller values of n and rather small budgets this is not necessarily the case (see, e. g., the curves for n = 200 and b ≤ 11000 in Figure 1). Note, however, that the bounds from [10] are asymptotic and it is not clear that they are actually valid for such small values of n. We see that our bounds are not only much sharper asymptotically but also allow for concrete (numerical) statements for concrete values of n and b. The presentation of our method allows for a wealth of new results in the fixed budget framework. This requires revisiting existing results on expected optimisation times, generalising them to results on the expected time needed to find results of a pre-defined quality (i. e., results on the expected approximation time), and strengthening them by adding tail bounds to the statements on the expected times. While general tools for this are available (see, e. g., the drift for constants C, c > 0 since k! ≥ (k/e)k . The bound on E(eλ∆t ) is obtained in a very similar way. P λk E((∆t )k ) We have E(eλ∆t ) = 1 + λE(∆t ) + ∞ . Using k=2 k! √ Lemma 13, we get E(∆t ) = 1 + O(1/ n). The higherorder moments √ are bounded as before, such that E(∆t ) = 1 + λ(1 + c/ n) follows by choosing c large enough. With the previous bound on the moment-generating function, the proof of the desired tail bounds will mostly be applications of Markov’s inequality. Proof of Theorem 9. Let T denote the random time until the distance from the target a is reduced to 0. We start with the upper tail. For any t > 0, we have Pr(T > t) = Pr(g(xt ) ≥ g(1)) = Pr(eλg(xt ) ≥ eλg(1) ) ≤ E(eλg(xt ) | x0 ) where the inequality uses Markov’s inequality along with eλg(1) ≥ 1. Now, E(eλg(xt ) | x0 ) = E eλg(xt−1 ) · E(e−λ∆t−1 | xt−1 ) | x0 = eλg(x0 ) · t−1 Y s=0 E(e−λ∆s | xs ), where the last equality follows inductively (note that this does not assume independence of the ∆s ). Using Lemma 15, our bound is no larger than −1/2 eλg(x0 ) · (1 − λ(1 − cn−1/2 ))t ≤ eλg(x0 ) e−tλ(1−cn ) , which, using Lemma 14 with i = a, is at most 2 eλ((n −n)/2)((1−1/n)−a −1)(1+O(1/n))−tλ(1−cn−1/2 ) −1/2 . −1/2 As (1 − cn )(1 + 2cn ) ≥ 1, plugging in t := ((n2 − −a n)/2)((1 − 1/n) − 1)(1 + C/n)(1 + 2cn−1/2 ) + d for some sufficiently large constant C finally yields −3/2−ǫ Pr(T > t) ≤ e−λd = e−dn . −1/2 Noting that (1 + O(1/n))(1 + 2cn ) = 1 + O(n−1/2 ) and −1 (1−1/n) = 1+1/(n−1), the first statement of the theorem follows. The second statement, the lower tail, is proved similarly. By a union bound, Pr(T < t) ≤ t−1 X Pr(g(xs ) = 0). s=0 Since g(xs ) = 0 is equivalent to e−λg(xs ) ≥ e−λ0 = 1, we get Pr(e−λg(xs ) ≥ 1) ≤ E(e−λg(xs ) ) from Markov’s inequalQs−1 ity. Now, E(e−λg(xs ) ) = e−λg(x0 ) · r=0 E(eλ∆r | xr ). By −1/2 ) Lemma 15, this is no larger than e−λg(x0 ) ·esλ(1+cn .√ Due √ −Ω( n) to random initialisation, Pr(d(x0 ) ≤ a − n) = 1 − 2 . 2 −a From Lemma 14, we get g(x ) ≥ ((n − n)/2)((1 − 1/n) − 0 √ case. Let t := ((n2 − n)/2)((1 − (1 − 1/n)− n ) in this √ −a − n 1/n) −(1−1/n) )(1−2cn−1/2 )−d for some sufficiently 1587 lower bound upper bound lower bound from [10] upper bound from [10] 200 lower bound upper bound lower bound from [10] upper bound from [10] 1000 800 function value function value 150 100 600 400 50 200 0 0 0 5000 10000 15000 20000 25000 30000 35000 0 150000 300000 #function evaluations lower bound upper bound lower bound from [10] upper bound from [10] 5000 600000 750000 900000 lower bound upper bound lower bound from [10] upper bound from [10] 10000 8000 function value 4000 function value 450000 #function evaluations 3000 6000 2000 4000 1000 2000 0 0 0 5e+06 1e+07 1.5e+07 2e+07 0 #function evaluations 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07 #function evaluations Figure 1: Bounds from Theorem 4 from [10] and Theorem 6 for n = 200 (top left), n = 1000 (top right), n = 5000 (bottom left) and n = 10000 (bottom right). The function value LO(xt ) stays between the upper and lower bound with at least 90% probability. theorem with tail bounds by Doerr and Goldberg [5]) it has not often been done in the past. We expect to see the need for more tools to allow for the derivation of such tail bounds and a renewed interest in the development of such analytical methods. We hope that this line of research proves to be motivating and intriguing for theoreticians and delivers results that will be of interest to practitioners. 9. [8] [9] [10] REFERENCES [1] A. Auger and B. Doerr, editors. Theory of Randomized Search Heuristics. World Scientific, 2011. [2] S. Böttcher, B. Doerr, and F. Neumann. Optimal fixed and adaptive mutation rates for the LeadingOnes problem. In Proc. of the 11th Int’l Conf. on Parallel Problem Solving From Nature (PPSN 2010), pages 1–10. Springer, 2010. LNCS 6238. [3] B. Doerr. Analyzing randomized search heuristics: Tools from probability theory. In Auger and Doerr [1], pages 1–20. [4] B. Doerr, M. Fouz, and C. Witt. Sharp bounds by probability-generating functions and variable drift. In Proc. of the Genetic and Evolutionary Comp. Conf. (GECCO 2011), pages 2083–2090. ACM Press, 2011. [5] B. Doerr and L. Goldberg. Drift analysis with tail bounds. In Proc. of the 11th Int’l Conf. on Parallel Problem Solving From Nature (PPSN 2010), pages 174–183. Springer, 2010. LNCS 6238. [6] B. Doerr, T. Jansen, and C. Klein. Comparing global and local mutations on bit strings. In Proc. of the Genetic and Evolutionary Comp. Conf. (GECCO 2008), pages 929–936. ACM Press, 2008. [7] S. Droste, T. Jansen, and I. Wegener. On the analysis [11] [12] [13] [14] [15] [16] 1588 of the (1+1) evolutionary algorithm. Theoretical Comp. Science, 276(1–2):51–81, 2002. B. Hajek. Hitting and occupation time bounds implied by drift analysis with applications. Advances in Applied Probability, 14:502–525, 1982. T. Jansen. Analyzing Evolutionary Algorithms. The Computer Science Perspective. Springer, 2013. T. Jansen and C. Zarges. Fixed budget computations: A different perspective on run time analysis. In Proc. of the Genetic and Evolutionary Comp. Conf. (GECCO 2012), pages 1325–1332. ACM Press, 2012. P. K. Lehre. Drift analysis (tutorial). In Companion to GECCO 2012, pages 1239–1258. ACM Press, 2012. F. Neumann and C. Witt. Bioinspired Computation in Combinatorial Optimization. Springer, 2010. P. S. Oliveto and C. Witt. Simplified drift analysis for proving lower bounds in evolutionary computation. Algorithmica, 59(3):369–386, 2011. Erratum in http://arxiv.org/abs/1211.7184. J. E. Rowe and D. Sudholt. The choice of the offspring population size in the (1, λ) EA. In Proc. of the Genetic and Evolutionary Comp. Conf. (GECCO 2012), pages 1349–1356. ACM Press, 2012. C. Scheideler. Probabilistic Methods for Coordination Problems, volume 78 of HNI-Verlagsschriftenreihe. University of Paderborn, 2000. Habilitation thesis. Available at: www.cs.jhu.edu/~scheideler/papers/habil.ps.gz. N. P. Troutman, B. E. Eskridge, and D. F. Hougen. Is “best-so-far” a good algorithmic performance metric? In Proc. of the Genetic and Evolutionary Comp. Conf. (GECCO 2008), pages 1147–1148. ACM Press, 2008.
© Copyright 2026 Paperzz