LMBOPT – A limited memory method for bound-constrained optimization Arnold Neumaier Behzad Azmi Fakultät für Mathematik, Universität Wien Nordbergstr. 15 , A-1090 Wien, Austria email: [email protected] WWW: http://www.mat.univie.ac.at/∼neum/ /matlab/opt/LM/lmbopt1.tex Preliminary version, June 27, 2014 confidential – do not distribute Abstract. This is an early draft of a paper on the bound constrained optimization problem solver LMBOPT and the theory on which it is based. This draft is incomplete and may still contain major errors. Contents 1 Introduction 2 2 Smooth bound-constrained optimization problems 2 3 The line search 3 4 The bent search path 7 5 Convergence 9 6 Subspace steps 11 1 7 Implementation issues 12 8 Numerical results 15 1 Introduction 2 Smooth bound-constrained optimization problems {s.problem} Let f : C ⊆ Rn → R be a continuously differentiable function with gradient g(x) := f ′ (x)T ∈ Rn . We consider the bound-constrained optimization problem min f (x) s.t. x ∈ x := {x ∈ Rn | x ≤ x ≤ x}. (1) {e.bopt} Here one-sided or missing bounds are accounted for by allowing the values −∞ for components of x and ∞ for components of x. To have a well-defined optimization problem, the box x must be part of the domain C of definition of x. (In practice, one may allow a smaller domain of definition if f satisfies the coercivity condition that, as x approaches the boundary of the domain of definition, f (x) exceeds the function value f (x0 ) at a known starting point x0 .) f is called the objective function. A point x is called feasible if it belongs to the box x. Given a feasible point x and an index i, we call the bound xi or xi active if xi = xi or xi = xi , respectively. In both cases, we also call the index i and the component xi active. Otherwise, i.e., if xi ∈ ]xi , xi [, the index i, the component xi , and the bounds xi and xi are called nonactice or free. We write I(x) := {i | xi < xi < xi } (2) {e.I} for the set of free indices. If the gradient g = g(x) has a nonzero component gi at a nonactive index i, we may change xi slightly without leaving the feasible region. The value of objective function is reduced by moving slightly to smaller or larger values depending on whether gi > 0 or gi < 0, respectively. However, if xi is active, only changes of xi in one direction is possible without losing feasibility. The value of objective function can then possibly be reduced by moving slightly in the feasible direction only when gi ≤ 0 if xi = xi , (3) {e.wa} gi ≥ 0 if xi = xi , 2 and a decrease is guaranteed only if the slightly stronger condition gi < 0 if xi = xi , gi > 0 if xi = xi (4) {e9.03} holds. We say that the active variable xi is weakly active if (3) holds, and strongly active if (3) is violated. Thus changing a single strongly active variable only cannot lead to a better feasible point. If (4) holds, we call the active index i freeable and say that the variable xi can be freed from its bound. Combining the various cases, we see that a decrease is always possible unless the reduced gradient gred (x) at x, with components 0 if xi = xi = xi , min(0, gi (x)) if xi = xi < xi , gred (x)i := (5) {e9.09} max(0, gi (x)) if xi = xi > xi , gi (x) otherwise, vanishes. A feasible point x with gred (x) = 0 is called a stationary point of the optimization problem (1). By the above, a local minimizer x of (1) must be a stationary point. This statement is a concise expression of the first order optimality conditions for bound-constrained optimization. Note that although the gradient is assumed to be continuous, the reduced gradient need not be continuous, since it may change abruptly when a bound becomes active. If no bound is active, gred (x) = g(x). The set of free or freeable indices can be written in terms of the reduced gradient as I+ (x) := {i | xi < xi < xi or (gred )i 6= 0}. (6) {e.I+} 3 The line search {s.LS} Our optimization method improves an initial feasible point x0 by constructing a sequence x0 , x1 , x2 , . . . of feasible points with decreasing function values. To ensure this, we search in each iteration along an appropriate search path x(α) with x(0) = xℓ , and take xℓ+1 = x(αℓ ) where αℓ is determined by a line search. If the iteration index ℓ is fixed, we simply write x for the current point xℓ . A line search proceeds by searching points x(α) on a curve of feasible points parameterized by a step size α > 0 starting at the current point x = x(0). The goal is to find a value for the step size such that f (x(α)) is sufficiently smaller than f (x), with a notion of ”sufficiently” to be made precise. If the gradient g = g(x) is nonzero, the existence of such an α > 0 is guaranteed if the tangent vector p := x′ (0) exists and satisfies g T p < 0. (7) {131} In the unconstrained case, the curve is frequently taken to be a ray from x in direction p, giving x(α) = x + αp. 3 A good and computationally useful measure of progress of a line search is the Goldstein quotient f (x(α)) − f (x) for α > 0. (8) {133’} µ(α) := αg T p µ can be extended to a continuous function on [ 0, ∞] by defining µ(0) := 1 since, by l’Hôpital’s rule, f ′ (x(α))x′ (α) f ′ (x)x′ (0) lim µ(α) = lim = = 1. α→0 α→0 gT p gT p By assumption (7) we have f (x(α)) < f (x) iff α > 0 and µ(α) > 0. Restrictions on the values of the Goldstein quotient define regions where sufficient descent is achieved. We consider here the sufficient descent condition µ(α)|µ(α) − 1| ≥ β (9) {e.Goldfast} with fixed β > 0. This condition requires µ(α) to be sufficiently positive, forcing f (x(α)) < f (x) and forbidding steps that are too long, and to be not too close to 1, forbidding steps that are too short. Hence satisfying the condition promises to deliver a sensible decrease in the objective function. A step size with (9) can be found constructively, at least when the objective function is bounded below. 3.1 Theorem. Let β ∈ ]0, 14 [, g T p < 0. If f (x(α)) is ubounded below as α → ∞ then the equation µ(b α) = 12 has a solution α b > 0, and any α sufficiently close to α b satisfies (9). Proof. Let f := inf f (x(α)) and µ0 := inf µ(α). If µ0 > 0 then (8) implies for α > 0 the α≥0 α≥0 inequality f − f (x) ≤ f (x(α)) − f (x) = αg T pµ(α) ≤ αg T pµ0 , (10) {e.ub} but since g T p < 0, this is impossible for sufficiently large α. Therefore µ0 ≤ 0. By continuity, µ(b α) = 21 has a solution α b > 0. Since µ(b α)|µ(b α) − 1| = 41 > β, (9) holds for all α sufficiently close to α b. ⊓ ⊔ Near a local minimizer, twice continuously differentiable functions are bounded below and almost quadratic. For a linear search path and a quadratic function that is bounded below, α2 α2 T f (x + αp) = f (x) + αg(x) p + p G(x)p =: f + αa + b with a < 0 < b, 2 2 T (11) {135b} so that µ(α) = 1 + αb/2a is linear in α. Thus µ(α) < 1, α a =− >0 2(1 − µ(α)) b for all α > 0. The minimizer along the search line can therefore be computed from any α, α b= α 2(1 − µ(α)) 4 (12) {e.alpLin} giving µ(b α) = 21 . In general, we therefore use the formula (12) after the first function evaluation. If the resulting value for µ does not satisfy the sufficient descent condition (9), the function is far from quadratic, and we proceed with a simple bisection scheme: We use extrapolation by constant factors towards 0 or ∞ until we have a nontrivial bracket for α b; then we use geometric mean steps since the bracket may span several orders of magnitude. However, we quit the line search once the stopping test is satisfied, and accept as final step size the α with the best function value. 3.2 Algorithm. Curved line search (CLS) {a.cls} Purpose: find a step size α with µ(α)|µ(α) − 1| ≥ β Input: x(α) (search path), f0 = f (x(0)), d0 = g(x(0))T x′ (0), αinit (initial step size) Requirements: d0 < 0, αinit > 0 Parameters: β ∈ ]0, 14 [, q > 1 start=true; α = 0; α = ∞; α = αinit ; µ = (f (x(α)) − f0 )/(αd0 ); while start or µ|µ − 1| < β, if µ ≥ 21 , α = α; else α = α; end; if start, start=0; if µ < 1, α = 12 α/(1 − µ); else α = αq; end; else if α = ∞, α = αq; elseif α = 0, α = √ α/q; else α = α α; end; end; µ = (f (x(α)) − f0 )/(αd0 ); end; return the α with the best f (x(α)) found. Note that this idealized line search needs an extra stopping test to end after finitely many steps when f is unbounded below along the search curve. In practice, one may use a small value such as β = 0.02, and a large value of q such as q = 50. The best values depend on the particular algorithm calling the line search, and must be determined by calibration on a set of test problems. The sufficient descent condition (9) is closely related to the so-called Goldstein condition f (x) − αµ′′ g T p ≤ f (x(α)) ≤ f (x) − αµ′ g T p, 5 (13) {138g} where 0 < µ′ < µ′′ < 1. Indeed, (13) is equivalent to µ′ ≤ µ(α) ≤ µ′′ , (14) {138} hence (9) holds with β = min(µ′ (1 − µ′ ), µ′′ (1 − µ′′ )) > 0. Conversely, (9) implies that either (13) holds with 2β √ , µ = 1 + 1 − 4β ′′ ′ µ = 1+ √ 1 − 4β , 2 or the alternative fast descent condition µ(α) ≥ µ′′′ holds with ′′′ µ = 1+ (15) {138f} √ 1 + 4β . 2 The Goldstein condition (14) can be interpreted geometrically by drawing in the graph of f (x(α)) the cone defined by the two lines through (0, f ) with slopes µ′ g T p and µ′′ g T p. This cone cuts out a section of the graph, which defines the admissible step size parameters. The Goldstein condition allows only a tiny and inefficient range of step sizes in cases where for small α, the graph of f (x(α)) is concave and fairly flat, while for larger α, f (x(α)) is strongly increasing. This is one of the reasons why most currently used line searches are based on the so-called Wolfe condition, which needs gradient evaluations during the line search. Our line search is gradient-free but avoids the defects of the Goldstein condition. Indeed, in the above cases, the range allowed by (9) is considerably larger than that of the Goldstein condition, since it includes the values where (15) holds. Our line search is even more liberal as it accepts the step size with best known function value once a step size satisfying (9) was found; this step size may itself violate (9). {t132} 3.3 Theorem. Suppose that the objective function has a Lipschitz continuous gradient, i.e., kg(y) − g(x)k2 ≤ γky − xk2 for x, y ∈ Rn . (16) {e.lcg} If the search path is a ray along the direction p then (f (x) − f (x(α′ ))kpk22 2β ≥ . T 2 (g(x) p) γ (17) {135c} holds for any step size α′ with f (x(α′ )) ≤ f (x(α)) for some α > 0 satisfying the sufficient descent condition (9). 6 Proof. By assumption, x(α) = x + αp. The function ψ defined by ψ(α) := f (x + αp) − αg(x)T p satisfies ψ ′ (α) = g(x + αp)T p − g(x)T p = (g(x + αp) − g(x))T p. The Cauchy-Schwarz inequality gives |ψ ′ (α)| ≤ kg(x + αp) − g(x)k2 kpk2 ≤ γkαpk2 kpk2 = γα kpk22 , hence Z α ψ ′ (α)dα |f (x + αp) − f (x) − αg(x) p| = |ψ(α) − ψ(0)| = 0 Z α Z α γα2 γαkpk22 dα = |ψ ′ (α)|dα ≤ kpk22 . ≤ 2 0 0 T Therefore 2 2 f (x + αp) − f (x) − αg(x)T p kpk22 = ≥ |µ(α) − 1|. |g(x)T p| αγ αg(x)T p αγ On the other hand, since g(x)T p < 0, f (x(α)) − f (x) f (x) − f (x(α)) = = αµ(α). T |g(x) p| g(x)T p Taking the product, we conclude that 2µ(α)|µ(α) − 1| 2β 2 (f (x) − f (x(α′ )) kpk22 ≥ . ≥ αµ(α) |µ(α) − 1| = T 2 (g(x) p) αγ γ γ ⊓ ⊔ 4 The bent search path {s.bent} For an arbitrary point x ∈ Rn , we define its feasible projection π[x] by π[x]i := max(xi , min(xi , xi )) = ( xi xi xi if xi ≤ xi , if xi ≥ xi , otherwise. (18) {e.projB} For solving the bound constrained optimization problem (1), we do each line search along a bent search path x(α) := π[x + αp], (19) {bentpath} 7 obtained by projecting the ray x + αp (α ≥ 0) into the feasible set. The bent search path is piecewise linear, with breakpoints at the elements of the set xi − xi xi − xi : pi > 0 ∪ : pi < 0 \ {0, ∞}. S := pi pi If the breakpoints α1 , . . . , αm are ordered such that 0 = α0 < α1 < . . . < αm < αm+1 = ∞, the bent search path is linear on each interval [αi−1 , αi ] (i = 1, ..., m + 1). By setting some entries of p to zero if necessary, we may assume that x(α) = x + αp for 0 ≤ α ≤ α1 . This is equivalent to requiring pi ≥ 0 if xi = xi , pi ≤ 0 if xi = xi . (20) {e.p1} Algorithms unable to identify the set of optimal active bound constraints may free and fix the same subset of variables alternatively in a large number of successive iterations. Such a zigzagging behaviour is a major cause of inefficiency that must be avoided as far as possible. The most useful way to prevent this zigzagging behaviour is to control the conditions under which the variable are freed. This is done in the following algorithmic scheme. {a.bopt} 4.1 Algorithm. (for bound constrained optimization) Given a feasible initial point x0 . x = x0 ; I = I+ (x); while gred (x) 6= 0, If I = ∅, update I = I+ (x); % freeing step Update x by performing the line search of Algorithm 3.2 along a bent search path satisfying B1–B2; update I = I(x); if no new bound has been fixed but some variable can be freed, update I = I+ (x); % freeing step; end; end; Suppose that in the previous step no variable has been fixed in one of its bounds, now we have to make a decision between performing a freeing step or an iteration step. In order to prevent the zigzagging behaviour, first we perform a iteration step. Then we check whether a new variable became active. If a variable has been active, we repeat the iteration step with respect to the new subspace of nonactive variables. After that, again we check whether a active variable can be freed. If a variable can be freed, we do a freeing step. Otherwise, we repeat the iteration step again. The bent search path’s tangent direction is determined by a subspace step method over the subspace of variables corresponding to I. 8 T 4.2 Proposition. Suppose p is a feasible direction at x satisfying (20), gred p < 0, and pi = 0 if either xi = xi , gi > 0 or xi = xi , gi < 0. Then x(α) := x + αp is feasible for sufficiently small α > 0, and we have f (x(α)) < f (x) for such α. Proof. Due to the definition of the reduced gradient and second condition, gi pi = (gred )i pi for all i = 1, . . . , n. (21) Hence, we have T g T p = gred p ≤ 0. (22) Consequently, for sufficiently small α > 0, f (x(α)) = f (x + αp) = f (x) + αg T p + o(α) = f (x) + α(g T p + o(1)) < f (x). ⊓ ⊔ 5 Convergence According to our convergence theory below, we impose the following conditions B=B1 If S = ∅ or α < min S, the line search is efficient. B2 If α ≥ min S, the condition f (x(α)) ≤ f (x) + µ′ αg T p (23) {descent} is satisfied for a fixed µ′ ∈ (0, 1). We now prove that if condition B is satisfied, the algorithm terminates at a stationary point of the problem; consequently the first order optimality conditions hold for this point. The convergence analysis consists of an analysis of how the activities may change, together with an analysis of what happens in the free affine subspaces consisting of all points in which the active variables have fixed values. From now on, we eliminate the subscripts k of the outer iteration for simplicity and we will write x for xk , f for fk = f (xk ), g for the gradient g(xk ), x̄ for xk+1 , f¯ for fk+1 = f (xk+1 ) and ḡ for the gradient g(xk+1 ). {t.nozigzagb 5.1 Theorem. Suppose that all search directions p used in Algorithm 4.1 satisfy pi = 0 if either xi = xi , gi > 0 or xi = xi , gi < 0 9 (24) {e.psub} and the reduced angle condition T gred p ≤ −δ < 0. kgred k2 kpk2 (25) {e.rac} If the sequence of iteration points xl (l = 0, 1, . . .) is bounded then: l (i) The reduced gradients gred satisfy l inf kgred k2 = 0. l≥0 (ii) If the iteration does not stop but xl → x b for l → ∞ then x b satisfies the first order optimality conditions gred (b x) = 0, and for all i and sufficiently large l, we have xli = x bi = xi if gi (b x) > 0, xli = x bi = xi if gi (b x) < 0. (26) {e.bfixl} (27) {e.bfixu} Proof. Everything is trivial when the algorithm stops at a stationary point; hence we may assume that infinitely many iterations are taken. By removing the components of x fixed by the bound constraints, we may also assume that xi < xi for all i. (i) Suppose that the line search is efficient only finitely often then there is an integer L such that in all iterations l > L, condition (B) is violated; in particular, the set of breakpoints is nonempty, and at least one new bound is fixed. Thus ultimately, each iteration fixes some new bound. But this means that ultimately, only Step 2 is executed. Since the number of activities is finite and no bound is freed in Step 2, this can happen only a finite number of times, contradiction. Thus the line search is infinitely often efficient, and, by the definition of efficiency, there is a number ρ > 0 such that infinitely often (f − f¯)kpk22 ≥ ρ. (g T p)2 (28) {e.135} Now by (24), we have (gred )i = gi or (gred )i = 0 ≤ gi , and in the latter case, pi = 0. Therefore, in both cases, (gred )i pi = gi pi , and summing this gives T g T p = gred p. Since the xl are bounded and f is continuous, the f l = f (xl ) are bounded, hence fb := inf f l is finite. Since we have a descent sequence, lim f l = fb. In the infinitely many iterations satisfying (28), we conclude (writing −δ < 0 for the left hand side of (25)) that gT p |g T p| δkgred k ≤ − red = red ≤ kpk2 kpk2 10 s fl − fl+1 → 0, ρ l so that inf kgred k = 0. (ii) By continuity of the gradient, gb := g(b x) = lim g l . l→∞ Let i be an index i for which gbi > 0. Hence there is a number L such that gil > 0 for l > L, and the definition (5) of the reduced gradient implies that for l > L, 0 if xli = xi , l (gred )i = gil otherwise. l By part (i), a subsequence of the gred converges to zero, and we conclude that xli = xi for infinitely many l > L. Since also gil > 0 for l > L, (4) implies that the bound cannot be freed anymore. Thus xli = xi for all l > L and x bi = lim xli = xi , so that (26) holds for l→∞ l > L. Similarly, if i is an index i for which gbi < 0 then (27) holds for sufficiently large l. Using (26), (27), and the definition of the reduced gradient, we may now conclude that l gred (b x) = lim gred = 0. Thus the first order optimality conditions hold. ⊓ ⊔ Thus all strongly active variables will be ultimately fixed. As a consequence, the algorithm is asymptotically zigzag-free for all problems whose stationary points have no activities with zero gradient. 6 Subspace steps {s.SS} The solution x of a linear system Gx = b with symmetric, positive definite coefficient matrix G ∈ Rn×n and right hand side b ∈ Rn may be viewed as the minimizer of the strictly convex quadratic function 1 f (x) := xT Gx − bT x + f0 , 2 since its gradient g(x) = Gx − b (29) {233} vanishes precisely at the solution. For any s ∈ Rn , we may compute the Hessian-vector product y := Gs = g(x + s) − g(x) in terms of gradient information only; here the choice of x ∈ Rn is completely arbitrary. If we have a list S := [s1 , . . . , sm ] of vectors sℓ for which y ℓ := Gsℓ 11 is available, we may form the matrix Y := [y 1 , . . . , y m ] = GS, and find for any z ∈ Rm the subspace expansion 1 1 f (x − Sz) = f (x) − g(x)T Sz + (Sz)T GSz = f − cT z + z T Hz, 2 2 where f := f (x), c := S T g(x), H := S T GS = S T Y = Y T S. We assume that the columns of S are linearly independent; then H is positive definite. The minimum of f (x − Sz) with respect to z is attained at zb := H −1 c = H −1 S T g(x), and the associated point and gradient are x b = x − Sb z, gb := g(x) − Y zb. If we calculate y = Gs at an additional direction s 6= 0, we have the consistency relations h := sS T Gs = Y T s = S T y, 0 < γ := sT Gs = y T s = f (x + αs) − f − αg T s α2 /2 for all α ∈ R. We may now form the augmented matrices S := [S, s], 7 Y = GS = [Y, y], T H = S GS = H hT h γ . Implementation issues In the program, the ith components of x and x are stored in low(i) and upp(i), respectively. The initial point. Conjugate direction methods and their limited memory versions produce search directions that are linear combinations of the previous steps and the preconditioned gradient. This may lead to inefficiencies, e.g., for minimizing the quadratic function 2 f (x) := (x1 − 1) + n X i=2 (xi − xi−1 )2 when started from x0 = 0 and a diagonal preconditioner is used. It is easy to see by induction that, for any method that chooses its search directions that are linear combinations of the previously computed preconditioned gradients, the ith iteration point has zeros in all coordinates k > i and its gradient has zeros in all coordinates k > i + 1. Since the solution 12 is the all-one vector, this implies that at least n iterations are needed to reduce the maximal error in components of x to below one. Thus for large-scale problems, some attention should be given to choosing the initial point not too special, so that the gradient contains significant information about all components. This is especially important in the bound constrained case, where the signs of gradient components determine which variables may be freed. The following piece of Matlab code moves a user-given initial point x slightly into the relative interior of the feasible domain. deltax is a number in [0, 1[; choosing it as 0 just projects the starting point into the feasible box. % force interior starting point when deltax>0; ind=(x<=low); x(ind)=low(ind)+min(1,deltax*(upp(ind)-low(ind))); ind=(x>=upp); x(ind)=upp(ind)-min(1,deltax*(upp(ind)-low(ind))); Avoiding zigzagging. Zigzagging is the main source of inefficiency of simple methods such as steppest descent. Any search direction p must satisfy g T p < 0. In order to avoid zigzagging we choose the search direction p as the vector with a fixed value g T p = −γ < 0 closest (with respect to the 2-norm) to the previous search direction. {t512} 7.1 Theorem. Among all p ∈ Rn with g T p = −γ < 0, the distance kp − p0 k22 = (p − p0 )T (p − p0 ) becomes minimal for where p = p0 − λg, λ= γ + g T p0 . gT g (30) {212} (31) {213} Proof. This optimization problem can be solved using Lagrange multipliers. We have to find a stationary point of the Lagrange function 1 L(p) := (p − p0 )T (p − p0 ) + λg T p, 2 giving the condition p − p0 + λg = 0, hence (30). The Lagrange multiplier λ is determined from the constraint g T p = −γ, and yields (31). ⊓ ⊔ Enforcing the angle condition. In general, if we have a direction q, we may always add a multiple of the gradient to enforce the angle condition for the modified direction p = q − tg 13 (32) {e.pId} with a suitable factor t ≥ 0; the case t = 0 corresponds to the case where q already satisfies the bounded angle condition. The choice of t depends on the three numbers σ1 := g T B −1 g > 0, σ2 := q T B −1 q > 0, σ := g T B −1 q; these are related by the Cauchy–Schwarz inequality σ ∈ [−1, 1]. c := √ σ1 σ2 We want to choose t such that the angle condition p gT p g T B −1 g · pT Bp ≤ −δangle . (33) {e.angle} holds. In terms of the σi , this reads σ − tσ1 p ≤ −δangle . σ1 (σ2 − 2tσ + t2 σ1 ) If c ≤ −δangle , this holds for t = 0, and we make this choice. Otherwise we enforce equality. 2 Squaring, multiplying with the denominator, and subtracting δangle (σ − tσ1 )2 gives 2 2 2 2 (1 − δangle )(σ − tσ1 )2 = δangle σ1 (σ2 − 2tσ + t2 σ1 ) − δangle (σ − tσ1 )2 = δangle (σ1 σ2 − σ 2 ), hence where 2 |σ − tσ1 | = δangle w, w := σ1 σ2 (1 − c2 ) . 2 1 − δangle (To ensure that w ≥ 0 even in finite precision arithmetic, one should use max(ε, 1 − c2 ) in place of 1 − c2 , where ε is the machine precision.) For the larger solution, √ σ+δ w , t= σ1 p satisfies the angle condition with equality. Taking account of finite precision effects. Stopping. Algorithm 4.1 is appropriate for a rigorous convergence analysis. In practice, one needs additional stopping criteria that guarantee an approximate result after finitely many steps. While implementing the algorithm, we terminate the algorithm whenever one of the following conditions are satisfied • The norm of reduced gradient is sufficiently small. • A large number of iterations have been done. 14 8 Numerical results References 15
© Copyright 2026 Paperzz