Methods of Optimization Fall 2009 Prof. S. Ta’Asan∗ Chris Almost† Contents Contents 1 1 Review and Definitions 1.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 2 4 2 Unconstrained Optimization 2.1 Inequalities . . . . . . . . . . . . . 2.2 Descent methods . . . . . . . . . 2.3 Newton’s method . . . . . . . . . 2.4 Non-differentiable optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 10 13 3 Constrained Optimization 3.1 Separating hyperplanes . 3.2 Conjugate functions . . . 3.3 Duality . . . . . . . . . . . 3.4 Complementary slackness 3.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 17 17 20 23 . . . . . . . . . . . . . . . . . . . . 4 Algorithms 24 Index 31 ∗ † email? [email protected] 1 2 1 1.1 Optimization Review and Definitions Convex sets The line through x 1 , x 2 ∈ Rn is the set of points {(1 − θ )x 1 + θ x 2 | θ ∈ R}. An affine set is a set that contains the entire line through any pair of points in it, i.e. C is affine if (1 − θ )x 1 + θ x 2 ∈ C for all x 1 , x 2 ∈ C and θ ∈ R. Notice that if C is an affine set and x 0 ∈ C is chosen, then C − x 0 = {x − x 0 | x ∈ C} is a vector space. Indeed, (x 1 − x 0 ) + (x 2 − x 0 ) = 2( 21 x 1 + 12 x 2 ) − x 0 ∈ C − x 0 and θ (x − x 0 ) = θ x + (1 − θ )x 0 − x 0 ∈ C − x 0 . If A is a matrix then the set of solutions x to Ax = b is an affine set. A convex set is a set that contains the line segment joining any pair of points in it, i.e. C is convex if (1 − θ )x 1 + θ x 2 ∈ C for all x 1 , x 2 ∈ C and θ ∈ [0, 1]. A cone is a set that contains the ray emanating from any point in it, i.e. K is a cone if θ x ∈ K for all x ∈ K. A hyperplane is {x | a T x − b = 0} and a half space is {a T x − b ≤ 0}, where a and b are vectors. The norm cone is {(x, t) | kxk ≤ t} ⊆ Rn+1 , where k · k is any norm on Rn . The set of symmetric n × n matrices is denoted S n . Define S+n = {A ∈ S n | x T Ax ≥ 0 for all x ∈ Rn } and n S++ = {A ∈ S n | x T Ax > 0 for all x 6= 0}. n Note that S+n and S++ are convex cones in the convex set S n ⊆ Rn×n . 1.2 Convex functions A convex function is a function whose graph lies below the line segment joining any two points of it. More precisely, f is convex if its domain is convex and f ((1−θ )x 1 +θ x 2 ) ≤ (1−θ ) f (x 1 )+θ f (x 2 ) for all x 1 , x 2 ∈ dom( f ) and θ ∈ [0, 1]. Any norm is a convex function with domain Rn . Recall from multi-variable calculus that if f is a differentiable convex function then f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ dom( f ). In fact, 1.2.1 Proposition. If f is differentiable and dom( f ) is convex then f is convex if and only if f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ dom( f ). PROOF: Let f be differentiable. On R, if f is convex then f (θ y + (1 − θ )x) ≤ θ f ( y) + (1 − θ ) f (x), so we get f ( y) − f (x) ≥ 1 θ ( f (x + θ ( y − x)) − f (x)) → f 0 (x)( y − x) as θ → 0, which is the required inequality. Conversely, still on R, let x, y ∈ dom( f ), θ ∈ [0, 1], and z = θ x + (1 − θ ) y. We have f (x) ≥ f (z)+ f 0 (z)(x −z) and f ( y) ≥ f (z)+ f 0 (z)( y −z), so multiplying Convex functions 3 and adding we get θ f (x) + (1 − θ ) f ( y) ≥ f (z) + f 0 (z)(θ x + (1 − θ ) y − z) = f (z). In Rn , let g(t) := f (t y + (1 − t)x). Then g is a convex function on its domain, a subset of R. Note that g 0 (t) = ∇ f (t y +(1− t)x) T ( y − x), so the one dimensional case implies the n dimensional case. 1.2.2 Proposition. If f is twice differentiable and dom( f ) is convex then f is convex if and only if ∇2 f (x) 0. PROOF: Let f be twice differentiable. We have seen that f is convex if and only if f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ dom( f ). On R, if f is convex, for x, y ∈ dom( f ) we have f ( y) ≥ f (x) + f 0 (x)( y − x) and f (x) ≥ f ( y) + f 0 ( y)(x − y). Adding and dividing by (x − y)2 , 0≤ f 0 (x) − f 0 ( y) x−y → f 00 (x) as y → x. Conversely, still on R, let x, y ∈ dom( f ) and note that the following is nonnegative. y Z x y Z f 00 (z)(z − y)dz = f 0 (z)( y − z) + x y f 0 (z)dz x = − f 0 (x)( y − x) + f ( y) − f (x) In Rn , for x, y ∈ dom( f ) fixed, define g(t) = g x, y := f (x + t( y − x)). Then g (t) = ∇ f (x + t( y − x)) T ( y − x), and 0 g 00 (t) = t 2 ( y − x) T ∇2 f (x + t( y − x))( y − x). Now g = g x, y is convex for every x, y ∈ dom( f ) if and only if f is convex, and g 00 (t) ≥ 0 for all t > 0 for all x, y ∈ dom( f ) if and only if ( y − x) T ∇2 f (x + t( y − x))( y − x) ≥ 0 for all t > 0 and all x, y, ∈ dom( f ). It follows that g is convex if and only if ∇2 f (x) 0 for all x ∈ dom( f ). Let f be convex in its first variable for every value of its second variable in some set A (not necessarily convex). Let g(x) := sup y∈A f (x, y). Then g is 4 Optimization convex. Indeed, g(θ x 1 + (1 − θ )x 2 ) = sup f (θ x 1 + (1 − θ )x 2 , y) y∈A ≤ sup θ f (x 1 , y) + (1 − θ ) f (x 2 , y) y∈A ≤ sup θ f (x 1 , y) + sup (1 − θ ) f (x 2 , y) y∈A y∈A = θ sup f (x 1 , y) + (1 − θ ) sup f (x 2 , y) y∈A y∈A = θ g(x 1 ) + (1 − θ )g(x 2 ) n Let C ⊆ R be a convex set and let f : Rm × C → R be convex. Then g(x) := inf y∈C f (x, y) is convex. Indeed, for x 1 , x 2 ∈ Rm and θ ∈ [0, 1], g(θ x 1 + (1 − θ )x 2 ) = inf f (θ x 1 + (1 − θ )x 2 , y) y∈C = inf y1 , y2 ∈C f (θ x 1 + (1 − θ )x 2 , θ y1 + (1 − θ ) y2 ) ≤ inf (θ f (x 1 , y1 ) + (1 − θ ) f (x 2 , y2 )) y1 , y2 ∈C ≤ θ inf f (x 1 , y1 ) + (1 − θ ) inf f (x 2 , y2 )) y1 ∈C y2 ∈C = θ g(x 1 ) + (1 − θ )g(x 2 ) since the sets C and θ C + (1 − θ )C are equal. 1.3 Convex optimization The constrained optimization problem we consider is that of minimizing a function f0 : Rn → R subject to inequality constraints f i (x) ≤ 0 for i = 1, . . . , m, and equality constraints h j (x) = 0 for j = 1, . . . , p, where the f i are convex and the h j are affine. Call the set of points satisfying the constraints X . If f0 is convex and differentiable then f0 ( y) ≥ f0 (x) + ∇ f0 (x) T ( y − x) for all x, y ∈ X . If x ∗ ∈ X is a minimizer then ∇ f0 (x ∗ ) T ( y − x ∗ ) ≥ 0 for all y ∈ X . These are the first order conditions. Note that for the unconstrained problem X = Rn , the first order conditions are ∇ f0 (x ∗ ) = 0, and they are necessary and sufficient. Here is a useful trick. If you can write f (x + t v) = f (x) + t w T v + O(t 2 ) then w = ∇ f (x). This follows from Taylor’s theorem. For example, if P ∈ S+n and f (x) := 12 x T P x + q T x + r then f (x + t v) = 12 (x + t v) T P(x + t v) + q T (x + t v) + r = f (x) + t(q + P x) T v + O(t 2 ) so ∇ f (x) = P x + q (Q) 1.3.1 Example. Solve the unconstrained problem min x kAx − bk22 . This is a general form of the problem of finding a least-squares approximation. Note that kAx − bk22 = x T (AT A)x − 2(AT b) T x + b T b, Unconstrained Optimization 5 so the gradient is 2(AT Ax − AT b) by (Q). Setting the gradient equal to zero gives the normal equations, AT Ax = AT b. 2 2.1 Unconstrained Optimization Inequalities Consider the unconstrained problem min x f (x), where f is a convex function. Let x ∗ = argmin x f (x). We assume that x ∗ is unknown, p∗ = f (x ∗ ) is unknown, but ∇ f (x) exists and is known. We would like to find x (our guess at the minimum) such that kp∗ − f (x)k is small or kx − x ∗ k is small. Assume that f is twice differentiable. By Taylor’s theorem, f ( y) = f (x) + ∇ f (x) T ( y − x) + 21 ( y − x) T ∇2 f (z)( y − x) for some z on the line segment between x and y. Assume that, for some m > 0, ∇2 f (x) mI for all x ∈ Rn . Then f ( y) ≥ f (x) + ∇ f (x) T ( y − x) + m ky 2 − xk2 . This expression is a quadratic function of y, so it has a minimum. Differentiating with respect to y, for the minimizer y ∗ we get ∇ f (x) + m y ∗ − mx = 0, i.e. y ∗ − x = − m1 ∇ f (x). Then f ( y) ≥ f (x) − 1 ∇f m (x) T ∇ f (x) + Whence p∗ = min y f ( y) ≥ f (x) − m 1 k ∇f 2 m 1 k∇ f 2m (x)k2 = f (x) − 1 k∇ f 2m (x)k2 . (x)k2 , so 0 ≤ f (x) − p∗ ≤ 1 k∇ f 2m (x)k2 . (1) Under the same assumptions, for any x, y ∈ Rn , as before, f ( y) = f (x) + ∇ f (x) T ( y − x) + 12 ( y − x) T ∇2 f (z)( y − x) m k y − xk2 2 (x) + ∇ f (x) T (x ∗ − x) + m2 kx ∗ − xk2 (x) − k∇ f (x)kkx ∗ − xk + m2 kx ∗ − xk2 ≥ f (x) + ∇ f (x) T ( y − x) + so p∗ = f (x ∗ ) ≥ f ≥f Therefore 0 ≥ p∗ − f (x) ≥ −k∇ f (x)kkx ∗ − xk + kx ∗ − xk ≤ 2 k∇ f m m kx ∗ 2 − xk2 , so (x)k (2) Now assume that, for some M > 0, M I ∇2 f (x) for all x ∈ Rn . Then f ( y) = f (x) + ∇ f (x) T ( y − x) + 21 ( y − x) T ∇2 f (z)( y − x) ≤ f (x) + ∇ f (x) T ( y − x) + M ky 2 so p∗ = min f ( y) ≤ min{ f (x) + ∇ f (x) T ( y − x) + y y − xk2 M ky 2 − xk2 } 6 Optimization We can solve the minimization problem on the right hand side, as above, by setting the derivative with respect to y to zero. ∇ f (x) + M y ∗ − M x = 0, so y ∗ − x = − M1 ∇ f (x). Plugging this into the equations, p∗ − f (x) ≤ − M1 ∇ f (x) T ∇ f (x) + 1 k∇ f 2M 1 (x)k2 = − 2M k∇ f (x)k2 , so 1 k∇ f 2M 2.2 (x)k2 ≤ f (x) − p∗ (3) Descent methods (n+1) Let {x (n) }∞ = ρ x (n) for some 0 < ρ < 1 then n=1 be a sequence of vectors. If x (n) (0) x → 0, for any starting value x , and we say that x (n) has linear convergence with constant ρ. The general descent method is as follows. def descent(f, dx, x0): x = x0 while abs(df(x)) > delta: dx = compute_dir(f, df, x) t = line_search(f, df, x, dx) x = x + t * dx return x Gradient descent The gradient descent method chooses ∆x = −∇ f (x). The line search for this method is either the exact line search, (i.e. solve min t f (x + t∆x) exactly for t) or the backtracking line search, as follows. The function g(t) := f (x + t∆x) is convex, and the line {(t, f (x) + t∇ f (x) T ∆x) | t ∈ R} supports the graph of g at the point t = 0. Choose an α < 21 and form the reference line {(t, f (x) + αt∇ f (x) T ∆x) | t ∈ R}, which intersects the graph of g at t = 0 and t = t 0 > 0. See Figure 1. Choose a reduction factor β < 1 and run the following. t = 1 while f(x + t * dx) > f(x) + alpha * t * df * dx t = beta * t return t Observe that this algorithm terminates with t ∈ [β t 0 , t 0 ∧ 1], where t 0 is the other point at which the reference line intersects the graph of f . Descent methods 7 Figure 1: Backtracking line search 2.2.1 Theorem. Let f be twice continuously differentiable and such that mI ∇2 f (x) M I for all x ∈ Rn . Assume that m > 0. Then the gradient descent m method applied to f converges linearly with constant 1− M using exact line search m and constant 1 − 2αβ M using backtracking line search. PROOF: Let {x (n) }∞ n=1 be the sequence of points encountered by the gradient descent method. For exact line search, let t exact := argmin t f (x − tk∇ f (x)k2 ). f (x − t∇ f ) ≤ f (x) − tk∇ f (x)k2 + t 2 M2 k∇ f (x)k2 1 k∇ f (x)k2 2M 1 f (x (n) ) − p∗ − 2M k∇ f m (n) (1 − M )( f (x ) − p∗ ) f (x − t exact ∇ f (x)) ≤ f (x) − so f (x (n+1) )) − p∗ ≤ ≤ (x (n) )k2 The third line follows from minimizing the right hand side of the second line with respect to t, and the last line follows from the inequality (1), written as −k∇ f (x)k2 ≤ −2m( f (x) − p∗ ). This proves convergence since m < M . For the backtracking method, as above, f (x − t∇ f ) ≤ f (x) − tk∇ f (x)k2 + t 2 M2 k∇ f (x)k2 = f (x) + (−t + M 2 t )k∇ f 2 (x)k2 8 Optimization ≤ f (x) − 12 tk∇ f (x)k2 ≤ f (x) − αtk∇ f (x)k2 where the third line holds for all 0 ≤ t ≤ 1 . M It follows from this that the back- tracking step always terminates with a non-zero t, and further, either t = 1 and f (x − ∇ f (x)) ≤ f (x) − αk∇ f (x)k2 , or f (x − t∇ f (x)) ≤ f (x) − αtk∇ f (x)k2 ≤ f (x) − β M αβ k∇ f M ≤ t ≤ 1. Now, (x)k2 Therefore, for the iteration, f (x (n+1) ) − p∗ ≤ f (x (n) ) − p∗ − min{1, αβ }k∇ f M (x)k2 m ≤ ( f (x (n) ) − p∗ )(1 − min{2m, 2αβ M }) m ) = ( f (x (n) ) − p∗ )(1 − 2αβ M Data processing Here is our first application. Given data {u∗j }Nj=1 , interpreted as coming from some “rough” or “noisy” source, find {u j }Nj=1 such that u is “nice” and u is “close to” u∗ . If by “close to” we mean that ku − u∗ k22 is small, and if by “nice” we mean that u has a reasonably sized quadratic variation, then the problem becomes X X min (u j − u∗j )2 + β (u j+1 − u j )2 . u j j The term on the right is the roughness penalty and it involves an arbitrary parameter controlling the relative importance of the two goals. Call this objective function f (u). It is convex (as a sum of convex functions) and twice continuously differentiable. ∂f = 2(uk − u∗k ) + 2β(2uk − uk−1 − uk+1 ) ∂ uk for 1 ≤ k ≤ N (where uk := 0 if k = 0 or N + 1). Since ∇ f (u) = 0, we need to solve the following linear system of equations. 2 + 4β −2β 0 ... 0 .. 2 + 4β −2β . −2β ∗ .. . −2β 2 + 4β 0 u = 2u . 0 . .. . . 2 + 4β −2β . 0 ... 0 −2β 2 + 4β We can of course do the same thing when u∗ is two-dimensional (e.g. an image). Then the problem becomes X X X min (u∗i j − ui j )2 + β (ui+1, j − ui, j )2 + (ui, j+1 − ui, j )2 . u ij ij ij Descent methods 9 Steepest Descent Given z ∈ Rn , z ∗ : Rn → R : x 7→ z T x defines a linear functional. We write kzk∗ := maxkxk=1 |z T x|, and call k · k∗ the dual norm of k · k. The dual norm of the Euclidean norm is itself. Recall that the operator norm of a matrix A acting on (Rn , k · k1 ) is the largest of the L 1 -norms of the columns of A. Also recall that the operator norm of a matrix A acting on (Rn , k · k∞ ) is the largest of the L 1 -norms of the rows of A. It follows that kzk1,∗ = kzk∞ and kzk∞,∗ = kzk1 . The steepest descent method chooses ∆x so that f (x + t∆x) decreases in t most rapidly, for some definition of “rapidly”. Recall from Taylor’s theorem that f (x + t∆x) ≈ f (x) + t∇ f (x) T ∆x. Given a norm k · k on Rn , we would like to find a direction vector ∆x nsd (i.e. k∆x nsd k = 1) that minimizes ∇ f (x) T ∆x (this minimum will be negative). That is to say, for steepest descent, ∆x nsd := argmin{∇ f (x) T ∆x : k∆xk = 1}. n Let P ∈ S++ and consider that 1 1 1 1 1 kxk2P = x T P x = x T P 2 P 2 x = (P 2 x) T (P 2 x) = kP 2 xk22 . Therefore we need to solve 1 min{∇ f (x) T ∆x : kP 2 ∆xk2 = 1} 1 1 1 = min{∇ f (x) T P − 2 (P 2 ∆x) : kP 2 ∆xk2 = 1} 1 1 1 = min{(P − 2 ∇ f (x)) T (P 2 ∆x) : kP 2 ∆xk2 = 1} 1 1 The solution to the final problem is known to be P 2 ∆x = −P 2 ∇ f (x) T , so it follows that the steepest descent direction for k · k P is ∆x = −P −1 ∇ f (x) T . 2.2.2 Example. Let f (x, y) = " x 2 + y 2 . Then is twice continuously 2"f0 2"differen x 2 0 then tiable, ∇ f (x, y) = 2" 2 y , and ∇ f (x, y) = 0 2 . If we take P = 0 2 the steepest descent direction is 2" ∆x = − 0 0 2 −1 2" x 2y x =− . y 2.2.3 Example. For the norm k · k1 , the steepest descent direction is ∆x = −sign(∇ f (x) T )ei , where i = argmaxi |∇ f (x)iT |. 2.2.4 Theorem. Let f be twice continuously differentiable and such that mI ∇2 f (x) M I for all x ∈ Rn . Assume that m > 0. Fix a norm k · k on Rn and choose γ ∈ (0, 1) such that kxk ≥ γkxk2 for all x ∈ Rn . The steepest descent method with backtracking line search applied to f converges linearly with constant 1 − 2mαγ2 min{1, βγ2 /M }. 10 Optimization PROOF: Define ∆x nsd := min{∇ f (x) T ∆x : k∆xk = 1} and ∆x sd := k∇ f (x)k∗ ∆x nsd . Then f (x + t∆x sd ) ≤ f (x) + t∇ f (x) T ∆x sd + t 2 M2 k∆x sd k22 ≤ f (x) + ≤ f (x) − M 2 t∇ f (x) ∆x sd + t 2 2γ 2 k∆x sd k∗ M 2 tk∇ f (x)k2∗ + t 2 2γ 2 k∇ f (x)k∗ T Taylor’s theorem by choice of γ definition of ∆x sd M 2 = f (x) − t(1 − t 2γ 2 )k∇ f (x)k∗ ≤ f (x) − 12 tk∇ f (x)k2∗ see below ≤ f (x) − αtk∇ f (x)k2∗ α< 1 2 γ2 M The second last line follows because the minimum value of −t + t 2 2γ 2 is − 2M and is attained at t = γ2 . M The line joining the vertex of this parabola and the origin γ2 − 12 and lies above the parabola for t ∈ (0, M ]. has slope It follows that the result of the backtracking line search satisfies the inequality βγ2 t ≥ min{1, M }, so by (1), βγ2 }k∇ f (x)k2∗ M βγ2 p∗ ) − αγ2 min{1, M }k∇ f (x)k22 f (x (n+1) ) − p∗ ≤ ( f (x (n) ) − p∗ ) − α min{1, ≤ ( f (x (n) ) − βγ2 ≤ ( f (x (n) ) − p∗ ) 1 − 2mαγ2 min{1, M } . 2.3 Newton’s method 2 Let {x n }∞ n=1 be a sequence of real numbers. If x n+1 = K x n then, if x n → 0, we say that x n has quadratic convergence. Unfortunately, the conditions under which such a sequence converges depend on both K and x 0 . Newton’s method is steepest descent where, at each step, we move in the steepest descent direction for the norm k · k∇2 f (x (n) ) . We will see that Newton’s method has quadratic convergence. Another approach to Newton’s method is as follows. Recall f (x + t v) ≈ f (x) + t∇ f (x) T v + 21 t 2 v T ∇2 f (x)v. The minimum value of min v ∇ f T v + 12 v T ∇2 f v occurs when ∇2 f (x)v = −∇ f (x), by (Q). Newton’s method takes ∆x = −∇2 f (x)−1 ∇ f (x). A third approach is to note f (x ∗ ) = min x f (x) if and only if ∇ f (x ∗ ) = 0. We can approximate the latter by solving ∇ f (x + v) = 0, which is approximately solved by v satisfying ∇ f (x) + ∇2 f (x)v = 0, by Taylor’s theorem. Newton’s method 11 2.3.1 Theorem. Let f be twice continuously differentiable and such that mI ∇2 f (x) M I for all x ∈ Rn . Assume that m > 0 and there is L (the Lipschitz constant) such that k∇2 f (x) − ∇2 f ( y)k2 ≤ Lkx − yk2 for all x, y ∈ Rn . Newton’s method with backtracking line search applied to f satisfies the following. There is 0 < η ≤ m2 /L and γ = γ(η) > 0 such that (i) if k∇ f (x)k ≥ η then f (x (n+1) ) − f (x (n) ) ≤ −γ; (ii) if k∇ f (x)k ≤ η then k 2mL 2 ∇ f (x (n+1) )k2 ≤ k 2mL 2 ∇ f (x (n) )k22 and the line search terminates with t = 1. In fact, η = max{3(1 − 2α), 1}m2 /L. Remark. It follows from (i) that the number of iterations needed, from any starting point x (0) , to get to case (ii) is about ( f (x (0) )−p∗ )/γ. If we are in case (ii) then k∇ f (x (n) )k ≤ η implies k∇ f (x (n+1) )k ≤ η, so we remain in case (ii). Further, for any ` > k, 2`−k 2`−k L L 1 (`) (k) ≤ 2m2 ∇ f (x ) ≤ 2m2 ∇ f (x ) 2 By (1), f (x (n) ∗ )−p ≤ 1 k∇ f 2m (x (n) )k22 ≤ 1 2m 2m2 2 2n+1 1 L 2 = 2m3 L2 2 n 1 2 Note that in this case the number of steps required so that f (x (n) ) − p∗ < " is on the order of log2 log2 1" . For all practical purposes, at most 6 steps will be required! PROOF: Define ∆x nt := −(∇2 f (x))−1 ∇ f (x). Let λ2 (x) := ∇ f (x) T (∇2 f (x))−1 ∇ f (x) > 0. Note that λ2 (x) = −∇ f (x) T ∆x nt . From the linear approximation, this is the amount we decrease our approximation with a step of unit size. Call it Newton’s decrement. Further, k∆x nt k22 = ∇ f (x) T (∇2 f (x))−1 (∇2 f (x))−1 ∇ f (x) ≤ 1 2 λ (x). m As usual, f (x + t∆x nt ) ≤ f (x) + t∇ f (x) T ∆x nt + t 2 M2 k∆x nt k22 M 2 λ (x) ≤ f (x) − tλ2 (x) + t 2 2m M ≤ f (x) + (−t + t 2 2m )λ2 (x) 12 Optimization m The parabola is minimized at t̂ = M and the slope of the line joining (0, 0) with m 1 m m , the vertex ( M , − 2M ) is − 2 . So when 0 < t ≤ M f (x + t∆x nt ) ≤ f (x) − 21 tλ2 (x) ≤ f (x) − αtλ2 (x) 1 2 α< = f (x) + αt∇ f (x) T ∆x nt m ] satisfies the backtracking condition. It follows that the t Therefore any t ∈ (0, M m returned by the backtracking algorithm is at least t̂ = β M . What is the reduction in f ? m 2 f (x + t̂∆x nt ) ≤ f (x) − αβ M λ (x) m 2 so f (x (n+1) ) − f (x (n) ) ≤ −αβ M λ (x) But λ2 (x) = ∇ f (x) T (∇2 f (x))−1 ∇ f (x), so 1 k∇ f M (x)k22 ≤ λ2 (x) ≤ 1 k∇ f m (x)k22 . Whence f (x (n+1) ) − f (x (n) ) ≤ −αβ Mm2 k∇ f (x)k22 ≤ −αβ Mm2 η2 =: −γ. Now for the second, more difficult, part. Let f˜(t) = f (x + t∆x nt ). The idea is to derive an estimate for f˜00 and then f˜0 , and then f˜. f˜0 (t) = ∇ f (x + t∆x nt ) T ∆x nt f˜00 (t) = ∆x T ∇2 f (x + t∆x nt )∆x nt nt f˜0 (0) = ∇ f (x) T ∆x nt = −λ2 (x) f˜00 (0) = ∆x T ∇2 f (x)∆x nt = λ2 (x) nt It follows from the Lipschitz condition that T | f˜00 (t) − f˜00 (0)| = ∆x nt (∇2 f (x + t∆x nt ) − ∇2 f (x))∆x nt ≤ t Lk∆x nt k32 . Whence, f˜00 (t) ≤ f˜00 (0) + t Lk∆x nt k32 ≤ λ2 (x) + t Lk∆x nt k32 ≤ λ2 (x) + t L m3/2 λ3 (x), so, integrating, f˜0 (x) ≤ f˜0 (0) + λ2 (x)t + t 2 L 2m3/2 λ3 (x) = −λ2 (x) + λ2 (x)t + t 2 L 2m3/2 λ3 (x). Integrating again, f˜(t) ≤ f˜(0) − tλ2 (x) + f (x + t∆x nt ) − f (x) ≤ −tλ2 (x) + 1 2 1 2 t 2 λ2 (x) + t 3 t 2 λ2 (x) + t 3 L 6m3/2 L 6m3/2 λ3 (x) λ3 (x). Non-differentiable optimization 13 We need to check the exit condition for the backtracking line search. f (x + t∆x nt ) ≤ f (x) + αt∇ f (x) T ∆x nt ⇐⇒ f˜(t) ≤ f˜(0) − αtλ2 (x) We want to exit the line search algorithm on the first step, with t = 1. The following steps are reversible. L λ3 (x) ≤ αλ2 (x) 6m3/2 1 L − + λ(x) ≤ −α 2 6m3/2 Lλ(x) 2α − 1 ≤ 3m3/2 3(1 − 2α)m3/2 λ(x) ≤ L −λ2 (x) + 21 λ2 (x) + p 3(1−2α)m2 We have the inequality mλ(x) ≤ k∇ f (x)k2 , so if we choose η = then, L since k∇ f (x)k ≤ η in case (ii), the inequality is satisfied and line search terminates with t = 1. It remains to show that 2 L L (n+1) (n) ∇ f (x ) ) ≤ 2m2 ∇ f (x 2 2m 2 2 Consider the following. k∇ f (x n+1 )k2 = k∇ f (x (n) + ∆x nt ) − ∇ f (x (n) ) − ∇2 f (x (n) )∆x nt )k2 Z 1 2 (n) (n) = ∇ f (x + t∆x nt ) − ∇ f (x ) ∆x nt d t 2 0 Z1 k∇2 f (x (n) + t∆x nt ) − ∇ f (x (n) )kk∆x nt k2 d t ≤ 0 Z 1 ≤L 0 tk∆x nt k22 d t = L 2 k∆x nt k22 ≤ L 2m2 k∇ f (x)k22 This proves the inequality because norms are homogeneous. The last inequality follows from k∆x nt k2 = k(∇2 f )−1 ∇ f (x)k2 ≤ k(∇2 f )−1 k2 k∇ f (x)k2 ≤ 2.4 1 k∇ f m (x)k2 . Non-differentiable optimization In applications the objective functions, while convex, may not always be differentiable. Whether or not f is not differentiable at a point x, define the sub-gradient 14 Optimization of f at x to be ∂ f (x) = {g ∈ Rn | f ( y) ≥ f (x) + g T ( y − x) for all y ∈ Rn }. For example, ∂ |0| = [−1, 1], and (sign(x), sign( y)) (x, y) 6= (0, 0) x = 0, y 6= 0 ∂ (|x| + | y|) = y = 0, x 6= 0 (x, y) = (0, 0) Note that the sub-gradient is always convex. 2.4.1 Lemma. x ∗ = argmin x f (x) if and only if 0 ∈ ∂ f (x ∗ ). PROOF: If 0 ∈ ∂ f (x) then by definition, f ( y) ≥ f (x) + 0 T ( y − x) = f (x) for all y ∈ Rn , so x is a minimizer. Conversely, if 0 ∈ / ∂ f (x) then there is y ∈ Rn such T that f ( y) < f (x) + 0 ( y − x) = f (x), so x is not the minimizer. The sub-gradient method is described below. Choose a step size αk and a direction g (k) ∈ ∂ f (x (k) ) and let x (k+1) = x (k) − αk g (k) . x = xbest = x0 while : choose g in sub-gradient of f(x) calculate alpha x = x - alpha g if f(x) < f(xbest): xbest = x Since g (k) ∈ ∂ f (x (k) ), it follows that f (x ∗ ) ≥ f (x (k) ) + g (k)T (x ∗ − x (k) ), so kx (k+1) − x ∗ k22 = kx (k) − αk g (k) − x ∗ k22 = kx (k) − x ∗ k22 − 2αk g (k)T (x (k) − x ∗ ) + α2k kg (k) k22 ≤ kx (k) − x ∗ k22 − 2αk ( f (x (k) ) − f (x ∗ )) + α2k kg (k) k22 ≤ kx (1) − x ∗ k22 − 2 k X α j ( f (x ( j) ) − f (x ∗ )) + j=1 k X α2j kg ( j) k22 j=1 In particular we conclude the following. 2 k X α j ( f (x ( j) ) − f (x ∗ )) ≤ kx (1) − x ∗ k22 + j=1 k X α2j kg ( j) k22 j=1 ∗ 2( f (x best ) − f (x )) k X j=1 α j ≤ kx (1) − x ∗ k22 + k X α2j kg ( j) k22 j=1 Pk kx (1) − x ∗ k22 + j=1 α2j kg ( j) k22 f (x best ) − f (x ∗ ) ≤ Pk 2 j=1 α j Constrained Optimization 15 There are several methods for choosing the sequence {αk }∞ k=1 . (i) Constant step size: αk = h for all k ≥ 1. (k) (ii) Variable step size, by gP : αk = h/kg (k) k2 for all k ≥ 1. P scaled 2 2 αk = ∞. (iii) ` sequence: k αk < ∞ but kP (iv) c0 sequence: limk→∞ αk = 0 but k αk = ∞. We assume that ∂ f (x) is bounded for all x ∈ Rn by the same radius G. We analyze each case below. (i) f (x best ) − f (x ∗ ) ≤ kx (1) − x ∗ k22 + 2kh2 G 2 2kh → hG 2 2 as k → ∞ (ii) f (x best ) − f (x ∗ ) ≤ kx (1) − x ∗ k22 G + kh2 G kx (1) − x ∗ k22 + kh2 hG ≤ → Pk 2hk 2 2h j=1 kg1( j) k (iii) Pk kx (1) − x ∗ k22 + j=1 α2j kg ( j) k22 f (x best ) − f (x ) ≤ →0 Pk 2 j=1 α j ∗ (iv) Exercise. 2.4.2 Example. In a data fitting problem we can choose our error function. Let’s compare a sum-of-squares smoothness error X X (u j+1 − u j )2 (u j − u∗j )2 + β j j with the non-differentiable X X (u j − u∗j )2 + β |u j+1 − u j |. j j Let h = 1/N and consider the two j = jh} and {v j = δ j,N P possible fits {uP P} on the grid { jh | j = 0, . . . , N }. Then j |u j+1 − u j | = j h = N h = 1 and j |v j+1 − P P P v j | = 1. But j (u j+1 − u j )2 = j h2 = N h2 = 1/N while j (v j+1 − v j )2 is still 1. If the “true” curve is expected to have jumps then we would prefer the nondifferentiable error function, because the sum-of-squares error function gives too much preference to smoother fits, i.e. we lose “sharpness.” 3 Constrained Optimization Let f : Rn → R be a differentiable convex function. Recall that min x f (x) has a solution at x ∗ if and only if ∇ f (x ∗ ) = 0. This is the unconstrained case. Consider 16 Optimization the problem min x f (x) subject to constraints such as Ax = b or x ≥ 0. The minimizer may occur at some x ∗ such that ∇ f (x ∗ ) 6= 0. Recall that f ( y) ≥ f (x) + ∇ f (x) T ( y − x) for all x, y ∈ Rn and that x ∗ is a minimizer if and only if ∇ f (x ∗ ) T ( y − x ∗ ) for all y satisfying the constraints. For an equality constraint Ax = b we require that Ay = b. The minimizer must of course satisfy the constraints, so we obtain A( y − x ∗ ) = 0. Furthermore, every element of ker(A) is obtained in this way for some y. Therefore, for any v ∈ ker(A), ∇ f (x ∗ ) T ≥ 0. Again, this is a linear functional bounded below on a linear space, so it must be the case that ∇ f (x ∗ ) T ≡ 0 on ker(A). It is a fact that Rn = ker(A) ⊕ range(AT ), so ∇ f (x ∗ ) T ∈ range(AT ). Write ∇ f (x ∗ ) T = −AT v. It follows that x ∗ is a minimizer if and only if Ax ∗ = b and ∇ f (x ∗ ) T + AT v = 0 for some v ∈ Rn . 3.0.3 Example. We solve min x 12 kx − x 0 k22 subject to Ax = b. Then ∇ f (x) = x − x 0 so the optimality condition becomes Ax ∗ = b and AT v + x ∗ − x 0 = 0. We can write this as the system A 0 I AT b x = . v x0 For the non-negativity constraint, recall that ∇ f (x ∗ ) T · ( y − x ∗ ) ≥ 0 for all y ∈ X . Then ∇ f T y ≥ ∇ f T x ∗ , so at y = 0 we get ∇ f T (x ∗ ) · x ∗ ≤ 0. Further, ∇ f T (x ∗ ) is a linear form bounded below on the positive cone, so ∇ f (x ∗ ) ≥ 0. Combining these, we find that x i∗ (∇ f (x ∗ ))i = 0 for all i = 1, . . . , n. This is also called complementary slackness. 3.1 Separating hyperplanes 3.1.1 Theorem. Let C, D ⊆ Rn be disjoint convex sets. Then there is a ∈ Rn and b ∈ R such that a T c ≤ b for all c ∈ C and a T d ≥ b for all d ∈ D. If C is closed and D compact then the inequalities may be taken to be strict. If C and D are both open then the inequalities may be taken to be strict. Consider the inequalities Ax ≤ b. This is equivalent to b − Ax ≥ 0, and these inequalities have a solution if and only if C = {b − Ax | x ∈ Rn } and D = { y | y ≥ 0} have a non-empty intersection. If C ∩ D = ∅ the by the separating hyperplane theorem there are λ ∈ Rn and µ ∈ R such that λ T z + µ ≤ 0 for all z ∈ C and λ T z + µ ≥ 0 for all z ∈ D. λ T (b − Ax) + µ ≤ 0 implies that −λ T Ax ≤ −λ T b for all x ∈ Rn . A linear functional is bounded above if and only if it is constant, so −λ T A = 0, or AT λ = 0. Further, λ T b + µ ≤ 0. λ T y + µ ≥ 0 for all y ≥ 0 implies in particular that µ ≥ 0 (taking y = 0). It follows that λ T b ≤ −µ ≤ 0. Further, if λ had a negative coordinate then the inequality wouldn’t be satisfied for all y ≥ 0, so λ > 0. Therefore Ax ≤ b does not have a solution if and only if there is λ > 0 such that AT λ = 0 and λ T b ≤ 0. Conjugate functions 3.2 17 Conjugate functions Let f : Rn → R be any function and let f ∗ ( y) = sup x∈dom( f ) ( y T x − f (x)) for all y for which this supremum is not ∞. We say that f ∗ is the conjugate function of f . 3.2.1 Examples. (i) If f (x) = a x + b then f ∗ ( y) = sup( y x − a x − b) = sup(x( y − a) − b) = ∞ if y 6= a. x x∈R Therefore dom( f ∗ ) = {a} and f ∗ (a) = −b. (ii) If f (x) = − log(x) then dom( f ) = R++ . ¨ ∗ f ( y) = sup ( y x − log x) = x∈R++ ∞ −1 − log(− y) y ≥0 y < 0. n (iii) Let Q ∈ S++ and let f (x) = 21 x T Qx. f ∗ ( y) = sup ( y T x − 21 x T Qx) = 21 y T Q−1 y. x∈Rn The maximum occurs at the x for which y + Qx = 0 by (Q). (iv) Let f (x) = kxk, where k · k is some norm. If k yk∗ > 1 then there is z ∈ Rn with kzk = 1 such that | y T z| > 1. Notice that in this case y T (tz) − ktzk = ( y T z − 1)t is unbounded. Suppose that k yk∗ ≤ 1. Then y T x − kxk ≤ k yk∗ kxk − kxk ≤ 0, with a maximum of 0 attained at x = 0. Therefore ¨ 0 k yk∗ ≤ 1 ∗ T f ( y) = sup ( y x − kxk) = n ∞ k yk∗ > 1. x∈R (v) If f (x) = 21 kxk2 then f ∗ ( y) = sup ( y T x − 21 kxk2 ) ≤ sup (k yk∗ kxk − 21 kxk2 ) = 21 k yk2∗ , x∈Rn x∈Rn where the supremum is attained at x = argmax{ y T x | kxk = k y ∗ k}. 3.3 Duality The general problem is min x f0 (x) subject to f i (x) ≤ 0, i = 1, . . . , m and h j (x) = 0, j = 1, . . . , p, where the f i are convex and the h j are affine. Let X be the set 18 Optimization T of points satisfying the constraints, the feasible points, and D = i dom( f i ). We assume ∇2 f ≥ mI and let p∗ be the optimal value. Define the Lagrangian to be L (x, λ, ν) = f0 (x) + m X λi f i (x) + i=1 p X ν j h j (x) j=1 and g(λ, ν) := inf x∈D L (x, λ, ν). We show that g(λ, ν) ≤ f0 (x) for λ ≥ 0 and any feasible point x. Indeed, g(λ, ν) = inf L (λ, ν) ≤ L (x̃, λ, µ) = f0 (x̃) + x∈D m X λi f i (x̃) + p X i=1 ν j h j (x̃) ≤ f0 (x̃). j=1 It follows that g(λ, ν) ≤ p∗ for all λ ≥ 0 and ν arbitrary. The problem maxλ≥0,ν g(λ, ν) is the dual problem. 3.3.1 Examples. (i) For the problem min x x T x subject to Ax = b, L (x, ν) = x T x + ν T (Ax − b) g(ν) = inf(x T x + ν T (Ax − b)) x = −ν T b − 41 ν T AAT ν (ii) For the problem min x c T x subject to Ax = b and x ≥ 0, L (x, λ, ν) = c T x − λ T x + ν T (Ax − b) g(λ, ν) = inf(c T x − λ T x + ν T (Ax − b)) x = −ν T b + inf((c − λ + AT ν) T x) x ¨ T −ν b c − λ + AT ν = 0 = −∞ otherwise (iii) For the problem min x f0 (x) subject to Ax = b and C x ≤ d, L (x, λ, ν) = f0 (x) + λ T (C x − d) + ν T (Ax − b) g(λ, ν) = inf( f0 (x) + λ T (C x − d) + ν T (Ax − b)) x = −ν T b − λ T d + inf( f0 (x) + (C T λ + AT ν) T x) x = −ν b − λ d − sup((−AT ν − C T λ) T x − f0 (x)) T T x = −ν T b − λ T d − f0∗ (−AT ν − C T λ) (iv) To solve the problem min x kxk subject to Ax = b, note that ¨ −ν T b kAT bk ≤ 1 T ∗ T g(ν) = −ν b − f0 (A ν) = −∞ otherwise. Duality 19 We have the following relationships. g(λ, ν) ≤ f0 (x) for any x ∈ X g(λ, ν) ≤ inf f0 (x) = p∗ x∈X sup g(λ, ν) ≤ inf f0 (x) x∈X λ≥0,ν This is weak duality. The problem supλ≥0,ν g(λ, ν) is called the dual problem to the primal problem inf x∈X f0 (x). 3.3.2 Examples. (i) For the problem min x c T x subject to Ax = b and x ≥ 0, ¨ g(λ, ν) = c − λ + AT ν = 0 otherwise −ν T b −∞ so the dual problem is maxν −b T ν subject to AT ν + c ≥ 0. (ii) For the problem min x c T x subject to C x ≤ d, ¨ g(λ) = −λ T d c + CTλ = 0 −∞ otherwise so the dual problem is maxλ −d T λ subject to C T λ + c = 0. Under some conditions d ∗ = sup g(λ, ν) = inf f0 (x) = p∗ . λ≥0,ν x∈X This is referred to as strong duality. Suppose for now that there are no equality constraints. Then g(λ) = inf x L (x, λ) and p X sup L (x, λ) = sup( f0 (x) + λ j f j (x)) ≤ f0 (x) λ≥0 λ≥0 j=1 Therefore inf sup L (x, λ) ≤ inf f0 (x) = p∗ . x λ≥0 x On the other hand, d ∗ = sup g(λ) = sup inf L (x, λ). λ≥0 λ≥0 x Finally, d ∗ = sup inf L (x, λ) ≤ inf sup L (x, λ) ≤ p∗ . λ≥0 x x λ≥0 3.3.3 Proposition (Slater condition). If the following two conditions hold then d ∗ = p∗ , i.e. strong duality holds. 20 Optimization (i) There is x̃ ∈ int(D) such that f i (x̃) < 0 for all i = 1, . . . , m and h j (x̃) = 0 for all j = 1, . . . , p; and (ii) rank(A) = p, where Ax = b is the other representation of the equality constraints. PROOF: Define the sets G := {( f1 (x), . . . , f m (x), h1 (x), . . . , h p (x), f0 (x)) | x ∈ D} A := {(u, v, t) ∈ Rm+p+1 | ∃ x ∈ D∀ i, j( f (x) ≤ ui ∧ h j (x) = v j ∧ f0 (x) ≤ t)} B := {(0, 0, t) ∈ Rm+p+1 | t < p∗ } Then A is convex and B is convex, and p∗ = inf{t | (0, 0, t) ∈ A }. It follows that A and B are disjoint. If p∗ = −∞ then d ∗ = p∗ , so we may assume that B is non-empty, and by assumption A is non-empty. By the separating hyperplane theorem there is (λ, ν, µ) 6= 0 and α ∈ R such that (λ, ν, µ) T (u, v, t) ≥ α for all (u, v, t) ∈ A and (λ, ν, µ) T (u, v, t) ≤ α for all (u, v, t) ∈ B. From the latter, µt ≤ α for all (0, 0, t) ∈ B, so µp∗ ≤ α. From the former, λ, µ ≥ 0 since u and t can be taken to +∞ in A . If µ > 0 then we may take µ = 1 without loss of generality, and for any feasible x, X X d∗ ≥ λi f i (x) + ν j h j (x) + µ f0 (x) ≥ α ≥ p∗ . i j Assume that µ = 0. For any x ∈ D, X λi f i (x) + ν T (Ax − b) ≥ α ≥ 0. i P For the particular point x̃, i λi f i (x̃) ≥ 0, so λ = 0. Therefore ν T (Ax − b) ≥ 0 for all x ∈ D, and ν 6= 0. Since x̃ is in the interior of D, if ν T A 6= 0 then there is a point x − ∈ int(D) such that ν T (Ax − − b) < 0. This contradiction implies that AT ν = 0. But this too is a contradiction since the rank of A is p. Therefore µ > 0. We know that g(λ, ν) ≤ p∗ for any λ ≥ 0 and ν. Therefore f0 (x) − p∗ ≤ f0 (x) − g(λ, ν). (4) This estimate is very important in applications. 3.4 Complementary slackness Assume for now that strong duality holds. Let x ∗ be primal optimal and (λ∗ , ν ∗ ) be dual optimal. Recall that f (x ∗ ) = g(λ∗ , ν ∗ ) = inf { f0 (x) + x∈D m X i=1 λ∗i f i (x) + p X j=1 ν j∗ h j (x)} Complementary slackness 21 ≤ f0 (x ∗ ) + m X λ∗i f i (x ∗ ) + p X i=1 ν j∗ h j (x ∗ ) j=1 ≤ f0 (x ∗ ) It follows that we have equalities on every line, and in particular λ∗i f i (x ∗ ) = 0 for all i = 1, . . . , m. If f i (x ∗ ) < 0 then λi = 0 is zero, and we say that this constraint is an inactive constraint. 3.4.1 Proposition (Karish-Kuhn-Tucker conditions). Assume that f0 and f i , i = 1, . . . , m, are convex and h j , j = 1, . . . , p, are affine, and all functions are differentiable. Then x ∗ is primal optimal, (λ∗ , ν ∗ ) is dual optimal, and strong duality holds and only if f i (x ∗ ) ≤ 0 i = 1, . . . , m h j (x ∗ ) = 0 j = 1, . . . , p λ∗i ∗ ≥0 i = 1, . . . , m )=0 i = 1, . . . , m λ∗i f i (x ∇ f0 (x ∗ ) + m X λ∗i ∇ f i (x ∗ ) + i=1 p X ν j∗ ∇h j (x ∗ ) = 0 j=1 PROOF: Note that he function x 7→ L (x, λ∗ , ν ∗ ) = f0 (x) + m X i=1 λ∗i f i (x) + p X ν j∗ h j (x) j=1 is convex. If x ∗ = argmin x L (x, λ∗ , ν ∗ ). Therefore ∇ x L (x, λ∗ , ν ∗ ) = 0. Finish this proof. 3.4.2 Example. Solve min x 12 x T P x + q T x + r subject to Ax = b. Here L (x, ν) = 21 x T P x + q T x + r + ν T (Ax − b). Therefore 0 = ∇L implies P x + q + AT ν = 0 and Ax = b, so the original problem is equivalent to P AT x −q = . A 0 ν b 22 Optimization 3.4.3 Example. Solve min x f0 (x) = a hyperplane. The dual function is Pn i=1 f i (x i ) subject to a T x = b, that x lies on g(ν) = inf L (x, ν) x n X = inf x f i (x i ) + ν(a T x − b) i=1 = −ν b + inf x = −ν b − = −ν b n X ( f i (x i ) + ν ai x i ) i=1 n X sup(−ν ai x i i=1 x i n X − f i∗ (−ν ai ). i=1 − f i (x i )) The dual problem is to maximize g(ν) over ν ∈ R. This should be easy, so say the maximum occurs at ν ∗ . Now to find the optimizer for the primal problem, we Pn must solve the unconstrained problem min x i=1 ( f i (x i ) + ν ∗ ai x i ). 3.4.4 Example (Optimal transport). Given a list of locations with given amounts of material (x i , pi ), we would like to find a scheme to move it to locations with given capacities ( y j , q j ), minimizingPsome distance quantity. P Let the flows be µi, j ≥ 0. So j µi, j = pi , and i µi, j = q j . The distance P quantity is i, j µi, j |x i − y j |2 (Wasserstein distance). This problem has the µi, j as the quantities to be found, and there are on the order of N 2 unknowns. The dual problem is found as follows. Notice that X X X X µi, j |x i − y j |2 = |x i |2 pi + | y j |2 q j − 2 µi, j x i · y j , i, j i j i, j µi, j x i · y j . X X X X X X L (µ, a, b, λ) = − µi, j x i · y j + ai µi, j − pi + bj µi, j − q j − λi, j µi, j so it suffices to minimize − i, j = X P i, j i j µi, j (−x i · y j + ai + b j − λi, j ) − i, j g(a, b, λ) = inf L (µ, a, b, λ) µ ( P P − i ai pi − j b j q j = −∞ j X i ai pi − i X i, j bjqj j −x i y j + ai + b j − λi, j = 0 for all i, j otherwise The supremum of g(a, b, λ) over λ ≥ 0 will occur when −x i y j + ai + b j − λi, j = 0 for all P i, j. ThenP0 ≤ λi, j = ai + b j − x i · y j . The dual problem becomes − mina,b i ai pi + j b j q j subject to ai + b j ≥ x i · y j . But consider the constraints: b j ≥ x i · y j − ai for all i, so we may take b j = maxi (x i · y j − ai ). In Alternatives 23 ∗ someP sense b = Pa ,∗ and the dual problem reduces to the unconstrained problem mina i ai pi + j a j q j . If strong duality holds (which it likely does), the minimum P P distance is 2d ∗ + i |x i |2 pi + j | yi |2 q j . To actually implement the problem we need an algorithm for calculating a∗ given a, and we need some descent method. From the KKT conditions, µi, j λi, j = 0, so it λi, j > 0 then µi, j = 0, and it can be seen that many of the primal variables are actually zero. 3.5 Alternatives Either there is a solution to Ax = b with x 0 or there is λ such that AT λ ≥ 0, AT λ 6= 0, and b T λ ≤ 0. Let C = {x 0 | x ∈ Rn } and D = {x ∈ Rn | Ax = b}. C ∩ D = ∅ if and only if there is no solution x 0. C and D are convex and non-empty, so there is c ∈ Rn and d ∈ R such that c T d ≥ d on C and c T x ≤ d on D. The fact that c T x ≥ d implies c ≥ 0, since the coördinates of x ∈ C may be taken to positive infinity. Taking x → 0, it is seen that d ≤ 0. D is an affine space, so c T x is constant on D since it is bounded above. Rename d, if necessary, to be equal to this constant. Then c T x = d on D. Write D = {x = x 0 + F y | y ∈ R r }, where r = dim ker A and F is a matrix whose columns are a basis of ker A. Then c T (x 0 + F y) = d for any y ∈ R r , which is only possible if c T F = 0, so c ⊥ ker A. But ker A ⊕ rangeAT = Rn , so c ∈ rangeAT , i.e. c = AT λ for some λ. Finally, λ T A = c ≥ 0, λ T A = c 6= 0, and λ T Ax 0 = λ T b = d ≤ 0. This was problem 2.20 in the textbook. 3.5.1 Example. f i (x) ≤ 0 for i = 1, . . . , m and h j (x) = 0 for j = 1, . . . , p. We would like to find an alternative for this collection of inequalities. The idea is to consider the problem min x 0 subject to f i (x) ≤ 0 and h j (x) = 0. If there is any feasible x then the minimum is p∗ = 0, otherwise p∗ = ∞. The dual function is g(ν, λ) = inf 0 + x p X ν j h j (x) j=1 m X λi f i (x). i=1 If α > 0 then g(αν, αλ) = αg(ν, λ). Whence, if there is (ν, λ) with λ ≥ 0 such that g(ν, λ) > 0 then p∗ ≥ d ∗ = ∞ by weak duality. If there is no feasible solution to λ ≥ 0, g(ν, λ) > 0 then d ∗ = 0. Summarizing, if the dual is feasible (i.e. there is (ν, λ) such that λ ≥ 0 and g(ν, λ) > 0) then the primal is not feasible. This is a weak alternative because it could be the case that neither are feasible. 3.5.2 Example. f i (x) < 0 for i = 1, . . . , m and h j (x) = 0 for j = 1, . . . , p. We would like to show that λ ≥ 0, λ 6= 0, g(ν, λ) ≥ 0 is an alternative. Consider the problem min x 0 subject to f i (x) ≤ 0 and h j (x) = 0. The dual function is g(ν, λ) = inf x p X j=1 ν j h j (x) m X λi f i (x). i=1 If the primal is feasible then g(ν, λ) < 0 for all λ ≥ 0, λ 6= 0, so the proposed alternative cannot hold. (Not quite right.) 24 Optimization If strong duality holds then g(λ∗ , ν ∗ ) = d ∗ = p∗ = f0 (x ∗ ). We assume for now that strong duality holds. 3.5.3 Example. For f i (x) < 0, i = 1, . . . , m, and Ax = b, the alternative is λ ≥ 0, λ 6= 0, and g(λ, ν) ≥ 0. Consider the problem min x,s s subject to f i (x) − s ≤ 0 and Ax = b. Then p∗ < 0 if and only if there is x such that f i (x) < 0 for all i and Ax = b. The dual function is g(λ, ν) = inf s + x,s m X λi ( f i (x) − s) + ν T (Ax − b) i=1 X m m X = −ν T b + inf s 1 − λi + λi f i (x) + ν T Ax x,s i=1 i=1 This is −∞ unless 1 T λ = 1. The dual problem hence is max g(λ, ν). 1 T λ=1 λ≥0 By strong duality there at λ∗ and ν ∗ optimizing the dual. Now, p∗ < 0 if and only if there is x such that Ax = b and f i (x) < 0 for all i. In this case g(λ∗ , ν ∗ ) = d ∗ = p∗ < 0, so the alternative does not hold, in particular g(λ, ν) < 0 for all λ ≥ 0 and ν. Similarly, if there is no such x then the alternative holds. 4 Algorithms Now we move to chapters 10 and 11, concerning problems of the from min f0 (x) subject to Ax = b, and min f0 (x) subject to Ax = b and f i (x) ≤ 0, respectively. One way to solve the problem with only equality constraints is to use the dual problem. g(ν) := inf( f0 (x) + ν T (Ax − b)) = −ν T b + inf( f0 (x) + (AT ν) T x) x x = −ν T b − sup(−(AT ν) T x − f0 (x)) x = −ν T b − f0∗ (−AT ν) Solving the primal is equivalent (under strong duality) to solving the unconstrained problem supν (−ν T b − f0∗ (−AT ν)). Suppose ν ∗ is the solution. Then plugging this into the Lagrangian, we can solve the unconstrained problem inf x ( f0 (x)+ ν ∗T (Ax − b) for x ∗ , the solution to the original problem. Another is way to use Newton’s method. Suppose x approximates the solution and we wish to improve it. Approximate f0 (x + v) up to the quadratic term, f0 (x + v) ≈ f0 (x) + ∇ f0 (x) T v + 12 v T ∇2 f (x)v Algorithms 25 and use this approximation to find the direction to move to improve x. This is also known as sequential quadratic programming. We solve the following problem for the best direction v. min f0 (x) + ∇ f0 (x) T v + 12 v T ∇2 f (x)v subject to Av = Ax − b v The optimality conditions are as follows. The Lagrangian is L = 21 v T ∇2 f (x)v + ∇ f0 (x) T v + ν T (Ax − b − Av) so the KKT condition is that ∇2 f0 (x)v + ∇ f0 (x) + AT ν = 0 and of course the constraint Av = Ax − b must be satisfied. We get the following system of equations for the best direction v and the dual variable ν. 2 ∇ f0 (x) vnt ∇ f0 (x) AT =− . Ax − b νnt A 0 A second derivation of this system comes from the optimality conditions for the original problem. They are ∇ f0 (x) + AT ν = 0 and Ax = b. We wish to find x + ∆x so that the optimality conditions still hold (but at which some other quantity is improved). Linearizing, ∇ f0 (x) + ∇2 f x (x)∆x + AT ∆ν + AT ν = 0 ∇ f0 (x) A 2 Ax + A∆x = b ∇ f0 (x) + AT ν ∆x nt A =− Ax − b ∆νnt 0 T In the unconstrained Newton’s method we want ∇ f0 (x + v) = 0 so we pick v so that ∇2 f0 (x)v = −∇ f0 (x). f0 (x + t v) = f0 (x) + t∇ f0 (x) T v + O(t 2 ) = f0 (x) − t v T ∇2 f0 (x) T v + O(t 2 ) d = −λ2 (x) < 0 f (x + t v) t=0 dt since ∇2 f0 (x) is positive semidefinite. With the backtracking method f0 (x) decreases with every iteration. With equality constraints, however, f0 (x) will not necessarily decrease if Ax 6= b. We need to find some other quantity that will go down with every iteration to use as the backtracking quantity. From Newton’s step, d dt f0 (x + t∆x) t=0 = ∇ f0 (x) T ∆x nt T T = −∆x nt (∇2 f0 (x)∆x nt + AT (ν + ∆νnt )) T T = −∆x nt ∇2 f0 (x)∆x nt − (ν + ∆νnt ) T (A∆x nt ) T T = −∆x nt ∇2 f0 (x)∆x nt − (ν + ∆νnt ) T (b − Ax) 26 Optimization We show now that r = r(x, ν) = rdual r pr imal = ∇ f0 (x) + AT ν Ax − b strictly decreases, and that r = 0 at optimum. Consider the new variable y = (x, ν) and do Newton’s method on r( y). The linear approximation is r( y + z) = r( y) + Dr( y)z, so we solve this equation for z. r is called the residual. This material is in §10.3. d kr( y + t∆ ynt )k22 t=0 dt d = (r( y) + t Dr( y)∆ ynt ), r( y) + t Dr( y)∆ ynt ) + O(t 2 ) t=0 dt = 2(Dr( y)∆ ynt , r( y)) = −2r( y) T r( y) = −2kr( y)k22 Similarly, d dt kr( y + t∆ ynt )k2 t=0 = d Æ dt kr( y + t∆ ynt )k22 t=0 −2kr( y)k22 = p = −krk2 2 kr( y)k22 Hence the tangent line to the graph of kr( y + t v)k2 at t = 0 is (1 − t)kr( y)k2 . We build our line search as follows. Find t such that kr( y + t∆ ynt )k2 ≤ (1 − αt)krk2 . The infeasible start Newton’s method is as follows. (See §10.3.) input: (x0, v0), e > 0, 0 < a < 0.5, 0 < b < 1 x, v = x0, v0 while |r(x, v)| > e: dxnt, dvnt = compute_dir(x, v, etc.) t = 1.0 while |r(x + t dxnt, v + t dvnt)| > (1 - a t) |r(x, v)| t = b * t x, v = x0 + t dxnt, v0 + t dvnt Some observations. If we accept the Newton step, i.e. if t = 1, then the solution becomes feasible, i.e. A(x + ∆x nt ) = b. Otherwise, if we step by t < 1 then we have b −A(x + t∆x nt ) = (1− t)(b −Ax), so we “improve the feasibility” by a factor of 1 − t. Now we move on to the problem with inequality constraints. Consider the problem min f0 (x) subject Pm to Ax = b and f i (x) ≤ 0, i = 1, . . . , m. Introduce the functional ϕ(x) = − i=1 log(− f i (x)) and solve min t f0 (x) + ϕ(x) subject to Ax = b. x Algorithms 27 We hope that the solutions get close to the solution of the original problem for large t. Compute ∇ϕ(x) = − ∇2 ϕ(x) = − m X 1 f (x) i=1 i m X ∇ f i (x) 1 i=1 1 T 2 ∇ f (x)∇ f (x) − ∇ f (x) i i i f i (x) f i (x)2 Combining with the dual, the optimality conditions are t∇ f0 (x ∗ ) − m X 1 i=1 f i (x ∗ ) ∇ f i (x ∗ ) + AT ν ∗ = 0 Ax ∗ − b = 0 where the first line is t∇ f0 (x) + ∇ϕ + AT ν = 0. These are the equations for the barrier method. We known that − f i (x ∗ ) > 0, so write λ∗i := −t f1(x ∗ ) > 0. Rescaling i ν ∗ , we can write the above equations as X ∇ f0 (x ∗ ) + λ∗i ∇ f i (x ∗ ) + AT ν ∗ = 0 Ax ∗ − b = 0 −λ∗i f i = 1 t These are the equations for the primal-dual method. The Lagrangian for the problem is L (x, ν, λ) = f0 (x) + m X λi f i (x) + ν T (Ax − b) i=1 m X ∇L (x, ν, λ) = ∇ f0 (x) + λi ∇ f i (x) + AT ν i=1 Note that ∇L (x ∗ , ν ∗ , λ∗ ) = 0. With the dual function g we have g(λ∗ , ν ∗ ) = f0 (x ∗ ) + m X λ∗i f i (x ∗ ) + ν ∗T (Ax ∗ − b) i=1 so f0 (x ∗ ) − p∗ ≤ f0 (x ∗ ) − g(λ∗ , ν ∗ ) m X 1 m f i (x ∗ ) = → 0 as t → ∞ = ∗ t f (x ) t i=1 For the barrier method we have the variables (x, ν). Use Newton’s method with line search on the residuals to find ∆x and ∆ν. 2 ∆x t∇ f0 + ∇ϕ + AT ν t∇ f0 + ∇2 ϕ AT = ∆ν Ax − b A 0 28 Optimization How do we start the barrier method? We need to start with a feasible point because otherwise ϕ(x) is not defined. The first phase is to solve the auxiliary problem min s subject to f i (x) ≤ s and Ax = b. x,s This problem is always feasible: choose x 0 such that Ax (0) = b and set s(0) = maxi f i (x (0) ). Use any method to get a sequence of approximate solutions to this problem. If we find s(n) < 0 for some n then we know p∗ < 0 for the auxiliary problem, so the original problem is feasible and we have obtained a feasible point x (n) . However, if find that p∗ is probably positive then the original problem is probably not feasible. In this case we hope to find (λ, ν) for the dual to the auxiliary problem for which g(λ, ν) > 0, proving infeasibility of the original problem. Another way to get started is to instead consider the problem min 1 T s subject to f i (x) ≤ si , i = 1, . . . , m, and Ax = b. x,s For the primal-dual method we apply Newton’s method with line search on the residuals to the modified KKT equations rdual = ∇ f0 (x) + X rcent = −λi f i (x) − 1 t λi ∇ f i (x) + AT ν = 0 =0 rprim = Ax − b = 0 to find ∆x ∆λ, and ∆ν. P ∇2 f0 + λi ∇2 f i T −[λ1 ∇ f1 | · · · | λm ∇ f m ] A [∇ f1 | · · · | ∇ f m ] −diag( f1 , . . . , f m ) 0 r ∆x AT dual nt 0 ∆λnt = − rcent rprim ∆νnt 0 Dr( y) = ∆ ynt Recall the inequality f0 (x) − p∗ ≤ 1 t m . t Define η̂ = − P i λi f i (x) ≈ m . t Since λi > 0 and −λi f i (x) = > 0, we must have f i (x) < 0. Write y + = y + s∆ ynt . The line search consists of the following steps: (i) Choose s1 ≤ 1 so that λ+ ≥ 0. (ii) Starting with s = s1 , s ← βs until f i (x + ) < 0. Call the result s2 . (iii) Starting with s = s2 , s ← βs until kr( y + )k2 ≤ (1 − αt)kr( y)k2 . What follows are fragments of the proof that this all works. Let t max = inf{t > 0 : kr( y + t∆ ynt )k2 > kr( y)k2 }. This is the first t for which moving along the Newton direction by this much will not improve the value of kr( y)k2 . Let S := {t : kr( y + t∆ ynt )k2 ≤ kr( y)k2 } ⊇ [0, t max ], Algorithms 29 and assume that S is closed. Note that we have Dr( y)∆ y = −r( y) because it is the definition of the Newton increment. Z1 r( y + t∆ y) = r( y) + t Dr( y + tτ∆ y)∆ y dτ 0 = r( y) + t Dr( y)∆ y + Z 1 [Dr( y + tτ∆ y) − Dr( y)]t∆ y dτ 0 = (1 − t)r( y) + e kr( y + t∆ y)k2 ≤ (1 − t)kr( y)k2 + kek2 Now, if Dr is Lipschitz, say kDr( y) − Dr( y)k2 ≤ Lk y − y 0 k2 , then Z 1 kek2 = [Dr( y + tτ∆ y) − Dr( y)]t∆ y dτ 2 0 Z1 kDr( y + tτ∆ y) − Dr( y)k2 k∆ yk2 dτ ≤t 0 ≤ Lt 2 Z 1 0 k∆ yk22 τdτ = 21 L t 2 k∆ yk22 = 12 L t 2 k[Dr( y)−1 r( y)k22 Suppose as well that k[Dr( y)]−1 k2 ≤ K for all y. Then for all t ∈ [0, 1 ∧ t max ], kr( y + t∆ y)k2 ≤ (1 − t)kr( y)k2 + 21 t 2 LK 2 kr( y)k22 . (5) This parabola in t is minimized at t̄ = 1/(LK 2 kr( y)k2 ). It is clear that 1/LK 2 is a critical value of kr( y)k2 . There are hence two cases: (i) krk > 1/(K 2 L) and (ii) krk ≤ 1/(K 2 L). In the first case t̄ < 1 ∧ t max , so t̄ is a valid step size and we get kr( y + t∆ y)k2 ≤ (1 − t̄)kr( y)k2 + 12 t̄ 2 LK 2 kr( y)k22 = kr( y)k2 − 1 2K 2 L 1 ≤ kr( y)k2 − α t̄ 2 K L ≤ (1 − α t̄)kr( y)k2 since t̄ < 1, α < 21 , and 1/(K 2 L) < kr( y)k2 . Therefore the line search terminates for some t ≥ β t̄. kr( y + t∆ y)k2 ≤ (1 − α t̄)kr( y)k2 ≤ kr( y)k2 − so the reduction after the line search as at least steps, we will end up in case (ii). αβ . K2 L αβ K2 L Therefore, after finitely many 30 Optimization In the second case kr( y)k2 ≤ 1/(K 2 L) so kr( y + t∆ y)k2 ≤ (1 − t)kr( y)k2 + 21 t 2 LK 2 kr( y)k22 ≤ (1 − t + 12 t 2 )kr( y)k2 . From this it follows that t max > 1, and for t = 1 we have kr( y + t∆ y)k2 ≤ 12 kr( y)k2 ≤ (1 − αt)kr( y)k2 so the line search exits with t = 1. Finally, plugging t = 1 into the original inequality, LK 2 kr( y)k22 2 2 2 LK K2 L kr( y + ∆ y)k2 ≤ kr( y)k2 2 2 kr( y + ∆ y)k2 ≤ so we have quadratic convergence in case (ii). Index affine set, 2 alternative, 23 quadratic convergence, 10 residual, 26 roughness penalty, 8 backtracking line search, 6 barrier method, 27 sequential quadratic programming, 25 steepest descent method, 9 strong duality, 19 sub-gradient, 13 complementary slackness, 16 cone, 2 conjugate function, 17 constrained optimization problem, 4 convex function, 2 convex set, 2 unconstrained problem, 4 Wasserstein distance, 22 weak alternative, 23 weak duality, 19 descent method, 6 dual norm, 9 dual problem, 18, 19 equality constraint, 16 equality constraints, 4 exact line search, 6 feasible, 18 first order conditions, 4 gradient descent method, 6 half space, 2 hyperplane, 2 inactive constraint, 21 inequality constraints, 4 infeasible start Newton’s method, 26 Lagrangian, 18 line, 2 linear convergence, 6 Lipschitz constant, 11 Newton’s decrement, 11 Newton’s method, 10 non-negativity constraint, 16 norm cone, 2 normal equations, 5 primal problem, 19 primal-dual method, 27 31
© Copyright 2026 Paperzz