Math Appendix 3: Convex Optimization: From Convex Sets to KKT

Mathematical Modelling for Computer Networks- Part I
Spring 2013 (Period 4)
Math Appendix 3: Convex Optimization: From Convex Sets to KKT rules
Lecturers: Laila Daniel and Krishnan Narayanan
Date:15th March 2013
If you optimize everything, you will always be unhappy.
- Donald Knuth
Abstract
This lesson provides a concise account of bare minimum needed to state
the KKT rules in convex optimization problems. On the road to KKT rules,
we describe several useful concepts such as convex sets, convex function,
polyhedral convex set, hyperplane, halfspace, local vs global minima and
convexity. Lagrangian function, Lagrangian multipliers, saddle point and
optimality, KKT rules and KKT point.
The life-breath of modern optimization (also known as mathematical programming) is the notion of convexity as brilliantly portrayed in the quotation below by
Rockafellar, one of the founding fathers of the theory of Convex Optimization.
In fact the great watershed in optimzation isn’t between linearity and
nonlinearity, but convexity and nonconvexity.
- R. Tyrrell Rockafellar
3.1
Convex set, Convex function
Definition 3.1 Convex set:
A subset X of Rn is convex, if for any pair of points x1 and x2 ∈ X , the line
segment joining x1 and x2 lies entirely within the set X .
Notation: A subset X of Rn is convex, if for any x1 and x2 ∈ X =⇒ λx1 + (1 −
λ)x2 ∈ X for ∀λ ∈ [0, 1]. So a set of real vectors is convex if the entire line segment
joining any pair of elements in the set lies entirely within the set.
Definition 3.2 Convex function:
Informally
a function is convex if the graph of the function resembles a cup, i.e.,
S
. We can state this idea precisely as follows.
A function f : R → R, is convex, if the graph of the function between any
two points a and b, where a < b lies entirely below the chord joining the points
(a, f (a) and (b, f (b)). This geometric idea is written formally as
f (λa + (1 − λ)b) ≤ λf (a) + (1 − λ)f (b)
λ ∈ [0, 1]
We can extend this idea to define a convex function on an arbitrary convex set that
lies in Rn . Let X be a convex set in Rn . A function f : X → Rn is convex if for
any x1 and x2 ∈ X , it holds that f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ), for
every λ ∈ (0, 1). The function is strictly convex if the inequality is strict for any
distinct x1 and x2 and any λ ∈ (0, 1).
3-1
3-2
Lecture 3: Convex Optimization: From Convex Sets to KKT rules
A function f is concave if its negation −f is convex. A function f is strictly
concave ifTits negation −f is strictly convex. So a concave function resembles a
cap, i.e., .
Note: The notion of a convex function and concave function are dual concepts; and
we pass from one to other by taking the ’negative’; the key thing about the duality
is that by taking the dual of the dual we end up with the original (function we
stared with). Things which are dual to each other are roughly like ’opposites’ of
each other. A set and its complement are duals of each other. A nice consequence
of duality is that for every result we prove, we get another for ’free’ by invoking
duality. For example, from (A ∪ B)c = (Ac ∩ B c ), by invoking duality we get
(A ∩ B)c = (Ac ∪ B c ). So the well-known pair of DeMorgan laws in set theory
can be regarded as duals of each other. In the linear algebra lesson, we saw that
ker(A) = ran(AT ) implies that ker(A) and ran(AT ) are duals of each other.
Duality is central to mathematical programming. When we study duality in linear programming and convex programming, we shall see that linear and convex
programs come in pairs called primal and dual, which are duals of each other.
The only functions that are both convex and concave are affine functions. An
affine function is a ’ shifted’ linear function that is a linear function with a fixed
displacement. So in particular a linear function and a constant function are convex
as well as concave.
Hyperplane: An equation of the form ax = b where a is a nonzero vector in Rn ,
b is an arbitrary scalar (b ∈ R) and x ∈ Rn defines a hyperplane in Rn . It is the
set of all the points in Rn (also called vectors) which satisfy the above equation.
Here the vector a appearing in the equation of the hyper plane is called the normal
defining the hyperplane.
Let H(a, b) denote the hyperplane defined by the non-zero vector a ∈ Rn and b ∈ R.
The above hyperplane defines two closed halfspaces H + (a, k) and H − (a, k) which
are defined as follows.
H + (a, b) = {x ∈ Rn |ax ≥ b}
H − (a, b) = {x ∈ Rn |ax ≤ b}
Geometrically H + (a, b) contains vectors which lie in the same side as pointed to by
vector a and H − (a, b) contains vectors which lie in the opposite side to by vector a,
i.e, the same side as pointed to vector −a Note that the hyperplane H(a, b) divides
the space Rn into two half spaces, H + (a, b) and H − (a, b).
Note: H + (a, b) and H − (a, b) are convex sets.
The hyperplane H(a, b) is a convex set. Rn = H + (a, b) ∪ H − (a, b)
H(a, b) = H + (a, b) ∩ H − (a, b)
That H(a, b) is a convex set can be seen in two ways. By directly applying the definition of convexity to the set and also by noting that the intersection of (arbitrary)
convex sets is a convex set.
3.1.1
How to recognize Convex functions
1. The notion of a convex function is intimately related to the notion of a convex
set. For a convex function f , the set that lies ’above’ the graph of the
function called the epigraph of the function, is a convex set. This epigraph
Lecture 3: Convex Optimization: From Convex Sets to KKT rules
3-3
viewpoint characterizes a convex function. We state this formally as follows:
The epigraph of the function f is defined by
epi(f ) = {(x, y) | x ∈ domain(f ) and y ≥ f (x)}
. The function f is convex if and only if (iff) epi(f ) is a convex set.
2. The graph of a convex function f : R → R between any two points a and
b in domain(f) such that the a < b lies below the chord joining the points
(a, f (a)) and (b, f (b))
3. From the graph of a convex function f : R → R , for any a < b < c, the slope
of the chord joining (a, f (a)) and (b, f (b)) is not greater than the slope of the
chord joining (b, f (b)) and (c, f (c)). we can state this as follows:
f (c) − f (a)
f (c) − f (b)
(f (b) − f (a))
≤
≤
b−a
c−a
c−b
4. When a function is first or second order differentiable, convexity can be characterized entirely in terms of its differentiability.
5. Given a differentiable function f defined on an open convex set, then the
function is convex (strictly convex) if and only if its derivative Df (also
denoted by f ′ ) is a monotone (strictly monotone) increasing function on its
domain.
6. Analogously, for a function which has a second derivative throughout its
domain of definition given by an open convex set, f is convex iff its second
derivative, f ′′ ≥ 0 on it domain. If the second derivative is f ′′ (x) > 0, then
f is strictly convex.
Characterizing a convex function by the behaviour of the first or second order
derivatives can be easily generalised to differentiable convex functions defined
in Rn .
7. For a twice differentiable function f : Rn → R, the above result has an analog
which is expressed in terms of the Hessian matrix. f is convex iff the Hessian
matrix ∇2 f is positive-semidefinite denoted by ∇2 f 0, and if the Hessian
matrix is strictly positive-semidefinite (∇2 f ≻ 0) , then f strictly convex
function, .
Note: Here the Hessian matrix ∇2 f is an n×n matrix of second-order partial
derivatives
2
f
∇2 f = (fi,j ) where fi,j = ∂x∂i ∂x
j
8. The tangent drawn to a convex function at any point on its graph lies below
the graph of the function. This elementary result leads to a very useful
inequality that can be stated as follows. Assuming that a convex function
defined on an open set U ⊂ Rn is differentiable at a point x0 we obtain
f (x) ≥ f (x0 ) + ∇f (x0 )(x − x0 )
. Here ∇f (x0 ) denotes the gradient of the function f at the point x0 .
Note that this simple inequality leads to a dramatic conclusion that for a
convex function, local minimum equals global minimum (by taking the point
x0 where ∇f (x0 ) = 0).
9. We can specialize the above result to a function defined on an open interval
in R
f (x) ≥ f (x0 ) + f ′ (x0 )(x − x0 )
3-4
Lecture 3: Convex Optimization: From Convex Sets to KKT rules
10. And finally,the most famous inequality of convex analysis: Jensen inequality:
For a convex function defined on an open (possibly infinite) interval (a, b),
let xi ∈ (a, b).
n
P
αi = 1, ∀i, then
If αi ≤ 0 and
i=1
f(
n
X
αi x i ) ≤
i=1
n
X
αi f (xi )
i=1
Note: Dualizing the above, we get corresponding results for a concave function. For
example, for a differentiable concave function, the slope is decreases monotonically
and so on.
The exponential function ex is perhaps the most important convex function. The
log function log x (for x > 0) is an important example of concave function. Functions x2 , x3 , xn are convex functions where as the function − x1 is concave. Often
the characterization of convex functions using their derivatives can be used to verify
that the example functions are either convex or concave.
3.1.2
Some important examples of convex/concave functions
on R
Convex
• affine: ax + b on R, for any a, b ∈ R
• exponential: eax , for any a ∈ R
• powers: xα on R++ , for α ≥ 1 or α ≤ 0
• powers of absolute value: |x|p on R, for p ≥ 1
• negative entropy: x log x on R++
Concave
• affine: ax + b on R, for any a, b ∈ R
• powers: xα on R++ , for 0 ≤ α ≥ 1
• logarithm: log x on R
• entropy: −x log x on R++
3.1.3
Examples of convex/concave functions on Rn and Rm×n
• affine functions are convex and concave
• all norms are convex
• Examples on Rn
– affine function f (x) = aT x + b
n
P
– norms: kxkp = ( |x|p ) for p ≥ 1; kxk∞ = maxk |xk |
i=1
Lecture 3: Convex Optimization: From Convex Sets to KKT rules
– norms: kxkp = (
n
P
3-5
|x|p )1/p for p ≥ 1; kxk∞ = maxk |xk |
i=1
• Examples on Rm×n ( m × n matrices)
– affine function
f (X) = tr(AT X) + b =
n
m X
X
Ai,j Xi,j + b
i=1 j=1
– spectral (maximum singular value)
f (X) = kXk = σmax (X) = (λmax (X T X))1/2
Constrained optimization problems
In constrained optimization problems, we optimize (maximize or minimize) an arbitrary real-valued function over an arbitrary set in Rn which which may be implicitly
defined using functional constraints. The function to be optimized may have multiple local optima and the algorithms designed to find the global optimum may
get find it difficult to get past local optimum it may enter during the search. Solutions to nonlinear optimizations problems can be difficult to find in general. In
constrained optimization problems, the set over which a real-valued function has to
be maximized or minimized is of often specified by means of inequality constraints
(such as gi (x) ≤ 0) based on functions (such as gi : Rn → R).
Convexity is a special structure that gives convex optimization problems welldeveloped theory, efficient algorithms and a variety of applications spanning numerous fields such as communication system design, optimal control, machine learning,
statistics etc.
In a convex optimization problem we minimize a convex function (or equivalently
maximize a concave function) defined over a convex set. This convex set is often
implicitly defined by means of inequalities involving convex functions (such as the
functions gi described earlier being convex).
Example: Let X be a convex set in Rn . Suppose the functions fi : X → R are
convex over X and bi are real numbers for i = 1, . . . 1. Then the set {fi ≤ bi , i ∈
.
[n]} = {x ∈ Rn | fi (x) ≤ bi , i ∈ [n]} is a convex set. Here [n] denotes the set
{1, . . . , n}.
Example: This example is a special case of the last example. If A is an m × n
matrix with real entries, (we write this as A ∈ Mmn (R) ) and b ∈ Rm , then
{x ∈ Rn | Ax ≤ b} is a convex set. This set is called a polyhedral convex set and is
denoted by P (A, b).
Note: Geometrically, each inequality defined by a row of the matrix A yields a half
space in Rn and the set given by P (A, b) is the intersection of these half spaces.
P (A, b) is convex as it is a finite intersection of half spaces which are convex sets.
3.2
Local and Global Optima
Definition 3.3 Feasible solution vs Optimal solution: In the problem of minimizing a function f : χ → R any x belonging to the domain χ is a feasible solution and an element x∗ in χ is a global optimal solution (or just a solution) if
f (x∗ ) ≤ f (x) ∀x ∈ χ.
3-6
Lecture 3: Convex Optimization: From Convex Sets to KKT rules
Convexity implies local minimum = global minimum. If f is strictly convex over
X, then the local minimum is a unique global minimum.
3.3
The Karush-Kuhn-Tucker (KKT) Conditions
In dealing with a convex optimization problem, we minimize a convex objective
function (equivalently maximize a concave objective function) subject to constraints which are given by convex functions. So the constraints together define a
convex set and so a convex optimization problem in one of minimizing a convex
function over a convex set. Any point that lies within the convex set is called a
feasible point or a feasible solution of a convex optimization problem. The optimal
solution is a feasible solution which yields a bounded (finite) minimum over the set
of all feasible solutions.
Primal Problem
min f (x)
subject to
gi (x) ≤ 0,
x ∈ Rn
1≤i≤m
Assumption: f : Rn → R and gi : Rn → R are all convex and differentiable
functions over Rn .
The Lagrangian function we introduce helps to convert the constrained optimization problem in to an unconstrained optimization problem which can be simpler to
solve at the expense of introducing additional variables called Lagrangian multipliers, denoted by λi ∈ R++ , i = 1, . . . m. Th Lagrangian multiplier λi corresponds
to the ith constraint gi (x) ≤ 0. We form the Lagrangian function L(x, λ) given
below and the basic philosophy of this approach is to obtain the optimal solution as
the saddle point of the Lagrangian function. This requires the stationarity condition S and the complementary slackness (CS) condition to be satisfied at a feasible
point x∗ . The conditions S and CS together are called the KKT conditions.
m
X
.
λi gi (x)
L(x, λ) = f + λg = f (x) +
i=1
Given a feasible x∗ ∈ Rn , ∃λ∗ ∈ Rn+ ,
(S)
∇f (x∗ ) +
(CS)
X
X
λ∗i ∇gi (x∗ ) = 0
λi gi (x∗ ) = 0
Then x∗ is a global optimal solution for the primal problem.
Here S is the Stationarity condition and CS is the Complementary Slackness.
The conditions of stationarity and complementary slackness together is called the
KKT conditions. If x∗ satisfies the KKT conditions, then x∗ is a KKT point.
The vector λ∗ which figures in the equations labelled S and CS is called a KKT
vector. Therefore x∗ in conjunction with λ∗ yields the saddle point (x∗ , λ∗ ) of the
Lagrangian function L
The complementary slackness condition implies that if at x∗ , the primal constraint
gi (x∗ ) is met with a strict inequality (gi (x∗ ) < 0), i.e., there is slack in the ith
Lecture 3: Convex Optimization: From Convex Sets to KKT rules
3-7
constraint, then the corresponding λ∗i = 0 as λi is non-negative. If at x∗ , the
primal constraint gi (x∗ ) is met with equality (gi (x∗ ) = 0), the corresponding λ∗i >
0. Hence the name complementary slackness. A constraint that is satisfied with
equality such as (gi (x) = 0 or λi = 0) is called a binding or active. Complementary
slackness requires that corresponding to the saddle point (x∗ , λ∗ ) of the Lagrangian,
a component λ∗i = 0 of the vector λ∗ is positive precisely when constraint gi is tight
(gi (x∗ ) = 0) and λi = 0 when the constraint gi is inactive (gi (x∗ ) < 0).
So at optimality, Slack in primal constraints gi =⇒ corresponding λi = 0
Constraint gi met with equality =⇒ corresponding λi ≥ 0
Example
This example is taken from ’Appendix C: Convex Optimization’ of the book [KMK04]
min (x1 − 5)2 + (x2 − 5)2
subject to
x21 + x22 − 5 = 0
1
x1 + x22 − 2 = 0
2
−x1 ≤ 0
−x2 ≤ 0
x ∈ R2
Plug this to the standard primal form, we can see that
m = 4, n = 2
f (x1 , x2 ) = (x1 − 5)2 + (x2 − 5)2
g1 (x1, x2 ) = x21 + x22 − 5
g2 (x1 , x2 ) =
1
x1 + x22 − 2
2
g3 (x1 , x2 ) = −x1 ,
Also we can find
g4 (x1 , x2 ) = −x2
2(x1 − 5)
∇f (x) =
2(x2 − 5)
∇g1 (x) =
2x1
,
2x2
−1
∇g3 (x) =
,
0
∇g2 (x) =
1
2
1
0
∇g4 (x) =
−1
Consider the point
2
x =
1
∗
At this point the first and the second constraints are binding. We have
1
−6
4
∗
∗
∗
∇f (x ) =
, ∇g1 (x ) =
, ∇g2 (x ) = 2
−8
2
1
3-8
Lecture 3: Convex Optimization: From Convex Sets to KKT rules
2
3
 20 
 3 ,


Take λ =  
0
0
We can see that at the point x∗
∇f (x∗ ) + λ1 ∇g1 (x∗ ) + λ2 ∇g2 (x∗ ) = 0 (Stationarity condition) and
4
P
λi gi (x∗ ) = 0, because λ1 = λ2 = 0 (Complementary slackness)
(x∗ , λ∗ ):
i=1
The KKT conditions are satisfied at the point x∗ =
2
1
Since f (x∗ ) ≤ f (x) ∀x ∈ χ, x∗ is a global optimal solution. The optimum value
of f (x) is 25. From Figure 3.1 we can see that f (x) = 25 passes through the x∗ .
As f (x) is strictly convex, x∗ is the unique global optimum.
f(x) = 9
f(x) = 25
f(x) = 1
f(x) = 4
(5,5)
f(x) = 16
g1 (x) = 0
g2 (x) = 0
∇g2 (x∗ )
∇g1 (x∗ )
x∗
g4 (x) = 0
∇f (x∗ )
(0,0)
g3 (x) = 0
Figure 3.1: Geometry of the convex optimization example problem. The quarter
area of circles centred at (5,5) are potions of contours of constant value of the
objective function: that is, they have equations (x1 − 5)2 + (x2 − 5)2 = h, for
various values of h ≥ 0; shown are these curves for h = 1, 4, 9, 25.
References
[BV]
[KMK04]
Stephen Boyd and Lieven Vandenberghe, http://www.stanford.edu/ boyd/cvxbook/
Anurag Kumar, D. Manjunath, and Joy Kuri, ”Communication
Networking: An Analytical Approach”, Morgan Kaufmann , 2004.