Machine Learning Lecture 7
A cursory introduction to nonlinear optimization
Hyeong In Choi
In this lecture we give a cursory introduction to some of the important aspects of nonlinear
optimization with an eye toward the applications in machine learn in general and support vector
machine in particular. A general form of optimization we deal with is as follow.
Problem. (General Problem)
minimize
subject to
f0 (x)
fi (x) ≤ 0,
for i = 1, · · · , m
hj (x) = 0,
for j = 1, · · · , p,
where x ∈ Rd , and f0 , fi , and hj are continuously differentiable (C 1 ) functions defined on Rn .
This of course is a very difficult problem for which there is no complete general solution. But
under some mild regularity condition, there are powerful results concerning the necessary condition
on the minimum point. The most famous is the celebrated Karush-Kuhn-Tucker (KKT) condition
which is stated and proved in Theorem below. Before we proceed, let us clarify the setup and some
terminologies. First, in our formulation, we assume the functions f0 , fi , and hj are defined on the
whole of Rn . This assumption can be made more general by requiring them to be defined only on
some subset of Rn . Our choice is for the simplicity of the presentation. We say a point x ∈ Rn is
a feasible point of the Problem if x satisfies the constraints: i.e., fi (x) ≤ 0 for i = 1, · · · , m and
hj (x) = 0 for j = 1, · · · , p. A point x̃ is called a local minimum point if x̃ is a feasible point and
there is a neighborhood N of x̃ on which f (x̃) ≤ f (x) for any feasible point x ∈ N. In this case, we
also say that f0 has local minimum value at x̃. And f (x̃) is called the local minimum value or local
minimum.
1
In what follows, we frequently borrow the ideas and results from multi-variate (vector) calculus
with which the reader is assumed to be familiar. Any reader who has some basic knowledge of
differentiable manifold theory and transversality results will find our presentation intuitive and
easy to understand. But of course the reader is not assumed to know any of these beyond vector
calculus. We need the following definition.
Definition 1. Let x̃ ∈ Rn be a feasible point. We define the active set AC(x̃) and the inactive set
IC(x̃) of indices by
AC(x̃) = {i | fi (x̃) = 0}
IC(x̃) = {i | fi (x̃) < 0}.
A feasible point x̃ is called a regular point if the vectors ∇fi (x̃) for i ∈ AC(x̃) and ∇hj (x̃) for
j = 1, · · · , p are linearly independent.
The geometric meaning of this definition can be seen as follows. Let N be a sufficiently small
neighborhood of x̃ and let
M = {x ∈ N | fi (x) = 0,
i ∈ AC(x̃),
hj (x) = 0,
j = 1, · · · , p}.
Then x̃ being regular means that M is a C 1 manifold, which is the same thing as saying that M,
after suitable local co-ordinate change in Rn , looks like a piece of Rn−a in Rn , where a = p+|AC(x̃)|.
And of course the vectors ∇fi (x̃) for i ∈ AC(x̃) and ∇hj (x̃) for j = 1, · · · , p are vectors spanning
the normal space of M at x̃. The situation is depicted in Figure 1.
The most important result in nonlinear optimization is the following.
Theorem 1. (KKT Condition) Let x̃ be a local minimum point of the General Problem. If x̃ is
regular, there exist λi for i = 1, · · · , m, and νj for 1, · · · , p such that
∇f0 (x̃) +
m
X
λi ∇fi (x̃) +
i=1
p
X
νj ∇hj (x̃) = 0,
j=1
where λi ≥ 0 for i = 1, · · · , m.
This is a very geometric result which becomes intuitively obvious when the geometric picture is
cleared up. Before embarking on the proof, let us look at some special cases to gain understanding.
First to look at is the following problem with no inequality constraints.
2
Figure 1: M and its normal vectors at x̃
Lemma 1. (Lagrange Multiplier) Let x̃ be a local minimum point of the following problem:
minimize
f0 (x)
subject to
hj (x) = 0,
for j = 1, · · · , p.
Then there exist νj for j = 1, · · · , p such that
∇f0 (x̃) +
p
X
νj ∇hj (x̃) = 0
j=1
Proof. As above, let N be a sufficiently small neighborhood of x̃ and let M = {x ∈ N | hj (x) =
0 for j = 1, · · · , p}. Denote by f0 |M the function f0 (x) restricted to M. Since f0 |M has a local
minimum at x̃, we must have
d
∇f0 (x̃) · γ (0) = f0 (γ(t)) = 0,
dt t=0
0
for any C 1 curve γ(t) in M with γ(0) = x̃. Since this is true for any such curve, ∇f0 (x̃) is perpendicular to any tangent vector of M at x̃. Thus it must be a normal vector. But ∇h1 (x̃), · · · , ∇hp (x̃) span
the normal space of M at x̃. Therefore ∇f0 (x̃) must be a linear combination of ∇h1 (x̃), · · · , ∇hp (x̃).
3
The next case to look is the problem with no equality constraints.
Lemma 2. (Problem with no equality constraints) Let x̃ be a local minimum point of the
following problem:
minimize
subject to
f0 (x)
fi (x) ≤ 0,
for i = 1, · · · , m.
Then there exist λi ≥ 0 for i = 1, · · · , m such that
∇f0 (x̃) +
m
X
λi ∇fi (x̃) = 0.
(1)
i=1
This lemma can be understood intuitively as follows. Define the domain Di in Rn by Di = {x ∈
Rn | fi (x) ≤ 0} for i = 1, · · · , m, and let D = ∩m
j=1 Dj . Let x be a point at the boundary of Di . Then
∇fi (x) is the normal vector at x to the level set {fi = 0} = { x | fi (x) = 0} pointing outwardly from
Di .
Figure 2: ∇fi (x̃) and the outward normal cone
Suppose x̃ is a local minimum point of f0 (x). If x̃ is in the interior of D, (1) holds with all λ = 0.
So assume x̃ is at the boundary of D and let AC(x̃) be the active set.
It is instructive to look at Figure 2. Since x̃ is a local minimum point of f0 in D which is the
shaded region, it can be seen that its gradient vector ∇f0 (x̃) points inwards. It means that −∇f0 (x̃)
4
is in the outward normal cone which is the positive linear span of ∇f1 (x̃) and ∇f2 (x̃). This is an
intuitive argument for the proof of Lemma 1
The proof for the general case can be shown along this line. But its rigorous proof is contained
in that for Theorem 1.
Proof of Theorem 1
Proof. Let x̃ be a local minimum point of the General Problem. Let AC(x̃) and IC(x̃) be its active
and inactive sets respectively, and define the following Reduced Problem:
Problem. (Reduced Problem)
minimize
f0 (x)
subject to
fi (x) = 0,
for i ∈ AC(x̃)
hj (x) = 0,
for j = 1, · · · , p.
Note the feasible set of this Reduced Problem is contained in the feasible set of the General
Problem. Therefore x̃ is also a local minimum point of the Reduced Problem. As x̃ is regular, by
Lemma 1, there exist λi and νj for i ∈ AC(x̃) and j = 1, · · · , p such that
X
∇f0 (x̃) +
λi ∇fi (x̃) +
p
X
νj ∇hj (x̃) = 0.
(2)
j=1
i∈AC(x̃)
Setting λi = 0 for i ∈ IC(x̃), we have
∇f0 (x̃) +
m
X
λi ∇fi (x̃) +
i=1
p
X
νj ∇hj (x̃) = 0.
j=1
It remains to prove λi ≥ 0 for i ∈ AC(x̃). Suppose for example that 1 ∈ AC(x̃)) and λ1 < 0. Define
M by
M = {x ∈ Rn | fi (x) = 0,
i ∈ AC(x̃),
i 6= 1 hj (x) = 0,
j = 1, · · · , p},
i.e., M is a manifold defined by all equality constraints and all active constraints except for i = 1.
Now the regularity condition can be reinterpreted as saying that neither the tangent plane at x̃ of
{f1 = 0} nor that of M can contain the other. (In the jargon of differentiable manifold, {f1 = 0}
and M are said to intersect transversally at x̃.) Therefore there exist a vector v tangent to M
5
at x̃ that cuts through the hypersurface {f1 = 0} pointing strictly to the interior of the region
D1 = { x | f1 (x) ≤ 0}. In other words, v satisfies
∇f1 (x̃) · v < 0.
(3)
Let γ(t) be a curve on M such that γ(0) = x̃ and γ 0 (0) = v. Taking the inner product of (2)
with v, we have
∇f0 (x̃) · v +
X
λi ∇fi (x̃) · v +
p
X
νj ∇hj (x̃) · v = 0.
(4)
j=1
i∈AC(x̃)
Since v is tangent to M at x̃, and ∇fi (x̃) (for i ∈ AC(x̃), i 6= 1) and ∇hj (x̃) (for j = 1, · · · , p)
are normal to M at x̃, the equality (4) becomes
∇f0 (x̃) · v + λ1 ∇f1 (x̃) · v = 0.
Thus
d f0 (γ(t)) = ∇f0 (x̃) · γ 0 (0) = ∇f0 (x̃) · v = −λ1 ∇f1 (x̃) · v.
dt t=0
Using (3) and the assumption that λ1 < 0, we conclude that
d f0 (γ(t)) < 0.
dt t=0
(5)
It is easy to see that γ(t) is a feasible point for 0 ≤ t ≤ for some . But then by (5), x̃ = γ(0)
cannot be a local minimum point. Therefore no λi can be negative.
Figure 3: ex4
6
Remark. In the course of the proof of Theorem 1, λi ’s are set so that if i 6∈ AC(x̃), i.e., fi (x̃) < 0,
then λi = 0. Since for i ∈ AC(x̃), fi (x̃) = 0, we must have
m
X
λi fi (x̃) = 0.
(6)
i=1
On the other hand, assume that λi ≥ 0 and that (6) holds at a feasible point x̃. Then since fi (x̃) ≤ 0,
it is easy to conclude that
•
if λi > 0, then fi (x̃) = 0;
•
if fi (x̃) < 0, then λi = 0.
These relations are called the complementary slackness relation.
Lagrangian Duality Theory
The Lagrangian duality is a method of setting up a derived problem from the General Problem,
which has a wide variety of applications in theoretical as well as practical situations. For the rest
of this lecture, we follow Boyd and Vandenberg’s book entitled Convex Optimization.
Following the appellation convention of the Lagrangian duality theory, let us call the General
Problem the Primal Problem (P). We first need the following definition.
Definition 2. (Lagrangian)
L(x, λ, ν) = f0 (x) +
m
X
i=1
λi fi (x) +
p
X
νj hj (x).
j=1
Thus L is a function defined on Rn × Rm × Rp , and λi ’s and νj ’s are called the dual variables.
Out of this we define the following:
Definition 3. (Dual Function)
Define
g(λ, ν) = inf L(x, λ, ν).
x
Thus g is a function defined on Rm × Rp .
Remark. (Notation Convention)
We set up a few notation to be used throughout this lecture. First, p∗ always denote the optimal
value of the Primal Problem. For a vector quantity like λ = (λ1 , · · · , λm ), we say λ ≥ 0 if every
7
component is non-negative, i.e., λi ≥ 0 for i = 1, · · · , m. Similarly for two vectors a = (a1 , · · · , an )
and b = (b1 , · · · , bn ), we write a ≥ b if ai ≥ bi for i = 1, · · · , n.
The following lemma is trivial, but nonetheless is very important.
Lemma 3. For any λ ≥ 0 and for any ν, we have
g(λ, ν) ≤ p∗
Proof. Let x be a feasible point. Then fi (x) ≤ 0 for i = 1, · · · , m and hi (x) = 0 for i = 1, · · · , p.
Therefore
L(x, λ, ν) ≤ f0 (x)
(7)
for any feasible point x. Then by definition of g(λ, ν),
g(λ, ν) ≤ f0 (x).
Since the left hand side of the above inequality is independent of x, the proof follows by the minimum
of the right hand side.
We now set up a dual problem.
Problem. (Dual Problem: DP)
maximize
g(λ, ν)
subject to
λ ≥ 0.
This dual problem is in many ways easier to deal with. Let us see why. Note first that L(x, λ, ν) is
a linear (hence concave) function of λ and ν, when x is regarded as a parameter. Since the infimum
of concave functions is still concave, g(λ, ν) is a concave function. Hence the Dual Problem is
equivalent to a convex optimization problem of minimizing the convex function −g(λ, ν) under the
linear constraints, for which there is a general theory and efficient algorithms.
Let d∗ be the optimal value of the Dual Problem, i.e.,
d∗ = max g(λ, ν),
8
under the constraints λ ≥ 0. Since the right hand side of the inequality in Lemma 3 is independent
of λ and ν, this inequality still holds after the maximum of g(λ, ν) is taken. Thus we have
d∗ ≤ p∗ .
This is called the Weak Duality.
Remark. This weak duality holds even if p∗ = −∞ or d∗ = ∞. Suppose that p∗ = −∞. Then using
(7), it is easily seen that d∗ = −∞. Similarly, d∗ = +∞ implies that p∗ = +∞.
A word on the optimal value is in order in case there is no feasible point. First, recall the
standard convention that the supremum of the empty set is −∞, while the infimum of the empty
set is +∞. In our context, if there is no feasible point for the primal problem, then p∗ = +∞, while
the existence of some feasible point for it implies that p∗ cannot be +∞. Similarly, if there is no
feasible point for the dual problem, then d∗ = −∞, while the existence of some feasible point for it
implies that d∗ cannot be −∞.
Strong Duality
If the inequality in the weak duality becomes the equality, i.e.,
d∗ = p∗ .
we say the Strong Duality holds. If so, one can find the optimal value of the Primal Problem
by solving the Dual Problem. Since the dual problems are convex optimization problems, they are
easier to handle and hence are preferred.
But in general, the strong duality does not hold. In what follows, we will prove the strong
duality is true if the primal problem is of a certain specific form. Fortunately, the problem that
comes out of the support vector machine is of this form. In particular we will prove the strong
duality for the following primal problem under a mild condition on the inequality constraints.
Problem. (Primal Problem in Special Form)
minimize
subject to
f0 (x)
fi (x) ≤ 0,
for i = 1, · · · , m
Ax = b,
where x ∈ Rn
f0 , f1 , · · · , fm are convex functions and A is a p × n matrix.
9
We need the following definition.
Definition 4. The Primal Problem in Special Form satisfies the Slater condition if there exists
a feasible point x ∈ Rn at which the inequality constraints are strict. In other words, fi (x) < 0, for
i = 1, · · · , m, and Ax = b.
Theorem 2. Assume the Slater condition is satisfied for the Primal Problem in Special Form. Then
the strong duality holds for this problem, i.e., d∗ = p∗ .
Proof. Since there is a feasible point, p∗ < ∞. If p∗ = −∞, then by the weak duality d∗ = −∞. So
we may assume p∗ is finite. We may also assume that the constituent linear equations of Ax = b
are linearly independent, i.e., rank(A) = p. If not, we may combine the linear equations (say, for
example, by the Gaussian elimination) to write up a perhaps smaller number of equivalent linearly
independent ones.
Define the sets A and B by
A = {(u, v, t) | ∃x ∈ Rn such that f (x) ≤ u, Ax − b = v, f0 (x) ≤ t},
where f (x) = (f1 (x), · · · , fm (x)), u = (u1 , · · · , um ) ∈ Rm , v = (v1 , · · · , vp ) ∈ Rp , and
B = {(0, 0, s) ∈ Rm × Rp × R | s < p∗ }.
Homework. Prove A and B are non-empty convex sets.
We claim A and B are disjoint. Suppose there exists some (u, v, t) ∈ A ∩ B. Since (u, v, t) ∈ B,
u = 0, and v = 0, there exists, by the definition of A, x̃ ∈ Rn such that fi (x̃) ≤ 0 and Ax̃ = b.
Since (0, 0, t) ∈ B, we have t < p∗ . And (0, 0, t) ∈ A implies that f0 (x̃) ≤ t. So f0 (x̃) < p∗ . which is
absurd. Thus A ∩ B = ∅.
Then applying the separating hyperplane theorem, which says that any two disjoint convex sets
can be separated by some hyperplane, we can find (λ̄, ν̄, µ) 6= 0 and α such that
λ̄T u + ν̄ T v + µt ≥ α,
(8)
λ̄T u + ν̄ T v + µt ≤ α,
(9)
for all (u, v, t) ∈ A, and
for all (u, v, t) ∈ B.
10
Since each ui is unbounded from above and the left hand side of (8) is bounded below by a finite
number α, no λi can be negative. Similarly for µ. Therefore, we can conclude that
λ̄ ≥ 0 and µ ≥ 0.
(10)
Now for any t < p∗ , (0, 0, t) ∈ B. Thus by (9), µt ≤ α. Therefore, we have
µp∗ ≤ α
(11)
Now obviously for any x ∈ Rn , (f (x), Ax − b, f0 (x)) ∈ A. Thus plugging this into (8) and using
(11), we have
m
X
λ̄i fi (x) + ν̄ T (Ax − b) + µf0 (x) ≥ µp∗ ,
(12)
i=1
for any x ∈ Rn . By (10), µ ≥ 0. Let us first look at the case where µ > 0. Dividing (12) by µ, we
get
m
X
λ̄i
ν̄
fi (x) +
µ
i=1 µ
Define λ =
!T
(Ax − b) + f0 (x) ≥ p∗ .
λ̄i
ν̄i
and ν = . Then the above inequality becomes
µ
µ
L(x, λ, ν) ≥ p∗ .
Therefore
g(λ, ν) = inf L(x, λ, ν) ≥ p∗ ,
x
which implies that
d∗ ≥ p∗ .
Therefore the strong duality
d∗ = p∗
is proved.
Now let us look at the case where µ = 0. In this case (12) becomes
m
X
λ̄i fi (x) + ν̄ T (Ax − b) ≥ 0,
(13)
i=1
which holds true for any x ∈ Rn . Now let x̃ be the point at which Slater condition is satisfied. Since
fi (x̃) < 0 and Ax̃ − b = 0, (13) becomes
m
X
λ̄i fi (x̃) ≥ 0.
i=1
11
Since fi (x̃) < 0 and λ̄i ≥ 0 by (10), we must have all λ̄i = 0. Now that µ = 0 is assumed and λ̄ = 0,
ν̄ must be a non-zero vector; for, otherwise, (λ̄, ν̄, µ) = 0. Thus (13) implies that
ν̄ T (Ax − b) ≥ 0
for any x ∈ Rn . Now if ν̄ T A 6= 0, then there must be some x0 such that
ν̄ T (Ax0 − b) < 0.
Therefore, we must have
ν̄ T A = 0.
(14)
(14) means that
AT ν̄ = 0.
Since nullity(AT ) + rankAT = p and ν̄ 6= 0, we must have rankAT ≤ p − 1. This contradicts the fact
rank(A) = p that was put forth at the beginning of this proof. Therefore µ 6= 0.
We close this lecture by giving one case in which there is a unique global minimum point.
Proposition 1. The Primal Problem in Special Form has a unique global minimum point if the
objective function f0 is strictly convex.
Homework. Prove this Proposition.
Caveat lector
This is a class note for the Machine Learning course given during the fall semester of 2014 by
Professor Hyeong In Choi, Math. Department of Seoul National University. The note was taken by
Mr. Jun-Haeng Park who prepared the LaTex draft from which this note was made. It may contain
some typos or even some material errors, which of course Prof. Choi is entirely responsible for. If
you found some, please send a message to the instructor. Your kind cooperation is appreciated.
12
© Copyright 2026 Paperzz