Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research
Methods (Part II)
4. Convexity and Optimization
Massimiliano Pontil
(based on previous lecture by Andreas Argyriou)
1
Today’s Plan
• Convex sets and functions
• Types of convex programs
• Algorithms
• Convex learning problems
2
Convexity
• Simple intuition, originates from simple geometric shapes (e.g.
polygons)
Convex
Convex
NOT
• Convexity plays a very important role in optimization
3
Convex Sets
Definition 1. A set C ⊆ IRd is called convex if
λx + (1 − λ)y ∈ C
for all x, y ∈ C, λ ∈ [0, 1]
• I.e. if x and y are in the set C, then the whole line segment
{λx + (1 − λ)y : λ ∈ [0, 1]} lies also in C
4
Convex Sets (contd.)
• We call λx + (1 − λ)y a convex combination of x and y whenever
λ ∈ [0, 1]
• Generally, for any k ∈ IN, the sum
k
X
λi xi
i=1
is called a convex combination of the points x1, . . . , xk ∈ IRd whenever
λ1, . . . , λk ≥ 0
and
k
X
λi = 1
i=1
5
Convex Sets (contd.)
• Clearly, if a set C is convex, all convex combinations of points in C (for
all k ∈ IN) belong to C
• This set of all convex combinations is called the convex hull of C
• In general, given a set S ⊆ IRd (S need not be convex), the convex hull
of S is the set
)
( k
k
X
X
λi = 1, k ∈ IN
λi xi : xi ∈ S, λ1, . . . , λk ≥ 0,
conv(S) :=
i=1
i=1
• The convex hull is the smallest convex set containing S
6
Convex Sets (contd.)
• S = conv(S) if and only if S is convex
7
Examples of Convex Sets
• Affine sets, i.e. sets of solutions of linear equations {x : Ax = b}
• Convex cones, i.e. sets containing any nonnegative combination
k
P
θixi , θ1, . . . , θk ≥ 0, of their points
i=1
1
0.8
0.6
0.4
1
0.2
0.8
0
1
0.6
0.4
0.5
0
0.2
−0.5
−1
0
8
Examples of Convex Sets (contd.)
• Hyperplanes, i.e. sets of the form {x : a⊤x = b}, where
a ∈ IRd, a 6= 0, b ∈ IR (since they are special cases of affine sets)
• Halfspaces, i.e. sets of the form {x : a⊤x ≤ b}, where
a ∈ IRd, a 6= 0, b ∈ IR
9
Polyhedra
• A polyhedron is a set defined by a finite number of affine equalities and
inequalities
⊤
P = {x : a⊤
i x ≤ bi, i = 1, . . . , m, cj x = dj , j = 1, . . . , p}
where a1, . . . , am, c1, . . . , cp ∈ IRd, b1, . . . , bm, d1, . . . , dp ∈ IR
• Polyhedra are convex sets
10
Polyhedra (contd.)
• A bounded polyhedron is called a polytope
• A set is a polytope if and only if it is the convex hull of a finite set of
points
11
The Positive Semidefinite Cone
• We use the notations
X0
X≻0
to denote that a d × d matrix X is positive semidefinite and positive
definite, respectively
• The sets
Sd+ := {X ∈ IRd×d : X 0}
and
Sd++ := {X ∈ IRd×d : X ≻ 0}
are called the positive semidefinite cone and positive definite cone,
respectively
12
The Positive Semidefinite Cone (contd.)
• Sd+ and Sd++ are convex cones
Proof. For any A, B ∈ Sd+, θ1, θ2 ≥ 0, the matrix θ1A + θ2B is psd. ⊓
⊔
• E.g. in IR2×2, the positive semidefinite cone consists of the matrices of
the form
x y
such that x, z ≥ 0, xz ≥ y 2
y z
1
0.8
0.6
0.4
1
0.2
0.8
0
1
0.6
0.4
0.5
0
0.2
−0.5
−1
0
13
Norms
• A norm, denoted by k · k, is a function from IRd to IR+ such that
1. kwk ≥ 0, for all w ∈ IRd
2. kwk = 0 if and only if w = 0
3. kawk = |a|kwk, for all a ∈ IR, w ∈ IRd (homogeneity)
4. kw + zk ≤ kwk + kzk, for all w, z ∈ IRd (triangle inequality)
• Important example: the Lp norm
kwkp :=
d
X
i=1
where p ∈ [1, +∞)
|wi|p
! p1
14
Norms (contd.)
• We have already seen the L2 norm – it is the regularizer in ridge
regression, SVM etc.
! 12
d
X
1
2
⊤
2
wi
= (w w)
kwk2 =
i=1
• The L1 norm
kwk1 =
d
X
|wi|
i=1
• Letting p → +∞, we get the L∞ norm
d
kwk∞ = max |wi|
i=1
15
Norm Balls
• The unit ball for a norm is the set {w : kwk ≤ 1}
1
0.5
0
−0.5
−1
1
0.5
1
0.5
0
0
−0.5
−0.5
−1
L1 unit ball
−1
L2 unit ball
L∞ unit ball
16
Norm Balls (contd.)
• In general, any norm ball of the form
{w : kw − ck ≤ r}
where c ∈ IRd and r ≥ 0 are the center and radius of the ball,
respectively, is a convex set
17
Convex Functions
• A function f : IRd → IR is called convex if
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
for all x, y ∈ IRd, λ ∈ [0, 1]
• Intuition: the line segment connecting any two points on the graph of f
lies above the graph
Convex
Non-convex
18
Convex Functions (contd.)
• Similarly, a function f is called concave if
f (λx + (1 − λ)y) ≥ λf (x) + (1 − λ)f (y)
for all x, y ∈ IRd, λ ∈ [0, 1]
or, equivalently, if −f is convex
19
Strict Convexity
• A function f : IRd → IR is called strictly convex if
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y)
for all x, y ∈ IRd, x 6= y, λ ∈ (0, 1)
• I.e. the line segment connecting any two points on the graph of f lies
strictly above the graph
• Equivalently, the graph of a strictly convex function does not contain
any line segments
20
Strict Convexity (contd.)
Not strictly convex
Strictly convex
21
Jensen’s Inequality
• If f is convex, it follows by induction that
f
k
X
λixi
i=1
d
!
≤
k
X
λif (xi)
i=1
for all k ∈ IN, x1 , . . . , xk ∈ IR , λ1 , . . . , λk ≥ 0, such that
k
P
λi = 1
i=1
• It can be generalised to integrals and expected values
(it is used e.g. to derive the EM algorithm)
22
Continuity / Differentiability
Theorem 1. If a function f is convex on IRd then it is also continuous
on IRd.
Proof. Not easy. ⊓
⊔
• There are convex functions which are not differentiable everywhere
(and others which are)
23
Second Order Condition
Theorem 2. Assume that a function f is twice differentiable on IRd.
Then f is convex if and only if its Hessian is psd.
∇2f (w) 0
for all w ∈ IRd
• Recall that the Hessian is the matrix formed by the second partial
derivatives
d
∂f
(w)
∇2f (w) :=
∂wi∂wj
i,j=1
• Note: the condition ∇2f ≻ 0 implies strict convexity, but the converse
is not true – see [Boyd & Vanderberghe]
24
Examples of Convex Functions
• Affine functions w 7→ a⊤w + b are both convex and concave
• Exponentials, powers, log-sum-exp
1
35
0.9
30
0.8
25
0.7
0.6
20
0.5
15
0.4
0.3
10
0.2
5
0.1
0
−5
−4
−3
−2
−1
0
1
2
w ∈ IR 7→ eaw
for any a ∈ IR
3
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
w ∈ IR 7→ |w|p
for any p ≥ 1
1
d
w ∈ IR 7→ log
P
d
wi
i=1 e
25
Examples of Convex Functions (contd.)
• Psd. quadratic functions
f (w) = w⊤Aw + a⊤w + b
for all w ∈ IRd
where A ∈ Sd+, a ∈ IRd, b ∈ IR
26
Examples of Convex Functions (contd.)
• Max function w 7→ max{w1, . . . , wd}
27
Norms Are Convex Functions
• Every norm is a convex function
Proof. Let w, z ∈ IRd and λ ∈ (0, 1). Then
kλw + (1 − λ)zk
≤
kλwk + k(1 − λ)zk
triangle
inequality
=
homogeneity
λkwk + (1 − λ)kzk
⊓
⊔
• No norm is strictly convex (to see this, select z = aw with a > 0, a 6= 1)
• The square of every norm, w 7→ kwk2, is a convex function
Proof. Do it as an exercise. ⊓
⊔
28
Operations that Preserve Convexity
Question: If f1, . . . , fq : IRd → IR are convex functions, which operations
F can we apply so that f := F (f1, . . . , fq ) is also convex?
• Nonnegative weighted sums
f=
q
X
θifi
i=1
where θ1, . . . , θq ≥ 0
Proof. Easy from the definition. ⊓
⊔
29
Operations that Preserve Convexity (contd.)
• Composition of a convex function with an affine map
f (x) = g(Ax + b)
for all x ∈ IRd
where g : IRn → IR is convex, A ∈ IRn×d is a matrix and b ∈ IRn
Proof. Let x, y ∈ IRd, λ ∈ (0, 1). Then
f (λx + (1 − λ)y) = g(λAx + (1 − λ)Ay + b)
= g (λ(Ax + b) + (1 − λ)(Ay + b))
≤ λg(Ax + b) + (1 − λ)g(Ay + b) = λf (x) + (1 − λ)f (y)
⊓
⊔
30
Operations that Preserve Convexity (contd.)
• Maximum of convex functions
f = max{f1, . . . , fq }
Proof (for q = 2, can be easily generalised to any q). Let x, y ∈ IRd,
λ ∈ (0, 1). Then
f (λx + (1 − λ)y) = max{f1 (λx + (1 − λ)y), f2(λx + (1 − λ)y)}
≤ max{λf1(x) + (1 − λ)f1 (y), λf2(x) + (1 − λ)f2(y)}
≤ max{λf1(x), λf2 (x)} + max{(1 − λ)f1(y), (1 − λ)f2(y)}
= λf (x) + (1 − λ)f (y) ⊓
⊔
• Extends also to infinite sets of convex functions
31
Proving Convexity
• Thus, to prove convexity of a function f , there are several approaches
–
–
–
–
–
From the definition of convexity
Compute the Hessian of f and show that it is psd.
Decompose f as a nonnegative weighted sum of convex functions
Decompose f as the composition of a convex and an affine function
Decompose f as a maximum of convex functions
32
Examples
• Show that the function
f (w) = w⊤w = kwk2
is strictly convex
Proof. The Hessian of f equals ∇2f (w) = 2Id which is positive
definite. ⊓
⊔
33
Examples (contd.)
• Show that the quadratic function
f (w) = w⊤Aw + a⊤w + b
where A ∈ Sd+, a ∈ IRd, b ∈ IR, is convex
Proof. Write A = R⊤R for some matrix R ∈ IRd×d. Then
w⊤Aw = (Rw)⊤(Rw), which is a composition of the convex function
w 7→ w⊤w and a linear map. The term w 7→ a⊤w + b is affine and
hence convex. Thus f is convex, as the sum of convex functions. ⊓
⊔
Alternatively, we may compute the Hessian, which equals 2A, which is
psd.
34
Proving Strict Convexity
• To prove strict convexity of a function f
– Use the definition (with strict inequality, x 6= y and λ ∈ (0, 1))
– Compute the Hessian of f and show that it is positive definite
– Decompose f as a sum of a convex and a strictly convex function
(easy to prove this property)
Note: When does the convex-affine composition operation apply?
• Example: the quadratic function f (w) = w⊤Aw + a⊤w + b is strictly
convex if and only if A ≻ 0
(since the Hessian equals 2A)
35
Convex Optimization
• The problem
min
f (w)
subject to
fi(w) ≤ 0
i = 1, . . . , M
a⊤
j w = bj
j = 1, . . . , P
w∈IRd
(1)
where f, f1, . . . , fM are convex functions, is called a convex program or
convex optimization problem
• The function f whose value we wish to minimise is called the objective
function
36
Remarks
• The set of points w satisfying the constraints fi (w) ≤ 0, a⊤
j w = bj is
called the feasible set
• The feasible set is convex: for any feasible points x, y,
fi(λx + (1 − λ)y) ≤ λfi (x) + (1 − λ)fi(y) ≤ 0 and
a⊤
j (λx + (1 − λ)y)) = λbj + (1 − λ)bj = bj
• In general, if we minimize a convex objective function over a convex set,
the problem can be rewritten in form (1) (in principle at least;
sometimes not practically possible)
• Many problems of interest can be rewritten in the form (1); they do not
necessarily appear in that form however
37
Remarks (contd.)
• Minimum (1) does not always exist! (could be an infimum or could be
−∞)
35
30
25
20
15
10
5
0
−5
−4
−3
−2
−1
0
1
2
3
• The set of solutions (minimisers) of problem (1) is convex (easy to
show)
• In particular, if the function f is strictly convex, then there is a unique
minimiser (if any exists)
38
Remarks (contd.)
• There are no local minima outside the set of minimisers
• This is important because it implies that algorithms will not get stuck
away from the solution(s)
• Thus, the great appeal of convex programs is that they can be solved!
(many of them in polynomial time)
39
Examples
min
w⊤Aw + a⊤w
subject to
w⊤Bw + b⊤w + c ≤ 0
w∈IRd
d⊤ w = e
where A, B 0
min
a⊤w
subject to
b⊤
1 w ≤ c1
w∈IRd
b⊤
2 w ≤ c2
d⊤
1 w = e1
40
Regularization
min
w∈IRd
m
X
i=1
E w xi , yi + γ kwk2
⊤
(R)
• Assume that E(·, y) is a convex function for every y ∈ IR
• Then, problem (R) is a convex program; this program is unconstrained
• Indeed, the objective function is convex, as a sum of convex functions:
E(w⊤xi , yi ) is convex as a convex-affine composition and kwk2 is also
convex, as we have already seen
• It can be shown that the minimum exists (under mild assumptions on
E)
41
Regularization (contd.)
• Example 1: ridge regression
m
X
(yi − w⊤xi)2 + γ kwk2
min
w∈IRd
i=1
• Convex program, since the function z 7→ (z − y)2 is convex for every
y ∈ IR
42
Regularization (contd.)
• Example 2: we have seen (in Lecture 1) that SVM is equivalent to the
regularization problem
min
w∈IRd
with γ =
1
2C
m
X
max{1 − yi(w⊤xi) , 0} + γ kwk2
i=1
• This is a convex program, since the function z 7→ max{1 − yz , 0} (the
hinge loss) is convex for every y ∈ {−1, 1}; indeed, it is a maximum of
convex (in particular, affine) functions
43
Regularization (contd.)
• How about the SVM primal
m
X
1
2
min kwk + C
ξi
d
w∈IR 2
i=1
subject to yi(w⊤xi) ≥ 1 − ξi, ξi ≥ 0, for i = 1, . . . , m
• It is also easy to see that this is a convex program, but now the
variables include wi and ξi as well
• The objective function is convex (quadratic in w, linear in ξi ); the
functions 1 − ξi − yi (w⊤xi ) and −ξi in the inequality constraints are
also convex
44
Regularization (contd.)
• Similarly, the SVM and ridge regression dual problems can be seen to be
convex problems
• In general, the dual of regularization
minm
c∈IR
m
X
i=1
⊤
E c gi , yi + γ c⊤Gc
(C)
is a convex problem (assuming as before that the loss function is
convex)
• Indeed, the quadratic form c⊤Gc is a convex function of c since the
Gram matrix G is positive semidefinite
45
Regularization (contd.)
min
w∈IRd
minm
c∈IR
m
X
E w xi , yi + γ kwk2
i=1
m
X
i=1
⊤
(R)
⊤
(C)
E c gi , yi + γ c⊤Gc
• Problem (R) has a unique solution; indeed, the term kwk2 is strictly
convex; hence the objective function is also strictly convex
• However, problem (C) has a unique solution only if G ≻ 0, i.e. only if
the feature vectors φ(xi) are linearly independent;
otherwise,
Pm there are infinite optimal ĉ, but the corresponding
ŵ = i=1 ĉiφ(xi) is unique
46
Convex Programs with Linear Equality Constraints
• The following special type of convex program can be solved using
Lagrange multipliers
min
f (w)
subject to
a⊤
j w = bj
w∈IRd
j = 1, . . . , P
(2)
where f is a convex and differentiable function
PP
• Set the gradient to zero: ∇f (w) = j=1 cj aj , for some cj ∈ IR; the
set of solutions of this equation is the same as the set of minimisers of
(2) (by a theorem); a dual problem can also be obtained
47
Important Convex Optimization Problems
• Linear Programming
• Quadratic Programming
• Semidefinite Programming
• Dedicated off-the-shelf algorithms exist for each of the above categories
• In machine learning, algorithms have been developed for special
subtypes of such problems
48
Linear Programming (LP)
min
c⊤ w
subject to
d⊤
i w ≤ ei
i = 1, . . . , M
a⊤
j w = bj
j = 1, . . . , P
w∈IRd
(3)
• The feasible set is a polyhedron (bounded or not)
• Problem (3) may have one, none, or infinite solutions
• Interesting fact: the dual problem is also a linear program
49
Linear Programming (contd.)
• It can be shown that the solution (if unique) will be one of the vertices
• The simplex algorithm is one of the oldest optimization algorithms
(Dantzig in the 40s)
50
Linear Programming (contd.)
• Intuition of simplex: find a vertex to start from; from each vertex, move
to a neighbour so that the objective function decreases; terminate if
there is no such neighbour
• Time complexity: very good in almost all cases, but very bad
(exponential) in the worst case
• In practice, very fast for typical problems and can be applied to large
data sets
• Methods developed in the 80s (interior-point methods) have been
applied to linear programming and are of polynomial-time complexity
51
Quadratic Programming (QP)
min
w⊤Aw + c⊤w
subject to
d⊤
i w ≤ ei
i = 1, . . . , M
a⊤
j w = bj
j = 1, . . . , P
w∈IRd
(4)
where A 0
• If A ≻ 0, the solution (if any) is unique (due to strict convexity)
• The dual is also a quadratic program
52
Quadratic Programming (contd.)
• The idea behind simplex does not apply here; in fact, the minimiser
could be anywhere in the feasible set (on the boundary or in the interior)
• The difficulty in solving QP is due to the fact that the solution may lie
on the boundary of the feasible set
53
Interior-Point Methods
• Idea: change the objective function by adding to it a barrier function
• The barrier depends on the constraints and is parameterised by a
parameter t
• Unconstrained minimisation of the barrier function gives a solution in
the interior of the feasible set
• Changing t appropriately, the algorithm converges to the solution of (4)
in polynomial time
• These methods can handle problems of reasonably large size
54
Ridge Regression as QP
m
X
(yi − w⊤xi)2 + γ kwk2
min
w∈IRd
i=1
• Ridge regression is an unconstrained QP
• Just need to solve a linear system using standard methods
55
SVM as QP
X
1
2
min kwk + C
ξi
d
w∈IR 2
i=1
m
m
X
1 ⊤
maxm − α Aα +
αi
α∈IR
2
i=1
ξi ≥ 0
for i = 1, . . . , m
s. t.
yi(w⊤xi) ≥ 1 − ξi
s. t. 0 ≤ αi ≤ C
for i = 1, . . . , m
• SVM is a QP with inequality constraints
• The SVM dual is a QP with “box” constraints
56
Algorithms for SVM
• One approach to solve SVMs is with interior-point methods
• For large datasets (say m > 103) it is practically impossible to solve the
dual problem with such methods (matrix A is dense!)
• A typical approach is to iteratively optimize wrt. an ’active set’ A of
dual variables, fixing the rest. Set α = 0, choose q ≤ m and an active
set A of q variables. We repeat until convergence the steps
– Solve the problem wrt. the variables in A
– Remove one variable from A which satisfies the KKT conditions and
add one variable, if any, which violates the KKT conditions. If no
such variable exists, stop
57
QCQP
min
w⊤Aw + c⊤w
subject to
w⊤Bw + d⊤
i w ≤ ei
i = 1, . . . , M
a⊤
j w = bj
j = 1, . . . , P
w∈IRd
where both A, B are psd.
• It is called a quadratically constrained quadratic program (QCQP)
• Larger family; contains the family of QP
• The dual problem is not a QCQP, in general
58
QCQP (contd.)
• The feasible set is the intersection of ellipsoids and/or a polyhedron
• It is faster to solve a QP with a dedicated method than to use a QCQP
solver
59
SDP
min
c⊤ w
subject to
w1F1 + · · · + wnFn + G 0
w∈IRd
a⊤
j w = bj
j = 1, . . . , P
• There is a linear matrix inequality (LMI) constraint
• Multiple LMIs reduce to an equivalent problem with just one LMI
• The dual problem of an SDP is also an SDP
• LP ⊆ QP ⊆ QCQP ⊆ · · · ⊆ SDP (LPs, QPs, QCQPs can be rewritten
as SDPs)
60
Bibliography
Lectures available at:
http://www.cs.ucl.ac.uk/staff/M.Pontil/courses/index-ATML10.htm
See also Boyd and Vandenberghe, Convex Optimization, 2004,
http://www.stanford.edu/boyd/cvxbook/
Secs. 2.1.4-2.2.5, 3.1.1, 3.1.5, 3.1.8, 3.2.1-3.2.3, 4.1.1, 4.2.1, 4.2.2, 4.3, 4.4, 4.6.2
61

Download Report

Mathematical Programming and Research Methods (Part II)

Paperzz.com

Your Paperzz