Introduction to
Machine Learning
236756
Tutorial 3-4:
Intro to Convex Optimization
Convex Sets
β’ A subset πΆ β βπ is convex if βπ₯, π¦ β πΆ, π β 0,1 :
ππ₯ + 1 β π π¦ β πΆ
Convex Functions
β’ A function π: πΆ β¦ β is convex if
π ππ₯ + 1 β π π¦ β€ ππ π₯ + 1 β π π(π¦)
Convex Functions
β’ A function f is convex iff itβs epigraph is convex
Operations that preserve convexity
β’ Nonnegative sum-
β’ Affine Composition f(Ax + b)
β’ Pointwise sup β equivalent to intersecting epigraphs
β’ Example: sum(max1β¦r[x])
β’ Compositionf is convex if h is convex and non-decreasing and g is convex,
or if h is convex and non-increasing and g is concave
Some more can be found in the literature.
Some examples without proofs
β’ In R
β’ Affine (both convex and concave function) unique
β’ Negative Log
β’ In Rn and Rmxn
β’ Norms
β’ Trace
β’ Maximum eigenvalue of a matrix
Convexity of Differentiable Functions
π€
Can we generalize that property for the non-differentiable case?
Gradients and Sub-Gradients
β’ For any π€ β πΆ we can construct a tangent to π that
lies below π everywhere:
βπ£ β βπ : βπ₯ β πΆ:
ππ£ π₯ β π π€ + π£, (π₯ β π€) β€ π(π₯)
β’ Such a vector is called a sub-gradient
β’ Set of sub-gradients βdifferential setβ ππ(π€)
ππ₯)
π€
Proving convexity
β’ Simple case: πΆ β β
Equiv.
β’ π convex
β’ πβ²β²(π€) β₯ 0 for all π€ β πΆ
β’ πβ²(π€) monotonically non-decreasing in π€
β’ If π twice differentiable on convex πΆ β βπ then:
Equiv.
β’ π convex
β’ Hessian π»π π€ PSD for all π€ β πΆ
β’ If π differentiable on πΆ and convex then ππ π€
unique:
β’ ππ π€ = π»π π€ =
ππ
ππ
,β¦,
ππ€1
ππ€π
Example 1
β’ π π₯ = βπππ
β’
πβ²
π₯ =
π π₯/2
π₯
π₯
β
π 2 +π 2
= log(1 + π βπ₯ )
1
βπ₯ )
(βπ
1+π βπ₯
=
β1
1+π π₯
β’ π β² (π₯) monotonically increasing
ο π convex
Example 2
β’ Suppose π: βπ β¦ β given as
π π₯ = π π€, π₯ + π
where π convex, π€ β βπ , π β β parameters
ο π convex
β’ Example
π π₯ = log(1 + π β π€,π₯ βπ )
Example 3
For π¦ β ±1
. π₯ β¦ ββππππ π¦, π₯ = max{0, 1 β π¦ β
π₯}
β’ Yes, itβs convex.
β’ Holds in βπ if we take π π€ = ββππππ (π¦, π€, π₯ )
Which of the following losses is
convex?
Which is convex in the prediction π§ β β:
β’ For π¦ β {±1}:
β’ πππ π π§; π¦ = π¦ β π§
1 ππ π¦π§ < 0
β’ πππ π π§; π¦ = π¦π§ ππ 0 β€ π¦π§ β€ 1
0 ππ π¦π§ > 1
β’ πππ π π§; π¦ = 1 β π¦π§
β’ πππ π π§; π¦ = 1 β π¦π§
2
+
1
2
+
Which of the following losses is
convex?
Which is convex in the prediction π§ β β:
β’ For π¦ β {±1}:
β’ No: πππ π π§; π¦ = π¦ β π§
1 ππ π¦π§ < 0
β’ No: πππ π π§; π¦ = 1 β π¦π§ ππ 0 β€ π¦π§ β€ 1
0 ππ π¦π§ > 1
β’ Yes: πππ π π§; π¦ = 1 β π¦π§
β’ No: πππ π π§; π¦ = 1 β π¦π§
2
+
1
2
+
Minimizing Convex Functions
β’ Any local minimum is also a global minimum
Gradient Descent
Input: convex π: πΆ β¦ β (via oracle),
small π
Set π‘ = 1, π€ (1) β πΆ arbitrarily
While ¬(stopping condition)
π€ (π‘+1) = π€ (π‘) β ππ»π(π€ (π‘) )
π‘ =π‘+1
Return π€ =
1
π‘
π‘
(π‘β²)
π€
β²
π‘ =1
Return π€ (π‘)
Return πππππππ‘ π(π€
Alternative options
π‘
)
Step Size π
β’ Too big can be bad
β’ Too small can be bad
β’ Tradeoff between desire to reduce π and desire not
to go too farβ¦
Controlling the Change: Lipschitz
β’ A function π: βπ β¦ β π-Lipschitz if βπ₯, π¦
π π₯ βπ π¦ β€π π₯βπ¦ 2
β’ Lemma: For π convex:
π is π-Lipschitz ο βπ€, π£ β ππ π€ : π£ β€ π
Definition of
Lipschitz is more
general than this
β’ Proof:
ο: Fix π₯, π€ and π£ β ππ π€
π π₯ β π π€ β€ π£, π₯ β π€ β€ π£ β
π₯ β π¦ β€ π π₯ β π¦
π π€ βπ π₯ β€β―β€π π₯βπ¦
ο: In the book
Guarantee of GD
β’ Theorem:
Let π be convex, π-Lipschitz, let
π€ β β πππππππ€: π€ β€π΅ π π€
Then output π€ of GD for π steps with π =
satisfies:
π π€ β€π
π€β
+
π΅π
π
π΅2
π2 π
GD on sub-gradients
β’ Gradient descent doesnβt need differentiability
β’ Enough convex-Lipschitz
β’ E.g. if π(π₯) = max{π(π₯), β(π₯)} then can take
ππ(π₯) π π₯ β₯ β(π₯)
π£β
β ππ(π₯)
πβ(π₯) β π₯ > π(π₯)
β’ E.g. hinge loss ββππππ π¦, π§ = max{0,1 β π¦ β
π§} for π¦ β
±1 . Can take:
0 0β₯1βπ¦β
π§
π£=
βπ¦ 0 < 1 β π¦ β
π§
β’ Holds in βπ if we take π π€ = ββππππ (π¦, π€, π₯ )
© Copyright 2026 Paperzz