Introduction to Machine Learning 236756

Introduction to
Machine Learning
236756
Tutorial 3-4:
Intro to Convex Optimization
Convex Sets
β€’ A subset 𝐢 βŠ† ℝ𝑑 is convex if βˆ€π‘₯, 𝑦 ∈ 𝐢, πœ† ∈ 0,1 :
πœ†π‘₯ + 1 βˆ’ πœ† 𝑦 ∈ 𝐢
Convex Functions
β€’ A function 𝑓: 𝐢 ↦ ℝ is convex if
𝑓 πœ†π‘₯ + 1 βˆ’ πœ† 𝑦 ≀ πœ†π‘“ π‘₯ + 1 βˆ’ πœ† 𝑓(𝑦)
Convex Functions
β€’ A function f is convex iff it’s epigraph is convex
Operations that preserve convexity
β€’ Nonnegative sum-
β€’ Affine Composition f(Ax + b)
β€’ Pointwise sup – equivalent to intersecting epigraphs
β€’ Example: sum(max1…r[x])
β€’ Compositionf is convex if h is convex and non-decreasing and g is convex,
or if h is convex and non-increasing and g is concave
Some more can be found in the literature.
Some examples without proofs
β€’ In R
β€’ Affine (both convex and concave function) unique
β€’ Negative Log
β€’ In Rn and Rmxn
β€’ Norms
β€’ Trace
β€’ Maximum eigenvalue of a matrix
Convexity of Differentiable Functions
𝑀
Can we generalize that property for the non-differentiable case?
Gradients and Sub-Gradients
β€’ For any 𝑀 ∈ 𝐢 we can construct a tangent to 𝑓 that
lies below 𝑓 everywhere:
βˆƒπ‘£ ∈ ℝ𝑑 : βˆ€π‘₯ ∈ 𝐢:
𝑔𝑣 π‘₯ ≔ 𝑓 𝑀 + 𝑣, (π‘₯ βˆ’ 𝑀) ≀ 𝑓(π‘₯)
β€’ Such a vector is called a sub-gradient
β€’ Set of sub-gradients β€œdifferential set” πœ•π‘“(𝑀)
𝑓π‘₯)
𝑀
Proving convexity
β€’ Simple case: 𝐢 βŠ† ℝ
Equiv.
β€’ 𝑓 convex
β€’ 𝑓′′(𝑀) β‰₯ 0 for all 𝑀 ∈ 𝐢
β€’ 𝑓′(𝑀) monotonically non-decreasing in 𝑀
β€’ If 𝑓 twice differentiable on convex 𝐢 βŠ† ℝ𝑑 then:
Equiv.
β€’ 𝑓 convex
β€’ Hessian 𝐻𝑓 𝑀 PSD for all 𝑀 ∈ 𝐢
β€’ If 𝑓 differentiable on 𝐢 and convex then πœ•π‘“ 𝑀
unique:
β€’ πœ•π‘“ 𝑀 = 𝛻𝑓 𝑀 =
πœ•π‘“
πœ•π‘“
,…,
πœ•π‘€1
πœ•π‘€π‘‘
Example 1
β€’ 𝑓 π‘₯ = βˆ’π‘™π‘œπ‘”
β€’
𝑓′
π‘₯ =
𝑒 π‘₯/2
π‘₯
π‘₯
βˆ’
𝑒 2 +𝑒 2
= log(1 + 𝑒 βˆ’π‘₯ )
1
βˆ’π‘₯ )
(βˆ’π‘’
1+𝑒 βˆ’π‘₯
=
βˆ’1
1+𝑒 π‘₯
β€’ 𝑓 β€² (π‘₯) monotonically increasing
οƒž 𝑓 convex
Example 2
β€’ Suppose 𝑓: ℝ𝑑 ↦ ℝ given as
𝑓 π‘₯ = 𝑔 𝑀, π‘₯ + 𝑏
where 𝑔 convex, 𝑀 ∈ ℝ𝑑 , 𝑏 ∈ ℝ parameters
οƒž 𝑓 convex
β€’ Example
𝑓 π‘₯ = log(1 + 𝑒 βˆ’ 𝑀,π‘₯ βˆ’π‘ )
Example 3
For 𝑦 ∈ ±1
. π‘₯ ↦ β„“β„Žπ‘–π‘›π‘”π‘’ 𝑦, π‘₯ = max{0, 1 βˆ’ 𝑦 β‹… π‘₯}
β€’ Yes, it’s convex.
β€’ Holds in ℝ𝑑 if we take 𝑓 𝑀 = β„“β„Žπ‘–π‘›π‘”π‘’ (𝑦, 𝑀, π‘₯ )
Which of the following losses is
convex?
Which is convex in the prediction 𝑧 ∈ ℝ:
β€’ For 𝑦 ∈ {±1}:
β€’ π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 𝑦 β‰  𝑧
1 𝑖𝑓 𝑦𝑧 < 0
β€’ π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 𝑦𝑧 𝑖𝑓 0 ≀ 𝑦𝑧 ≀ 1
0 𝑖𝑓 𝑦𝑧 > 1
β€’ π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 1 βˆ’ 𝑦𝑧
β€’ π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 1 βˆ’ 𝑦𝑧
2
+
1
2
+
Which of the following losses is
convex?
Which is convex in the prediction 𝑧 ∈ ℝ:
β€’ For 𝑦 ∈ {±1}:
β€’ No: π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 𝑦 β‰  𝑧
1 𝑖𝑓 𝑦𝑧 < 0
β€’ No: π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 1 βˆ’ 𝑦𝑧 𝑖𝑓 0 ≀ 𝑦𝑧 ≀ 1
0 𝑖𝑓 𝑦𝑧 > 1
β€’ Yes: π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 1 βˆ’ 𝑦𝑧
β€’ No: π‘™π‘œπ‘ π‘  𝑧; 𝑦 = 1 βˆ’ 𝑦𝑧
2
+
1
2
+
Minimizing Convex Functions
β€’ Any local minimum is also a global minimum
Gradient Descent
Input: convex 𝑓: 𝐢 ↦ ℝ (via oracle),
small πœ‚
Set 𝑑 = 1, 𝑀 (1) ∈ 𝐢 arbitrarily
While ¬(stopping condition)
𝑀 (𝑑+1) = 𝑀 (𝑑) βˆ’ πœ‚π›»π‘“(𝑀 (𝑑) )
𝑑 =𝑑+1
Return 𝑀 =
1
𝑑
𝑑
(𝑑′)
𝑀
β€²
𝑑 =1
Return 𝑀 (𝑑)
Return π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›π‘‘ 𝑓(𝑀
Alternative options
𝑑
)
Step Size πœ‚
β€’ Too big can be bad
β€’ Too small can be bad
β€’ Tradeoff between desire to reduce 𝑓 and desire not
to go too far…
Controlling the Change: Lipschitz
β€’ A function 𝑓: ℝ𝑑 ↦ ℝ 𝜌-Lipschitz if βˆ€π‘₯, 𝑦
𝑓 π‘₯ βˆ’π‘“ 𝑦 β‰€πœŒ π‘₯βˆ’π‘¦ 2
β€’ Lemma: For 𝑓 convex:
𝑓 is 𝜌-Lipschitz  βˆ€π‘€, 𝑣 ∈ πœ•π‘“ 𝑀 : 𝑣 ≀ 𝜌
Definition of
Lipschitz is more
general than this
β€’ Proof:
οƒœ: Fix π‘₯, 𝑀 and 𝑣 ∈ πœ•π‘“ 𝑀
𝑓 π‘₯ βˆ’ 𝑓 𝑀 ≀ 𝑣, π‘₯ βˆ’ 𝑀 ≀ 𝑣 β‹… π‘₯ βˆ’ 𝑦 ≀ 𝜌 π‘₯ βˆ’ 𝑦
𝑓 𝑀 βˆ’π‘“ π‘₯ β‰€β‹―β‰€πœŒ π‘₯βˆ’π‘¦
οƒž: In the book
Guarantee of GD
β€’ Theorem:
Let 𝑓 be convex, 𝜌-Lipschitz, let
𝑀 βˆ— ∈ π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›π‘€: 𝑀 ≀𝐡 𝑓 𝑀
Then output 𝑀 of GD for 𝑇 steps with πœ‚ =
satisfies:
𝑓 𝑀 ≀𝑓
π‘€βˆ—
+
𝐡𝜌
𝑇
𝐡2
𝜌2 𝑇
GD on sub-gradients
β€’ Gradient descent doesn’t need differentiability
β€’ Enough convex-Lipschitz
β€’ E.g. if 𝑓(π‘₯) = max{𝑔(π‘₯), β„Ž(π‘₯)} then can take
πœ•π‘”(π‘₯) 𝑔 π‘₯ β‰₯ β„Ž(π‘₯)
π‘£βˆˆ
βŠ† πœ•π‘“(π‘₯)
πœ•β„Ž(π‘₯) β„Ž π‘₯ > 𝑔(π‘₯)
β€’ E.g. hinge loss β„“β„Žπ‘–π‘›π‘”π‘’ 𝑦, 𝑧 = max{0,1 βˆ’ 𝑦 β‹… 𝑧} for 𝑦 ∈
±1 . Can take:
0 0β‰₯1βˆ’π‘¦β‹…π‘§
𝑣=
βˆ’π‘¦ 0 < 1 βˆ’ 𝑦 β‹… 𝑧
β€’ Holds in ℝ𝑑 if we take 𝑓 𝑀 = β„“β„Žπ‘–π‘›π‘”π‘’ (𝑦, 𝑀, π‘₯ )