Introduction to Machine Learning 236756

Introduction to
Machine Learning
236756
Tutorial 3-4:
Intro to Convex Optimization
Convex Sets
• A subset 𝐶 ⊆ ℝ𝑑 is convex if ∀𝑥, 𝑦 ∈ 𝐶, 𝜆 ∈ 0,1 :
𝜆𝑥 + 1 − 𝜆 𝑦 ∈ 𝐶
Convex Functions
• A function 𝑓: 𝐶 ↦ ℝ is convex if
𝑓 𝜆𝑥 + 1 − 𝜆 𝑦 ≤ 𝜆𝑓 𝑥 + 1 − 𝜆 𝑓(𝑦)
Convex Functions
• A function f is convex iff it’s epigraph is convex
Operations that preserve convexity
• Nonnegative sum-
• Affine Composition f(Ax + b)
• Pointwise sup – equivalent to intersecting epigraphs
• Example: sum(max1…r[x])
• Compositionf is convex if h is convex and non-decreasing and g is convex,
or if h is convex and non-increasing and g is concave
Some more can be found in the literature.
Some examples without proofs
• In R
• Affine (both convex and concave function) unique
• Negative Log
• In Rn and Rmxn
• Norms
• Trace
• Maximum eigenvalue of a matrix
Convexity of Differentiable Functions
𝑤
Can we generalize that property for the non-differentiable case?
Gradients and Sub-Gradients
• For any 𝑤 ∈ 𝐶 we can construct a tangent to 𝑓 that
lies below 𝑓 everywhere:
∃𝑣 ∈ ℝ𝑑 : ∀𝑥 ∈ 𝐶:
𝑔𝑣 𝑥 ≔ 𝑓 𝑤 + 𝑣, (𝑥 − 𝑤) ≤ 𝑓(𝑥)
• Such a vector is called a sub-gradient
• Set of sub-gradients “differential set” 𝜕𝑓(𝑤)
𝑓𝑥)
𝑤
Proving convexity
• Simple case: 𝐶 ⊆ ℝ
Equiv.
• 𝑓 convex
• 𝑓′′(𝑤) ≥ 0 for all 𝑤 ∈ 𝐶
• 𝑓′(𝑤) monotonically non-decreasing in 𝑤
• If 𝑓 twice differentiable on convex 𝐶 ⊆ ℝ𝑑 then:
Equiv.
• 𝑓 convex
• Hessian 𝐻𝑓 𝑤 PSD for all 𝑤 ∈ 𝐶
• If 𝑓 differentiable on 𝐶 and convex then 𝜕𝑓 𝑤
unique:
• 𝜕𝑓 𝑤 = 𝛻𝑓 𝑤 =
𝜕𝑓
𝜕𝑓
,…,
𝜕𝑤1
𝜕𝑤𝑑
Example 1
• 𝑓 𝑥 = −𝑙𝑜𝑔
•
𝑓′
𝑥 =
𝑒 𝑥/2
𝑥
𝑥
−
𝑒 2 +𝑒 2
= log(1 + 𝑒 −𝑥 )
1
−𝑥 )
(−𝑒
1+𝑒 −𝑥
=
−1
1+𝑒 𝑥
• 𝑓 ′ (𝑥) monotonically increasing
 𝑓 convex
Example 2
• Suppose 𝑓: ℝ𝑑 ↦ ℝ given as
𝑓 𝑥 = 𝑔 𝑤, 𝑥 + 𝑏
where 𝑔 convex, 𝑤 ∈ ℝ𝑑 , 𝑏 ∈ ℝ parameters
 𝑓 convex
• Example
𝑓 𝑥 = log(1 + 𝑒 − 𝑤,𝑥 −𝑏 )
Example 3
For 𝑦 ∈ ±1
. 𝑥 ↦ ℓℎ𝑖𝑛𝑔𝑒 𝑦, 𝑥 = max{0, 1 − 𝑦 ⋅ 𝑥}
• Yes, it’s convex.
• Holds in ℝ𝑑 if we take 𝑓 𝑤 = ℓℎ𝑖𝑛𝑔𝑒 (𝑦, 𝑤, 𝑥 )
Which of the following losses is
convex?
Which is convex in the prediction 𝑧 ∈ ℝ:
• For 𝑦 ∈ {±1}:
• 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 𝑦 ≠ 𝑧
1 𝑖𝑓 𝑦𝑧 < 0
• 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 𝑦𝑧 𝑖𝑓 0 ≤ 𝑦𝑧 ≤ 1
0 𝑖𝑓 𝑦𝑧 > 1
• 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 1 − 𝑦𝑧
• 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 1 − 𝑦𝑧
2
+
1
2
+
Which of the following losses is
convex?
Which is convex in the prediction 𝑧 ∈ ℝ:
• For 𝑦 ∈ {±1}:
• No: 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 𝑦 ≠ 𝑧
1 𝑖𝑓 𝑦𝑧 < 0
• No: 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 1 − 𝑦𝑧 𝑖𝑓 0 ≤ 𝑦𝑧 ≤ 1
0 𝑖𝑓 𝑦𝑧 > 1
• Yes: 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 1 − 𝑦𝑧
• No: 𝑙𝑜𝑠𝑠 𝑧; 𝑦 = 1 − 𝑦𝑧
2
+
1
2
+
Minimizing Convex Functions
• Any local minimum is also a global minimum
Gradient Descent
Input: convex 𝑓: 𝐶 ↦ ℝ (via oracle),
small 𝜂
Set 𝑡 = 1, 𝑤 (1) ∈ 𝐶 arbitrarily
While ¬(stopping condition)
𝑤 (𝑡+1) = 𝑤 (𝑡) − 𝜂𝛻𝑓(𝑤 (𝑡) )
𝑡 =𝑡+1
Return 𝑤 =
1
𝑡
𝑡
(𝑡′)
𝑤
′
𝑡 =1
Return 𝑤 (𝑡)
Return 𝑎𝑟𝑔𝑚𝑖𝑛𝑡 𝑓(𝑤
Alternative options
𝑡
)
Step Size 𝜂
• Too big can be bad
• Too small can be bad
• Tradeoff between desire to reduce 𝑓 and desire not
to go too far…
Controlling the Change: Lipschitz
• A function 𝑓: ℝ𝑑 ↦ ℝ 𝜌-Lipschitz if ∀𝑥, 𝑦
𝑓 𝑥 −𝑓 𝑦 ≤𝜌 𝑥−𝑦 2
• Lemma: For 𝑓 convex:
𝑓 is 𝜌-Lipschitz  ∀𝑤, 𝑣 ∈ 𝜕𝑓 𝑤 : 𝑣 ≤ 𝜌
Definition of
Lipschitz is more
general than this
• Proof:
: Fix 𝑥, 𝑤 and 𝑣 ∈ 𝜕𝑓 𝑤
𝑓 𝑥 − 𝑓 𝑤 ≤ 𝑣, 𝑥 − 𝑤 ≤ 𝑣 ⋅ 𝑥 − 𝑦 ≤ 𝜌 𝑥 − 𝑦
𝑓 𝑤 −𝑓 𝑥 ≤⋯≤𝜌 𝑥−𝑦
: In the book
Guarantee of GD
• Theorem:
Let 𝑓 be convex, 𝜌-Lipschitz, let
𝑤 ∗ ∈ 𝑎𝑟𝑔𝑚𝑖𝑛𝑤: 𝑤 ≤𝐵 𝑓 𝑤
Then output 𝑤 of GD for 𝑇 steps with 𝜂 =
satisfies:
𝑓 𝑤 ≤𝑓
𝑤∗
+
𝐵𝜌
𝑇
𝐵2
𝜌2 𝑇
GD on sub-gradients
• Gradient descent doesn’t need differentiability
• Enough convex-Lipschitz
• E.g. if 𝑓(𝑥) = max{𝑔(𝑥), ℎ(𝑥)} then can take
𝜕𝑔(𝑥) 𝑔 𝑥 ≥ ℎ(𝑥)
𝑣∈
⊆ 𝜕𝑓(𝑥)
𝜕ℎ(𝑥) ℎ 𝑥 > 𝑔(𝑥)
• E.g. hinge loss ℓℎ𝑖𝑛𝑔𝑒 𝑦, 𝑧 = max{0,1 − 𝑦 ⋅ 𝑧} for 𝑦 ∈
±1 . Can take:
0 0≥1−𝑦⋅𝑧
𝑣=
−𝑦 0 < 1 − 𝑦 ⋅ 𝑧
• Holds in ℝ𝑑 if we take 𝑓 𝑤 = ℓℎ𝑖𝑛𝑔𝑒 (𝑦, 𝑤, 𝑥 )

Download Report

Introduction to Machine Learning 236756

Paperzz.com

Your Paperzz