CIS 700: *algorithms for Big Data

CIS 700:
β€œalgorithms for Big Data”
Lecture 8:
Gradient Descent
Slides at http://grigory.us/big-data-class.html
Grigory Yaroslavtsev
http://grigory.us
Smooth Convex Optimization
β€’ Minimize 𝑓 over ℝ𝑛 :
– 𝑓 admits a minimizer π‘₯ βˆ— (𝛻𝑓 π‘₯ βˆ— = 0)
– 𝑓 is continuously differentiable and convex on ℝ𝑛 :
βˆ€π‘₯, 𝑦 ∈ ℝ𝑛 : 𝑓 π‘₯ βˆ’ 𝑓 𝑦 β‰₯ π‘₯ βˆ’ 𝑦 𝛻f(y)
– 𝑓 is smooth (𝛻𝑓 is 𝛽-Lipschitz)
βˆ€π‘₯, 𝑦 ∈ ℝ𝑛 : 𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓 𝑦 ≀ 𝛽||π‘₯ βˆ’ 𝑦||
β€’ Example:
–𝑓=
1 𝑇
π‘₯ 𝐴π‘₯
2
βˆ’ 𝑏𝑇 π‘₯
– 𝛻𝑓 = 𝐴π‘₯ βˆ’ 𝑏 β‡’ π‘₯ βˆ— = π΄βˆ’1 𝑏
Gradient Descent Method
β€’ Gradient descent method:
– Start with an arbitrary π‘₯1
– Iterate π‘₯𝑠+1 = π‘₯𝑠 βˆ’ πœ‚ β‹… 𝛻𝑓(π‘₯𝑠 )
β€’ Thm. If πœ‚ = 1/𝛽 then:
𝑓 π‘₯𝑑 βˆ’ 𝑓 π‘₯ βˆ— ≀
2𝛽 π‘₯1 βˆ’
βˆ— 2
π‘₯ 2
𝑑+3
β€’ β€œLinear convergence”, can be improved to
quadratic using Nesterov’s accelerated descent
Gradient Descent: Analysis
β€’ Lemma 1: If 𝑓 is 𝛽-smooth then βˆ€π‘₯, 𝑦: ∈ ℝ𝑛 :
𝛽
𝑇
𝑓 π‘₯ ≀ 𝑓 𝑦 + 𝛻𝑓 𝑦 π‘₯ βˆ’ 𝑦 +
π‘₯βˆ’π‘¦
2
β€’ 𝑓 π‘₯ βˆ’ 𝑓 𝑦 βˆ’ 𝛻𝑓 𝑦
𝑇
π‘₯βˆ’π‘¦ =
1
𝛻𝑓
0
𝑦+
2
Gradient Descent: Analysis
β€’ Lemma 2: If 𝑓 convex and 𝛽-smooth then βˆ€π‘₯, 𝑦: ∈ ℝ𝑛 :
𝑓 𝑦 β‰₯ 𝑓 π‘₯ + 𝛻f π‘₯
β€’ Cor:
𝛻f x βˆ’ 𝛻𝑓 𝑦
𝑇
𝑇
1
π‘¦βˆ’π‘₯ +
𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓 𝑦
2𝛽
π‘₯βˆ’π‘¦ β‰₯
1
𝛽
𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓 𝑦
2
2
2
β€’ πœ™ π‘₯ 𝑦 = 𝑓 𝑦 βˆ’ 𝛻𝑓 π‘₯ 𝑇 𝑦
β€’ π›»πœ™ π‘₯ 𝑦 = 𝛻𝑓 𝑦 βˆ’ 𝛻𝑓(π‘₯)
β€’ πœ™ π‘₯ is convex, 𝛽-smooth and minimized at π‘₯:
πœ™ π‘₯ π‘₯ βˆ’ πœ™ 𝑦 = 𝑓 π‘₯ βˆ’ 𝛻𝑓 π‘₯ 𝑇 π‘₯ βˆ’ 𝑓 𝑦 + 𝛻𝑓 π‘₯ 𝑇 𝑦
β‰₯ π‘₯ βˆ’ 𝑦 π›»πœ™ π‘₯ (y)
π›»πœ™ π‘₯ 𝑦1 βˆ’ π›»πœ™ π‘₯ 𝑦2 = 𝛻𝑓 𝑦1 βˆ’ 𝛻𝑓 𝑦2 ≀ 𝛽||𝑦1 βˆ’ 𝑦2 ||
Gradient Descent: Analysis
β€’ Lemma 2: If 𝑓 convex and 𝛽-smooth then βˆ€π‘₯, 𝑦: ∈ ℝ𝑛 :
𝑓 𝑦 β‰₯ 𝑓 π‘₯ + 𝛻f π‘₯
𝑇
1
π‘¦βˆ’π‘₯ +
𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓 𝑦
2𝛽
2
2
β€’ πœ™ π‘₯ 𝑦 = 𝑓 𝑦 βˆ’ 𝛻𝑓 π‘₯ 𝑇 𝑦
β€’ π›»πœ™ π‘₯ 𝑦 = 𝛻𝑓 𝑦 βˆ’ 𝛻𝑓(π‘₯)
β€’ 𝑓 π‘₯ βˆ’ 𝑓 𝑦 βˆ’ 𝛻f π‘₯ 𝑇 𝑦 βˆ’ π‘₯ = πœ™ π‘₯ π‘₯ βˆ’ πœ™ π‘₯ 𝑦
1
π‘₯
≀ πœ™ 𝑦 βˆ’ π›»πœ™ π‘₯ 𝑦 βˆ’ πœ™ π‘₯ 𝑦
𝛽
≀
1
𝑦
βˆ’ π›»πœ™ π‘₯ 𝑦
𝛽
1
= βˆ’
π›»πœ™ π‘₯ 𝑦
2𝛽
π›»πœ™ π‘₯
2
𝑇
2
𝛽 1
+
π›»πœ™ π‘₯ 𝑦
𝑏𝑦 πΏπ‘’π‘šπ‘šπ‘Ž 1
2 𝛽
1
2
=βˆ’
𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓 𝑦
2𝛽
Gradient Descent: Analysis
β€’ Gradient descent: π‘₯𝑠+1 = π‘₯𝑠 βˆ’ 1/𝛽 β‹… 𝛻𝑓 π‘₯𝑠
2
β€’ Thm: 𝑓 π‘₯𝑑 βˆ’ 𝑓 π‘₯ βˆ— ≀
2𝛽 π‘₯1 βˆ’π‘₯ βˆ— 2
𝑑+3
𝑇
𝑓 π‘₯𝑠+1 βˆ’ 𝑓 π‘₯𝑠 ≀ 𝛻𝑓 π‘₯𝑠 π‘₯𝑠+1 βˆ’ π‘₯𝑠
1
2
=βˆ’
𝛻𝑓 π‘₯𝑠
2𝛽
β€’ Let 𝛿𝑠 = 𝑓 π‘₯𝑠 βˆ’
𝑓 βˆ—.
Then 𝛿𝑠+1 ≀ 𝛿𝑠 βˆ’
1
2𝛽
𝛽
+
π‘₯𝑠+1 βˆ’ π‘₯𝑠
2
𝛻𝑓 π‘₯𝑠
β€’ 𝛿𝑠 ≀ 𝛻𝑓 π‘₯𝑠 𝑇 π‘₯𝑠 βˆ’ π‘₯ βˆ— ≀ | π‘₯𝑠 βˆ’ π‘₯ βˆ— | 𝛻𝑓 π‘₯𝑠
β€’ Lem: | π‘₯𝑠 βˆ’ π‘₯ βˆ— | is decreasing with 𝑠.
β€’ 𝛿𝑠+1 ≀ 𝛿𝑠 βˆ’
𝛿𝑠2
2𝛽 π‘₯1 βˆ’π‘₯ βˆ—
2
2
.
2
Gradient Descent: Analysis
β€’ 𝛿𝑠+1 ≀ 𝛿𝑠 βˆ’
β€’
πœ”π›Ώπ‘ 2
β€’
1
𝛿𝑠+1
𝛿𝑠2
2𝛽 π‘₯1
2
βˆ—
βˆ’π‘₯
+ 𝛿𝑠+1 ≀ 𝛿𝑠 ⇔
βˆ’
1
𝛿𝑠
β‰₯πœ”β‡’
1
𝛿𝑑
β€’ 𝑓 π‘₯1 βˆ’ 𝑓 π‘₯ βˆ— ≀
𝛻𝑓 π‘₯
β€’ 𝛿𝑑 ≀
βˆ—
1
πœ”(𝑑+3)
π‘₯1 βˆ’ π‘₯
βˆ—
;πœ”=
πœ”π›Ώπ‘ 
𝛿𝑠+1
+
1
𝛿𝑠
1
2𝛽 π‘₯1
≀
2
βˆ—
βˆ’π‘₯
1
𝛿𝑠+1
β‰₯πœ” π‘‘βˆ’1 +
1
𝑓 π‘₯1 βˆ’π‘“ π‘₯ βˆ—
𝛽
+
π‘₯1 βˆ’ π‘₯ βˆ—
2
2
1
=
4πœ”
Gradient Descent: Analysis
β€’ Lem: | π‘₯𝑠 βˆ’ π‘₯ βˆ— | is decreasing with 𝑠.
β€’
𝑇
𝛻f x βˆ’ 𝛻𝑓 𝑦
π‘₯βˆ’π‘¦ β‰₯
β‡’ 𝛻𝑓 𝑦 𝑦 βˆ’ π‘₯ βˆ—
β€’
π‘₯𝑠+1 βˆ’
2
βˆ—
π‘₯
= π‘₯𝑠 βˆ’
= π‘₯𝑠 βˆ’
2
βˆ—
π‘₯
1
𝛻𝑓
𝛽
2
βˆ’ 𝛻𝑓 π‘₯𝑠
𝛽
≀ π‘₯𝑠 βˆ’
𝑇
1
𝛽
𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓 𝑦
1
2
β‰₯
𝛻𝑓 𝑦
𝛽
π‘₯𝑠 βˆ’ π‘₯ βˆ—
π‘₯𝑠 βˆ’
π‘₯βˆ—
1
+ 2 𝛻𝑓 π‘₯𝑠
𝛽
1
βˆ’ 2 𝛻𝑓 π‘₯𝑠
𝛽
2
βˆ—
π‘₯𝑠 βˆ’ π‘₯
2
βˆ—
π‘₯
2
2
2
Nesterov’s Accelerated Gradient
Descent
β€’ Params: πœ†0 = 0, πœ†π‘  =
1+ 1+4πœ†2π‘ βˆ’1
2
, 𝛾𝑠 =
1βˆ’πœ†π‘ 
πœ†π‘ +1
β€’ Accelerated Gradient Descent (π‘₯1 = 𝑦1 ):
– 𝑦𝑠+1 =
1
π‘₯𝑠 βˆ’ 𝛻𝑓
𝛽
π‘₯𝑠
– π‘₯𝑠+1 = 1 βˆ’ 𝛾𝑠 𝑦𝑠+1 + 𝛾𝑠 𝑦𝑠
β€’ Optimal convergence rate 𝑂(1/𝑑 2 )
β€’ Thm. If 𝑓 is convex and 𝛽-smooth then:
𝑓 𝑦𝑑 βˆ’ 𝑓
π‘₯βˆ—
2𝛽 π‘₯1 βˆ’ π‘₯
≀
𝑑2
βˆ— 2
Accelerated Gradient Descent: Analysis
β€’ 𝑓 π‘₯
1
βˆ’ 𝛻f
𝛽
x
βˆ’π‘“ 𝑦 ≀
1
≀ 𝑓 π‘₯ βˆ’ 𝛻f x
𝛽
βˆ’ 𝑓 π‘₯ + 𝛻𝑓 π‘₯
𝑇
π‘₯βˆ’π‘¦
2
≀ 𝛻𝑓 π‘₯
𝑇
1
π‘₯ βˆ’ 𝛻f x βˆ’
𝛽
𝛻𝑓 π‘₯ 𝑇 π‘₯ βˆ’ 𝑦
1
=βˆ’
𝛻𝑓 π‘₯
2𝛽
π‘₯
𝛽
+
2
π‘₯ βˆ’
1
𝛻f
𝛽
x βˆ’π‘₯
(by Lemma 1)
2
+ 𝛻f x T (x βˆ’ y)
+
2
Accelerated Gradient Descent: Analysis
β€’ 𝑓 π‘₯
1
βˆ’ 𝛻f
𝛽
x
βˆ’π‘“ 𝑦 ≀
1
βˆ’
2𝛽
𝛻𝑓 π‘₯
2
+ 𝛻f x
T
x βˆ’y
β€’ Apply to π‘₯ = π‘₯𝑠 , 𝑦 = 𝑦𝑠 :
𝑓 𝑦𝑠+1 βˆ’ 𝑓 𝑦𝑠
1
= 𝑓 π‘₯𝑠 βˆ’ 𝛻f xs
𝛽
1
β‰€βˆ’
𝛻𝑓 π‘₯𝑠
2𝛽
2
+ 𝛻𝑓 π‘₯𝑠 π‘₯𝑠 βˆ’ 𝑦𝑠
2
𝛽
βˆ’
2
=
𝑦𝑠+1 βˆ’ π‘₯𝑠
βˆ’ 𝛽 𝑦𝑠+1 βˆ’ π‘₯𝑠
β€’ Apply to π‘₯ = π‘₯𝑠 , 𝑦 = π‘₯ βˆ— :
𝑓 𝑦𝑠+1 βˆ’ 𝑓
(2)
π‘₯βˆ—
≀
𝛽
βˆ’
2
βˆ’ 𝑓(𝑦𝑠 )
𝑦𝑠+1 βˆ’ π‘₯𝑠
𝑇
2
π‘₯𝑠 βˆ’ 𝑦𝑠
βˆ’
𝛽
2
(1)
𝑦𝑠+1 βˆ’ π‘₯𝑠 𝑇 (π‘₯𝑠 βˆ’ π‘₯ βˆ— )
Accelerated Gradient Descent: Analysis
β€’ 1 x πœ†π‘  βˆ’ 1 + 2 , for 𝛿𝑠 = 𝑓 𝑦𝑠 βˆ’ 𝑓 π‘₯ βˆ— :
πœ†π‘  𝛿𝑠+1 βˆ’ (πœ†π‘  βˆ’1)𝛿𝑠 ≀
𝛽
2
βˆ’ πœ†π‘  𝑦𝑠+1 βˆ’ π‘₯𝑠 βˆ’ 𝛽 𝑦𝑠+1 βˆ’ π‘₯𝑠 𝑇 (πœ†π‘  π‘₯𝑠 βˆ’ πœ†π‘  βˆ’ 1 𝑦𝑠 βˆ’ π‘₯ βˆ— )
2
β€’ (x) πœ†π‘  and use πœ†2π‘ βˆ’1 = πœ†2𝑠 βˆ’ πœ†π‘  :
πœ†2𝑠 𝛿𝑠+1 βˆ’ πœ†2π‘ βˆ’1 𝛿𝑠
𝛽
≀ βˆ’ ( πœ†π‘  𝑦𝑠+1 βˆ’ π‘₯𝑠
2
β€’ It holds that:
πœ†π‘  𝑦𝑠+1 βˆ’ π‘₯𝑠
2
2
+ 2πœ†π‘  𝑦𝑠+1 βˆ’ π‘₯𝑠 𝑇 (πœ†π‘  π‘₯𝑠 βˆ’ πœ†π‘  βˆ’ 1 𝑦𝑠 βˆ’ π‘₯ βˆ— ))
+ 2πœ†π‘  𝑦𝑠+1 βˆ’ π‘₯𝑠 𝑇 (πœ†π‘  π‘₯𝑠 βˆ’ πœ†π‘  βˆ’ 1 𝑦𝑠 βˆ’ π‘₯ βˆ— )) =
πœ†π‘  𝑦𝑠+1 βˆ’ πœ†π‘  βˆ’ 1 𝑦𝑠 βˆ’
2
βˆ—
π‘₯
βˆ’ πœ†π‘  π‘₯𝑠 βˆ’ πœ†π‘  βˆ’ 1 𝑦𝑠 βˆ’
π‘₯βˆ—
2
Accelerated Gradient Descent: Analysis
β€’ By definition of AGD:
π‘₯𝑠+1 = 𝑦𝑠+1 + 𝛾𝑠 𝑦𝑠 βˆ’ 𝑦𝑠+1 ⇔
πœ†π‘ +1 π‘₯𝑠+1 = πœ†π‘ +1 𝑦𝑠+1 + 1 βˆ’ πœ†π‘  𝑦𝑠 βˆ’ 𝑦𝑠+1 ⇔
πœ†π‘ +1 π‘₯𝑠+1 βˆ’ πœ†π‘ +1 βˆ’ 1 𝑦𝑠+1 = πœ†π‘  𝑦𝑠+1 βˆ’ πœ†π‘  βˆ’ 1 𝑦𝑠
β€’ Putting last three facts together for 𝑒𝑠 = πœ†π‘  π‘₯𝑠 βˆ’
πœ†π‘  βˆ’ 1 𝑦𝑠 βˆ’ π‘₯ βˆ— we have:
𝛽
2
2
2
2
πœ†π‘  𝛿𝑠+1 βˆ’ πœ†π‘ βˆ’1 𝛿𝑠 ≀
𝑒𝑠 βˆ’ 𝑒𝑠+1
2
β€’ Adding up over 𝑠 = 1 to 𝑠 = 𝑑 βˆ’ 1:
𝛽
2
𝛿𝑑 ≀ 2
𝑒1
2πœ†π‘‘βˆ’1
β€’ By induction πœ†π‘‘βˆ’1 β‰₯
𝑑
.
2
Q.E.D.
Constrained Convex Optimization
β€’ Non-convex optimization is NP-hard:
π‘₯𝑖2 1 βˆ’ π‘₯𝑖
2
= 0 ⇔ βˆ€π‘–: π‘₯𝑖 ∈ 0,1
𝑖
β€’ Knapsack:
– Minimize 𝑖 𝑐𝑖 π‘₯𝑖
– Subject to: 𝑖 𝑀𝑖 π‘₯𝑖 ≀ π‘Š
β€’ Convex optimization can often be solved by
ellipsoid algorithm in π‘π‘œπ‘™π‘¦(𝑛) time, but too slow
Convex multivariate functions
β€’ Convexity:
β€’
β€’
βˆ€π‘₯, 𝑦 ∈ ℝ𝑛 : 𝑓 π‘₯ β‰₯ 𝑓 𝑦 + π‘₯ βˆ’ 𝑦 𝛻f(y)
βˆ€π‘₯, 𝑦 ∈ ℝ𝑛 , 0 ≀ πœ† ≀ 1:
𝑓 πœ†π‘₯ + 1 βˆ’ πœ† 𝑦 ≀ πœ†π‘“ π‘₯ + 1 βˆ’ πœ† 𝑓(𝑦)
β€’ If higher derivatives exist:
𝑓 π‘₯
= 𝑓 𝑦 + 𝛻𝑓 𝑦 β‹… π‘₯ βˆ’ 𝑦
+ π‘₯ βˆ’ 𝑦 𝑇 𝛻2𝑓 π‘₯ π‘₯ βˆ’ 𝑦 + β‹―
β€’ 𝛻2𝑓 π‘₯
𝑖𝑗
=
πœ•2 𝑓
πœ•π‘₯𝑖 πœ•π‘₯𝑗
is the Hessian matrix
β€’ 𝑓 is convex iff it’s Hessian is positive semidefinite,
𝑦 𝑇 𝛻 2 𝑓𝑦 β‰₯ 0 for all 𝑦.
Examples of convex functions
β€’ ℓ𝑝 -norm is convex for 1 ≀ 𝑝 ≀ ∞:
πœ†π‘₯ + 1 βˆ’ πœ† 𝑦 𝑝 ≀ πœ†π‘₯ 𝑝 +
=πœ† π‘₯
+ 1 βˆ’πœ†
𝑝
π‘₯1
π‘₯
𝑦
1 βˆ’πœ† 𝑦
𝑝
𝑝
π‘₯
β€’ 𝑓 π‘₯ = log(𝑒 + 𝑒 2 + β‹― 𝑒 𝑛 )
max π‘₯1 , … , π‘₯𝑛 ≀ 𝑓 π‘₯ ≀ max π‘₯1 , … , π‘₯𝑛 + log 𝑛
β€’ 𝑓 π‘₯ = π‘₯ 𝑇 𝐴π‘₯ where 𝐴 is a p.s.d. matrix, 𝛻 2 𝑓 = 𝐴
β€’ Examples of constrained convex optimization:
– (Linear equations with p.s.d. constraints):
1 𝑇
π‘₯ 𝐴π‘₯
2
minimize:
βˆ’ 𝑏 𝑇 π‘₯ (solution satisfies 𝐴π‘₯ = 𝑏)
– (Least squares regression):
Minimize: 𝐴π‘₯ βˆ’ 𝑏
2
2
= x T AT A x βˆ’ 2 Ax T b + bT b
Constrained Convex Optimization
β€’ General formulation for convex 𝑓 and a convex set 𝐾:
π‘šπ‘–π‘›π‘–π‘šπ‘–π‘§π‘’: 𝑓 π‘₯ subject to: π‘₯ ∈ 𝐾
β€’ Example (SVMs):
– Data: 𝑋1 , … , 𝑋𝑁 ∈ ℝ𝑛 labeled by 𝑦1 , … , 𝑦𝑁 ∈ βˆ’1,1 (spam /
non-spam)
– Find a linear model:
π‘Š β‹… 𝑋𝑖 β‰₯ 1 β‡’ 𝑋𝑖 is spam
π‘Š β‹… 𝑋𝑖 ≀ βˆ’1 β‡’ 𝑋𝑖 is non-spam
βˆ€π‘–: 1 βˆ’ 𝑦𝑖 π‘Šπ‘‹π‘– ≀ 0
β€’ More robust version:
π‘šπ‘–π‘›π‘–π‘šπ‘–π‘§π‘’:
πΏπ‘œπ‘ π‘  1 βˆ’ π‘Š 𝑦𝑖 𝑋𝑖
𝑖
+πœ† π‘Š
– E.g. hinge loss Loss(0,t)=max(0,t)
– Another regularizer: πœ† π‘Š 1 (favors sparse solutions)
2
Gradient Descent for Constrained
Convex Optimization
β€’ (Projection): π‘₯ βˆ‰ 𝐾 β†’ 𝑦 ∈ 𝐾
y = argminz∈𝐾 𝑧 βˆ’ π‘₯
β€’ Easy to compute for β‹…
β€’ Let 𝛻𝑓 π‘₯
2
≀ 𝐺, max
2
2
π‘₯,π‘¦βˆˆπΎ
: 𝑦 = π‘₯/ π‘₯
π‘₯βˆ’π‘¦
4𝐷2 𝐺 2
πœ–2
2
2
2
2
≀ 𝐷.
β€’ Let 𝑇 =
β€’ Gradient descent (gradient + projection oracles):
– Let πœ‚ = 𝐷/𝐺 𝑇
– Repeat for 𝑖 = 0, … , 𝑇:
β€’ 𝑦 (𝑖+1) = π‘₯ (𝑖) + πœ‚π›»π‘“ π‘₯ 𝑖
β€’ π‘₯ (𝑖+1) = projection of 𝑦 (𝑖+1) on 𝐾
– Output 𝑧 =
1
𝑇
𝑖π‘₯
(𝑖)
Gradient Descent for Constrained
Convex Optimization
β€’
π‘₯
𝑖+1
βˆ’π‘₯
2
βˆ—
2
≀ 𝑦
𝑖+1
= π‘₯
𝑖
= π‘₯
2
βˆ’ π‘₯βˆ—
2
𝑖
βˆ’π‘₯
βˆ—
2
2
βˆ—
βˆ’ π‘₯ βˆ’ πœ‚π›»π‘“ π‘₯
+ πœ‚2 𝛻𝑓 π‘₯
2
𝑖
2
β€’ Using definition of 𝐺:
𝛻𝑓 π‘₯
𝑖
β€’ 𝑓 π‘₯
β‹… π‘₯
𝑖
𝑖
βˆ’π‘“
βˆ’
π‘₯βˆ—
π‘₯βˆ—
≀
1
≀
2πœ‚
1
2πœ‚
π‘₯
π‘₯
𝑖
𝑖
βˆ’
βˆ’
π‘₯βˆ—
β€’ Sum over 𝑖 = 1, … , 𝑇:
𝑇
𝑓 π‘₯
𝑖=1
𝑖
βˆ’π‘“
π‘₯βˆ—
1
≀
2πœ‚
π‘₯
0
βˆ’
2
𝑖
2
βˆ’ 2πœ‚π›»π‘“ π‘₯
π‘₯βˆ—
2
2
π‘₯βˆ—
2
2
2
2
𝑖+1
βˆ’ π‘₯
β‹… π‘₯
𝑖+1
βˆ’ π‘₯
βˆ’ π‘₯
𝑖
𝑇
βˆ’
βˆ’
π‘₯βˆ—
βˆ’
π‘₯βˆ—
𝑖
π‘₯βˆ—
2
βˆ’ π‘₯βˆ—
2
2
+
2
2
2
πœ‚ 2
+ 𝐺
2
πœ‚ 2
𝐺
2
π‘‡πœ‚ 2
+
𝐺
2
Gradient Descent for Constrained
Convex Optimization
β€’
𝑇
𝑖=1 𝑓
π‘₯
𝑖
βˆ’ 𝑓 π‘₯βˆ— ≀
1
2πœ‚
π‘₯
0
βˆ’ π‘₯βˆ—
2
2
βˆ’
Online Gradient Descent
β€’ Gradient descent works in a more general
case:
β€’ 𝑓 β†’ sequence of convex functions 𝑓1 , 𝑓2 … , 𝑓𝑇
β€’ At step 𝑖 need to output π‘₯ (𝑖) ∈ 𝐾
β€’ Let π‘₯ βˆ— be the minimizer of 𝑖 𝑓𝑖 (𝑀)
β€’ Minimize regret:
𝑓𝑖 π‘₯
𝑖
𝑖
βˆ’ 𝑓𝑖 (π‘₯ βˆ— )
β€’ Same analysis as before works in online case.
Stochastic Gradient Descent
β€’ (Expected gradient oracle): returns 𝑔 such that
𝔼𝑔 𝑔 = 𝛻𝑓 π‘₯ .
β€’ Example: for SVM pick randomly one term
from the loss function.
β€’ Let 𝑔𝑖 be the gradient returned at step i
β€’ Let 𝑓𝑖 = 𝑔𝑖 π‘₯ be the function used in the i-th
step of OGD
β€’ Let 𝑧 =
1
𝑇
(𝑖)
βˆ—
π‘₯
and
π‘₯
be the minimizer of 𝑓.
𝑖
Stochastic Gradient Descent
𝐷𝐺
𝑇
β€’ Thm. 𝔼 𝑓 𝑧 ≀ 𝑓 π‘₯ βˆ— +
gradient output by oracle.
β€’ 𝑓 𝑧 βˆ’ 𝑓 π‘₯βˆ— ≀
=
1
𝑇
1
𝑇
𝑖(𝑓
1
≀
𝑇
π‘₯
𝑖
where 𝐺 is an upper bound of any
βˆ’ 𝑓(π‘₯ βˆ— )) (convexity)
𝛻𝑓 π‘₯
𝑖
π‘₯
𝑖
βˆ’ π‘₯βˆ—
𝑖
𝑖 βˆ’ π‘₯ βˆ— )] (grad. oracle)
𝔼
[𝑔
(π‘₯
𝑖
𝑖
1
=
𝔼[𝑓𝑖 (π‘₯ 𝑖 ) βˆ’ 𝑓𝑖 (π‘₯ βˆ— )]
𝑇
𝑖
1
= 𝔼[
𝑓𝑖 (π‘₯ 𝑖 ) βˆ’ 𝑓𝑖 (π‘₯ βˆ— )]
𝑇
𝑖
β€’ 𝔼[] = regret of OGD , always ≀ πœ–